In this blog, I will create a deep learning model that uses the negative log-likelihood of Gaussian distribution as a loss. For this purpose, I will use Tensorflow.
Why not Keras?¶
Keras has been my first-choice deep learning framework in the last 1 year. However, if you want to create personal loss functions or layers, Keras requires to use backend functions written in either TensorFlow or Theano. As the negative log-likelihood of Gaussian distribution is not one of the available loss in Keras, I need to implement it in Tensorflow which is often my backend. So this motivated me to learn Tensorflow and write everything in Tensorflow rather than mixing up two frameworks.
So this blog assumes that this is your first time using Tensorflow.
Reference¶
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import tensorflow as tf
import os, sys
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.95
config.gpu_options.visible_device_list = "1"
tf.Session(config=config)
print("python {}".format(sys.version))
print("tensorflow version {}".format(tf.__version__))
For those familiar with Keras's functional API, one of the roadblocks in using Tensorflow is understanding the concept of "Tensor". In Keras, defining a deep learning model in Sequential or Functional APIs do not require new data structure concept. When training the model, the input data is a numpy array and output from Keras model is also numpy array. So there is no concept of the "Tensor".
However, in Tensorflow, the computational graph or networks are defined using Tensor data structure:
TensorFlow programs use a tensor data structure to represent all data — only tensors are passed between operations in the computation graph. You can think of a TensorFlow tensor as an n-dimensional array or list.
As the purpose of the tensors are to define the graph, and it is an arbitrary array or matrix that represent the graph connections, it does not hold actual values. Therefore, you cannot print the Tensor like the numpy array. To understand, let's start with creating our familiar numpy array, and convert it to Tensor.
Let’s look at an example.
y1_np = np.arange(3*2*1).reshape(3,2,1)
y1_np *= y1_np
y2_np = np.arange(3*2*1).reshape(3,2,1)
print("-"*10)
print("y1_np.shape={}".format(y1_np.shape))
print(y1_np)
print("-"*10)
print("y2_np.shape={}".format(y2_np.shape))
print(y2_np)
I will create a very simple computational graph which simply convert a numpy array to constant immutable tensor. You need to specify the data type and shape of the tensor.
You see that printing the tensors do not show any actual array values!
y1_tf = tf.constant(y1_np,shape=y1_np.shape, dtype="float64")
y2_tf = tf.constant(y2_np,shape=y2_np.shape, dtype="float64")
print("-"*10)
print("y1_tf.shape={}".format(y1_tf.get_shape()))
print(y1_tf)
print("-"*10)
print("y2_tf.shape={}".format(y2_tf.get_shape()))
print(y2_tf)
In order to see the values of Tensor, you need to start the session, and execute our graph (which is simply defining a constant tensor). The output of the session are numpy arrays.
A session allows to execute graphs or part of graphs. It allocates resources (on one or more machines) for that and holds the actual values of intermediate results and variables.
sess = tf.InteractiveSession()
with tf.Session() as sess:
y1_tf_back_to_np = sess.run(y1_tf)
y2_tf_back_to_np = sess.run(y2_tf)
print("-"*10)
print("y1_tf_back_to_np.shape={}".format(y1_tf_back_to_np.shape))
print(y1_tf_back_to_np)
print("-"*10)
print("y2_tf_back_to_np.shape={}".format(y2_tf_back_to_np.shape))
print(y2_tf_back_to_np)
def mse(y_pred,y_test, verbose=True):
'''
y_pred : tensor
y_test : tensor having the same shape as y_pred
'''
## element wise square
square = tf.square(y_pred - y_test)## preserve the same shape as y_pred.shape
## mean across the final dimensions
ms = tf.reduce_mean(square)
return(ms)
y1_tf = tf.constant(y1_np,shape=y1_np.shape, dtype="float64")
y2_tf = tf.constant(y2_np,shape=y2_np.shape, dtype="float64")
mse_tf = mse(y1_tf,y2_tf)
with tf.Session() as sess:
mse_np = sess.run(mse_tf)
print("======")
print("Result")
print("======")
print("MSE={}".format(mse_np))
print("the numpy calculation must be the same as the tensor flow calculation:")
print(mse_np==np.mean((y1_np - y2_np)**2))
tf.placeholder tensor as input¶
You can also define the input to the tensor within session by using tf.placeholder:
A placeholder is simply a variable that we will assign data to at a later date. It allows us to create our operations and build our computation graph, without needing the data. In TensorFlow terminology, we then feed data into the graph through these placeholders.
The 1st dimension of the placeholder can be defined during session.
Nsample = 10
n_feature = 2
## Define Graph
y1_tf_var = tf.placeholder(tf.float32,[None, n_feature])
y2_tf_var = tf.placeholder(tf.float32,[None, n_feature])
mse_tf = mse(y1_tf_var,y2_tf_var)
## Input
y1_np = np.random.rand(*[Nsample, n_feature])
y2_np = np.random.rand(*[Nsample, n_feature])
## Define session
with tf.Session() as sess:
mse_np = sess.run(mse_tf,feed_dict={y1_tf_var:y1_np,
y2_tf_var:y2_np})
print("======")
print("Result")
print("======")
print("MSE={}".format(mse_np))
print("The numpy calculation must be the same as the tensor flow calculation:")
print(np.abs(mse_np-np.mean((y1_np - y2_np)**2)) < 1E-6)
A least square regrssion (feed forward network)¶
Before diving into a deep learning model, let's solve simpler problem and fit a simple least squarea regression model to very small data. A simple least square regression can be thought as a feed-foward neural network model with no hidden layer.
I will consider only 5 data points with 2 features.
Generate sample points¶
## input numpy array
N = 5
n_feature = 2
X_np = np.random.rand(N,n_feature)
y_np = (X_np[:,0]*(-3) + X_np[:,1]*2).reshape(-1,1)
fig = plt.figure(figsize=(10,5))
for i in range(X_np.shape[1]):
ax = fig.add_subplot(1,X_np.shape[1],i+1)
ax.plot(X_np[:,i],y_np,"p")
ax.set_xlabel("x{}".format(i))
ax.set_ylabel("y")
plt.show()
We will use tf.Variable for the learning parameters so that the tensors are mutable.
- $ X_i \in R^{(1,2)}$ : input tf.placeholder
- $ w \in R^{(2,1)}$ : tf.Variable
- $ b \in R^{(1,)}$ : tf.Variable
- $ y_i \in R^{(1,)}$ : input tf.placeholder
- $ h_i \in R^{(1,)}$
$$ h_i = X_i w + b $$
$$ RMSE = \sum_i (y_i - h_i)^2 $$
def weight_variable(shape):
## weight variable, initialized with truncated normal distribution
initial = tf.truncated_normal(shape, stddev=0.01, dtype="float32")
return tf.Variable(initial)
def bias_variable(shape):
initial = tf.constant(0.001, shape=shape, dtype="float32")
return tf.Variable(initial)
def fully_connected_layer(h0,n_h0,n_h1,verbose=True):
'''
h0 : tensor of shape (n_h0, n_h1)
n_h0 : scalar
n_h1 : scalar
'''
W1 = weight_variable([n_h0, n_h1])
b1 = bias_variable([n_h1])
if verbose:
print(" h0.shape={}".format(h0.get_shape()))
print(" W1.shape={}".format(W1.get_shape()))
print(" b1.shape={}".format(b1.get_shape()))
h1 = tf.matmul(h0, W1) + b1
return(h1, (W1,b1))
Define a graph¶
n_output = 1
## Defines Graph
x_tf = tf.placeholder(tf.float32,[None, n_feature])
y_tf = tf.placeholder(tf.float32,[None, n_output])
h_tf, (W1_tf, b1_tf) = fully_connected_layer(x_tf,n_feature,n_output,verbose=True)
mse_tf = mse(y_tf,h_tf)
## a single step of the gradient descent to minimize the loss
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(mse_tf)
Define a session¶
Nepochs = 500
mses = []
with tf.Session() as sess:
## This line is necessary because the weights need to be initialized.
tf.global_variables_initializer().run()
for _ in range(Nepochs):
sess.run(train_step,feed_dict={x_tf:X_np,y_tf:y_np})
mse_np = sess.run(mse_tf,feed_dict={x_tf:X_np,y_tf:y_np})
mses.append(mse_np)
h_np = sess.run(h_tf,feed_dict={x_tf:X_np})
W1_np = sess.run(W1_tf,feed_dict={x_tf:X_np})
b1_np = sess.run(b1_tf,feed_dict={x_tf:X_np})
y_est_np = np.dot(X_np, W1_np) + b1_np
def check_equal(A,B,small=1E-5):
TF = np.all(np.abs(A - B) < small)
print("Numerical Check, 2 objects are numerically the same: {}".format(TF))
print("======")
print("Result")
print("======")
print("MSE is almost zero:")
print("MSE={}\n".format(mse_np))
print("Parameter:\n W1={}\n b1={}".format(W1_np,b1_np))
print("Estimate:\n h={}".format(h_np))
check_equal(h_np, y_est_np)
plt.plot(mses)
plt.title("mean square loss")
plt.show()
Evaluate the gradient of the MSE with respect to weight, bias and the input data¶
It seems like you can readily evaluate gradients with Tensorflow!
grad_tf = tf.gradients(mse_tf,[h_tf,W1_tf,b1_tf,x_tf])
feed_dict={x_tf:X_np,
y_tf:y_np,
W1_tf : W1_np,
b1_tf: b1_np}
with tf.Session() as sess:
dh, dW1,db1,dx = sess.run(grad_tf,
feed_dict=feed_dict)
h_np = sess.run(h_tf,feed_dict=feed_dict)
print("=="*10)
print("dh:")
print(dh)
dh_np = 2*(h_np - y_np)/5.0
check_equal(dh, dh_np)
print("=="*10)
print("dW1:")
print(dW1)
dW1_np = np.dot(X_np.T, dh_np)
check_equal(dW1, dW1_np)
print("=="*10)
print("db1:")
print(db1)
db1_np = np.sum(dh_np)
check_equal(db1, db1_np)
print("=="*10)
print("dx:")
print(dx)
dx_np = np.dot(W1_np, dh_np.T).T
check_equal(dx, dx_np)
Fit feed foward network with negative log likelihood as a loss¶
Now, let's generate more complex data and fit more complex model on it. The input is a one dimensional sequence ranging between -2 and 2 with a jump between -1.5 and -1. The output time series is a function of the input at the corresponding time point with some noise. The noise level is high around input = 0.
## Define x
inc = 0.001
x_train =np.concatenate([np.arange(-2,-1.5,inc),
np.arange(-1,2,inc)])
x_train = x_train.reshape(len(x_train),1)
## Define y
steps_per_cycle = 1
def sinfun(xs,noise=0.001):
import random, math
xs = xs.flatten()
def randomNoise(x):
ax = 2 - np.abs(x)
wnoise = random.uniform(-noise*ax,
noise*ax)
return(math.sin(x * (2 * math.pi / steps_per_cycle) ) + wnoise)
vec = [randomNoise(x) - x for x in xs]
return(np.array(vec).flatten())
y_train0 = sinfun(x_train,noise=0.5)
y_train = y_train0.reshape(len(y_train0),1)
print(" x_train.shape={}".format(x_train.shape))
print(" y_train.shape={}".format(y_train.shape))
## Visualize the generated data (X,y)
plt.figure(figsize=(10,3))
plt.scatter(x_train,y_train,s=0.5)
plt.xlabel("x_train")
plt.ylabel("y_train")
plt.title("The x values of the synthetic data ranges between {:4.3f} and {:4.3f}".format(
np.min(x_train),np.max(x_train)))
plt.show()
Model¶
Now that we know Tensorflow, we are free to create and use any loss function for our model!
To model this data, I will use a neural network that outputs two values in the final layer, corresponding to the predicted mean $\mu(x)$ and $\sigma^2(x) > 0$.
This essentially means that I am treating the observed value as a sample from a (heteroscedastic) Gaussian distribution with the predicted mean and variance as:
$$ y_i | x_i \overset{i.i.d.}{\sim} N\left( \mu(x_i), \sigma^2(x_i)\right) $$
Then we minimize the negative log-likelihood criterion, instead of using MSE as a loss:
$$ NLL = \sum_i \frac{ \textrm{log} \left(\sigma^2(x_i)\right) }{2} + \frac{ \left(y_i - \mu(x_i) \right)^2 }{ 2 \sigma^2(x_i) } $$
Notice that when $\sigma^2(x_i)=1$, the first term of NLL becomes constant, and this loss function becomes essentially the same as the MSE.
By modeling $\sigma^2(x_i)$, in theory, our model can weight more on the data points with less noise. For example, my data above has more variability when input variable is around origin. So ideally we want our model to weight less on this region.
Use of NLL as a loss function is discussed at:
- Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
- Estimating the mean and variance of the target probability distribution
For compaison purpose, we will fit two models with two loss:
- MSE (i.e., $\sigma^2(x_i)=1$)
- NLL
def nll_gaussian(y_pred_mean,y_pred_sd,y_test):
## element wise square
square = tf.square(y_pred_mean - y_test)## preserve the same shape as y_pred.shape
ms = tf.add(tf.divide(square,y_pred_sd), tf.log(y_pred_sd))
## axis = -1 means that we take mean across the last dimension
## the output keeps all but the last dimension
## ms = tf.reduce_mean(ms,axis=-1)
## return scalar
ms = tf.reduce_mean(ms)
return(ms)
def define_model(n_feature,n_hs,n_output,verbose=True,NLL=True):
x_input = tf.placeholder(tf.float32, [None, n_feature])
y_input = tf.placeholder(tf.float32, [None, 1])
h_previous = x_input
n_h_previous = n_feature
paras = []
for ilayer,n_h in enumerate(n_hs,1):
if verbose:
print(" layer:{}".format(ilayer))
h, p = fully_connected_layer(h_previous,n_h_previous,n_h,verbose)
h_previous = tf.nn.relu(h)
n_h_previous = n_h
paras.append(p)
if verbose:
print(" output layer for y_mean")
y_mean,p = fully_connected_layer(h_previous,n_h_previous,n_output,verbose)
paras.append(p)
if NLL:
if verbose:
print(" output layer for y_sigma")
y_sigma, p = fully_connected_layer(h_previous,n_h_previous,n_output,verbose)
## for numerical stability this enforce the variance to be more than 1E-4
y_sigma = tf.clip_by_value(t=tf.exp(y_sigma),
clip_value_min=tf.constant(1E-4),
clip_value_max=tf.constant(1E+100))
paras.append(p)
loss = nll_gaussian(y_mean, y_sigma,y_input)
y = [y_mean, y_sigma]
else:
loss = mse(y_mean, y_input)
y = [y_mean]
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
inputs = [x_input, y_input]
return(loss, train_step, inputs, y, paras)
print("___"*10)
print("loss=NLL")
loss, train_step, inputs, y, _ = define_model(n_feature=1,
n_hs=[500,300,100],
n_output = 1)
[x_input, _] = inputs
[y_mean, y_sigma] = y
print("___"*10)
print("loss=MSE")
## For comparison purpose, we also consider MSE as a loss
loss_mse, train_step_mse, inputs_mse, y_mse, _ = define_model(
n_feature=1,
n_hs=[500,300,100],
n_output = 1,
NLL=False)
[x_input_mse, _] = inputs_mse
[y_mean_mse] = y_mse
Training starts here¶
from sklearn.utils import shuffle
n_epochs = 8000
n_batch = 500
def train(train_step,loss,inputs,
x_train,y_train,n_epochs,n_batch):
x_input, y_input = inputs
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
lvalues = []
for count in range(n_epochs):
x_shuffle, y_shuffle = shuffle(x_train,y_train)
for i in range(0,x_train.shape[0],n_batch):
batch_xs = x_shuffle[i:i+n_batch]
batch_ys = y_shuffle[i:i+n_batch]
sess.run(train_step,
feed_dict={x_input: batch_xs,
y_input: batch_ys})
lv = sess.run(loss,
feed_dict={x_input: x_train,
y_input: y_train})
lvalues.append(lv)
if count % 1000 == 1:
print(" epoch={:05.0f}: {:5.3f}".format(count,lv))
return(sess,lvalues,lv)
print("loss=NLL")
sess, lvalues, lv = train(train_step,loss,
inputs,
x_train,y_train,
n_epochs,n_batch)
print("loss=MSE")
sess_mse, lvalues_mse, lv_mse = train(train_step_mse,
loss_mse,
inputs_mse,
x_train,y_train,
n_epochs,n_batch)
Training loss over epochs¶
MSE can be thought as NLL with fixed variance 1. The model with NLL loss returns smaller NLL than the model with MSE loss.
plt.plot(lvalues,label="loss=NLL")
plt.plot(lvalues_mse,label="loss=MSE")
plt.title("loss")
plt.xlabel("epochs")
plt.legend()
plt.show()
Prediction with testing set¶
The input of the testing set is a sequence ranging between -2.5 and 2.5 with increment of 0.01. Notice that the RMSE on the testset is smaller by the model with NLL loss than the model with MSE as a loss This is because the model with NLL loss has more reasonable assumption; variance depends on the input value.
x_test = np.arange(-2.5,2.5,inc).reshape(-1,1)
y_test = sinfun(x_test,noise=0)
print("**")
print("Model with NLL as a loss")
y_mean_pred = sess.run(y_mean,feed_dict={x_input: x_test})
y_sigma_pred = sess.run(y_sigma,feed_dict={x_input: x_test})
print("RMSE: {}".format(np.sqrt(np.mean((y_test - y_mean_pred)**2))))
print("**")
print("Model with MSE as a loss")
y_mean_pred_mse = sess_mse.run(y_mean_mse,feed_dict={x_input_mse: x_test})
print("RMSE: {}".format(np.sqrt(np.mean((y_test - y_mean_pred_mse)**2))))
Plot of estimated y¶
The NLL model can return not only the estimated mean but also the estimated variance, which can tells the noise in the training data.
xlim = (-2.5,2.5)
fig = plt.figure(figsize=(15,10))
ax = fig.add_subplot(4,1,1)
ax.plot(x_train,y_train,"p",alpha=0.1,c="red",label="data")
ax.plot(x_test,y_mean_pred_mse,label="estimated_y (loss=MSE)")
ax.plot(x_test,y_test, label="true")
ax.set_xlabel("x")
ax.set_xlim(*xlim)
ax.legend()
ax = fig.add_subplot(4,1,2)
ax.plot(x_train,y_train,"p",alpha=0.1,c="red",label="data")
ax.plot(x_test,y_mean_pred,label="estimated_y (loss=NLL)")
ax.plot(x_test,y_test, label="true")
ax.set_xlabel("x")
ax.set_xlim(*xlim)
ax.legend()
ax = fig.add_subplot(4,1,3)
ax.plot(x_test,y_sigma_pred,label="estimated_sigma2")
ax.set_xlabel("x")
ax.set_xlim(*xlim)
ax.legend()
ax = fig.add_subplot(4,1,4)
ax.plot(x_train,y_train,"p",alpha=0.1,label="data",c="red")
ax.errorbar(x_test, y_mean_pred, yerr=1.96*np.sqrt(y_sigma_pred),
capsize=0,label="estimated_y +- 1.96*sigma")
ax.set_xlabel("x")
ax.set_xlim(*xlim)
ax.legend()
plt.show()