Evaluating the quality of predictive uncertainties is challenging as "ground truth" uncertainty is usually not available. Yet, model's confidence about its estimation is often of interest for researchers. If the model can tell "what it knows" or what is "out of distribution", such infomation gives insights about when the researchers should take the point estimates as their face values.
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning recently proposed using Monte Carlo dropout to estimate predictive uncertainty by using Dropout at test time. Most of the recent work on uncertainty in deep learning is in this line. I study MC dropout method for evaluating model's confidence in Measure the uncertainty in deep learning models using dropout. I found that this MC dropout method requires the model to restrict the choice of network structures even for modeling simple data; activation functions seem to have to be relu and we seem to need to have quite a few dropout layers.
I recently found a new paper Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles in this line of uncertainty research. Their methods do not take Bayesian approaches and it has a lot of similarity with the maximum likelihood approach.
In this blog post, I will review this new method.
Model
The approach assumes that given available input $X$, the target has a normal distribution with mean and variance dependent on the values of $X$: $$ Y | X \overset{i.i.d.}{\sim} N \left(\mu_m (X) , \sigma_m^2(X) \right) $$ Here, $\mu_m (X)$ and $\sigma_m^2(X)$ are modelled non-linearly using neural network. Then the maximum likelihood methods are used to estimate unknown parameters $\mu_m (X)$, $\sigma_m^2(X)$.In practice, estimates become more accurate if $M$ models are ensembled together. Therefore, Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles also proposed to use the ensemble model for prediction. This essentially means that you are assuming that conditional distribution of $Y$ to be from the mixture: $$ Y \left| X \right. \sim \frac{1}{M}\sum_{m=1}^M N \left(\mu_m (X) , \sigma_m^2(X) \right) $$
Under this ensemble model structure, mean $\mu_{*} (X)$ and variance $\sigma_{*}^2(X)$ of the marginal distribution can be calcualted as:
\begin{array}{rcll} \mu_{*} (X) &=& \frac{1}{M}\sum_{m=1}^M \mu_m (X)\\ \sigma_{*}^2 (X) &=& \frac{1}{M} \sum_{m=1}^M \left( \sigma_m^2(X) + \mu_m (X)^2 \right) - \mu_{*} (X)^2\\ \end{array}
These are because: \begin{array}{rcll} \mu_{*} (X) & := & E(Y)\\ &=& \sum_{m=1}^M E(Y|m) \\ &=& \frac{1}{M}\sum_{m=1}^M \mu_m (X)\\ \sigma_{*}^2 (X) &:=& Var(Y) \\ &=& E_m(Var(Y|m)) + Var_m(E(Y|m)) \\ &=& \frac{1}{M}\sum_{m=1}^M \sigma_m^2(X) + Var_m \left( \mu_m (X) \right)\\ \textrm{ where } Var_m \left( \mu_m (X) \right) &=& E_m \left( \mu_m (X)^2 \right) - E_m \left( \mu_m (X) \right)^2\\ &=& \frac{1}{M} \sum_{m=1}^M \mu_m (X)^2 - \left( \frac{1}{M}\sum_{m=1}^M \mu_m (X) \right)^2\\ &=& \frac{1}{M} \sum_{m=1}^M \mu_m (X)^2 - \mu_{*} (X)^2 \end{array} See Wikipedia if you get lost dividing variance into its components.
Finally, the authors proposed to use adversarial training to improve the estimation of the likelihood:
Adversarial training can be interpreted as a computationally efficient solution to smooth the predictive distributions by increasing the likelihood of the target around an $\epsilon$ neighboorhood.I previously studied adversarial examples. So I will not discuss its details here. See Generate adversarial examples using TensorFlow.
In this blog, I would like to know how easy it is to implement this procedure and whether adversarial training is really necessary.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import tensorflow as tf
import os, sys
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.95
config.gpu_options.visible_device_list = "1"
tf.Session(config=config)
print("python {}".format(sys.version))
print("tensorflow version {}".format(tf.__version__))
Generate synthetic data¶
I will use the same simulated data as the previous blog: TensorFlow newbie creates a neural net with a negative log likelihood as a loss.
## Define x
inc = 0.001
x_train =np.concatenate([np.arange(-2,-1.5,inc),
np.arange(-1,2,inc)])
x_train = x_train.reshape(len(x_train),1)
## Define y
steps_per_cycle = 1
def sinfun(xs,noise=0.001):
import random, math
xs = xs.flatten()
def randomNoise(x):
ax = 2 - np.abs(x)
wnoise = random.uniform(-noise*ax,
noise*ax)
return(math.sin(x * (2 * math.pi / steps_per_cycle) ) + wnoise)
vec = [randomNoise(x) - x for x in xs]
return(np.array(vec).flatten())
y_train0 = sinfun(x_train,noise=0.5)
y_train = y_train0.reshape(len(y_train0),1)
print(" x_train.shape={}".format(x_train.shape))
print(" y_train.shape={}".format(y_train.shape))
## Visualize the generated data (X,y)
plt.figure(figsize=(10,3))
plt.scatter(x_train,y_train,s=0.5)
plt.xlabel("x_train")
plt.ylabel("y_train")
plt.title("The x values of the synthetic data ranges between {:4.3f} and {:4.3f}".format(
np.min(x_train),np.max(x_train)))
plt.show()
Define model¶
I will consider 3-hidden layer NN for modeling this synthetic data. I will consider negative log-likelihood as a loss function. This model was also used in the blog post: TensorFlow newbie creates a neural net with a negative log likelihood as a loss. Notice that all functions are the same as ones in the previous blog, except "define_model" function. The "define_model" function now has tensors to calcualte the adversarial examples. The codes for calculating adversarial examples are mostly borrowed from the previous blog Generate adversarial examples using TensorFlow.
def weight_variable(shape):
## weight variable, initialized with truncated normal distribution
initial = tf.truncated_normal(shape, stddev=0.01, dtype="float32")
return tf.Variable(initial)
def bias_variable(shape):
initial = tf.constant(0.001, shape=shape, dtype="float32")
return tf.Variable(initial)
def fully_connected_layer(h0,n_h0,n_h1,verbose=True):
'''
h0 : tensor of shape (n_h0, n_h1)
n_h0 : scalar
n_h1 : scalar
'''
W1 = weight_variable([n_h0, n_h1])
b1 = bias_variable([n_h1])
if verbose:
print(" h0.shape={}".format(h0.get_shape()))
print(" W1.shape={}".format(W1.get_shape()))
print(" b1.shape={}".format(b1.get_shape()))
h1 = tf.matmul(h0, W1) + b1
return(h1, (W1,b1))
def nll_gaussian(y_pred_mean,y_pred_sd,y_test):
## element wise square
square = tf.square(y_pred_mean - y_test)## preserve the same shape as y_pred.shape
ms = tf.add(tf.divide(square,y_pred_sd), tf.log(y_pred_sd))
## axis = -1 means that we take mean across the last dimension
## the output keeps all but the last dimension
## ms = tf.reduce_mean(ms,axis=-1)
## return scalar
ms = tf.reduce_mean(ms)
return(ms)
def mse(y_pred,y_test, verbose=True):
'''
y_pred : tensor
y_test : tensor having the same shape as y_pred
'''
## element wise square
square = tf.square(y_pred - y_test)## preserve the same shape as y_pred.shape
## mean across the final dimensions
ms = tf.reduce_mean(square)
return(ms)
def define_model(n_feature,n_hs,n_output,eps=0.1,verbose=True,NLL=True):
x_input_shape = [None, n_feature]
x_input = tf.placeholder(tf.float32, x_input_shape)
y_input = tf.placeholder(tf.float32, [None, 1])
h_previous = x_input
n_h_previous = n_feature
paras = []
for ilayer,n_h in enumerate(n_hs,1):
if verbose:
print(" layer:{}".format(ilayer))
h, p = fully_connected_layer(h_previous,n_h_previous,n_h,verbose)
h_previous = tf.nn.relu(h)
n_h_previous = n_h
paras.append(p)
if verbose:
print(" output layer for y_mean")
y_mean,p = fully_connected_layer(h_previous,n_h_previous,n_output,verbose)
paras.append(p)
if NLL:
if verbose:
print(" output layer for y_sigma")
y_sigma, p = fully_connected_layer(h_previous,n_h_previous,n_output,verbose)
## for numerical stability this enforce the variance to be more than 1E-1
y_sigma = tf.clip_by_value(t=tf.exp(y_sigma),
clip_value_min=tf.constant(1E-1),
clip_value_max=tf.constant(1E+100))
paras.append(p)
loss = nll_gaussian(y_mean, y_sigma,y_input)
y = [y_mean, y_sigma]
else:
loss = mse(y_mean, y_input)
y = [y_mean]
## tensor to calculate the adversarial image
eps_tf = tf.constant(float(eps),name="epsilon")
grad_tf = tf.gradients(loss,[x_input])
grad_sign_tf = tf.sign(grad_tf)
grad_sign_eps_tf = tf.scalar_mul(eps_tf,
grad_sign_tf)
aimage = tf.add(grad_sign_eps_tf,x_input)
aimage = tf.reshape(aimage,[-1,n_feature])
x_input_a = tf.concat([aimage,
x_input],axis=0)
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
inputs = (x_input,y_input, x_input_a)
tensors = (inputs, loss, train_step, y, paras)
return(tensors)
Functions to train the model¶
Notice that every batch is augmented by adversarial examples.
def train(x_train,y_train,
tensors,
n_batch = 500,
n_epochs = 5000,
adversarialTF = True):
from sklearn.utils import shuffle
import time
inputs, loss, train_step, _ , _ = tensors
x_input,y_input, x_input_a = inputs
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
lvalues = []
start = time.time()
for i_epoch in range(n_epochs):
x_shuffle, y_shuffle = shuffle(x_train,y_train)
for i in range(0,x_train.shape[0],n_batch):
batch_xs = x_shuffle[i:i+n_batch]
batch_ys = y_shuffle[i:i+n_batch]
## obtain adversarial samples
if adversarialTF:
batch_xs = sess.run(x_input_a,
feed_dict={x_input: batch_xs,
y_input: batch_ys})
batch_ys = np.concatenate([batch_ys,
batch_ys],axis=0)
sess.run(train_step,
feed_dict={x_input: batch_xs,
y_input: batch_ys})
lv = sess.run(loss,
feed_dict={x_input: x_train,
y_input: y_train})
lvalues.append(lv)
if (i_epoch % 1000) - 999 == 0:
print(" epoch {: 5.0f} loss {: 5.4f} {:3.0f}sec".format(
i_epoch,lv,time.time() - start))
start = time.time()
return(sess,lvalues)
Training multiple models¶
Here, I consider two training setting:
- training with adversarial examples
- training without adversarial examples
It seems that the training without adversarial examples takes 2/3 less times than with adversarial examples.
result_NLL, result_NLL_adv = [], []
n_hs = [500,300,100]
eps = 0.05
Nensemble = 5
n_epochs = 5000
for iens in range(Nensemble):
print("Adversarial training: {}th ensemble".format(iens))
tensors = define_model(n_feature=1,
n_hs=n_hs,
n_output = 1,
eps=eps,
verbose=False)
sess, lvalues = train(x_train,
y_train,
tensors,
n_epochs = n_epochs,
adversarialTF = True)
result_NLL_adv.append((sess, lvalues, tensors))
print("Regular training: {}th ensemble".format(iens))
tensors = define_model(n_feature=1,
n_hs=n_hs,
n_output = 1,
eps=eps,
verbose=False)
sess, lvalues = train(x_train,
y_train,
tensors,
n_epochs = n_epochs,
adversarialTF = False)
result_NLL.append((sess, lvalues, tensors))
Prediction on testing data¶
Here I estimate the mean $\mu_m (X)$ and variance $\sigma_m^2(X)$ of each model.
x_test = np.arange(-2.5,2.5,inc).reshape(-1,1)
y_test = sinfun(x_test,noise=0)
def get_predicted_values(x_test,result_NLL_adv):
y_mean_pred, y_sigma_pred = [], []
for iens,mod in enumerate(result_NLL_adv):
(sess, lvalues, tensors) = mod
(inputs, _ , _ , y, _) = tensors
(x_input, _ , _ ) = inputs
(y_mean, y_sigma) = y
y_mean_pred.append(sess.run(y_mean,feed_dict={x_input: x_test}))
y_sigma_pred.append(sess.run(y_sigma,feed_dict={x_input: x_test}))
y_mean_pred = np.array(y_mean_pred).squeeze()
y_sigma_pred = np.array(y_sigma_pred).squeeze()
return(y_mean_pred,y_sigma_pred)
prediction_adv = get_predicted_values(x_test,result_NLL_adv)
prediction_ = get_predicted_values(x_test,result_NLL)
From mean $\mu_m (X)$ and variance $\sigma_m^2(X)$, I will estimate the aggregate estimates:
\begin{array}{rcll}
\mu_{*} (X)
&=& \frac{1}{M}\sum_{m=1}^M \mu_m (X)\\
\sigma_{*}^2 (X)
&=& \frac{1}{M} \sum_{m=1}^M \left( \sigma_m^2(X) + \mu_m (X)^2 \right) - \mu_{*} (X)^2\\
\end{array}
def get_marginal_var(y_mean,y_sigma):
'''
y_mean : M by N
y_sigma : M by N
'''
y_mean_mar = np.mean(y_mean,axis=0) ## N by 1
y_sigma_mar = np.mean(y_mean**2 + y_sigma,axis=0) - y_mean_mar**2
return(y_mean_mar,y_sigma_mar)
y_mean_adv, y_sigma_adv = get_marginal_var(*prediction_adv)
y_mean_ , y_sigma_ = get_marginal_var(*prediction_)
def plot(x_train,y_train,prediction_adv,iensemble=3):
'''
y_mean_mar, y_sigma_mar = prediction_adv
'''
def plotdata(axs):
axs.scatter(x_train,y_train,s=0.5, alpha=0.5)
def ploterrorbar(x_test,y_test,y_sigma):
axs.plot(x_test,y_test)
axs.errorbar(x_test, y_test,
yerr=1.96*np.sqrt(y_sigma),
capsize=0,alpha=0.1,
label="estimated_y +- 1.96*sigma")
fig = plt.figure(figsize=(15,15))
## Plot training data with estimated y from each model
axs = fig.add_subplot(4,1,1)
plotdata(axs)
for i, y_pre in enumerate(prediction_adv[0]):
axs.plot(x_test,y_pre,label=" Ensemble {:}".format(i))
axs.set_title("Estimated y from each ensemble model")
axs.legend()
## Plot estimated sigma
axs = fig.add_subplot(4,1,2)
for i, y_sig in enumerate(prediction_adv[1]):
axs.plot(x_test,y_sig,label=" Ensemble {:}".format(i))
axs.set_title("Estimated sigma from each ensemble model")
## Plot a single estimated y with error bar
axs = fig.add_subplot(4,1,3)
plotdata(axs)
ploterrorbar(x_test,
prediction_adv[0][iensemble],
prediction_adv[1][iensemble])
axs.set_title("Estimated y with estimated sigma from one of the ensemble models")
## Plot
axs = fig.add_subplot(4,1,4)
axs.plot(x_test,y_mean_adv,label=" Marginal Mean {:}".format(i))
plotdata(axs)
ploterrorbar(x_test,
y_mean_adv,
y_sigma_adv)
axs.set_title("Aggregate Prediction")
plt.legend()
plt.show()
const = 80
iensemble = 1
print("~"*const)
print("Visualization of results with adversarial training")
print("~"*const)
plot(x_train,y_train,prediction_adv,iensemble=iensemble)
print("~"*const)
print("Visualization of results without adversarial training")
print("~"*const)
plot(x_train,y_train,prediction_ ,iensemble=iensemble)