# Evaluate uncertainty using ensemble models with likelihood loss and adverserial training

Evaluating the quality of predictive uncertainties is challenging as "ground truth" uncertainty is usually not available. Yet, model's confidence about its estimation is often of interest for researchers. If the model can tell "what it knows" or what is "out of distribution", such infomation gives insights about when the researchers should take the point estimates as their face values.

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning recently proposed using Monte Carlo dropout to estimate predictive uncertainty by using Dropout at test time. Most of the recent work on uncertainty in deep learning is in this line. I study MC dropout method for evaluating model's confidence in Measure the uncertainty in deep learning models using dropout. I found that this MC dropout method requires the model to restrict the choice of network structures even for modeling simple data; activation functions seem to have to be relu and we seem to need to have quite a few dropout layers.

I recently found a new paper Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles in this line of uncertainty research. Their methods do not take Bayesian approaches and it has a lot of similarity with the maximum likelihood approach.

In this blog post, I will review this new method.

## Model

The approach assumes that given available input $X$, the target has a normal distribution with mean and variance dependent on the values of $X$: $$Y | X \overset{i.i.d.}{\sim} N \left(\mu_m (X) , \sigma_m^2(X) \right)$$ Here, $\mu_m (X)$ and $\sigma_m^2(X)$ are modelled non-linearly using neural network. Then the maximum likelihood methods are used to estimate unknown parameters $\mu_m (X)$, $\sigma_m^2(X)$.

In practice, estimates become more accurate if $M$ models are ensembled together. Therefore, Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles also proposed to use the ensemble model for prediction. This essentially means that you are assuming that conditional distribution of $Y$ to be from the mixture: $$Y \left| X \right. \sim \frac{1}{M}\sum_{m=1}^M N \left(\mu_m (X) , \sigma_m^2(X) \right)$$

Under this ensemble model structure, mean $\mu_{*} (X)$ and variance $\sigma_{*}^2(X)$ of the marginal distribution can be calcualted as:

\begin{array}{rcll} \mu_{*} (X) &=& \frac{1}{M}\sum_{m=1}^M \mu_m (X)\\ \sigma_{*}^2 (X) &=& \frac{1}{M} \sum_{m=1}^M \left( \sigma_m^2(X) + \mu_m (X)^2 \right) - \mu_{*} (X)^2\\ \end{array}

These are because: \begin{array}{rcll} \mu_{*} (X) & := & E(Y)\\ &=& \sum_{m=1}^M E(Y|m) \\ &=& \frac{1}{M}\sum_{m=1}^M \mu_m (X)\\ \sigma_{*}^2 (X) &:=& Var(Y) \\ &=& E_m(Var(Y|m)) + Var_m(E(Y|m)) \\ &=& \frac{1}{M}\sum_{m=1}^M \sigma_m^2(X) + Var_m \left( \mu_m (X) \right)\\ \textrm{ where } Var_m \left( \mu_m (X) \right) &=& E_m \left( \mu_m (X)^2 \right) - E_m \left( \mu_m (X) \right)^2\\ &=& \frac{1}{M} \sum_{m=1}^M \mu_m (X)^2 - \left( \frac{1}{M}\sum_{m=1}^M \mu_m (X) \right)^2\\ &=& \frac{1}{M} \sum_{m=1}^M \mu_m (X)^2 - \mu_{*} (X)^2 \end{array} See Wikipedia if you get lost dividing variance into its components.

Finally, the authors proposed to use adversarial training to improve the estimation of the likelihood:

Adversarial training can be interpreted as a computationally efficient solution to smooth the predictive distributions by increasing the likelihood of the target around an $\epsilon$ neighboorhood.
I previously studied adversarial examples. So I will not discuss its details here. See Generate adversarial examples using TensorFlow.

In this blog, I would like to know how easy it is to implement this procedure and whether adversarial training is really necessary.

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import tensorflow as tf
import os, sys

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.95
config.gpu_options.visible_device_list = "1"
tf.Session(config=config)

print("python {}".format(sys.version))
print("tensorflow version {}".format(tf.__version__))

python 2.7.13 |Anaconda 4.3.1 (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
tensorflow version 1.2.0


## Generate synthetic data¶

I will use the same simulated data as the previous blog: TensorFlow newbie creates a neural net with a negative log likelihood as a loss.

In [2]:
## Define x
inc = 0.001
x_train =np.concatenate([np.arange(-2,-1.5,inc),
np.arange(-1,2,inc)])
x_train = x_train.reshape(len(x_train),1)

## Define y
steps_per_cycle = 1
def sinfun(xs,noise=0.001):
import random, math
xs = xs.flatten()
def randomNoise(x):
ax = 2 - np.abs(x)
wnoise = random.uniform(-noise*ax,
noise*ax)
return(math.sin(x * (2 * math.pi / steps_per_cycle) ) + wnoise)
vec = [randomNoise(x) - x  for x in xs]
return(np.array(vec).flatten())

y_train0 =  sinfun(x_train,noise=0.5)
y_train = y_train0.reshape(len(y_train0),1)

print(" x_train.shape={}".format(x_train.shape))
print(" y_train.shape={}".format(y_train.shape))

## Visualize the generated data (X,y)
plt.figure(figsize=(10,3))
plt.scatter(x_train,y_train,s=0.5)
plt.xlabel("x_train")
plt.ylabel("y_train")
plt.title("The x values of the synthetic data ranges between {:4.3f} and {:4.3f}".format(
np.min(x_train),np.max(x_train)))
plt.show()

 x_train.shape=(3500, 1)
y_train.shape=(3500, 1)


### Define model¶

I will consider 3-hidden layer NN for modeling this synthetic data. I will consider negative log-likelihood as a loss function. This model was also used in the blog post: TensorFlow newbie creates a neural net with a negative log likelihood as a loss. Notice that all functions are the same as ones in the previous blog, except "define_model" function. The "define_model" function now has tensors to calcualte the adversarial examples. The codes for calculating adversarial examples are mostly borrowed from the previous blog Generate adversarial examples using TensorFlow.

In [3]:
def weight_variable(shape):
## weight variable, initialized with truncated normal distribution
initial = tf.truncated_normal(shape, stddev=0.01, dtype="float32")
return tf.Variable(initial)

def bias_variable(shape):
initial = tf.constant(0.001, shape=shape, dtype="float32")
return tf.Variable(initial)

def fully_connected_layer(h0,n_h0,n_h1,verbose=True):
'''
h0   :  tensor of shape (n_h0, n_h1)
n_h0 :  scalar
n_h1 :  scalar
'''
W1 = weight_variable([n_h0, n_h1])
b1 =  bias_variable([n_h1])

if verbose:
print("    h0.shape={}".format(h0.get_shape()))
print("    W1.shape={}".format(W1.get_shape()))
print("    b1.shape={}".format(b1.get_shape()))

h1 = tf.matmul(h0, W1) + b1
return(h1, (W1,b1))
def nll_gaussian(y_pred_mean,y_pred_sd,y_test):

## element wise square
square = tf.square(y_pred_mean - y_test)## preserve the same shape as y_pred.shape
## axis = -1 means that we take mean across the last dimension
## the output keeps all but the last dimension
## ms = tf.reduce_mean(ms,axis=-1)
## return scalar
ms = tf.reduce_mean(ms)
return(ms)
def mse(y_pred,y_test, verbose=True):
'''
y_pred : tensor
y_test : tensor having the same shape as y_pred
'''
## element wise square
square = tf.square(y_pred - y_test)## preserve the same shape as y_pred.shape
## mean across the final dimensions
ms = tf.reduce_mean(square)
return(ms)

def define_model(n_feature,n_hs,n_output,eps=0.1,verbose=True,NLL=True):

x_input_shape = [None, n_feature]
x_input = tf.placeholder(tf.float32, x_input_shape)
y_input = tf.placeholder(tf.float32, [None, 1])

h_previous = x_input
n_h_previous = n_feature
paras = []
for ilayer,n_h in enumerate(n_hs,1):
if verbose:
print("  layer:{}".format(ilayer))
h, p = fully_connected_layer(h_previous,n_h_previous,n_h,verbose)
h_previous = tf.nn.relu(h)
n_h_previous = n_h
paras.append(p)
if verbose:
print("  output layer for y_mean")
y_mean,p = fully_connected_layer(h_previous,n_h_previous,n_output,verbose)
paras.append(p)

if NLL:
if verbose:
print("  output layer for y_sigma")
y_sigma, p = fully_connected_layer(h_previous,n_h_previous,n_output,verbose)
## for numerical stability this enforce the variance to be more than 1E-1
y_sigma = tf.clip_by_value(t=tf.exp(y_sigma),
clip_value_min=tf.constant(1E-1),
clip_value_max=tf.constant(1E+100))

paras.append(p)
loss = nll_gaussian(y_mean, y_sigma,y_input)
y = [y_mean, y_sigma]
else:
loss = mse(y_mean, y_input)
y = [y_mean]

## tensor to calculate the adversarial image
eps_tf = tf.constant(float(eps),name="epsilon")
aimage = tf.reshape(aimage,[-1,n_feature])
x_input_a = tf.concat([aimage,
x_input],axis=0)

inputs = (x_input,y_input, x_input_a)
tensors = (inputs, loss, train_step, y, paras)
return(tensors)


## Functions to train the model¶

Notice that every batch is augmented by adversarial examples.

In [4]:
def train(x_train,y_train,
tensors,
n_batch  = 500,
n_epochs = 5000,
from sklearn.utils import shuffle
import time
inputs, loss, train_step, _ , _ = tensors
x_input,y_input, x_input_a  = inputs

sess = tf.InteractiveSession()
tf.global_variables_initializer().run()

lvalues = []
start = time.time()
for i_epoch in range(n_epochs):

x_shuffle, y_shuffle = shuffle(x_train,y_train)
for i in range(0,x_train.shape[0],n_batch):
batch_xs = x_shuffle[i:i+n_batch]
batch_ys = y_shuffle[i:i+n_batch]
batch_xs = sess.run(x_input_a,
feed_dict={x_input: batch_xs,
y_input: batch_ys})
batch_ys = np.concatenate([batch_ys,
batch_ys],axis=0)
sess.run(train_step,
feed_dict={x_input: batch_xs,
y_input: batch_ys})
lv = sess.run(loss,
feed_dict={x_input: x_train,
y_input: y_train})
lvalues.append(lv)
if (i_epoch % 1000) - 999 == 0:
print("     epoch {: 5.0f} loss {: 5.4f}  {:3.0f}sec".format(
i_epoch,lv,time.time() - start))
start = time.time()
return(sess,lvalues)


## Training multiple models¶

Here, I consider two training setting:

It seems that the training without adversarial examples takes 2/3 less times than with adversarial examples.

In [5]:
result_NLL, result_NLL_adv = [], []
n_hs = [500,300,100]
eps = 0.05
Nensemble = 5
n_epochs = 5000
for iens in range(Nensemble):
tensors = define_model(n_feature=1,
n_hs=n_hs,
n_output = 1,
eps=eps,
verbose=False)

sess, lvalues = train(x_train,
y_train,
tensors,
n_epochs = n_epochs,

print("Regular training: {}th ensemble".format(iens))
tensors = define_model(n_feature=1,
n_hs=n_hs,
n_output = 1,
eps=eps,
verbose=False)

sess, lvalues = train(x_train,
y_train,
tensors,
n_epochs = n_epochs,

result_NLL.append((sess, lvalues, tensors))

Adversarial training: 0th ensemble
epoch   999 loss  0.1921   34sec
epoch  1999 loss -0.5653   33sec
epoch  2999 loss -0.9877   34sec
epoch  3999 loss -1.1007   35sec
epoch  4999 loss -1.1928   34sec
Regular training: 0th ensemble
epoch   999 loss  0.0817   19sec
epoch  1999 loss -0.3772   19sec
epoch  2999 loss -1.1049   19sec
epoch  3999 loss -1.1557   18sec
epoch  4999 loss -1.3086   19sec
epoch   999 loss  0.1046   34sec
epoch  1999 loss -0.3785   34sec
epoch  2999 loss -0.8608   34sec
epoch  3999 loss -1.1535   34sec
epoch  4999 loss -1.2078   34sec
Regular training: 1th ensemble
epoch   999 loss  0.0923   19sec
epoch  1999 loss -0.6870   19sec
epoch  2999 loss -1.2025   19sec
epoch  3999 loss -1.2581   19sec
epoch  4999 loss -1.3139   18sec
epoch   999 loss  0.1234   34sec
epoch  1999 loss -0.5084   34sec
epoch  2999 loss -0.8654   33sec
epoch  3999 loss -1.1636   34sec
epoch  4999 loss -1.2155   33sec
Regular training: 2th ensemble
epoch   999 loss  0.0873   19sec
epoch  1999 loss -0.3271   19sec
epoch  2999 loss -0.9850   18sec
epoch  3999 loss -1.2101   19sec
epoch  4999 loss -1.3164   17sec
epoch   999 loss  0.1084   34sec
epoch  1999 loss -0.2542   34sec
epoch  2999 loss -1.0142   34sec
epoch  3999 loss -1.1467   34sec
epoch  4999 loss -1.2141   33sec
Regular training: 3th ensemble
epoch   999 loss  0.0915   19sec
epoch  1999 loss -0.3137   19sec
epoch  2999 loss -1.1367   19sec
epoch  3999 loss -1.3088   19sec
epoch  4999 loss -1.2444   19sec
epoch   999 loss  0.1462   34sec
epoch  1999 loss -0.6342   34sec
epoch  2999 loss -1.1118   34sec
epoch  3999 loss -1.1936   34sec
epoch  4999 loss -1.1991   34sec
Regular training: 4th ensemble
epoch   999 loss  0.0635   19sec
epoch  1999 loss -0.6976   18sec
epoch  2999 loss -1.1754   18sec
epoch  3999 loss -1.2974   19sec
epoch  4999 loss -1.3013   19sec


## Prediction on testing data¶

Here I estimate the mean $\mu_m (X)$ and variance $\sigma_m^2(X)$ of each model.

In [6]:
x_test = np.arange(-2.5,2.5,inc).reshape(-1,1)
y_test = sinfun(x_test,noise=0)

y_mean_pred, y_sigma_pred = [], []
(sess, lvalues, tensors) = mod

(inputs, _ , _ , y, _) = tensors
(x_input, _ , _ ) = inputs
(y_mean, y_sigma) = y

y_mean_pred.append(sess.run(y_mean,feed_dict={x_input: x_test}))
y_sigma_pred.append(sess.run(y_sigma,feed_dict={x_input: x_test}))
y_mean_pred = np.array(y_mean_pred).squeeze()
y_sigma_pred = np.array(y_sigma_pred).squeeze()
return(y_mean_pred,y_sigma_pred)

prediction_     = get_predicted_values(x_test,result_NLL)


From mean $\mu_m (X)$ and variance $\sigma_m^2(X)$, I will estimate the aggregate estimates:
\begin{array}{rcll} \mu_{*} (X) &=& \frac{1}{M}\sum_{m=1}^M \mu_m (X)\\ \sigma_{*}^2 (X) &=& \frac{1}{M} \sum_{m=1}^M \left( \sigma_m^2(X) + \mu_m (X)^2 \right) - \mu_{*} (X)^2\\ \end{array}

In [7]:
def get_marginal_var(y_mean,y_sigma):
'''
y_mean  : M by N
y_sigma : M by N
'''

y_mean_mar = np.mean(y_mean,axis=0) ## N by 1
y_sigma_mar = np.mean(y_mean**2 + y_sigma,axis=0) - y_mean_mar**2
return(y_mean_mar,y_sigma_mar)

y_mean_   , y_sigma_    = get_marginal_var(*prediction_)


## Visualize the model performance¶

#### Results¶

• Ensemble is neccesary to have a sensible error bound. With only single model, the variance seems too small.
• Adversarial samples seem not to affect the error bound and estiamtes much for this simple data analysis.
In [8]:
def plot(x_train,y_train,prediction_adv,iensemble=3):
'''
'''
def plotdata(axs):
axs.scatter(x_train,y_train,s=0.5, alpha=0.5)
def ploterrorbar(x_test,y_test,y_sigma):
axs.plot(x_test,y_test)
axs.errorbar(x_test, y_test,
yerr=1.96*np.sqrt(y_sigma),
capsize=0,alpha=0.1,
label="estimated_y +- 1.96*sigma")

fig = plt.figure(figsize=(15,15))

## Plot training data with estimated y from each model
plotdata(axs)
axs.plot(x_test,y_pre,label=" Ensemble {:}".format(i))
axs.set_title("Estimated y from each ensemble model")
axs.legend()

## Plot estimated sigma
axs.plot(x_test,y_sig,label=" Ensemble {:}".format(i))
axs.set_title("Estimated sigma from each ensemble model")

## Plot a single estimated y with error bar
plotdata(axs)
ploterrorbar(x_test,
axs.set_title("Estimated y with estimated sigma from one of the ensemble models")

## Plot
plotdata(axs)
ploterrorbar(x_test,
axs.set_title("Aggregate Prediction")

plt.legend()
plt.show()
const = 80
iensemble = 1
print("~"*const)
print("Visualization of results with adversarial training")
print("~"*const)
print("~"*const)
print("Visualization of results without adversarial training")
print("~"*const)
plot(x_train,y_train,prediction_   ,iensemble=iensemble)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Visualization of results with adversarial training
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Visualization of results without adversarial training
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~