Gaussian Process

In this blog post, I would like to review the traditional Gaussian process modeling. This blog was motivated by the blog post Fitting Gaussian Process Models in Python by Christ at Domino which explains the basic of Gaussian process modeling.

When I was reading his blog post, I felt that some mathemtatical details are missing. Therefore, I am writing this blog to digest his blog post.

In [1]:

import numpy as np
import matplotlib.pylab as plt
import scipy, os
from scipy.stats import norm

import warnings 
warnings.filterwarnings("ignore")

print "scipy={}v, numpy={}v, matplotlib={}v".format(scipy.__version__,np.__version__,plt.__version__)

scipy=1.0.0v, numpy=1.14.0v, matplotlib=1.14.0v

Gaussian model assumption¶

Say you have a pair of data $(x_i, y_i),i=1,\cdots,n$ and you want to estimate the future value of $y_m$ for given features $x_m$, Gaussian process may be useful just like any other regression model e.g. linear regression, support vector machine or neural network model etc. One of the unique characteristics of Gaussian model is that it can give you the closed form expression for the distribution of the future target variable $y_m$ given observed data, AKA the posterior predictive distribution. I will get to this point later, but let's formulate the Gaussian process.

Let's get notation straight. I use bold symbol to express a vector.

$\boldsymbol{x}_n = [x_1,\cdots,x_n]$
$\boldsymbol{y}_n = [y_1,\cdots,y_n]$

Just like linear regression model, the Gaussian process assumes that your target variable $y_i$ can be modelled to have a mean $\mu_i$ and some unknown variance $\sigma^2$.

$$ y_i \overset{ind.}{\sim} N(\mu_i, \sigma^2) $$

We can express the same distribution in vector form using multivariate normal distribution.

$$ \boldsymbol{y}_n \sim N(\boldsymbol{\mu}_n, \sigma^2 \boldsymbol{I}) $$

The mean $\mu_i$ is modeled with a $x_i$. If this were simple linear regression model, we would assume that $\mu_i$ was modeled as a linear function of $x_i$ and it has unknown weights that linearly link $x_i$ and $\mu_i$ as $\mu_i = x_i^T w + b_0$. The unknown weights can be estimated by minimizing mean square error, or in other words, maximizing the likelihood function.

In Gaussian process, the mean $\mu_i$ is modeled with a $x_i$ but functional form of $\mu_i = f(x_i)$ is not known. It could be a linear function of $x_i$ or cubic function of $x_i$. Gaussian process allows us to keep it general. To model our uncertainty in the form of $\mu_i=f(x_i)$, Gaussian process assume that $\mu_i$ is random, and from a multivariate normal with mean $\boldsymbol{0}$ and covariance matrix $\boldsymbol{K}(\boldsymbol{x})$:

$$ \boldsymbol{\mu}_n | \boldsymbol{x}_n \sim N(\boldsymbol{0}, \boldsymbol{K}(\boldsymbol{x}_n)) $$

Kernel for covariance matrix¶

The covariance matrix is modeled with the feature vector $\boldsymbol{x}$: $$ \boldsymbol{K}(\boldsymbol{x}_n) = \begin{bmatrix} k(x_1,x_1)&\cdots&\cdots\\ k(x_1,x_2)&k;(x_2,x_2)&\cdots\\ k(x_1,x_3)&k;(x_2,x_3)&k;(x_3,x_3)\\ \vdots\\ \end{bmatrix} $$

Here, dependence between $\mu_i$ and $\mu_j$ are modelled with features $x_i$ and $x_j$ through kernel $k$, and I need to give functional form to kernel. For example, I can use the squared exponential covariance function. This kernel is also called as the Radial Basis Function kernel or the Gaussian kernel. $$ k(x,x') = \theta_1 \exp \left( -\frac{ \theta_2 }{ 2 } (x-x')^2 \right) $$ Here, $0 < k(x,x') <\theta_1$, and this kernel is infinitely differentiable with respect to $\theta_1$ and $\theta_2$. In this kernel, $\theta_1>0$ and $\theta_2>0$ are parameters. What are their intepretations?

When $x = x'$, $k(x,x') = \theta_1$. So $\theta_1=Var(y)$, so $\theta_1$ determines the marginal variability of $y$. If you are standardizing the training data before fitting, it may be reasonable to assume that $\theta_1=1$.
if $\theta_2$ is large then $k(x,x')$ are very small regardless of the distance between x and x' indicating that the correlation depends only on the distance between x and x'.

There are a lot of good discussion of the choice of kernel by David Kristjanson Duvenaud from CS department of University of Tronto. The presentation slides from Tutorial: Gaussian process models for machine learning was very comprehensive.

Marginal likelihood¶

That's all assumptions for the Gaussian model.

Gaussian model is really just a hiearchical model with two layers of ditribution assumption: $\boldsymbol{y}_n | \boldsymbol{\mu}_n$ and $\boldsymbol{\mu}_n|\boldsymbol{x}_n$. The distribution $\boldsymbol{y}_n|\boldsymbol{x}_n$ is called marginal likelihood because it can be obtained by integrating out $\boldsymbol{\mu}_n$. In the derivation below, I drop the subscript $n$ to avoid cluttered notations.

$$ p(\boldsymbol{y}|\boldsymbol{x}) \\ %% = \int p(\boldsymbol{y},\boldsymbol{\mu}|\boldsymbol{x}) d \boldsymbol{\mu} \textrm{ by total low of probability}\\ %% = \int p(\boldsymbol{y} | \boldsymbol{\mu})p(\boldsymbol{\mu}|\boldsymbol{x})d \boldsymbol{\mu} \textrm{ because $\boldsymbol{y}$ is independent of $\boldsymbol{x}$ given $\boldsymbol{\mu}$}\\ %% \propto \int exp \left( -\frac{1}{2\sigma^2}( \boldsymbol{y} - \boldsymbol{\mu})^T ( \boldsymbol{y} - \boldsymbol{\mu}) \right) exp \left( -\frac{1}{2}\boldsymbol{\mu}^T \boldsymbol{K}^{-1} \boldsymbol{\mu} \right) d\boldsymbol{\mu}\\ %% \propto exp \left( -\frac{1}{2}\boldsymbol{\mu}^T \left(\boldsymbol{K} + \sigma^2\boldsymbol{I})^{-1} \boldsymbol{\mu} \right) \right) $$

Since the final equation is proportional to the multivariate normal density with mean 0 and variance $\boldsymbol{K}(\boldsymbol{x}) + \sigma^2 \boldsymbol{I}$, after standardizing the final line to have area under the density = 1, we have: $$ \boldsymbol{y}_n | \boldsymbol{x}_n \sim N (\boldsymbol{0},\boldsymbol{K}(\boldsymbol{x}_n) + \sigma^2 \boldsymbol{I}) $$

The parameterers, $\theta_1$, $\theta_2$ and $\sigma^2$ are some unknown parameters that we need to estimate using available data. I will discuss the maximum likelihood procedure later.

In [2]:

def nlog_pdf_multivariate_gauss(x, mu, cov):
    '''
    Caculate the multivariate normal density (pdf)
    x  : n by 1
    mu : n by 1
    cov: n by n
    '''
    mu = np.array(mu).reshape(-1,1) # [n by 1] 
    x = np.array(x).reshape(-1,1)   # [n by 1] 
    term1 = - np.log( ((2.0* np.pi)**(len(mu)/2.0)) * (np.linalg.det(cov)**(1/2.0)) )
    term2 = (-1/2.0) * ((x - mu).T.dot(np.linalg.inv(cov))).dot((x-mu))
    term = float(term1 +  term2)
    
    return - np.sum(term)

Let's check the calculation is right for 1 dimentional normal scenario¶

The assert statement return error message if the hand-calculated density does not agree to the negative of the log density from stats.norm.

In [3]:

cov = np.array([[1.2]])
x   = np.array([3])
mu  = np.array([0])
nldf = nlog_pdf_multivariate_gauss(x   = x, 
                                 mu  = mu,
                                 cov = cov)
print(" 1/({sigma2:} * np.sqrt(2 * np.pi * {sigma2:}  )) * np.exp( - ({x:} - {mu:})**2 / (2.0 * ({sigma2:})) ) = {ldens:}".format(
        x = x[0] , mu = mu[0] , sigma2 = cov[0][0], ldens= nldf ))
assert np.abs(nldf + scipy.stats.norm.logpdf(x=x[0],scale=np.sqrt(cov[0][0]))) < 1E-10

 1/(1.2 * np.sqrt(2 * np.pi * 1.2  )) * np.exp( - (3 - 0)**2 / (2.0 * (1.2)) ) = 4.7600993116

Define the squared exponential covariance function¶

In [4]:

from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform

def exponential_cov(eucl_dist, para):
    '''
    the core of the exponential covariance defenition
    '''
    assert len(para) == 2, len(para)
    return para[0] * np.exp( -0.5 * para[1] *  eucl_dist **2)

def exponential_cov_mat(x,para):
    '''
    return matrix
    '''
    eucl_dist = squareform(pdist(x))
    return(exponential_cov(eucl_dist, para))

def exponential_cov_pair(x1,x2,para):
    '''
    return scalar
    '''
    x1, x2 = np.array(x1).flatten(), np.array(x2).flatten()
    eucl_dist = np.subtract.outer(x1, x2)
    return(exponential_cov(eucl_dist, para))

Let's see example values of the squared exponential covariance function¶

Looks fine!

In [5]:

para = [1,0.2]
x1s = [1,2,3,5]
x2s = [13,10,92,45]
for x1 , x2 in zip(x1s,x2s):
    ex = exponential_cov_pair(x1,x2,para)
    print( " {theta1:}*exp(-0.5*{theta2:}*({x1:}-{x2:})**2)={value:}".format(
        theta1=para[0],theta2=para[1],
        x1=x1,x2=x2,value=ex))

 1*exp(-0.5*0.2*(1-13)**2)=[[5.57390369e-07]]
 1*exp(-0.5*0.2*(2-10)**2)=[[0.00166156]]
 1*exp(-0.5*0.2*(3-92)**2)=[[0.]]
 1*exp(-0.5*0.2*(5-45)**2)=[[3.25748853e-70]]

Posterior predictive distribution of $y_m | \boldsymbol{y}_n$¶

So what do we get by assuming Gaussian distribution over $Y_1,Y_2,...$? We can get the distribution for the future $y_m$ given the available data $y_1,...,y_n$.

The predictive distribution is the distribution of the target variable given the observed data $y_m |data$ and it is used for inference in Bayesian statistics. Let $n$ be the observed data size (training data size) and consider the scenario where I have $data = \boldsymbol{y}_{n} \in R^n$ observed data.

Under the model assumption, we have a closed form expression for the posterior distribution of $y_m \in R^1$ is:

$$ \left. y_m \right| \boldsymbol{y}_n \sim N_1 \left( \Sigma_{m,n}\Sigma_{n,n}^{-1} \boldsymbol{y}_n, k(x_m,x_m) + \sigma^2- \Sigma_{m,n}\Sigma_{n,n}^{-1} \Sigma_{m,n}^T \right) $$

$$ \Sigma_{m,n} = \begin{bmatrix} k(x_1,x_m)\\ k(x_2,x_m)\\ k(x_3,x_m)\\ \vdots\\ k(x_n,x_m)\\ \end{bmatrix}^T $$ $$ \Sigma_{n,n} = \begin{bmatrix} k(x_1,x_1)+\sigma^2&\cdots&\cdots&\cdots&k;(x_n,x_1)\\ k(x_2,x_1)&k;(x_2,x_2)+\sigma^2&\cdots&\cdots&k;(x_n,x_2)\\ k(x_3,x_1)&k;(x_3,x_2)&k;(x_3,x_3)+\sigma^2&\cdots&k;(x_3,x_n)\\ \vdots\\ k(x_n,x_1)&k;(x_n,x_2)&k;(x_n,x_3)&\cdots&k;(x_n,x_n)+\sigma^2\\ \end{bmatrix} $$

Some observations¶

The Gaussian process becomes computationally very expensive to evaluate $E(y_m | \boldsymbol{y}_n)$, because we need to take an inverse of $\Sigma_{n,n}$ and this dimension depends on the observed data size. Imagine if $n =$ 1 million, how can we take an inverse of such function ?! Does this mean that the Gaussian process is not very scalable?

Posterior predictive mean¶

In Baysian statistics, the conditional mean of $y_m$, $E(y_m | \boldsymbol{y}_n)$, is called posterior predictive mean.

Let's study the property of this posterior predictive mean. Consider the scenario when the training data only have a single $(y_1,x_1)$ so $n=1$, and $(y_m, x_m)$ happened to have $x_1 = x_m$. Then you might expect that the posterior predicitve mean of $y_m$ becomes $y_1$. This is not the case!

$$ E(y_m|\boldsymbol{y}_n) = k(x_1,x_m') (k(x_1,x_1)+\sigma^2)^{-1} y_1 = \frac{\theta_1}{ \theta_1+\sigma^2} y_1\\ $$

Notice that $E(y_m|\boldsymbol{y}_n)=y_1$ only if $\sigma^2=0$. Remind you that $\sigma^2 = Var(y_i|\mu_i)$.
Nonzero $\sigma^2$ means that knowledge of $\mu_i$ is not quite enough to know the value of $y_i$ and this is because $\mu_i$ is also a rondom quantity. Since this random quantity $\mu_i$ has a prior mean at 0, your posterior predictive mean of $y_i$ is pulled toward 0 when $\sigma^2$ is large relative to $\theta_1$.

If you feel that $(y_m,x_m)$ should be the same as $(y_1,x_1)$, when $x_1= x_m$, you are essntially assuming that

$\sigma^2 = 0$
no noise in the training data and $x_1$ can completely explain $y_1$.
Your model is reduced to: $ \boldsymbol{y} | \boldsymbol{x} \sim N(\boldsymbol{0},\boldsymbol{K})$.

Generate toy data¶

We have enough theories and let's look at examples to understand the theories!

I will first genrate data $y_1,.....,y_{15}$ from Gaussian distribution.

In [6]:

import numpy as np 
np.random.seed(1)
para_true = [1,0.01,0.25] ## theta1, theta2, sigma^2

N = 15
means = np.array([0]*N)
xmin, xmax = -100, 100


observed_x = np.random.uniform(xmin, xmax, N).reshape(-1,1)
cov = exponential_cov_mat(observed_x,para_true[:2])
cov[range(cov.shape[0]),range(cov.shape[0])] += para_true[2] 


observed_y = np.random.multivariate_normal(means,cov,1).reshape(-1,1)
observed_data = (observed_x, observed_y)

I will use scipy.optimizer.minimize for finding the maximum likelihood estimates. In pratice, we do not need to define Gaussian likelihood by hand to find the MLE for the model. For example, one can use

However, for the sake of understanding I will use the optimizer routine and directly minimize the negative of log-likelihood to obtain MLE.

In [7]:

from scipy.stats import multivariate_normal as mn
from scipy.optimize import minimize
def nllk_gaussian(para, y, x,verbose=False, kernel_mat=exponential_cov_mat, gauss_mean=0):
    '''
    calculate negative log likelihood of Gaussian model
    para = [log(theta2),log(sigma^2)]
    '''
    if ~hasattr(gauss_mean,"__len__"): ## is this scalar ? 
        gauss_mean = [gauss_mean]*len(y)

    para = np.array([0] + list(para))
    para = np.exp(para)
    assert len(para) == 3
    assert len(y) == len(x)
    assert len(y) == len(gauss_mean)
    
    cov = kernel_mat(x.reshape(-1,1),para[:2])
    cov[range(cov.shape[0]),
         range(cov.shape[1])] +=  para[2] ## add sigma^2 for 
    nllk = nlog_pdf_multivariate_gauss(y, gauss_mean, cov)
        
    if verbose:
        print("theta1 = {:5.3f}, theta2 = {:5.3f}, sigma^2={:5.3f} -llk = {:6.5f}".format(para[0],para[1],para[2],nllk))
    if ~np.isfinite(nllk):
        nllk = np.inf
    return(nllk)

para0 = [np.log(1.0),np.log(1)] ## theta2 and sigma^2

res = minimize(fun = nllk_gaussian, 
               x0 = para0,
               args = (observed_y, observed_x,True),
               #method='BFGS',
               options={ 'disp': True, 'maxiter':100}
              )
para = [1] +list(np.exp(res["x"]))
print("The maximume likelihood estimates of theta1 = {:5.4f}, theta2 = {:5.3f}, sigma^2 = {:5.3f}".format(
    para[0],para[1],para[2]))

theta1 = 1.000, theta2 = 1.000, sigma^2=1.000 -llk = 23.64354
theta1 = 1.000, theta2 = 1.000, sigma^2=1.000 -llk = 23.64354
theta1 = 1.000, theta2 = 1.000, sigma^2=1.000 -llk = 23.64354
theta1 = 1.000, theta2 = 1.000, sigma^2=1.000 -llk = 23.64354
theta1 = 1.000, theta2 = 0.982, sigma^2=0.364 -llk = 22.91802
theta1 = 1.000, theta2 = 0.982, sigma^2=0.364 -llk = 22.91802
theta1 = 1.000, theta2 = 0.982, sigma^2=0.364 -llk = 22.91802
theta1 = 1.000, theta2 = 0.982, sigma^2=0.364 -llk = 22.91802
theta1 = 1.000, theta2 = 0.935, sigma^2=0.317 -llk = 22.89801
theta1 = 1.000, theta2 = 0.935, sigma^2=0.317 -llk = 22.89801
theta1 = 1.000, theta2 = 0.935, sigma^2=0.317 -llk = 22.89801
theta1 = 1.000, theta2 = 0.935, sigma^2=0.317 -llk = 22.89801
theta1 = 1.000, theta2 = 0.836, sigma^2=0.272 -llk = 22.88609
theta1 = 1.000, theta2 = 0.836, sigma^2=0.272 -llk = 22.88609
theta1 = 1.000, theta2 = 0.836, sigma^2=0.272 -llk = 22.88609
theta1 = 1.000, theta2 = 0.836, sigma^2=0.272 -llk = 22.88609
theta1 = 1.000, theta2 = 0.702, sigma^2=0.244 -llk = 22.87374
theta1 = 1.000, theta2 = 0.702, sigma^2=0.244 -llk = 22.87374
theta1 = 1.000, theta2 = 0.702, sigma^2=0.244 -llk = 22.87374
theta1 = 1.000, theta2 = 0.702, sigma^2=0.244 -llk = 22.87374
theta1 = 1.000, theta2 = 0.349, sigma^2=0.158 -llk = 22.66653
theta1 = 1.000, theta2 = 0.349, sigma^2=0.158 -llk = 22.66653
theta1 = 1.000, theta2 = 0.349, sigma^2=0.158 -llk = 22.66653
theta1 = 1.000, theta2 = 0.349, sigma^2=0.158 -llk = 22.66653
theta1 = 1.000, theta2 = 0.021, sigma^2=0.028 -llk = 22.93146
theta1 = 1.000, theta2 = 0.021, sigma^2=0.028 -llk = 22.93146
theta1 = 1.000, theta2 = 0.021, sigma^2=0.028 -llk = 22.93146
theta1 = 1.000, theta2 = 0.021, sigma^2=0.028 -llk = 22.93146
theta1 = 1.000, theta2 = 0.081, sigma^2=0.064 -llk = 21.69681
theta1 = 1.000, theta2 = 0.081, sigma^2=0.064 -llk = 21.69681
theta1 = 1.000, theta2 = 0.081, sigma^2=0.064 -llk = 21.69681
theta1 = 1.000, theta2 = 0.081, sigma^2=0.064 -llk = 21.69681
theta1 = 1.000, theta2 = 0.068, sigma^2=0.057 -llk = 21.67129
theta1 = 1.000, theta2 = 0.068, sigma^2=0.057 -llk = 21.67129
theta1 = 1.000, theta2 = 0.068, sigma^2=0.057 -llk = 21.67129
theta1 = 1.000, theta2 = 0.068, sigma^2=0.057 -llk = 21.67129
theta1 = 1.000, theta2 = 0.000, sigma^2=0.000 -llk = 4070932991425.09766
theta1 = 1.000, theta2 = 0.000, sigma^2=0.000 -llk = 4070932991425.09766
theta1 = 1.000, theta2 = 0.000, sigma^2=0.000 -llk = 4070932991425.09766
theta1 = 1.000, theta2 = 0.000, sigma^2=0.000 -llk = 4070932991425.09766
theta1 = 1.000, theta2 = 0.068, sigma^2=0.057 -llk = 21.67129
theta1 = 1.000, theta2 = 0.068, sigma^2=0.057 -llk = 21.67129
theta1 = 1.000, theta2 = 0.068, sigma^2=0.057 -llk = 21.67129
theta1 = 1.000, theta2 = 0.068, sigma^2=0.057 -llk = 21.67129
theta1 = 1.000, theta2 = 0.068, sigma^2=0.057 -llk = 21.67129
theta1 = 1.000, theta2 = 0.068, sigma^2=0.057 -llk = 21.67129
theta1 = 1.000, theta2 = 0.068, sigma^2=0.057 -llk = 21.67129
theta1 = 1.000, theta2 = 0.068, sigma^2=0.057 -llk = 21.67129
theta1 = 1.000, theta2 = 0.000, sigma^2=0.000 -llk = 25540347.18033
theta1 = 1.000, theta2 = 0.000, sigma^2=0.000 -llk = 25540347.18033
theta1 = 1.000, theta2 = 0.000, sigma^2=0.000 -llk = 25540347.18440
theta1 = 1.000, theta2 = 0.000, sigma^2=0.000 -llk = 25540346.80619
theta1 = 1.000, theta2 = 0.000, sigma^2=0.002 -llk = 2788.20045
theta1 = 1.000, theta2 = 0.000, sigma^2=0.002 -llk = 2788.20045
theta1 = 1.000, theta2 = 0.000, sigma^2=0.002 -llk = 2788.20043
theta1 = 1.000, theta2 = 0.000, sigma^2=0.002 -llk = 2788.20042
theta1 = 1.000, theta2 = 0.014, sigma^2=0.022 -llk = 24.86537
theta1 = 1.000, theta2 = 0.014, sigma^2=0.022 -llk = 24.86537
theta1 = 1.000, theta2 = 0.014, sigma^2=0.022 -llk = 24.86537
theta1 = 1.000, theta2 = 0.014, sigma^2=0.022 -llk = 24.86537
theta1 = 1.000, theta2 = 0.054, sigma^2=0.050 -llk = 21.69128
theta1 = 1.000, theta2 = 0.054, sigma^2=0.050 -llk = 21.69128
theta1 = 1.000, theta2 = 0.054, sigma^2=0.050 -llk = 21.69128
theta1 = 1.000, theta2 = 0.054, sigma^2=0.050 -llk = 21.69128
theta1 = 1.000, theta2 = 0.065, sigma^2=0.056 -llk = 21.66965
theta1 = 1.000, theta2 = 0.065, sigma^2=0.056 -llk = 21.66965
theta1 = 1.000, theta2 = 0.065, sigma^2=0.056 -llk = 21.66965
theta1 = 1.000, theta2 = 0.065, sigma^2=0.056 -llk = 21.66965
theta1 = 1.000, theta2 = 0.065, sigma^2=0.056 -llk = 21.66634
theta1 = 1.000, theta2 = 0.065, sigma^2=0.056 -llk = 21.66634
theta1 = 1.000, theta2 = 0.065, sigma^2=0.056 -llk = 21.66634
theta1 = 1.000, theta2 = 0.065, sigma^2=0.056 -llk = 21.66634
theta1 = 1.000, theta2 = 0.064, sigma^2=0.056 -llk = 21.65312
theta1 = 1.000, theta2 = 0.064, sigma^2=0.056 -llk = 21.65312
theta1 = 1.000, theta2 = 0.064, sigma^2=0.056 -llk = 21.65312
theta1 = 1.000, theta2 = 0.064, sigma^2=0.056 -llk = 21.65312
theta1 = 1.000, theta2 = 0.061, sigma^2=0.057 -llk = 21.60055
theta1 = 1.000, theta2 = 0.061, sigma^2=0.057 -llk = 21.60055
theta1 = 1.000, theta2 = 0.061, sigma^2=0.057 -llk = 21.60055
theta1 = 1.000, theta2 = 0.061, sigma^2=0.057 -llk = 21.60055
theta1 = 1.000, theta2 = 0.050, sigma^2=0.061 -llk = 21.39528
theta1 = 1.000, theta2 = 0.050, sigma^2=0.061 -llk = 21.39528
theta1 = 1.000, theta2 = 0.050, sigma^2=0.061 -llk = 21.39528
theta1 = 1.000, theta2 = 0.050, sigma^2=0.061 -llk = 21.39528
theta1 = 1.000, theta2 = 0.022, sigma^2=0.082 -llk = 20.68440
theta1 = 1.000, theta2 = 0.022, sigma^2=0.082 -llk = 20.68440
theta1 = 1.000, theta2 = 0.022, sigma^2=0.082 -llk = 20.68440
theta1 = 1.000, theta2 = 0.022, sigma^2=0.082 -llk = 20.68440
theta1 = 1.000, theta2 = 0.004, sigma^2=0.333 -llk = 19.45677
theta1 = 1.000, theta2 = 0.004, sigma^2=0.333 -llk = 19.45677
theta1 = 1.000, theta2 = 0.004, sigma^2=0.333 -llk = 19.45677
theta1 = 1.000, theta2 = 0.004, sigma^2=0.333 -llk = 19.45677
theta1 = 1.000, theta2 = 0.008, sigma^2=0.203 -llk = 19.59351
theta1 = 1.000, theta2 = 0.008, sigma^2=0.203 -llk = 19.59351
theta1 = 1.000, theta2 = 0.008, sigma^2=0.203 -llk = 19.59351
theta1 = 1.000, theta2 = 0.008, sigma^2=0.203 -llk = 19.59351
theta1 = 1.000, theta2 = 0.005, sigma^2=0.283 -llk = 19.38598
theta1 = 1.000, theta2 = 0.005, sigma^2=0.283 -llk = 19.38598
theta1 = 1.000, theta2 = 0.005, sigma^2=0.283 -llk = 19.38598
theta1 = 1.000, theta2 = 0.005, sigma^2=0.283 -llk = 19.38598
theta1 = 1.000, theta2 = 0.004, sigma^2=0.256 -llk = 19.38758
theta1 = 1.000, theta2 = 0.004, sigma^2=0.256 -llk = 19.38758
theta1 = 1.000, theta2 = 0.004, sigma^2=0.256 -llk = 19.38758
theta1 = 1.000, theta2 = 0.004, sigma^2=0.256 -llk = 19.38758
theta1 = 1.000, theta2 = 0.005, sigma^2=0.270 -llk = 19.37982
theta1 = 1.000, theta2 = 0.005, sigma^2=0.270 -llk = 19.37982
theta1 = 1.000, theta2 = 0.005, sigma^2=0.270 -llk = 19.37982
theta1 = 1.000, theta2 = 0.005, sigma^2=0.270 -llk = 19.37982
theta1 = 1.000, theta2 = 0.005, sigma^2=0.270 -llk = 19.37981
theta1 = 1.000, theta2 = 0.005, sigma^2=0.270 -llk = 19.37981
theta1 = 1.000, theta2 = 0.005, sigma^2=0.270 -llk = 19.37981
theta1 = 1.000, theta2 = 0.005, sigma^2=0.270 -llk = 19.37981
theta1 = 1.000, theta2 = 0.005, sigma^2=0.270 -llk = 19.37981
theta1 = 1.000, theta2 = 0.005, sigma^2=0.270 -llk = 19.37981
theta1 = 1.000, theta2 = 0.005, sigma^2=0.270 -llk = 19.37981
theta1 = 1.000, theta2 = 0.005, sigma^2=0.270 -llk = 19.37981
Optimization terminated successfully.
         Current function value: 19.379811
         Iterations: 11
         Function evaluations: 116
         Gradient evaluations: 29
The maximume likelihood estimates of theta1 = 1.0000, theta2 = 0.005, sigma^2 = 0.270

The negative log likelihood of the multivariate normal distribution evaluated at the true parameters¶

Because of the sampling error, the maximum likelihood estimates (MLE) are not completely the same as the true parameters. In fact, the lowest negative likelihood values evaluated at the true parameters are not the smallest. In theory, the sampling error will average out and when the sample size goes to infinity, the MLE should converge to the true parameter values.

In [8]:

nllk_gaussian(np.log(para_true[1:]),y=observed_y,x=observed_x,verbose=True)

theta1 = 1.000, theta2 = 0.010, sigma^2=0.250 -llk = 19.67072

Out[8]:

19.67071970911342

Using the MLE, let's illustrate the model's performance.¶

In [9]:

def predict(x, kernel_pair, para, observed_sigma, observed_data):
    '''
    calcualte the expected value of y | x, data
    m = 1 as this function predict E(y|x,data) for a scalar y 
    x must be scalar
    
    para = [theta1, theta2, sigma^2]
    '''
    if hasattr(x,"__len__"):
        x = x[0]
        
    (observed_x,observed_y) = observed_data 
    observed_y = np.array(observed_y).reshape(-1,1) ## [n by 1] 
    
    Smn = kernel_pair(x,observed_x,para[:2]) ## [m by n]
    Sinv = np.linalg.inv(observed_sigma)  ## [n by n]
    y_pred = np.dot(Smn, Sinv).dot(observed_y) ## [m by n] [n by n] by [n by 1] = [m by 1]
    ## variance 
    sigma_new = kernel_pair(x, x, para[:2]) + para[2] - np.dot(Smn, Sinv).dot(Smn.T)
    return y_pred, sigma_new



def get_predictive_y(newxs,kernel_pair,para,observed_data):
    '''
     evaluate the estimated E(y_i|x_i,data) and SD(y_i|x_i,data)
     at every i = 1, 2, 3,... in newx    
    '''
    (observed_x,observed_y) = observed_data 
    Snn = exponential_cov_mat(observed_x,para[:2])
    Snn[range(Snn.shape[0]),
        range(Snn.shape[1])] += para[2]
    y_preds, y_preds_sigma = [], []
    for newx in newxs:
        pr = predict(newx, kernel_pair, para, Snn, observed_data) 
        y_preds.append(pr[0])
        y_preds_sigma.append(pr[1])
    return(y_preds, y_preds_sigma)

Visualization of Gaussian process¶

In [10]:

## create a function for plotting

def plotGauss(ax, newxs, 
              y_preds, y_preds_sigma,
              observed_data, 
              title="noise model"
              ):
    (ox,oy) = observed_data
    ax.errorbar(newxs, y_preds, yerr=np.sqrt(y_preds_sigma), capsize=0)
    count = 0
    for x, y in zip(ox, oy):
        if count == 0:
            ax.plot(x, y, "ro",label="observed x,y")
            count += 1
        else:
            ax.plot(x, y, "ro")
    ax.set_ylim(-3,3)
    ax.set_xlabel("x", fontsize=30)
    ax.set_ylabel("y", fontsize=30)
    ax.set_title(title + " training data size: n={}".format(len(ox)),fontsize=30)
    ax.legend(fontsize=30)

In [11]:

dir_image = "./GP_images/"
from copy import copy

kernel_pair=exponential_cov_pair

newxs = np.linspace(xmin*1.5 , xmax*1.5 , 500)
para_no_noise = copy(para)
para_no_noise[2] = 0
for n in range(2,len(observed_x),1):

    observed_data = (observed_x[:n],observed_y[:n])
    y_preds, y_preds_sigma = get_predictive_y(newxs,kernel_pair,
                                              para,observed_data)
    
    y_preds0, y_preds_sigma0 = get_predictive_y(newxs,kernel_pair,
                                              para_no_noise,observed_data)
    
    
    fig = plt.figure(figsize=(20,10))
    fig.subplots_adjust(hspace=0,wspace=0)
    ax = fig.add_subplot(2,1,1)
    plotGauss(ax, newxs, 
              y_preds, y_preds_sigma,
              observed_data, 
              title = "GP regrssion with GP noise")
    
    ax = fig.add_subplot(2,1,2)
    plotGauss(ax, newxs, 
              y_preds0, y_preds_sigma0,
              observed_data, 
              title = "GP regression without noise (sigma2=0)"
              )
    
    plt.savefig(dir_image + "/fig{:04.0f}.png".format(n),
                bbox_inches='tight',pad_inches=0)
    if n % 5 == 4:
        plt.show()
    plt.close('all')

In [12]:

def create_gif(gifname,dir_image,duration=1):
    import imageio
    filenames = np.sort(os.listdir(dir_image))
    filenames = [ fnm for fnm in filenames if ".png" in fnm]

    with imageio.get_writer(dir_image + '/' + gifname + '.gif', 
                            mode='I',duration=duration) as writer:
        for filename in filenames:
            image = imageio.imread(dir_image + filename)
            writer.append_data(image)
create_gif("GP",dir_image,duration=0.5)

Use existing scikit learn routine for Gussian process fitting¶

In practice, people do not write their own likelihood to find parameters. For python user, scikit learn has nice Gaussian regrssion model API!

Let's try to fit the same model using scikit learn Gaussian modeling API.

In [13]:

from sklearn.gaussian_process import GaussianProcessRegressor

This Gaussian process regressor takes one argument, kernel. This kernel defines the covariance structure for the marginal likelihood $\boldsymbol{y}_n|\boldsymbol{x}_n$.

Remind you that the covariance for our marginal liklihood is: $$ Var(\boldsymbol{y}_n | \boldsymbol{x}_n) = \boldsymbol{K}(\boldsymbol{x}_n) + \boldsymbol{I}\sigma^2 $$ That is, the diagonal entires are: $$ Var(y_i | x_i) = \theta_1 + \sigma^2 $$ The off-diagonal entries are: $$ Cov(y_i,y_j | x_i,x_j) = \theta_1 exp\left( -\frac{1}{2} \theta_2 \left|\left| x_i - x_j \right|\right|^2 \right) $$ where $i\ne j$.

In my python script parameters are saved in para object as: $$ \texttt{para}=[\theta_1, \theta_2, \sigma^2] $$

In order to define this kernel in scikit learn API, we need to use following 3 modules.

In [14]:

from sklearn.gaussian_process.kernels import RBF,  ConstantKernel,WhiteKernel

The three kernels are defined as:

RBF kernel: $$ k_{\textrm{RBF}}(x_i, x_j) = exp \left( -\frac{1}{2} \left|\left| \frac{x_i}{\textrm{length_scale}} - \frac{x_j}{\textrm{length_scale}}\right|\right|^2 \right)\\ = exp \left( -\frac{1}{2\textrm{length_scale}^2} \left|\left| x_i -x_j\right|\right|^2 \right) $$ Constant kernel: $$ k_{\textrm{constant}}(x_i, x_j) = \theta $$

White-noise kernel: $$ k_{\textrm{white}}(x_i, x_j) = \sigma^2 \textrm{ if $x_i = x_j$ else 0} $$

So by adding and multipling these kernels, I can produce my kernel as:

$$ k(x_i, x_j) = k_{\textrm{constant}}(x_i, x_j) \textrm{ x } k_{\textrm{RBF}}(x_i, x_j) = \theta exp \left( -\frac{1}{2\textrm{length_scale}^2} \left|\left| x_i -x_j\right|\right|^2 \right) $$

The parameters in scikit learn API and the parameters in prevoius script are linked as:

$\texttt{para_true[0]} = \theta $
$\texttt{para_true[1]} = \frac{1}{\textrm{length_scale}^2}$
$\texttt{para_true[2]} = \sigma^2$

Let's fit the two Gaussian process model as before:

one, assuming noise in data and another without noise.

In [15]:

## specify the range of parameters 
kernel_with_noise = ( ConstantKernel(constant_value_bounds=(1e-2,2))*
                      RBF( length_scale_bounds=(5, 100.0))+
                      WhiteKernel(noise_level=1e-5, noise_level_bounds=(1e-10, 1)))

kernel0           = ( ConstantKernel(constant_value_bounds=(1e-2,2))* 
                      RBF( length_scale_bounds=(5, 100.0)))
    
gp_noise = GaussianProcessRegressor(
            kernel= kernel_with_noise,
            n_restarts_optimizer=10)
    
gp0 =  GaussianProcessRegressor(
        kernel= kernel0,
        n_restarts_optimizer=10)
gps = [gp_noise,gp0]

Fit the 2 Gaussian models to same observed_data and plot the results¶

Observation: the results of predicted distribution plots are more or less similar to the plots with MLEs previously obtaiend.

In [16]:

dir_image = "./GP_images_scikit/"
from copy import copy

kernel_pair=exponential_cov_pair

newxs = np.linspace(xmin*1.5 , xmax*1.5 , 500)
para_no_noise = copy(para)
para_no_noise[2] = 0
for n in range(2,len(observed_x),1):

    fig = plt.figure(figsize=(20,10))
    fig.subplots_adjust(hspace=0,wspace=0)
    
    observed_data = (observed_x[:n],observed_y[:n])
    for count, gp in enumerate(gps,1):
        ax = fig.add_subplot(len(gps),1,count)
        gp.fit(*observed_data)
        y_preds, y_preds_sigma = gp.predict(
            newxs.reshape(-1, 1),return_std=True)
        plotGauss(ax, newxs, 
                  y_preds, y_preds_sigma,
                  observed_data, 
                  title = "GP regrssion with GP noise")
        if n % 5 == 1:
            if count == 1:
                print("----\nEstimated parameters of Gaussian model:")
            print(gp.kernel_)
            
    plt.savefig(dir_image + "/fig{:04.0f}.png".format(n),
                bbox_inches='tight',pad_inches=0)
    if n % 5 == 4:
        plt.show()
        
    plt.close('all')
    
create_gif("GP",dir_image,duration=0.5)

----
Estimated parameters of Gaussian model:
1.09**2 * RBF(length_scale=16.5) + WhiteKernel(noise_level=9.94e-09)
1.09**2 * RBF(length_scale=16.5)

----
Estimated parameters of Gaussian model:
0.995**2 * RBF(length_scale=16.4) + WhiteKernel(noise_level=0.169)
1.41**2 * RBF(length_scale=5)

Yumi's Blog

Review on Gaussian process