Code and explanation mostly from Nando de Freitas, http://www.cs.ubc.ca/~nando/540-2013/lectures.html
import numpy as np
import matplotlib.pyplot as pl
%matplotlib inline
"Smoothness Prior"
$$ \mathcal D_{train} = \{ (\vec x^{(1)}, f^{(1)}), \dots, (\vec x^{(m)}, f^{(m)})\} $$
with
Goal:
Given a test set $X_*$ of size $m_* \times n$ we want to predict the (function) outputs $f_*$:
$$ \left(\begin{array}{c} \vec f \\ \vec f_* \end{array} \right) = \mathcal N \left(\left( \begin{array}{c} \vec \mu \\ \vec \mu_* \end{array} \right) , \left(\begin{array}{cc} K & K_* \\ K_*^T & K_{**} \end{array} \right) \right) $$
$$ \vec f^T = (f^{(1)}, f^{(2)}, \dots, f^{(n)}) $$ and analog $\vec f_*^T$, $\vec \mu$ and $\vec \mu_*$
with
and the kernel $\kappa$, e.g.
$$
\kappa (x, x') = \sigma_f^2 \exp \left( - \frac{(x-x')^2}{2 l^2}\right)
$$
def kernel(a, b, kernelParameter = 0.1, sigma_square = 1.):
""" GP squared exponential kernel
\kappa (x, x') = \sigma_f^2 \exp \left( - \frac{(x-x')^2}{2 l^2}\right)
Parameters
----------
a : numpy array (shape: m x n)
Design matrix: row index == data index; column index == feature index
b : numpy array (shape: m_* x n)
Design matrix: row index == data index; column index == feature index
kernelParameter: float
kernelParameter = l^2
sigma_square: float
\sigma_f^2
"""
# efficient implementation in numpy:
# squared euclidian distance between all row vectors of a and b.
sqdist = np.sum(a**2,1).reshape(-1,1) + np.sum(b**2,1) - 2*np.dot(a, b.T)
return sigma_square * np.exp(-.5 * (1/kernelParameter) * sqdist)
X = np.array([[1,2,3],[2,2,3],[3,2,3]])
X_ = np.array([[4,2,3],[2,2,4]])
kernel(X, X_, 2)
Let $x_1$ and $x_2$ be random vectors distributed joinly with
$$ p(\vec x_2 \mid \vec x_1) = \mathcal N \left( \vec x_2 \mid \vec \mu_{2 \mid 1}, \Sigma_{2 \mid 1} \right) $$
$$ \mu_{2 \mid 1} = \mu_2 + \Sigma_{2 \mid 1} \Sigma_{11}^{-1}(\vec x_1 - \mu_1) $$
$$ \Sigma_{2 \mid 1} = \Sigma_{22} - \Sigma_{} $$
$$ p(\vec f_* \mid X_*, X, \vec f) = \mathcal N(\vec f_* \mid \vec \mu_*, \Sigma_*) $$
with (by the Multivariate Gaussian Theorem, see K.Murphy ML, Theorem 4.2.1): $$ \vec \mu_* = \vec \mu (X_*) + K_*^T K^{-1} ( \vec f - \vec \mu (X) ) $$ $$ \Sigma_{*} = K_{**} - K_*^T K^{-1} K_* $$
What is $\vec \mu (X_*)$ resp. $\vec \mu (X)$ ? These are the means of the priors.
$$ y = f(\vec x) + \epsilon $$ with
$$ p(\vec y \mid X) = \int p(\vec y \mid f, X) p(\vec f \mid X) d\vec f $$
with zero-mean prior: $$ p(\vec f \mid X) = \mathcal N (f \mid 0, K) $$
the noise of each data point is independent: $$ p(y \mid f) = \prod_{j=1}^m \mathcal N(f^{(i)}, \sigma_y^2) $$
with the multivariate Gaussian theorem:
$$ \text{cov} \left[ \vec y \mid X \right] = K + \sigma_y^2 {\bf I}_n =: K_y $$ so we just need to modify the kernel: $K \rightarrow K_y$
$$ \left( \begin{array}{c} \vec f \\ \vec f_* \end{array} \right) = \mathcal N \left( \left( \begin{array}{c} \vec \mu \\ \vec \mu_* \end{array} \right) , \left( \begin{array}{cc} K_y & K_* \\ K_*^T & K_{**} \end{array} \right) \right) $$
$$ p(\vec f_* \mid X_*, X, \vec f) = \mathcal N(\vec f_* \mid \vec \mu_*, \Sigma_*) $$
$$ \vec \mu_* = \vec \mu (X_*) + K_*^T K_y^{-1} ( \vec y - \vec \mu (X) ) $$ $$ \Sigma_{*} = K_{**} - K_*^T K_y^{-1} K_* $$
Prior for the noise (inverse wishard distribution in 1-D) i.e. inverse Gamma distribution, see
# This is the true unknown function we are trying to approximate
f = lambda x: np.sin(0.9*x).flatten()
#f = lambda x: (0.25*(x**2)).flatten()
N = 5 # number of training points.
n = 100 # number of test points.
s = 0.00005 # noise variance.
# Sample some input points and noisy versions of the function evaluated at
# these points.
X = np.random.uniform(-5, 5, size=(N,1))
y = f(X) + s*np.random.randn(N)
# this is also a parameter, we must estimate from the data!!! TODO
# e.g. by max. likelihood, etc.
s_y = 0.001
var_y = np.eye(len(X)) * s_y
# points we're going to make predictions at.
Xtest = np.linspace(-5, 5, n).reshape(-1,1)
kernelParameter = 2.
K = kernel(X, X, kernelParameter = kernelParameter)
K_y = K + var_y
k_ = kernel(X, Xtest, kernelParameter = kernelParameter)
# naive numerically unstable version
K_inv = np.linalg.inv(K_y)
alpha = np.dot(K_inv, y)
mu = np.dot(k_.T, alpha)
K_ = kernel(Xtest, Xtest, kernelParameter = kernelParameter)
s_2 = K_ - np.dot( k_.T, np.dot(K_inv, k_) )
s_ = np.sqrt(np.diag(s_2))
pl.figure(1)
pl.clf()
pl.plot(X, y, 'r+', ms=20)
pl.plot(Xtest, f(Xtest), 'b-')
pl.gca().fill_between(Xtest.flat, mu-3*s_, mu+3*s_, color="#dddddd")
pl.plot(Xtest, mu, 'r--', lw=2)
pl.savefig('predictive.png', bbox_inches='tight')
pl.title('Mean predictions plus 3 st.deviations')
pl.axis([-5, 5, -3, 3])
In the implementation there is a matrix inversion of $K_y$. The matrix $K_y$ is symmetric and positive definite.
The inversion of such a matrix can be done more stable and faster by using the Cholesky decomposition.
A symmetric (more general Hermitian) positive definite matrix $K$ can be factorized $$ K = L L^T $$ with
# same code snipped as above
K = kernel(X, X, kernelParameter = kernelParameter)
K_y = K + var_y
L = np.linalg.cholesky(K_y)
L
np.allclose(L, np.tril(L)) # check if lower triangular
It is much easier to compute the inverse of a triangular matrix and there exist numerical solutions, see http://www.mcs.csueastbay.edu/~malek/TeX/Triangle.pdf
So we use this to compute $K_y^{-1}$:
$$ K_{y}^{-1} = \left(L L^T \right)^{-1} = \left(L^T\right)^{-1} L^{-1} = L^{-T} L^{-1} $$
Note: $\left( A B\right)^{-1} = B^{-1} A^{-1}$
$$ \vec \mu_* = \vec \mu (X_*) + K_*^T K^{-1} ( \vec y - \vec \mu (X) ) $$ with prior $\mu (.) = 0$ $$ \vec \mu_* = K_*^T K_y^{-1} \vec f = K_*^T L^{-T} L^{-1} \vec f = \left( (L^{-1} K_*)^T \right) \left( L^{-1} \vec f \right) $$
For solving $\left( (L^{-1} K_*)^T \right)$ and $\left( L^{-1} \vec f \right)$ we can use the fact that $L$ is a lower triangular matrix.
With scipy by scipy.linalg.solve_triangular
.
import scipy.linalg
help(scipy.linalg.solve_triangular)
k_ = kernel(X, Xtest, kernelParameter = kernelParameter)
Lk = scipy.linalg.solve_triangular(L, k_, lower=True)
Ly = scipy.linalg.solve_triangular(L, y, lower=True)
mu = np.dot(Lk.T, Ly)
analog $$ \Sigma_{*} = K_{**} - K_*^T K_y^{-1} K_* = K_{**} - K_*^T L^{-T}L^{-1} K_* = K_{**} - \left(L^{-1} K_*\right)^T \left(L^{-1} K_* \right) $$
K_ = kernel(Xtest, Xtest, kernelParameter = kernelParameter)
# we need only the diagonal elements of $\Sigma_*$:
s2 = np.diag(K_) - np.sum(Lk**2, axis=0)
s_ = np.sqrt(s2)
# PLOTS:
pl.figure(1)
pl.clf()
pl.plot(X, y, 'r+', ms=20)
pl.plot(Xtest, f(Xtest), 'b-')
pl.gca().fill_between(Xtest.flat, mu-3*s_, mu+3*s_, color="#dddddd")
pl.plot(Xtest, mu, 'r--', lw=2)
pl.savefig('predictive.png', bbox_inches='tight')
pl.title('Mean predictions plus 3 st.deviations')
pl.axis([-5, 5, -3, 3])
##TODO explaination
# draw samples from the prior at our test points.
L = np.linalg.cholesky(K_ + s * np.eye(n) )
f_prior = np.dot(L, np.random.normal(size=(n,10)))
pl.figure(2)
pl.clf()
pl.plot(Xtest, f_prior)
pl.title('Ten samples from the GP prior')
pl.axis([-5, 5, -3, 3])
pl.savefig('prior.png', bbox_inches='tight')
# draw samples from the posterior at our test points.
L = np.linalg.cholesky(K_ + s * np.eye(n) - np.dot(Lk.T, Lk))
f_post = mu.reshape(-1,1) + np.dot(L, np.random.normal(size=(n,10)))
pl.figure(3)
pl.clf()
pl.plot(Xtest, f_post)
pl.title('Ten samples from the GP posterior')
pl.axis([-5, 5, -3, 3])
pl.savefig('post.png', bbox_inches='tight')
TODO: How to determine the hyperparameters s_y, kernel_parameter etc.
Python Library: