Gaussian Process Regression (Draft)

Gaussian Distributions

A Gaussian distribution exists over variables, i.e. the distribution explains how (relatively) frequently the values for those variables show up in observations. A Gaussian distribution for a n-dimensional vector variable is fully specified by a mean vector, μ, and covariance matrix Σ

$$ \mathrm{x} = (x_{1},....x_{n})^{T} \sim \mathcal{N}(\mu,\Sigma) $$ A univariate Gaussian distribution is given by $$ p(x|\mu,\sigma^2) = \dfrac{1}{2\pi \sigma^2} e^{ \dfrac{ -(x - \mu)^2 }{2 \sigma^2} } $$ where μ is the mean and σ is the standard deviation for the Gaussian. If we wanted to expand that to the multivariate form in D dimensions, we get the following form $$ p(x|\mu,\Sigma) = (\dfrac{1}{2\pi})^{-D/2} |\Sigma|^{-1/2} e^{ - \frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) } $$ where μ is the mean vector and Σ is the covariance matrix for the Gaussian.

If x and y are jointly Gaussian (both of them can be scalars or vectors) $$p(x) = \mathcal{N}(x_{\mu},C_{xx})$$ and $$p(y) = \mathcal{N}(y_{\mu},C_{yy})$$ $$ p(x,y) = p(\left[ \begin{array}{ccc} x \\ y \end{array} \right]) = \mathcal{N}(\left[ \begin{array}{c} x_{\mu} \\ y_{\mu} \end{array} \right] , \left[ \begin{array}{cc} C_{xx} & C_{xy} \\ C_{xy}^T & C_{yy} \end{array} \right] ) $$ where $$C_{xx}, C_{xy}, C_{yy} $$ are the covariance matrices for the variables indicated by the subscripts and where μ indicates that it is the mean of the corresponding variable. These variables x and y are independently Gaussian and their marginal distributions here are given by the equations

We can also write a conditional distribution from a jointly Gaussian distribution given above as (reference Murphy, chapter 4) $$p(x|y) = \mathcal{N}( x_{\mu} + C_{xy}C_{yy}^{-1}(y - y_{\mu}), C_{xx} - C_{xy}C_{yy}^{-1}C_{xy}^T )$$ A similar expression can also be written for p(y|x).

Now given a standard univariate Normal distribution N(0,1) how can we represent any distribution in terms of the standard distribution? $$ \mathcal{N}(\mu,\sigma^2) = \mu + \sigma \mathcal{N}(0,1) $$ That is a pretty neat property to have but now how do we expand this to a multivariate Normal distribution? We need something that is analogous to the square root of a covariance matrix to accomplish this. The Cholesky decomposition provides exactly this for us, for any matrix C we can write the Cholesky factorization as $$ C = L^T L $$ allowing us to write for a multivariate Normal distribution the following $$ \mathcal{N}(\mu,\Sigma) = \mu + L \mathcal{N}(0,1) , \Sigma = L^T L $$

Gaussian Processes

A Gaussian process is a generalization of a multivariate Gaussian and is fully specified by a mean function m(x) and covariance function k(x, x′). One requirement of this covariance matrix is that it must be positive semidefinite. Formally, a function f can be considered a Gaussian Process if $$ f(x) = [ f(x_1), ... f(x_n)], \; x = [x_1, .... x_n ] $$ and f(x) has a multivariate Gaussian distribution. Generally, it can be written as $$ f(x) \sim GP(m(x), k(x, x_k)) $$

A word on notation: f and f(x) are used interchangeably in some parts below for conciseness. The functional dependence of f, the observed value, on x should not be forgotten.

We can interpret this as meaning that the mean is now computed as a function of the observed x and the elements of the covariance matrix is now formed from a covariance function given above. This covariance function can be any kernel and is used to measure similarity between elements. One frequently used kernel is the Radial Basis function.

$$ k(x_i,x_j) = \sigma^2 e^{ \frac{ - || x_i - x_j ||^2}{2l}} $$ where σ and l are the parameters of this function. σ can be interpreted as the variance of this kernel, increasing this increases the envelope of uncertainty around the mean. The smaller value of l, which can be interpreted as the length scale of the kernel, implies a narrower kernel implying that you only want close points to have large similarity points. The larger you make this value, the kernel becomes more spread out which means that points farther away can have high similarity as well. So why do we need a covariance function?

When we perform regression, we have points x at which we know the value of the function f and x∗ for which we do not have the output f∗ and our goal becomes to use the joint distribution formula shown in the last section to infer the mean and covariance, and therefore the distribution of the unknown. If we represent the covariance using a kernel function, that depends only on x and x∗ with parameters, we can fit the parameters of this kernel during the regression procedure using the knowns:

For the known points: inputs and their outputs
The joint distribution of the knowns and the unknowns

Note that the covariance kernel function is only a function of the input x and not the output f(x), which allows us to compute it at every point( known and unknown ). If we assume that the output f and f∗ is drawn from a Gaussian distribution we can proceed as follows. I am going to switch variables around a little bit and refer to the known elements of the covariance matrix by K to indicate that is the output of a kernel function. In other words, K will refer to the covariance between x and itself, K∗ will represent the covariance between x and x* while K∗∗ will represent the covariance between x∗ and itself. In matrix form we can write

$$ f \sim \mathcal{N}(0,K) $$ $$ f_* \sim \mathcal{N}(0,K_{**}) $$ $$ \left[ \begin{array}{ccc} f \\ f_* \end{array} \right] = N(\left[ \begin{array}{c} 0 \\ 0 \end{array} \right] , \left[ \begin{array}{cc} K & K_* \\ K_*^T & K_{**} \end{array} \right] ) $$ and the conditional mean and conditional covariance for f∗ given f,x and x∗ as $$ \mu_* = p(f_*|f,x,x_*) = K_*^T K^{-1} f $$ $$ cov_* = p(f_*|f,x,x_*) = -K_*^T K^{-1} K_* + K_{**} $$ This is an important result because what it essentially gives us is the mean or expected result and the variance or uncertainty around that mean. In fact, this allows us to not only compute the output at one point but any number of points (infinite if we want!), thereby fitting a curve with only a few known values. However, the problem with doing so is that the farther your unknown x∗ is from the known x, the more variance or uncertainty you will have in your fitted curve. If you remember our definition of the kernel, it was defined to determine similarity between x and x∗ and the farther away the two are, the lower the value of this kernel. A note on the values of l for the kernel: it was mentioned earlier that a smaller value of l results in a narrow kernel which has response limited to a smaller region. This will result in a wigglier curve and increasing the value of l results in a smoother curve.

What about noisy data?

What happens when the data is corrupted by noise? This is pretty common during data acquistion and we should have a way to represent this uncertainty. We assume that the data has some noise in it which can, again, be modeled as a Gaussian distribution. Let the acquired or observed data be represented by y ( for input x )

$$ y = f(x) + \epsilon, \epsilon \sim \mathcal{N}(0,\sigma_y) $$

where ϵ is the noise that we assume is added to the observed data. Now the conditional probability of y given a certain x can be computed by marginalizing or integrating out the other dependent variables, which in this case is f(x)

$$ p(y|x) = \int p(y|f,x) p(f|x) df $$

we have two variables in the integral above. The second term is simply the distribution we saw earlier

$$ p(f|x) \sim \mathcal{N}(0,K) $$

while the first term can be interpreted as independent observations drawn from a normal distribution. By the law of total probability for conditional distributions we have

$$ p(y,f) = \prod_i \mathcal{N}(y_i|f_i,\sigma^2) $$

Skipping the algebra, we arrive at the following expression for the covariance for y

$$ cov(y|x) = K + \sigma^2 I = K_y $$

So what happened above is that the new covariance of a noisy observed variable simply became a modified covariance matrix that has the variance of the noise added to the original covariance matrix. So how does this affect our curve when we fit it? With no noise, uncertainty is zero at the observed points, or the variance in the graph is zero. When we assume a noisy signal, this is no longer the case, at observed points the fitted curve has a thickness that is proportional to the variance of the noise representing our uncertainty at that point. If we rewrite the equations for noisy Gp, it looks like the following

$$ \left[ \begin{array}{ccc} y \\ f_* \end{array} \right] = \mathcal{N}(\left[ \begin{array}{c} 0 \\ 0 \end{array} \right] , \left[ \begin{array}{cc} K_y & K_* \\ K_*^T & K_{**} \end{array} \right] ) $$ and the conditional mean and conditional covariance, assuming noise in the observed data is given as $$ \mu_* = p(f_*|f,x,x_*) = K_*^T K_y^{-1} f $$ $$ cov_* = p(f_*|f,x,x_*) = -K_*^T K_y^{-1} K_* + K_{**} $$

Learning the parameters

So far we have assumed that we had values for l are fixed, however the purpose of regression is determine the parameters that best minimizes the error between the observed y and the y computed from the model parameterized by l. We can write this as a Maximum Likelihood Estimation (MLE) problem of the following probability

$$ argmax_l \; \mathcal{L}(l|y) $$ or in terms of the probability as $$ argmax_l \; p(y|X,l) $$

The problem then reduces to an optimization one of the following integral that we saw earlier and was defined in terms of the mean and variance. Note that now we have parameterized the probability by l

$$ p(y|x,l) = \int p(y|f,x) p(f|x) df = p(y|0,K_y) $$

Skipping over all the algebra, the integral can be simplified by taking the log of the above probability distribution

$$ log \; p(y|x,l) = \dfrac{1}{2} y K_y^{-1} y - \dfrac{1}{2} log |K_y| - \dfrac{N}{2} log(2\pi) $$

and the derivative of the above term as follows

$$ \dfrac{\partial}{\partial l} log \; p(y|x,l) = \dfrac{1}{2} y^T K_y^{-1} \dfrac{\partial K_y}{\partial l} K_y^{-1} y - \dfrac{1}{2} tr(K_y^{-1} \dfrac{\partial K_y}{\partial l}) $$

Now that we have the function and its derivative we can use any gradient-based optimization scheme, e.g. L-BFGS to obtain an optimal value of the kernel parameter l.

Acknowledgements

This post was heavily inspired by the fantastic lectures of Prof. Nando de Freitas, the GP book by Dr. Carl Rasmussen. This blog post by Katherine Bailey was an excellent introduction to the topic with some Python code.