Gaussian Distributions
A Gaussian distribution exists over variables, i.e. the distribution explains how (relatively) frequently the values for those variables show up in observations. A Gaussian distribution for a n-dimensional vector variable is fully specified by a mean vector, μ, and covariance matrix Σ
Now given a standard univariate Normal distribution N(0,1) how can we represent any distribution in terms of the standard distribution? $$ \mathcal{N}(\mu,\sigma^2) = \mu + \sigma \mathcal{N}(0,1) $$ That is a pretty neat property to have but now how do we expand this to a multivariate Normal distribution? We need something that is analogous to the square root of a covariance matrix to accomplish this. The Cholesky decomposition provides exactly this for us, for any matrix C we can write the Cholesky factorization as $$ C = L^T L $$ allowing us to write for a multivariate Normal distribution the following $$ \mathcal{N}(\mu,\Sigma) = \mu + L \mathcal{N}(0,1) , \Sigma = L^T L $$
Gaussian Processes
We can interpret this as meaning that the mean is now computed as a function of the observed x and the elements of the covariance matrix is now formed from a covariance function given above. This covariance function can be any kernel and is used to measure similarity between elements. One frequently used kernel is the Radial Basis function.
$$
k(x_i,x_j) = \sigma^2 e^{ \frac{ - || x_i - x_j ||^2}{2l}}
$$
where σ and l are the parameters of this function. σ can be interpreted as the variance of this kernel, increasing this increases the envelope of uncertainty around the mean. The smaller value of l, which can be interpreted as the length scale of the kernel, implies a narrower kernel implying that you only want close points to have large similarity points. The larger you make this value, the kernel becomes more spread out which means that points farther away can have high similarity as well. So why do we need a covariance function?
When we perform regression, we have points x at which we know the value of the function f and x∗ for which we do not have the output f∗ and our goal becomes to use the joint distribution formula shown in the last section to infer the mean and covariance, and therefore the distribution of the unknown. If we represent the covariance using a kernel function, that depends only on x and x∗ with parameters, we can fit the parameters of this kernel during the regression procedure using the knowns:
- For the known points: inputs and their outputs
- The joint distribution of the knowns and the unknowns
Note that the covariance kernel function is only a function of the input x and not the output f(x), which allows us to compute it at every point( known and unknown ). If we assume that the output f and f∗ is drawn from a Gaussian distribution we can proceed as follows. I am going to switch variables around a little bit and refer to the known elements of the covariance matrix by K to indicate that is the output of a kernel function. In other words, K will refer to the covariance between x and itself, K∗ will represent the covariance between x and x* while K∗∗ will represent the covariance between x∗ and itself. In matrix form we can write
What about noisy data?
What happens when the data is corrupted by noise? This is pretty common during data acquistion and we should have a way to represent this uncertainty. We assume that the data has some noise in it which can, again, be modeled as a Gaussian distribution. Let the acquired or observed data be represented by y ( for input x )
$$ y = f(x) + \epsilon, \epsilon \sim \mathcal{N}(0,\sigma_y) $$
where ϵ is the noise that we assume is added to the observed data. Now the conditional probability of y given a certain x can be computed by marginalizing or integrating out the other dependent variables, which in this case is f(x)
$$ p(y|x) = \int p(y|f,x) p(f|x) df $$
we have two variables in the integral above. The second term is simply the distribution we saw earlier
$$ p(f|x) \sim \mathcal{N}(0,K) $$
while the first term can be interpreted as independent observations drawn from a normal distribution. By the law of total probability for conditional distributions we have
$$ p(y,f) = \prod_i \mathcal{N}(y_i|f_i,\sigma^2) $$
Skipping the algebra, we arrive at the following expression for the covariance for y
$$ cov(y|x) = K + \sigma^2 I = K_y $$
So what happened above is that the new covariance of a noisy observed variable simply became a modified covariance matrix that has the variance of the noise added to the original covariance matrix. So how does this affect our curve when we fit it? With no noise, uncertainty is zero at the observed points, or the variance in the graph is zero. When we assume a noisy signal, this is no longer the case, at observed points the fitted curve has a thickness that is proportional to the variance of the noise representing our uncertainty at that point. If we rewrite the equations for noisy Gp, it looks like the following
Learning the parameters
So far we have assumed that we had values for l are fixed, however the purpose of regression is determine the parameters that best minimizes the error between the observed y and the y computed from the model parameterized by l. We can write this as a Maximum Likelihood Estimation (MLE) problem of the following probability
$$ argmax_l \; \mathcal{L}(l|y) $$ or in terms of the probability as $$ argmax_l \; p(y|X,l) $$
The problem then reduces to an optimization one of the following integral that we saw earlier and was defined in terms of the mean and variance. Note that now we have parameterized the probability by l
$$ p(y|x,l) = \int p(y|f,x) p(f|x) df = p(y|0,K_y) $$
Skipping over all the algebra, the integral can be simplified by taking the log of the above probability distribution
$$ log \; p(y|x,l) = \dfrac{1}{2} y K_y^{-1} y - \dfrac{1}{2} log |K_y| - \dfrac{N}{2} log(2\pi) $$
and the derivative of the above term as follows
$$ \dfrac{\partial}{\partial l} log \; p(y|x,l) = \dfrac{1}{2} y^T K_y^{-1} \dfrac{\partial K_y}{\partial l} K_y^{-1} y - \dfrac{1}{2} tr(K_y^{-1} \dfrac{\partial K_y}{\partial l}) $$
Now that we have the function and its derivative we can use any gradient-based optimization scheme, e.g. L-BFGS to obtain an optimal value of the kernel parameter l.