How to turn a collection of small building blocks into a versatile tool for solving regression problems.

Even if you have spent some time reading about machine learning, chances are that you have never heard of Gaussian processes. And if you have, rehearsing the basics is always a good way to refresh your memory. With this blog post we want to give an introduction to Gaussian processes and make the mathematical intuition behind them more approachable.

Gaussian processes are a powerful tool in the machine learning toolbox*fitting* a function to the data.
This is called regression and is used, for example, in robotics or time series forecasting.
But Gaussian processes are not limited to regression — they can also be extended to classification and clustering tasks

We will first explore the mathematical foundation that Gaussian processes are built on — we invite you to follow along using the interactive figures and hands-on examples. They help to explain the impact of individual components, and show the flexibility of Gaussian processes. After following this article we hope that you will have a visual intuition on how Gaussian processes work and how you can configure them for different types of data.

Before we can explore Gaussian processes, we need to understand the mathematical concepts they are based on.
As the name suggests, the Gaussian distribution (which is often also referred to as *normal* distribution) is the basic building block of Gaussian processes.
In particular, we are interested in the multivariate case of this distribution, where each random variable is distributed normally and their joint distribution is also Gaussian.
The multivariate Gaussian distribution is defined by a mean vector $\mu$ and a covariance matrix $\Sigma$.
You can see an interactive example of such distributions in the figure below.

The mean vector $\mu$ describes the expected value of the distribution.
Each of its components describes the mean of the corresponding dimension.
$\Sigma$ models the variance along each dimension and determines how the different random variables are correlated.
The covariance matrix is always symmetric and positive semi-definite

We say $X$ follows a normal distribution. The covariance matrix $\Sigma$ describes the shape of the distribution. It is defined in terms of the expected value $E$:

$\Sigma = \text{Cov}(X_i, X_j) = E \left[ (X_i - \mu_i)(X_j - \mu_j)^T \right]$Visually, the distribution is centered around the mean and the covariance matrix defines its shape. The following figure shows the influence of these parameters on a two-dimensional Gaussian distribution. The variances for each random variable are on the diagonal of the covariance matrix, while the other values show the covariance between them.

Gaussian distributions are widely used to model the real world.
For example, we can employ them to describe errors of measurements or phenomena under the assumptions of the *central limit theorem*

Gaussian distributions have the nice algebraic property of being closed under conditioning and marginalization. Being closed under conditioning and marginalization means that the resulting distributions from these operations are also Gaussian, which makes many problems in statistics and machine learning tractable. In the following we will take a closer look at both of these operations, as they are the foundation for Gaussian processes.

*Marginalization* and *conditioning* both work on subsets of the original distribution and we will use the following notation:

With $X$ and $Y$ representing subsets of original random variables.

Through *marginalization* we can extract partial information from multivariate probability distributions. In particular, given a normal probability distribution $P(X,Y)$ over vectors of random variables $X$ and $Y$, we can determine their marginalized probability distributions in the following way:

The interpretation of this equation is that each partition $X$ and $Y$ only depends on its corresponding entries in $\mu $ and $\mathrm{\Sigma}$. To marginalize out a random variable from a Gaussian distribution we can simply drop the variables from $\mu$ and $\Sigma$.

$p_X(x) = \int_y p_{X,Y}(x,y)dy = \int_y p_{X|Y}(x|y) p_Y(y) dy$The way to interpret this equation is that if we are interested in the probability density of
$X = x$, we need to consider all possible outcomes of
$Y$ that can jointly lead to the result

Another important operation for Gaussian processes is *conditioning*.
It is used to determine the probability of one variable depending on another variable.
Similar to marginalization, this operation is also closed and yields a modified Gaussian distribution.
This operation is the cornerstone of Gaussian processes since it allows Bayesian inference, which we will talk about in the next section.
Conditioning is defined by:

Note that the new mean only depends on the conditioned variable, while the covariance matrix is independent from this variable.

Now that we have worked through the necessary equations, we will think about how we can understand the two operations visually. While marginalization and conditioning can be applied to multivariate distributions of many dimensions, it makes sense to consider the two-dimensional case as shown in the following figure. Marginalization can be seen as integrating along one of the dimensions of the Gaussian distribution, which is in line with the general definition of the marginal distribution. Conditioning also has a nice geometric interpretation — we can imagine it as making a cut through the multivariate distribution, yielding a new Gaussian distribution with fewer dimensions.

Now that we have recalled some of the basic properties of multivariate Gaussian distributions, we will combine them together to define Gaussian processes, and show how they can be used to tackle regression problems.

First, we will move from the continuous view to the discrete representation of a function:
rather than finding an implicit function, we are interested in predicting the function values at concrete points, which we call *test points* $X$.
So how do we derive this functional view from the multivariate normal distributions that we have considered so far?
Stochastic processes, such as Gaussian processes, are essentially a set of random variables.
In addition, each of these random variables has a corresponding index $i$.
We will use this index to refer to the $i$-th dimension of our $n$-dimensional multivariate distributions.
The following figure shows an example of this for two dimensions:

Now, the goal of Gaussian processes is to learn this underlying distribution from *training data*.
Respective to the test data $X$, we will denote the training data as $Y$.
As we have mentioned before, the key idea of Gaussian processes is to model the underlying distribution of $X$ together with $Y$ as a multivariate normal distribution.
That means that the joint probability distribution $P_{X,Y}$ spans the space of possible function values for the function that we want to predict.
Please note that this joint distribution of test and training data has $|X| + |Y|$ dimensions.

In order to perform regression on the training data, we will treat this problem as *Bayesian inference*.
The essential idea of Bayesian inference is to update the current hypothesis as new information becomes available.
In the case of Gaussian processes, this information is the training data.
Thus, we are interested in the conditional probability $P_{X|Y}$.
Finally, we recall that Gaussian distributions are closed under conditioning — so $P_{X|Y}$ is also distributed normally.

Now that we have the basic framework of Gaussian processes together, there is only one thing missing:
how do we set up this distribution and define the mean $\mu$ and the covariance matrix $\Sigma$?
The covariance matrix $\Sigma$ is determined by its *covariance function* $k$, which is often also called the *kernel* of the Gaussian process.
We will talk about this in detail in the next section.
But before we come to this, let us reflect on how we can use multivariate Gaussian distributions to estimate function values.
The following figure shows an example of this using ten test points at which we want to predict our function:

In Gaussian processes we treat each test point as a random variable. A multivariate Gaussian distribution has the same number of dimensions as the number of random variables. Since we want to predict the function values at $|X|=N$ test points, the corresponding multivariate Gaussian distribution is also $N$ -dimensional. Making a prediction using a Gaussian process ultimately boils down to drawing samples from this distribution. We then interpret the $i$-th component of the resulting vector as the function value corresponding to the $i$-th test point.

Recall that in order to set up our distribution, we need to define $\mu$ and $\Sigma$.
In Gaussian processes it is often assumed that $\mu = 0$, which simplifies the necessary equations for conditioning.
We can always assume such a distribution, even if $\mu \neq 0$, and add $\mu$ back to the resulting function values after the prediction step.
This process is also called *centering* of the data.
So configuring $\mu$ is straight forward — it gets more interesting when we look at the other parameter of the distribution.

The clever step of Gaussian processes is how we set up the covariance matrix $\Sigma$.
The covariance matrix will not only describe the shape of our distribution, but ultimately determines the characteristics of the function that we want to predict.
We generate the covariance matrix by evaluating the kernel $k$, which is often also called *covariance function*, pairwise on all the points.
The kernel receives two points $t,t’ \in \mathbb{R}^n$ as an input and returns a similarity measure between those points in the form of a scalar:

We evaluate this function for each pairwise combination of the test points to retrieve the covariance matrix. This step is also depicted in the figure above. In order to get a better intuition for the role of the kernel, let’s think about what the entries in the covariance matrix describe. The entry $\Sigma_{ij}$ describes how much influence the $i$-th and $j$-th point have on each other. This follows from the definition of the multivariate Gaussian distribution, which states that $\Sigma_{ij}$ defines the correlation between the $i$-th and the $j$-th random variable. Since the kernel describes the similarity between the values of our function, it controls the possible shape that a fitted function can adopt. Note that when we choose a kernel, we need to make sure that the resulting matrix adheres to the properties of a covariance matrix.

Kernels are widely used in machine learning, for example in *support vector machines*

Kernels can be separated into *stationary* and *non-stationary* kernels. *Stationary* kernels, such
as the RBF kernel or the periodic kernel, are functions invariant to translations, and the covariance of two points is only
dependent on their relative position. *Non-stationary* kernels, such as the linear kernel, do not have this
constraint and depend on an absolute location. The stationary nature of the RBF kernel can be observed in the
banding around the diagonal of its covariance matrix (as shown in this figure). Increasing the length parameter increases the banding, as
points further away from each other become more correlated. For the periodic kernel, we have an additional parameter
$P$ that determines the periodicity, which controls the distance between each repetition of the function.
In contrast, the parameter $C$ of the linear kernel allows us to change the point on which all functions hinge.

There are many more kernels that can describe different classes of functions, which can be used to model the desired shape of the function.
A good overview of different kernels is given by Duvenaud

We will now shift our focus back to the original task of regression.
As we have mentioned earlier, Gaussian processes define a probability distribution over possible functions.
In this figure above, we show this connection:
each sample of our multivariate normal distribution represents one realization of our function values.
Because this distribution is a multivariate Gaussian distribution, the distribution of functions is normal.
Recall that we usually assume $\mu=0$.
For now, let’s consider the case where we have not yet observed any training data.
In the context of Bayesian inference, this is called the *prior* distribution $P_X$.

If we have not yet observed any training examples, this distribution revolves around $\mu=0$, according to our original assumption. The prior distribution will have the same dimensionality as the number of test points $N = |X|$. We will use the kernel to set up the covariance matrix, which has the dimensions $N \times N$.

In the previous section we have looked at examples of different kernels. The kernel is used to define the entries of the covariance matrix. Consequently, the covariance matrix determines which type of functions from the space of all possible functions are more probable. As the prior distribution does not yet contain any additional information, it is perfect to visualize the influence of the kernel on the distribution of functions. The following figure shows samples of potential functions from prior distributions that were created using different kernels: