GP CaKe: An introduction to Granger causality with Gaussian processes, part II

In the previous installment of this series, I explain the ideas behind the GP CaKe model. In addition, I provided a toy example comparing GP CaKe with a VAR model. In the current blog post, I’ll go a bit deeper into the actual computations. It makes a lot of sense to read the first post before starting with this one, but hey, I’m not going to stop you from reading this one…

The model

Let me begin by restating the model. GP CaKe describes the dynamics of a time series of observations $x_j(t)$ as:

where $D_j$ is the differential operator (i.e. it describes the dynamics up to the $P$-th derivative), $w_j(t)$ is Gaussian white noise with variance $\sigma^2$ and $c_{i\Rightarrow j}(\tau)$ is the causal impulse response function (CIRF) describing the causal interaction from $i$ to $j$. The CIRF’s are the unobserved quantities that we want to learn.

The computational aspects

Applying GP CaKe to your data means you want to infer the latent response functions given the time series, or in other words, to compute the posterior $p(c_{i \Rightarrow j}(t) \mid x)$. “Surely,” you think, “fitting a continuous function that is used in a convolution within a stochastic differential equation is more complex than fitting a VAR?” Well, you’ll be surprised. The icing on the GP CaKe comes from transforming these two simple equations into the frequency domain via the Fast Fourier Transform. The result is

Here, $P_j(\omega)$ is the differential operator in the spectral domain, for frequencies $\omega$. Importantly, this transformation has a property that is convenient for us here: convolution in the temporal domain becomes multiplication in the spectral domain. Because of this, the spectral representation of the GP CaKe model is quite simple. We can rearrange it some more to obtain

Now wait a minute! This equation says that our vector $x_j(\omega)$ (the spectral representation of our original time series) is given by a weighted sum of functions $c_{i \Rightarrow j}(\omega)$, with weights $x_i(\omega) / P_j(\omega)$, plus some (scaled) noise. We encounter this type of equation much more often in the context of Bayesian modelling. It is known as Gaussian process regression, with the slight generalization that we have a weight term in front of the function we want to estimate.

What is very convenient about GP regression, is that we know the closed-form solution of the posterior $p(c_{i\Rightarrow j}(\omega)\mid x)$. This means we do not have to sample this model using Markov chain Monte Carlo, or approximate it via Variational Inference (as with DCM). This makes GP CaKe very convenient to work with. All we need to do is to compute for each connection the expression

This is the posterior mean of the causal impulse response function that describes the effect of variable $i$ on variable $j$. The matrices $\mathbf{K}_i$ are the kernels of the Gaussian process that describe how similar the frequencies of the spectrum are. I’ll come back to these in a second. Because of the reshuffling of terms, the $[\mathbf{\Gamma}_i]_{mm} = x_i(\omega_m) / P_j(\omega_m)$ are diagonal matrices that give the GP weight function for each frequency, and $[\mathbf{D}]_{mm} = \sigma^2 / | P_j(\omega_m)|^2$ is the diagonal matrix with the noise (co-)variance. Note that this simply follows from how we have rewritten the model into the frequency domain, and have divided the both sides by $P_j(\omega)$. We can interpret what these terms mean in terms of filtering the signal, but for now we’ll simply consider them given.

Constructing causal kernels

Much more interesting are the kernels of the Gaussian process that describe how similar subsequent points (here: subsequent frequencies, because remember that we are now working in the frequency domain!) are to each other. Here, they indicate how, a priori, we expect causal impulse response functions to behave. This is where the Bayesian modeler becomes excited, because thinking about prior beliefs is where we shine. There are three desiderata for causal response functions:

  1. The CIRF should be mostly temporally localized. By this we mean that a causal effect should be strongest after a short delay, and if the delay is very large the causal effect should diminish.
  2. The CIRF should be smooth. In most physical systems, we expect that a CIRF is not all over the place, but rather gradually increases, and then vanishes again.
  3. Finally, the CIRF should be causal. This implies that the effect from one variable to another should be zero for negative lags; the future of one variable should not influence the past of another. That’d just be weird.

In the spectral domain, we can conveniently express each of these criteria:

In this equation, parameter $\vartheta$ determines how correlated frequencies should be, i.e. how localized the CIRF is. The term $t_s$ is called the time shift, which makes sure the causal effect starts roughly at zero. The smoothness can be formalized by ensuring that the CIRF consists mostly of low frequencies:

Finally, causality is enforced by adding an imaginary part to the localized kernel that is equal to its Hilbert transform H, as follows:

Effectively, this enforces the causal impulse response functions to be zero for $t_2<t_1$. The total kernel is a combination of each of these elements, as we can create new kernels by adding and multiplying existing ones (see the seminal work on GPs by Rasmussen & Williams, 2006):

This kernel we can now compute for each pair of frequencies $\omega_1, \omega_2$. With that kernel, the Fourier transformed time series $x(\omega)$, the matrices $\mathbf{\Gamma}$ and $\mathbf{D}$, we can easily compute the posterior response functions. That is pretty much it!

Now that we have constructed a proper kernel that expresses our requirements for (Granger) causality, we can use it to draw samples from a Gaussian process. We’ll use the kernel we’ve just constructed and a mean function that is zero everywhere:

The figure below shows five of such samples $z$. As you can see, the bulk of the mass of the functions is after the zero-lag point. That means the function is causal, because there is no influence on the past. Also, the functions are temporally smooth, and decay after a while. Of course, the functions are all over the place, but that is to be expected as we did not condition on any observations. If we do so, we recover causal response functions such as in this figure from the previous post on GP CaKe. Here, the combination of our prior and the observations result in smooth and localized causal responses.

What about the hyperparameters?

Of course, there is a catch. We saw that the kernel depends on a number of hyperparameters: the localization $\vartheta$, the time shift $t_s$ and the smoothness $\nu$. Furthermore, we have the variance of the observation noise $\sigma^2$, which we typically do not know. Plus, we haven’t talked about the dynamics that are captured in $D_j$ (or $P_j$ in the spectral domain). Dynamical processes, such as the Ornstein-Uhlenbeck process (see previous post), have their own parameters, such as their mean value and speed at which they return to this mean. Luckily, all these parameters can be estimated in standard frameworks. For instance, the hyperparameters of the GP kernel can be learned via marginal likelihood, which also knows convenient closed-form solutions in the Gaussian process framework. In GP CaKe, we estimate the parameters of the dynamics as a pre-processing step using maximum likelihood.

Stay tuned for the upcoming new release of GP CaKe, in which we have implemented these approaches.

What’s next?

By now, you should have an intuition for the ideas behind GP CaKe, both in how it relates to e.g. VAR models, and what the principles are behind its efficient computation. In later posts, we’ll describe how the model can be extended to nonlinear forward models (e.g. when we observe action potentials, rather than a smooth signal), and demonstrate some of our empirical results using this method.