MindCodecMind Codec is the general-audience blog of the [Cognitive Artificial Intelligence Department](https://www.ru.nl/cai/), [Donders Institute for Brain, Cognition and Behaviour](http://www.ru.nl/donders), [Radboud University](http://www.ru.nl/english), Nijmegen, Netherlands.
https://mindcodec.ai/
Fri, 14 Dec 2018 12:30:28 +0100Fri, 14 Dec 2018 12:30:28 +0100Jekyll v3.2.1GP CaKe: An introduction to Granger causality with Gaussian processes, part II<p>In <a href="../../../../2018/10/22/an-introduction-to-causal-inference-with-gaussian-processes-part-i/">the previous installment of this series</a>, I explain the ideas behind the GP CaKe model. In addition, I provided a toy example comparing GP CaKe with a VAR model. In the current blog post, I’ll go a bit deeper into the actual computations. It makes a lot of sense to read the first post before starting with this one, but hey, I’m not going to stop you from reading this one…</p>
<h1 id="the-model">The model</h1>
<p>Let me begin by restating the model. GP CaKe describes the dynamics of a time series of observations $x_j(t)$ as:</p>
<script type="math/tex; mode=display">D_j x_j (t) = \sum_{i\neq j} \int_0^{\infty} c_{i \Rightarrow j}(\tau) x_i(t - \tau) \text{d} \tau + w_j(t),</script>
<p>where $D_j$ is the differential operator (i.e. it describes the dynamics up to the $P$-th derivative), $w_j(t)$ is Gaussian white noise with variance $\sigma^2$ and $c_{i\Rightarrow j}(\tau)$ is the causal impulse response function (CIRF) describing the causal interaction from $i$ to $j$. The CIRF’s are the unobserved quantities that we want to learn.</p>
<h1 id="the-computational-aspects">The computational aspects</h1>
<p>Applying GP CaKe to your data means you want to infer the latent response functions given the time series, or in other words, to compute the posterior $p(c_{i \Rightarrow j}(t) \mid x)$. “Surely,” you think, “fitting a continuous function that is used in a convolution within a stochastic differential equation is more complex than fitting a VAR?” Well, you’ll be surprised. The icing on the GP CaKe comes from transforming these two simple equations into the frequency domain via the <a href="https://en.wikipedia.org/wiki/Fast_Fourier_transform">Fast Fourier Transform</a>. The result is</p>
<script type="math/tex; mode=display">P_j x_j(\omega) = \sum_{i \neq j} c_{i \Rightarrow j}(\omega)x_i(\omega) + w_j(\omega).</script>
<p>Here, $P_j(\omega)$ is the differential operator in the spectral domain, for frequencies $\omega$. Importantly, this transformation has a property that is convenient for us here: convolution in the temporal domain becomes multiplication in the spectral domain. Because of this, the spectral representation of the GP CaKe model is quite simple. We can rearrange it some more to obtain</p>
<script type="math/tex; mode=display">x_j(\omega) = \sum_{i\neq j} \frac{x_i(\omega)}{P_j(\omega)} c_{i \Rightarrow j}(\omega) + \frac{w_j(\omega)}{P_j(\omega)}</script>
<p>Now wait a minute! This equation says that our vector $x_j(\omega)$ (the spectral representation of our original time series) is given by a weighted sum of functions $c_{i \Rightarrow j}(\omega)$, with weights $x_i(\omega) / P_j(\omega)$, plus some (scaled) noise. We encounter this type of equation much more often in the context of Bayesian modelling. It is known as <strong>Gaussian process regression</strong>, with the slight generalization that we have a weight term in front of the function we want to estimate.</p>
<p>What is very convenient about GP regression, is that we know the closed-form solution of the posterior $p(c_{i\Rightarrow j}(\omega)\mid x)$. This means we do not have to sample this model using Markov chain Monte Carlo, or approximate it via Variational Inference (as with DCM). This makes GP CaKe very convenient to work with. All we need to do is to compute for each connection the expression</p>
<script type="math/tex; mode=display">\mathbb{E}\left[c_{i\Rightarrow j}(\omega) \mid \mathbf{x} \right] = \mathbf{K}_j \mathbf{\Gamma} \left(\sum_i \mathbf{\Gamma}_i \mathbf{K}_i \mathbf{\Gamma}_i^T + \mathbf{D}\right)^{-1} \mathbf{x}.</script>
<p>This is the posterior mean of the causal impulse response function that describes the effect of variable $i$ on variable $j$. The matrices $\mathbf{K}_i$ are the kernels of the Gaussian process that describe how similar the frequencies of the spectrum are. I’ll come back to these in a second. Because of the reshuffling of terms, the $[\mathbf{\Gamma}_i]_{mm} = x_i(\omega_m) / P_j(\omega_m)$ are diagonal matrices that give the GP weight function for each frequency, and $[\mathbf{D}]_{mm} = \sigma^2 / | P_j(\omega_m)|^2$ is the diagonal matrix with the noise (co-)variance. Note that this simply follows from how we have rewritten the model into the frequency domain, and have divided the both sides by $P_j(\omega)$. We can interpret what these terms mean in terms of filtering the signal, but for now we’ll simply consider them given.</p>
<h1 id="constructing-causal-kernels">Constructing causal kernels</h1>
<p>Much more interesting are the kernels of the Gaussian process that describe how similar subsequent points (here: subsequent frequencies, because remember that we are now working in the frequency domain!) are to each other. Here, they indicate how, a priori, we expect causal impulse response functions to behave. This is where the Bayesian modeler becomes excited, because thinking about prior beliefs is where we shine. There are three desiderata for causal response functions:</p>
<ol>
<li>The CIRF should be mostly temporally localized. By this we mean that a causal effect should be strongest after a short delay, and if the delay is very large the causal effect should diminish.</li>
<li>The CIRF should be smooth. In most physical systems, we expect that a CIRF is not all over the place, but rather gradually increases, and then vanishes again.</li>
<li>Finally, the CIRF should be causal. This implies that the effect from one variable to another should be zero for negative lags; the future of one variable should not influence the past of another. That’d just be weird.</li>
</ol>
<p>In the spectral domain, we can conveniently express each of these criteria:</p>
<script type="math/tex; mode=display">k_{\text{local}}(\omega_1, \omega_2) = e^{-\frac{\vartheta}{2}(\omega_2-\omega_1)^2 + i t_s (\omega_2 - \omega)2)} .</script>
<p>In this equation, parameter $\vartheta$ determines how correlated frequencies should be, i.e. how localized the CIRF is. The term $t_s$ is called the time shift, which makes sure the causal effect starts roughly at zero. The smoothness can be formalized by ensuring that the CIRF consists mostly of low frequencies:</p>
<script type="math/tex; mode=display">k_{\text{smooth}}(\omega_1, \omega_2) = e^{-\frac{\nu}{2} (\omega_1^2 + \omega_2^2)}.</script>
<p>Finally, causality is enforced by adding an imaginary part to the localized kernel that is equal to its <a href="https://en.wikipedia.org/wiki/Hilbert_transform">Hilbert transform</a> H, as follows:</p>
<script type="math/tex; mode=display">k_{\text{causal}}(\omega_1, \omega_2) = k_{\text{local}} + i H k_{\text{local}}(\omega_1, \omega_2).</script>
<p>Effectively, this enforces the causal impulse response functions to be zero for $t_2<t_1$. The total kernel is a combination of each of these elements, as we can create new kernels by adding and multiplying existing ones (see <a href="http://www.gaussianprocess.org/gpml/">the seminal work on GPs by Rasmussen & Williams, 2006</a>):</p>
<script type="math/tex; mode=display">k_{\text{CIRF}}(\omega_1, \omega_2) = k_{\text{smooth}}(\omega_1, \omega_2)k_{\text{causal}}(\omega_1, \omega_2).</script>
<p>This kernel we can now compute for each pair of frequencies $\omega_1, \omega_2$. With that kernel, the Fourier transformed time series $x(\omega)$, the matrices $\mathbf{\Gamma}$ and $\mathbf{D}$, we can easily compute the posterior response functions. That is pretty much it!</p>
<p>Now that we have constructed a proper kernel that expresses our requirements for (Granger) causality, we can use it to draw samples from a Gaussian process. We’ll use the kernel we’ve just constructed and a mean function that is zero everywhere:</p>
<script type="math/tex; mode=display">z \sim GP(0, k_{\text{causal}}(\omega_1, \omega_2)).</script>
<p>The figure below shows five of such samples $z$. As you can see, the bulk of the mass of the functions is after the zero-lag point. That means the function is causal, because there is no influence on the past. Also, the functions are temporally smooth, and decay after a while. Of course, the functions are all over the place, but that is to be expected as we did not condition on any observations. If we do so, we recover causal response functions such as in this figure from the previous post on GP CaKe. Here, the combination of our prior and the observations result in smooth and localized causal responses.</p>
<p><img src="/images/artboard-1.png" alt="" /></p>
<h1 id="what-about-the-hyperparameters">What about the hyperparameters?</h1>
<p>Of course, there is a catch. We saw that the kernel depends on a number of hyperparameters: the localization $\vartheta$, the time shift $t_s$ and the smoothness $\nu$. Furthermore, we have the variance of the observation noise $\sigma^2$, which we typically do not know. Plus, we haven’t talked about the dynamics that are captured in $D_j$ (or $P_j$ in the spectral domain). Dynamical processes, such as the Ornstein-Uhlenbeck process (see previous post), have their own parameters, such as their mean value and speed at which they return to this mean. Luckily, all these parameters can be estimated in standard frameworks. For instance, the hyperparameters of the GP kernel can be learned via marginal likelihood, which also knows convenient closed-form solutions in the Gaussian process framework. In GP CaKe, we estimate the parameters of the dynamics as a pre-processing step using maximum likelihood.</p>
<p>Stay tuned for the upcoming new release of GP CaKe, in which we have implemented these approaches.</p>
<h1 id="whats-next">What’s next?</h1>
<p>By now, you should have an intuition for the ideas behind GP CaKe, both in how it relates to e.g. VAR models, and what the principles are behind its efficient computation. In later posts, we’ll describe how the model can be extended to nonlinear forward models (e.g. when we observe action potentials, rather than a smooth signal), and demonstrate some of our empirical results using this method.</p>
09 Dec 2018
https://mindcodec.ai/2018/12/09/an-introduction-to-granger-causality-using-gaussian-processes-part-ii/
https://mindcodec.ai/2018/12/09/an-introduction-to-granger-causality-using-gaussian-processes-part-ii/GP CaKe: An introduction to Granger causality with Gaussian processes, part I<h1 id="introduction">Introduction</h1>
<p>Recently, we developed a novel approach for Granger causality in time series data [<a href="http://papers.nips.cc/paper/6696-gp-cake-effective-brain-connectivity-with-causal-kernels.pdf">Ambrogioni et al., 2017</a>]. We call this method ‘GP CaKe’, which stands for <em>Gaussian Processes with Causal Kernels</em>, and it does not only have a tasty acronym, but also provides an elegant combination of the attractive features of vector autoregression models (VAR) and dynamical systems theory (DST). Yes, you can have your cake and eat it too! We developed the method with applications for effective connectivity (i.e. the study of causal interactions between brain regions) in mind, but it is completely generic and can also be applied elsewhere (if you are applying GP CaKe on an interesting problem, please let us know!). In this first of a series, I’ll explain what the ideas behind GP CaKe are. In forthcoming posts I’ll proceed with a step-by-step tutorial on how to use <a href="https://github.com/LucaAmbrogioni/GP-CaKe-project">the code we provide on GitHub</a>, and later on I’ll expand on extensions of this model.</p>
<h1 id="background-analysis-of-multivariate-time-series">Background: analysis of multivariate time series</h1>
<p>The context of this work is the study of complex systems with a temporal dimension. In our case, this could be successive measurements of activation of multiple brain regions, via e.g. EEG, MEG or fMRI (the latter brings some additional challenges, but we’ll ignore these for now), but you can also think of successive stock exchange listings, weather phenomena, changing protein concentrations and many more. Throughout statistics and machine learning, there are two prominent ways of modelling such time series of complex systems: vector autoregression (VAR) [<a href="https://www.springer.com/gp/book/9783662026915">Lütkepohl, 2005</a>] and dynamical systems theory (DST), which is usually implemented via (stochastic) differential equations (SDEs) or difference equations [<a href="https://www.amazon.com/Dynamics-Geometry-Behavior-Ralph-Abraham/dp/0201567172">Abraham & Shaw, 1983</a>]. We briefly recap each of these approaches, to <span style="text-decoration: line-through;">whet your appetite</span> illustrate how GP CaKe relates to them. If you are already familiar with these methods, feel free to skip to the part of this post about GP CaKe itself!</p>
<h2 id="vector-autoregression">Vector autoregression</h2>
<p>In its most basic formulation, VAR predicts the value for a particular variable $x_j(t)$ at time $t$ as a (stochastic) function of the <em>other</em> variables $x_i$ via the equation (often, this equation is written in matrix notation so that the interactions between all variables are described at once, but for the sake of exposition we consider here only one target variable at a time):</p>
<script type="math/tex; mode=display">x_j(t) = \sum_{\tau=1}^L \sum_{i \neq j} a_{ij}(\tau) x_i(t - \tau) + w(t) \enspace,</script>
<p>This equation has the following meaning. The signal of the variable $x_j(t)$ depends on the input this variable get from all other variables. The strength of this dependence is captured by the autoregression coefficient $a_{ij}(\tau)$. The parameter $\tau$ is the <em>lag</em> between the signals of $x_j(t)$ and $x_i(t)$. Together, this represents that the effect of one variable on the other can, for example, be zero at $\tau=0$, then slowly the causal influence increases (i.e. larger $a_{ij}(\tau)$), only to then decay again when the lag becomes large – this reflects that something that happened far in the past, is now no longer relevant. If we plot these coefficients as a function of the lag, we obtain what is known as an <em>impulse response function</em>. Finally, the term $w(t)$ describes the random ‘innovations’ or ‘shocks’ that drive the system. They can reflect the internal dynamics of $x_j(t)$. For instance, the weather in our country is influenced by weather conditions of the surrounding countries (i.e. they have a causal effect on our climate), but also by processes internal to our country. If $a_{ij}(\tau)>0$, we say that $x_i$ has a causal effect on $x_j$ (a practical implementation of this idea would need some test for significance). This implies a <em>temporal</em> notion of causality; the past of one variable informs us about the future of another. This perspective on causality is known as <em>Wiener-Granger causality</em>, or sometimes simply Granger causality [<a href="https://www.sciencedirect.com/science/article/pii/S1053811910002272?via%3Dihub">Bressler & Seth, 2011</a>]. By looking at the IRF, we can see precisely what temporal shape a Granger-causal interaction takes:</p>
<p><img src="/images/an-introduction-to-causal-inference-with-gaussian-processes-part-i-2.png" alt="An-introduction-to-causal-inference-with-gaussian-processes-part-I-2.png" />
<em>Example of VAR analysis on three financial indices. The leftmost figure shows the time series of each of these variables, while the rightmost figure shows the impulse response functions for a maximum lag of 10 months. Note that the self-responses are also included.</em></p>
<h2 id="dynamical-systems-theory">Dynamical systems theory</h2>
<p>As the name implies, DST specifically models the <em>dynamics</em> of a system. Consider for example the canonical Ornstein-Uhlenbeck process, which is given by</p>
<script type="math/tex; mode=display">\text{d}x = (\mu - x)\text{d}t + \text{d} w(t) \enspace,</script>
<p>and describes a random-walk process that as time progresses asymptotically reverts back to its mean $\mu$.</p>
<p><img src="/images/an-introduction-to-causal-inference-with-gaussian-processes-part-i-3.png" alt="An-introduction-to-causal-inference-with-gaussian-processes-part-I-3.png" />
<em>Five Ornstein-Uhlenbeck processes all returning (asymptotically) to the same mean μ=0.8, but with their own noise levels and their own speed at which to revert to the mean.</em></p>
<p>DST is also used in dynamic causal modelling (DCM) [<a href="https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1000033">Friston, 2009</a>]. While most implementations of DCM contain a forward model that is specific for fMRI, limiting the use of DCM to neuroimaging studies, at its core lies a generic system of differential equations:</p>
<script type="math/tex; mode=display">\frac{\text{d}\mathbf{x}}{\text{d}t} = \left( \mathbf{A} + \sum_j u_j \mathbf{B}_j \right)\mathbf{x} + \mathbf{C}u \enspace.</script>
<p>Note that here $\mathbf{x} = (x_1, \ldots, x_n)$. Furthermore, $\mathbf{A}$ is a matrix that contains the fixed interactions between the variables in $\mathbf{x}$. Its role is similar to the autoregression coefficients of the VAR model, but there is no lag modelled in the DCM. Instead, the dynamics are affected instantaneously. The other terms $\mathbf{B}$ and $\mathbf{C}$ account for (node-specific) exogenous input $u$, which we will not discuss in detail here as there isn’t (yet) an analogue for these terms in GP CaKe (If need be, they are easy to add in forthcoming extensions to GP CaKe. Stay tuned!).</p>
<h1 id="continuous-and-dynamic-vector-autoregression-gp-cake">Continuous and dynamic vector autoregression: GP CaKe</h1>
<p>Enough preliminaries, it’s time to shake & bake some cake. (The people understanding that reference are cordially invited for movie night.) The difficulty with VAR models is that in practice we do not have enough observations to reliably estimate the autoregression coefficients. Consequently, our impulse response functions will be noisy and difficult to interpret. Furthermore, the VAR model only gives a crude description of the dynamics of the system. Higher-order interactions are completely ignored. The DCM does consider dynamics more extensively, but it has the issue of not modelling the delay between the changes in one variable and the change in dynamics of the other variable. Some variants of DCM do include a lag term, but keep this as a constant term, rather than a range of values for which we estimate the interaction coefficients. As you probably have guessed by now, GP CaKe combines both lagged interactions with dynamic systems. Let’s take a look. The bread-and-butter of GP CaKe is given by</p>
<script type="math/tex; mode=display">D_j x_j(t) = C_j(t)+w_j(t) \enspace,</script>
<p>where $D_j$ is the differential operator (i.e. it describes the dynamics up to the $p$-th derivative), $w_j(t)$ is again an innovation or shock term and crucially, $C_j(t)$ is the sum of the causal effects from other variables $i\neq j$:</p>
<script type="math/tex; mode=display">C_j(t) = \sum_{i\neq j} \int_0^{\infty}! c_{i\Rightarrow j}(\tau) x_i(t-\tau) ,\text{d}\tau \enspace,</script>
<p>in which $c_{i\Rightarrow j}(\tau)$ is the <em>causal impulse response function</em> (CIRF) describing the causal interaction from $i$ to $j$. You may recognize in that $C_j(t)$ is a sum (over all input variables) of time series, convolved with their impulse response functions. This definition is completely analogous to the first term on the right-hand side of the VAR model, but is continuous instead of discrete. However, GP CaKe is not simply a continuous variant of VAR. The differential operator $D_j$ seems innocuous, but plays a crucial role. It describes the <em>internal dynamics</em> of a variable, regardless of the input it receives from other variables – and we haven’t even stated yet what these dynamics are! Several (in fact: limitless) options are possible, but for example, these dynamics can be the simple Ornstein-Uhlenbeck random walk we saw earlier, or an oscillatory process. In any case, it’s important to remember that GP CaKe assumes that the input from other variables affects, through the causal impulse response function, the <em>dynamics</em> $D_j x_j(t)$, not just $x_j(t)$ itself!</p>
<h2 id="lets-put-it-to-work">Let’s put it to work</h2>
<p>In the next post, I’ll explain how to <em>compute</em> the causal impulse response functions, which relates back to our earlier post about <a href="https://www.mindcodec.com/the-fourier-transform-through-the-lens-of-gaussian-process-regression/">the Fourier transform in Gaussian process regression</a>. For now, we’ll assume that we have a tool available to do this for us (<a href="https://github.com/LucaAmbrogioni/GP-CaKe-project">which we do</a>), without going into the details too much, and give a little demonstration. We’ll generate some data with a known impulse response function, and try to recover it using both VAR and GP CaKe. Note that there are better implementations available than standard, non-regularized, VAR, but for educational purposes this will suffice. We start with two variables, $x_A$ and $x_B$, with a causal effect</p>
<script type="math/tex; mode=display">% <![CDATA[
c_{B\Rightarrow A}(\tau) = \begin{cases} \tau \exp(-\tau s) & \text{if $\tau>0$,}\ 0 & \text{otherwise.}\end{cases} \enspace, %]]></script>
<p>where $\tau$ is again the time lag between the two variables, and $s$ is the length scale of the impulse response (the shape of this function is shown in red in the figure below). For our purposes right now, this is an arbitrary variable for which we just pick some value. Furthermore, we assume that the internal dynamics of both variables are Ornstein-Uhlenbeck processes, so that</p>
<script type="math/tex; mode=display">D_j = \frac{\text{d}}{\text{d}t} + \alpha, \quad j \in {A, B} \enspace.</script>
<p>Here, $\alpha$ is the relaxation coefficient of the process, indicating how quickly the time series revert back to their (zero) mean. We generate 100 samples for the dynamical system, with a total length of 4 seconds and a sampling frequency of 100 Hz. We then recover the impulse response function using a 100-lags VAR model (i.e. 1 second), and GP CaKe. There are three important parameters for GP CaKe, reflecting the temporal smoothness, temporal localization and noise level of the response function, but we’ll discuss them in detail in the next post as well. For now, I’ve simply manually set these parameters to reasonable values; in practice we would estimate them from data or set them via knowledge of the application context. Figure 3 shows the results of the simulation. As you can see, both approaches can distinguish between a present and absent connection reasonably well (note the different vertical axes of the plots). For the present connection, both methods recover its shape to some extent, but GP CaKe gives a much smoother and more reliable result than VAR. Also, the response function has no hard cut-off after 1 second.</p>
<p><img src="/images/an-introduction-to-causal-inference-with-gaussian-processes-part-i-4.png" alt="An-introduction-to-causal-inference-with-gaussian-processes-part-I-4.png" />
<em>The recovered causal impulse response functions for the 99-lags VAR model and GP CaKe. The red lines indicate the ground truth interaction, and the green lines the expectation of the recovery averaged over 100 samples. The shaded areas show the 95% confidence intervals.</em></p>
<p>This simulation gives a nice starting point for applying GP CaKe to empirical data. We see that the result is a much smoother, and more reliable, estimate. This does require that we learn the hyperparameters that determine the smoothness, localization and noise level of the response function. We’ll return to this topic, and to the actual <em>computation</em> of the response functions, in the next post!</p>
22 Oct 2018
https://mindcodec.ai/2018/10/22/an-introduction-to-causal-inference-with-gaussian-processes-part-i/
https://mindcodec.ai/2018/10/22/an-introduction-to-causal-inference-with-gaussian-processes-part-i/Datasets for mind-reading<p>Since 2017 there has been <a href="https://mindcodec.com/using-gans-brain-reading/">a new wave of visual reconstruction research</a> (which can be seen as a form of ‘mind reading’), mostly driven by the popularity of powerful new generative models. Due to the obvious link to modern neural networks and machine learning these papers have also gained wide attention within the machine learning community. This article is a short introduction to the nature of functional MRI data and to open data sets that are available to people who want to contribute to this field, but are new to neuroimaging research. Currently these data sets are spread all over the internet, hidden deep inside various repositories and lab websites, and it is unclear that data on which new ideas can be developed is easily accessible.</p>
<p><em>(‘mind reading’ defined as reconstruction of visual perception)</em></p>
<h3 id="the-nature-of-functional-mri-data">The nature of functional MRI data</h3>
<p>There are various neuroimaging techniques, but <a href="https://en.wikipedia.org/wiki/Functional_magnetic_resonance_imaging">functional MRI</a> is the most spatially accurate technique we have for healthy human participants (among non-invasive techniques, i.e. as long as you don’t stick electrodes into somebody’s brain). It measures the blood oxygenation level dependent (BOLD) signal in a 3D-localized piece of brain matter, which changes due to local energy consumption. You can measure it with MRI because deoxygenated blood has different magnetic properties than oxygenated blood. The BOLD signal seems to be interpretable as local brain activity, and you can make the same assumption when you use this data for machine learning.</p>
<p><img src="/images/datasets-for-mind-reading-2.png" alt="Datasets-for-mind-reading-2.png" /></p>
<p>Often authors will already have sub-selected the necessary voxels (the local activity <em>feature</em>, measured inside 3D pixels) for you, so every data point will just be an easy-to-handle long vector of local activity values. If you get a 3D (or for video data, 4D) matrix in an obscure neuroimaging format like NIfTI (<code class="highlighter-rouge">.nii</code>) or DICOM the authors provided you with the full functional MRI image. This is meant for sub-selecting either gray matter or interesting regions of interest (ROIs, e.g. various regions of the visual system) as much of that 3D box will not contain brain matter, and the necessary masks should have been provided together with the data. In case you got NIfTIs you can also get a blurry impression of the participant’s brain by loading the <code class="highlighter-rouge">.nii</code> file into viewers such as <a href="https://www.nitrc.org/projects/mricron">MRIcron</a>. fMRI is far from being an optimal measure of brain activity. As fMRI measures blood oxygenation properties that only occur after local energy expenditure, and as the peak of the signal is the influx of fresh oxygenated blood lagging behind seconds, you will only ever get an indirect measure. Next to low temporal accuracy it also suffers from several other sources of physiological noise (e.g. heart beat, location of big brain arteries, participant movements at a scale of millimeters, brain moving or being squished inside the skull…). Also, a typical voxel contains millions of actual neurons and glial cells and billions of synapses. However, as the brain and especially the visual system have complex and very localized blood supply you still get a lot of fine-grained activity detail. The signal-to-noise ratio seems to be sufficient for reconstructing what somebody sees after all. The delay of the BOLD signal is described in the hemodynamic response function (HRF). The function you see below is the canonical hemodynamic response function (as available in various open source neuroimaging software modules):</p>
<p><img src="/images/datasets-for-mind-reading-3.png" alt="Datasets-for-mind-reading-3.png" /></p>
<p>It can be described as a combination of two gamma distributions, one modeling the peak (overshoot) and one the undershoot:</p>
<script type="math/tex; mode=display">h(t) = A(\frac{t^5 e^{-t}}{\Gamma(6)} - \frac{t^{15} e^{-t} }{6~\Gamma(16) } )</script>
<p>The peak of the BOLD signal (amplitude $A$) usually needs to be aligned with the stimulus occurring a few seconds before, which is done by temporal convolution with the HRF: $s(t) = (s_0 \ast h) (t)$ (with $s_0(t)$ being the unconvolved stimulus signal and $h(t)$ the canonical HRF, at time $t$). The HRF can vary quite a bit depending on local or temporal physiological differences, so the canonical HRF model is considered inflexible. If feasible it is best to adjust (learn) it separately for every voxel. One paper where this has been done in a simple way is <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3326357/">(Nishimoto 2011)</a>. When you are trying to reconstruct static images these steps have usually already been done for you during preprocessing, and for reproducibility and as somebody unfamiliar with neuroimaging you may just want to stick with the preprocessing provided by the original authors. For video stimuli you have to take this delay into account yourself however. Now on to the list of data sets. When using these data sets, please check the terms of use. Usually you simply have to cite the original publication(s), which we tried to specify in the list (but verify for yourself!). Please leave a comment if you think something is missing.</p>
<h3 id="image-stimulus-data-sets">Image stimulus data sets</h3>
<p>In the following experiments participants saw static images while fixating on the center of the screen. In several data sets the test set was repeated and averaged while images from the training set were only presented once or twice. This resampling on the test set will provide a cleaner signal there, while the training set recording is aimed at more variance (but will usually be very noisy).</p>
<h4 id="69--mnist-digits-6-and-9"><a href="http://hdl.handle.net/11633/di.dcc.DSC_2018.00112_485">69 – MNIST digits 6 and 9</a></h4>
<p><img src="/images/datasets-for-mind-reading-4.png" alt="Datasets-for-mind-reading-4.png" /></p>
<p>100 examples of MNIST handwritten digits 6 and 9 (80 in train, 20 in test, class-balanced), presented in a single-participant experiment. Original publications:</p>
<p><em>van Gerven, M., de Lange, F. P., & Heskes, T. (2010). Neural decoding with hierarchical generative models. Neural Computation, 22, 3127–42.</em> <a href="http://staging.csml.ucl.ac.uk/archive/talks/2ccad2886a6b6f93dd5b9eed456378b8/paper_19.pdf"><em>(direct link)</em></a>
<em>van Gerven, M., Cseke, B., de Lange, F. P., & Heskes, T. (2010). Efficient Bayesian multivariate fMRI analysis using a sparsifying spatio-temporal prior. NeuroImage, 50, 150–61.</em> [<em>(direct link)</em>]</p>
<h4 id="10-by-10-binary-pixel-patterns"><a href="http://www.cns.atr.jp/dni/en/downloads/fmri-data-set-for-visual-image-reconstruction/">10-by-10 binary pixel patterns</a></h4>
<p><img src="/images/datasets-for-mind-reading-5.png" alt="Datasets-for-mind-reading-5.png" /></p>
<p>32 10-by-10 binary patterns of geometric shapes, alphabet characters and random patterns in a single participant, recorded across the visual system. <a href="https://openneuro.org/datasets/ds000255/versions/00002">(raw data on OpenNeuro)</a></p>
<p><a href="http://nilearn.github.io/auto_examples/02_decoding/plot_miyawaki_reconstruction.html">Reproduction with code in python nilearn</a>
<a href="https://www.youtube.com/watch?v=70Jy0b_pD1s">Sample reconstruction</a> from <a href="https://www.sciencedirect.com/science/article/pii/S0896627308009586">(Miyawaki 2008)</a></p>
<p>Original publication:
<em>Miyawaki, Y., Uchida, H., Yamashita, O., Sato, M., Morito, Y., Tanabe, H. C., Sadato, N., and Kamitani, Y. (2008). Visual image reconstruction from human brain activity using a combination of multiscale local image decoders. Neuron, 60(5):915–929.</em> <a href="https://www.sciencedirect.com/science/article/pii/S0896627308009586"><em>(direct link)</em></a></p>
<h4 id="brains--6-handwritten-characters"><a href="http://hdl.handle.net/11633/di.dcc.DSC_2018.00114_120">BRAINS – 6 handwritten characters</a></h4>
<p><img src="/images/datasets-for-mind-reading-6.png" alt="Datasets-for-mind-reading-6.png" /></p>
<p>360 class-balanced examples of 6 handwritten characters presented to three participants (288 in train, 72 in test). Measured in primary visual cortex (V1) and V2.</p>
<p>Original publications:
<em>Schoenmakers, S., Barth, M., Heskes, T., and van Gerven, M. A. J. (2013). Linear reconstruction of perceived images from human brain activity. NeuroImage, 83:951–961.</em> <a href="https://www.ncbi.nlm.nih.gov/pubmed/23886984"><em>(direct link)</em></a></p>
<p><em>Schoenmakers, S., Güçlü, U., van Gerven, M. A. J., and Heskes, T. (2015). Gaussian mixture models and semantic gating improve reconstructions from human brain activity. Frontiers in Computational Neuroscience, 8:173.</em> <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4311641/"><em>(direct link)</em></a></p>
<h4 id="vim-1--naturalistic-grayscale-images"><a href="https://crcns.org/data-sets/vc/vim-1/about-vim-1">vim-1 – naturalistic grayscale images</a></h4>
<p><img src="/images/datasets-for-mind-reading-7.png" alt="Datasets-for-mind-reading-7.png" /></p>
<p>These are fMRI responses of 2 participants to 1750 masked training images (averaged over 2 repetitions) and 120 test images (averaged over 13 repetitions) from 2 participants, measured in the larger visual system.</p>
<p>Original publications:
<em>Kay, K. N., Naselaris, T., Prenger, R. J., and Gallant, J. L. (2008). Identifying natural images from human brain activity. Nature, 452(7185):352–355.</em> <a href="http://gallantlab.org/_downloads/2008a.Kay.etal.pdf"><em>(direct link)</em></a></p>
<p><em>Naselaris, T., Prenger, R. J., Kay, K. N., Oliver, M., and Gallant, J. L. (2009). Bayesian reconstruction of natural images from human brain activity. Neuron, 63(6):902–915.</em> <a href="https://www.ncbi.nlm.nih.gov/pubmed/19778517"><em>(direct link)</em></a></p>
<p>Recent reconstructions: <a href="https://www.biorxiv.org/content/early/2018/06/29/226688">(Seeliger 2017</a>) and <a href="https://www.biorxiv.org/content/early/2018/04/20/304774">(Ghislain St-Yves 2018)</a></p>
<h4 id="generic-object-decoding--naturalistic-images-from-imagenet"><a href="https://github.com/KamitaniLab/GenericObjectDecoding">Generic Object Decoding – naturalistic images from ImageNet</a></h4>
<p><img src="/images/datasets-for-mind-reading-8.png" alt="Datasets-for-mind-reading-8.png" /></p>
<p>5 participants, 1200 training set images taken from 150 ImageNet categories, 50 test set images of different categories (not in the original ImageNet set). One cool thing about this data set is that it contains an imagery set. <a href="https://openneuro.org/datasets/ds001246/versions/1.0.1">(raw data on OpenNeuro)</a></p>
<p>Original publication:
<em>Horikawa, T. and Kamitani, Y. (2017). Generic decoding of seen and imagined objects using hierarchical visual features. Nature Communications, 8.</em> <a href="https://www.nature.com/articles/ncomms15037"><em>(direct link)</em></a></p>
<p>Recent reconstructions from <a href="https://www.biorxiv.org/content/early/2017/12/30/240317">(Shen 2017)</a>: <a href="https://www.youtube.com/watch?v=jsp1KaM-avU">Naturalistic images</a>, <a href="https://www.youtube.com/watch?v=JHj0xaoW84k">geometric shapes</a>, <a href="https://www.youtube.com/watch?v=b7kVwoN8Cx4&t=56s">imagined geometric shapes</a>.</p>
<h4 id="bold5000--brain-object-landscape-dataset--naturalistic-images"><a href="https://bold5000.github.io">BOLD5000 – Brain, Object, Landscape Dataset – naturalistic images</a></h4>
<p><img src="/images/datasets-for-mind-reading-9.png" alt="Datasets-for-mind-reading-9.png" /></p>
<p>A massive fMRI recording of 5254 images from ImageNet, Scene Understanding (SUN) and MS COCO in the brains of 4 participants. Publication: <em>Chang, N., Pyles, J. A., Gupta, A., Tarr, M. J., & Aminoff, E. M. (2018). BOLD5000: A public fMRI dataset of 5000 images. arXiv preprint arXiv:1809.01281.</em> <a href="https://arxiv.org/abs/1809.01281"><em>(direct link)</em></a></p>
<h3 id="video-stimulus-data-sets">Video stimulus data sets</h3>
<p>Unsurprisingly the participants have been exposed to videos here. So-called spatiotemporal naturalistic stimulation seems to generate stronger brain responses as they are more engaging than flashing still images, but you have to find a way to handle the hemodynamic delay (as introduced above). Again SNR can be increased by repeating and averaging, which is usually done for test sets.</p>
<h4 id="vim-2--naturalistic-video-clips"><a href="https://crcns.org/data-sets/vc/vim-2/about-vim-2">vim-2 – naturalistic video clips</a></h4>
<p><img src="/images/datasets-for-mind-reading-10.png" alt="Datasets-for-mind-reading-10.png" /></p>
<p>If you are interested in this branch of research you have certainly seen <a href="https://www.youtube.com/watch?v=nsjDnYxJ0bo">this video</a>. The data for the underlying paper has been available for long time now, and will provide you with 7200 time points of training data and 540 time points of test data for 3 participants, measured across the visual system. Note that the videos were shown at half speed in the scanner, which may facilitate reconstruction given the slow hemodynamic response.</p>
<p>Original publication:
<em>Nishimoto, S., Vu, A. T., Naselaris, T., Benjamini, Y., Yu, B., and Gallant, J. L. (2011). Reconstructing visual experiences from brain activity evoked by natural movies. Current Biology, 21(19):1641–1646.</em> <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3326357/"><em>(direct link)</em></a></p>
<h4 id="studyforrest--forrest-gump-movie"><a href="http://studyforrest.org">studyforrest – Forrest Gump movie</a></h4>
<p><img src="/images/datasets-for-mind-reading-11.png" alt="Datasets-for-mind-reading-11.png" /></p>
<p>This is a large-scale project where 20 participants watched the movie Forrest Gump with audio narration (written for visually impaired people) inside the MRI scanner. Their data was recorded at high spatial fMRI resolution in one of the world’s few 7T scanners. Viewing conditions were natural: The movie was shown in real time and participants were not required to fixate. Due to copyright you will have to build the stimulus material yourself, but the authors provide detailed instructions for replicating the stimuli.</p>
<p>Original publication:
<em>Hanke, M., Baumgartner, F. J., Ibe, P., Kaule, F. R., Pollmann, S., Speck, O., Zinke, W. & Stadler, J. (2014). A high-resolution 7-Tesla fMRI dataset from complex natural stimulation with an audio movie. Scientific Data, 1, 140003.</em> <a href="https://www.nature.com/articles/sdata20143"><em>(direct link)</em></a></p>
03 Oct 2018
https://mindcodec.ai/2018/10/03/datasets-for-mind-reading/
https://mindcodec.ai/2018/10/03/datasets-for-mind-reading/An intuitive guide to optimal transport, part III: entropic regularization and the sinkhorn iterations<p>One of the goals of this series of posts is to explain methods that many researchers are using without really knowing the underlying theory. The topic of today’s post is the Sinkhorn iterative algorithm for efficiently solving regularized optimal transport problems. <a href="https://arxiv.org/abs/1306.0895">This algorithm</a> is one of the main driving forces behind the current wave of optimal transport theory in machine learning research as it can be several orders of magnitude faster than that of solvers. I start by confessing an embarrassing fact. The Sinkhorn iterations are the fundamental ingredient behind our <a href="https://arxiv.org/abs/1805.11284">Wasserstein variational inference</a> paper that we are going to present at <a href="https://nips.cc/">NIPS 2018</a>. However, I did not really study the reason for why this beautiful algorithm works until a few days ago when I started the preparations for this post! This is particularly embarrassing since it turns out that the derivation of the Sinkhorn iterations is really simple. All in all this has definitively been very useful for me and I hope it will be useful for you as well. While in the case of the <a href="https://www.mindcodec.com/an-intuitive-guide-to-optimal-transport-part-ii-the-wasserstein-gan-made-easy/">Wasserstein GAN</a> our starting point was the dual formulation, we will now start from the primal formulation of the optimal transport problem (if these terms are confusing for you go back to our <a href="https://www.mindcodec.com/an-intuitive-guide-to-optimal-transport-for-machine-learning/">first post</a> on optimal transport). Let’s get started! (If you are looking for a reference or if you are in a rush you can find a simple Python implementation at the end of the post).</p>
<h2 id="another-look-at-probabilistic-optimal-transport">Another look at probabilistic optimal transport</h2>
<p>I will start by refreshing the probabilistic interpretation of the optimal transport problem. Consider a discrete probability distribution $P$ defined on a finite set $X = { x_1,…,x_k,…,x_K }$ and another discrete distribution $Q$ defined on the set $Y = {y_1,…,y_k,…,y_K }$. Stochastic transportation maps $\Gamma$ between $P$ and $Q$ are conditional distributions that fulfill the following marginalization property:</p>
<script type="math/tex; mode=display">\sum_k \Gamma(y_n\vert x_k) P(x_k) = Q(y_n)~.</script>
<p>This can be interpreted in the following way: Sampling an element $x \in X$ from $P$ followed by $y \in Y$ from the conditional $\Gamma(y\vert x)$ is equivalent to sampling $y$ from $Q$ directly. I will refer to the set of all possible stochastic transportation maps as $G$. Using this notation, we can write the optimal transport divergence between $P$ and $Q$ as the expected cost of the least expensive transportation map:</p>
<script type="math/tex; mode=display">OT_c[p,q] = \min_{\Gamma \in G} \sum_{n,k} \left[ c(x_n,y_k) \Gamma(y_n\vert x_k) P(x_k) \right] ~.</script>
<p>An important feature of optimal transportation maps is that they have a tendency to be “as deterministic as possible”. More formally, the optimal conditional probability $\Gamma(y\vert x)$ is non-zero only for few values of $y$. The reason for this behavior is rather simple to understand. Given an element $\hat{x}$, there is usually a $\hat{y}$ such that the transportation cost is the smallest:</p>
<script type="math/tex; mode=display">% <![CDATA[
c(\hat{x},\hat{y}) < c(\hat{x},y), ~~~ \forall y \in Y ~. %]]></script>
<p>In this case, the cheapest thing to do is clearly to transport all the probability mass from $\hat{x}$ to $\hat{y}$. Of course this is not always possible because the probability mass $P(\hat{x})$ could be larger than the mass $Q(\hat{y})$ and this would violate the marginalization constraint. Moreover, even if $P(\hat{x}) \leq Q(\hat{y})$ there could be other elements of $X$ such that $(\hat{y})$ is the least expensive target. These other elements compete for the probability mass $Q(\hat{y})$. If this is the case then the best thing to do is to allocate as much mass as possible to $\hat{y}$ and allocate the remaining mass to the second cheapest element of $Y$. Clearly this second allocation could not be fully possible as well and the remaining mass will be allocated by repeating this procedure. Clearly, if for each $x \in X$ there is a unique and distinct $y \in Y$ that minimizes the transportation cost and such that $P(x) = Q(y)$, the optimal transportation map will be fully deterministic.</p>
<h2 id="some-bits-of-information-theory">Some bits of information theory</h2>
<p>Entropic regularization is the main theoretical concept behind this post. In order to explain this form of regularization I need to introduce some basics of information theory. Entropy is one of the deepest ideas of statistical physics and information theory and is a measure of the level of uncertainty or unpredictability in a probability distribution. Deterministic distributions have zero entropy since they are fully predictable. Conversely, uniform distributions have the maximal possible entropy. The entropy of a discrete distribution $P$ is defined by averaging negative log probabilities:</p>
<script type="math/tex; mode=display">\mathcal{H}\[P\] = - \sum_k \log{P(x_k)} P(x_k)~.</script>
<p>The entropy of the joint distribution $\Gamma(y_n, x_k) =\Gamma(y_n\vert x_k) P(x_k)$ quantifies the (average) randomness of the transportation maps:</p>
<script type="math/tex; mode=display">\mathcal{H}[\Gamma(y,x)] = - \sum_{nk} \log{\Gamma(y_n, x_k)}\Gamma(y_n, x_k) = -\sum_k \mathcal{H}[\Gamma(y\vert x_k)] P(x_k) -\mathcal{H}[P]~.</script>
<h2 id="entropic-regularization">Entropic regularization</h2>
<p>Entropic regularization is a way to counteract the tendency of optimal transport to produce nearly deterministic transportation maps by adding a term that favors randomness. <span class="st">As the reader will have guessed by now</span>, this term is given by the (averaged) entropy of the transportation maps. The result is a regularized transportation problem:</p>
<script type="math/tex; mode=display">OT_c^\epsilon[p,q] = \min_{\Gamma \in G} \left[\sum_{n,k} \left[ c(x_n,y_k) \Gamma(y_n\vert x_k) P(x_k) \right] - \epsilon\mathcal{H}[\Gamma(y,x)] \right]~,</script>
<p>The solution of the regularized transport between $P$ and $Q$ can be visualized by plotting the joint probability $\Gamma(y,x) = \Gamma(y\vert x) P(x)$. The unregularized transport problem has a sparse solution, meaning that $\Gamma(y,x)$ is different from zero only for few $(y,x)$ couples. When we start adding regularization the transportation maps become dense and the joint distribution becomes more spread out.</p>
<h2 id="the-sinkhorn-iterations">The Sinkhorn iterations</h2>
<p>The biggest advantage of including entropic regularization in an optimal transport problem is that the regularized solution can be found very efficiently using a simple iterative algorithm. The derivation of those iterations is quite simple and requires just some basics of calculus. I will start by introducing a simplified notation. I will denote the joint distribution $\Gamma(y_n\vert x_k) P(x_k)$ as $\Gamma_{nk}$. The probability $P(x_k)$ will be denoted as $P_k$ and $Q(x_k)$ will be denoted as $Q_k$. We need to optimize the following function with respect to the transportation maps:</p>
<script type="math/tex; mode=display">E[\Gamma] = \sum_{n,k} \left[C_{nk} \Gamma_{nk} \right] + \epsilon \sum_{n,k} \log{\left({\Gamma_{nk}} \right)\Gamma_{nk}} ~,</script>
<p>under the following set of constraints:</p>
<script type="math/tex; mode=display">\sum_k \Gamma_{nk} = Q_n</script>
<p>and</p>
<script type="math/tex; mode=display">\sum_n \Gamma_{nk} = P_k~.</script>
<p>This last constraint assures that the solutions are proper conditional probability distributions. The constraints can be embedded into the loss function using the two sets of <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange multipliers</a> ${\lambda_1,…,\lambda_K}$ and ${\chi_1,…,\chi_N}$. This results in the following modified loss function that can be minimized without constraints:</p>
<script type="math/tex; mode=display">\mathcal{L}[\Gamma] = E[\Gamma] - \sum_n \lambda_n \left(Q_n - \sum_k \Gamma_{nk} \right) - \sum_k \chi_k \left(P_k - \sum_n \Gamma_{nk} \right)\~.</script>
<p>This loss function is clearly smooth with respect to the transportation maps and it can easily be shown that it is convex. This implies that we can find the global minimum by differentiating the loss and setting the gradient equal to zero:</p>
<script type="math/tex; mode=display">\nabla_{n,k} \mathcal{L}[\Gamma] = C_{kn} +\epsilon \log{(\Gamma_{nk})} + \epsilon + \lambda_{k} + \chi_{n} ~,</script>
<p>where $\nabla_{n,k}$ denotes the derivative with respect to $\Gamma_{nk}$. By setting the derivatives to zero and applying the exponential function to both sides we obtain the following equation:</p>
<script type="math/tex; mode=display">\Gamma_{nk} = \exp(-\lambda_k/\epsilon) \exp{\left(-C_{kn}/\epsilon \right)} \exp{(-\chi_n/\epsilon - 1)}.</script>
<p>We are almost there! All we need to do is find the values of the Lagrange multipliers. First, lets rewrite the equation as follows:</p>
<script type="math/tex; mode=display">\Gamma_{nk} = v_n K_{nk} u_k~,</script>
<p>where $K_{nk} =\exp!\left(-C_{kn}/\epsilon \right)$, $v_n =\exp(-\chi_n/\epsilon)$ and $u_n = \exp(-\lambda_k/\epsilon - 1)$. In order to determine the constants $v_n$ and $u_k$ we need to enforce the constraints:</p>
<script type="math/tex; mode=display">\sum_k \Gamma_{nk} = v_n \left(\sum_k K_{n,k} u_k \right) = Q_n</script>
<p>and</p>
<script type="math/tex; mode=display">\sum_n \Gamma_{nk} = u_k \left(\sum_n K_{n,k} v_n \right) = P_n~.</script>
<p>These two equations have to be satisfied simultaneously. An obvious strategy for finding a solution is to iteratively update each set of variables while keeping the other set fixed:</p>
<script type="math/tex; mode=display">v_n^{(t+1)} = \frac{Q_n}{\left(\sum_k K_{nk} u_k^{(t)} \right)}</script>
<p>and</p>
<script type="math/tex; mode=display">u_k^{(t+1)} = \frac{P_k}{\left(\sum_n K_{nk} v_n^{(t+1)} \right) }~.</script>
<p>There you go, we just derived the Sinkhorn iterations! I will leave convergence proofs, complexity analysis and all this beautiful stuff to the mathematicians! There is still much to say about the Sinkhorn iterations from an applicational point of view. In the next post I will explain how to use them in the context of deep learning and variational Bayesian inference.</p>
<h2 id="sinkhorn-in-few-lines-of-python">Sinkhorn in few lines of Python</h2>
<p>The best algorithms in machine learning are also the simplest to explain and to program and I hope I convinced you that the Sinkhorn iterative algorithm is among them. I will conclude the post with a simple Python implementation:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">sinkhorn</span><span class="p">(</span><span class="n">C</span><span class="p">,</span> <span class="n">P</span><span class="p">,</span> <span class="n">Q</span><span class="p">,</span> <span class="n">epsilon</span><span class="p">,</span> <span class="n">num_iter</span><span class="p">):</span>
<span class="n">scale</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="nb">max</span><span class="p">(</span><span class="n">C</span><span class="p">),</span> <span class="n">K</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">transpose</span><span class="p">(</span><span class="n">C</span><span class="o">/</span><span class="n">scale</span><span class="p">)</span><span class="o">/</span><span class="n">epsilon</span><span class="p">)</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="n">Q</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">Q</span><span class="p">)</span>
<span class="n">u</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="n">P</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">P</span><span class="p">)</span>
<span class="k">for</span> <span class="n">itr</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_iter</span><span class="p">):</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">Q</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">K</span><span class="p">,</span> <span class="n">u</span><span class="p">)</span>
<span class="n">u</span> <span class="o">=</span> <span class="n">P</span><span class="o">/</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">transpose</span><span class="p">(</span><span class="n">K</span><span class="p">),</span> <span class="n">v</span><span class="p">))</span>
<span class="n">coupling</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">outer</span><span class="p">(</span><span class="n">v</span><span class="p">,</span><span class="n">u</span><span class="p">)</span>
<span class="n">optimal_cost</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">coupling</span><span class="p">)</span>
<span class="k">return</span> <span class="n">coupling</span><span class="p">,</span> <span class="n">scale</span><span class="o">*</span><span class="n">optimal_cost</span></code></pre></figure>
01 Oct 2018
https://mindcodec.ai/2018/10/01/an-intuitive-guide-to-optimal-transport-part-iii-entropic-regularization-and-the-sinkhorn-iterations/
https://mindcodec.ai/2018/10/01/an-intuitive-guide-to-optimal-transport-part-iii-entropic-regularization-and-the-sinkhorn-iterations/An intuitive guide to optimal transport, part II: the Wasserstein GAN made easy<p>No guide to optimal transport for machine learning would be complete without an explanation of the Wasserstein GAN (wGAN). In <a href="https://mindcodec.ai//2018/09/19/an-intuitive-guide-to-optimal-transport-part-i-formulating-the-problem/">the first post of this series</a> I explained the optimal transport problem in its primal and dual form. I concluded the post by proving the Kantorovich-Rubinstein duality, which provides the theoretical foundation of the wGAN. In this post I will provide an intuitive explanation of the concepts behind the wGAN and discuss their motivations and implications. Moreover, instead of using weight clipping like in the original wGAN paper, I will use a new form of regularization that is in my opinion closer to the original Wasserstein loss. Implementing the method using a deep learning framework should be relatively easy after reading the post. A simple Chainer implementation is available <a href="https://github.com/LucaAmbrogioni/Wasserstein-GAN-on-MNIST">here</a>. Let’s get started!</p>
<h2 id="why-to-use-a-wasserstein-divergence">Why to use a Wasserstein divergence</h2>
<p>The original wGAN paper opens with a lengthy explanation of the advantages of the Wasserstein metric over other commonly used statistical divergences. While the discussion was rather technical, the take home message is simple: the Wasserstein metric can be used for comparing probability distributions that are radically different. What do I mean by different? The most common example is when two distributions have different support, meaning that they assign zero probability to different families of sets. For example, assume that $P(x)$ is a usual probability distribution on a two dimensional space defined by a probability density. All sets of zero volume (such as individual points and curves) in this space have zero probability under $p$. Conversely, $Q(x)$ is a weirder distribution that concentrates all its probability mass on a curve $\alpha$. All sets that do not contain the curve have zero probability under Q while some sets with zero volume have non-zero probability as far as they “walk along” the curve. I visualized this behavior in the following picture:</p>
<p><img src="/images/an-intuitive-guide-to-optimal-transport-part-ii-the-wasserstein-gan-made-easy-1.png" alt="An-intuitive-guide-to-optimal-transport-part-II-the-Wasserstein-GAN-made-easy-1.png" /></p>
<p>Now, these two distributions are very different from each other and they are pretty difficult to compare. For example, in order to compute their KL divergence we would need to calculate the density ratio $p(x)/q(x)$ for all points, but $Q$ does not even have a density with respect to the ambient space! However, we can still transport one distribution into the other using the optimal transport formalism that I introduced in the <a href="https://mindcodec.ai//2018/09/19/an-intuitive-guide-to-optimal-transport-part-i-formulating-the-problem/">previous post</a>! The Wasserstein distance between the two distributions is given by:</p>
<script type="math/tex; mode=display">\mathcal{W}_2\[p,q]^2 = \inf_\gamma \int_{x_2 \in \alpha} \left( \int \left\lVert x_2 - x_1 \right\rVert^2_2 \gamma(x_2\vert x_1)dx_2 \right) dQ(x_1)~.</script>
<p>Let’s analyze this expression in detail. The inside integral is the average cost of transporting a point $x_1$ of the curve to a point $x_2$ of the ambient space under the transport map $\gamma(x_2\vert x_1)$. The outer integral is the average of this expected cost under the distribution $Q$ defined on the curve. We can summarize this in four step: 1) pick a point $x_1$ from the curve $\alpha$, 2) transport a particle from $x_1$ to $x_2$ with probability $\gamma(x_2\vert x_1)$, 3) compute the cost of transporting a particle from $x_1$ to $x_2$ and 4) repeat this many times and average the cost. Of course, in order to assure that you are transporting $Q$ to the target distribution $P$ you need to check that the marginalization constraint is satisfied:</p>
<script type="math/tex; mode=display">\int_{x_2 \in \alpha} \gamma(x_2\vert x_1) dQ(x_1) = p(x)~,</script>
<p>meaning that sampling particles from $Q$ and then transporting them using $\gamma$ is equivalent to sampling particles directly from $P$. Note that the procedure does not care whether the distributions $P$ and $Q$ have the same support. Thus we can use the Wasserstein distance for comparing these extremely different distributions.</p>
<p><img src="/images/an-intuitive-guide-to-optimal-transport-part-ii-the-wasserstein-gan-made-easy-2.png" alt="An-intuitive-guide-to-optimal-transport-part-II-the-Wasserstein-GAN-made-easy-2.png" /></p>
<p>But is this relevant in real applications? Yes it definitely is. Actually most of the optimizations we perform in probabilistic machine learning involve distributions with different support. For example, the space of natural images is often assumed to live in a lower dimensional (hyper-)surface embedded in the pixel space. If this hypothesis is true, the distribution of natural images is analogous to our weird distribution $Q$. Training a generative model requires the minimization of some sort of divergence between the model and the real distribution of the data. The use of the KL divergence is very sub-optimal in this context since it is only defined for distributions that can be expressed in terms of a density. This could be one of the reasons why variational autoencoders perform worse than GANs on natural images.</p>
<h2 id="the-dual-formulation-of-the-wasserstein-distance">The dual formulation of the Wasserstein distance</h2>
<p>That was a long diversion, however I think it is important to properly understand the motivations behind the wGAN. Let’s now focus on the method! As I explained in the last post, the starting point for the wGAN is the dual formulation of the optimal transport problem. The dual formulation of the (1-)Wasserstein distance is given by the following formula:</p>
<script type="math/tex; mode=display">\mathcal{W}_1\[p,q] = \text{sup}_{f \in L_c} \left\[ \int f(x) p(x) dx - \int f(x) q(x) dx \right] ~,</script>
<p>where $L$ is the set of Lipschitz continuous functions:</p>
<script type="math/tex; mode=display">L = \left { f:\Re \rightarrow \Re \middle\vert \vert f(x_2) - f(x_1)\vert \leq \vert x_2 - x_1\vert \right}~.</script>
<p>The dual formulation of the Wasserstein distance has a very intuitive interpretation. The function $f$ has the role of a nonlinear feature map that maximally enhances the differences between the samples coming from the two distributions. For example, if $p$ and $q$ are distributions of images of male and female faces respectively, then $f$ will assign positive values to images with masculine features and these values will get increasingly higher as the input gets closer to a caricatural hyper-male face. In other words, the optimal feature map $f$ will assign a continuous score on a masculinity/femininity spectrum. The role of the Lipschitz constraint is to block $f$ from arbitrarily enhancing small differences. The constraint assures that if two input images are similar the output of $f$ will be similar as well. In the previous example, a minor difference in the hairstyle should not make an enormous difference on our masculine/feminine spectrum. Without this constraint the result would be zero when $p$ is equal to $q$ and $\infty$ otherwise since the effect of any minor difference can be arbitrarily enhanced by an appropriate feature map.</p>
<h2 id="the-wasserstein-gan">The Wasserstein GAN</h2>
<p>The basic idea behind the wGAN is to minimize the Wasserstein distance between the sampling distribution of the data $p(x)$ and the distribution of images synthesized using a deep generator. Specifically, images are obtained by passing a latent variable $z$ through a deep generative model $g$ parameterized by the weights $\phi$. The resulting loss has the following form:</p>
<script type="math/tex; mode=display">\mathcal{L}\[\phi] = \sup_{f \in L_c} \left[ \text{E}_{x \sim p(x)} \left\[ f(x) \right] - \text{E}_{z \sim p(z)} \left\[ f(g(z,\phi)) \right]\right] ~,</script>
<p>where $q(z)$ is a distribution over the latent space. As we saw in the last section, the dual formulation already contains the idea of a discriminator in the form of a nonlinear feature map $f$. Unfortunately it is not possible to obtain the optimal $f$ analytically. However, we can parameterize $f$ using a deep network and learn its parameters $\theta$ with stochastic gradient descent. This naturally leads to a min-max problem:</p>
<script type="math/tex; mode=display">\inf_\phi \left[ \sup_\theta\left[ \text{E}_{x \sim p(x)} \left\[ f(x) \right] - \text{E}_{z \sim p(z)} \left\[ f(g(z,\phi)) \right] \right] \right]~.</script>
<p>In theory, the discriminator should be fully optimized every time we make an optimization step in the generator. However in practice we update $\phi$ and $\theta$ simultaneously. Isn’t this beautiful? The adversarial training naturally emerges from the abstract idea of minimizing the Wasserstein distance together with some obvious approximations. The last thing to do is to enforce the Lipschitz constraint in our learning algorithm. In the original GAN paper this is done by clipping the weights if they get bigger than a predefined constant. In my opinion, a more principled way is to relax the constraint and add an additional stochastic regularization term to the loss function:</p>
<script type="math/tex; mode=display">\tilde{\mathcal{L}}\[\phi, \theta] = \mathcal{L}\[\phi, \theta] + \lambda \text{ReLu}\left(\vert f(x_2) - f(x_1)\vert - \vert x_2 - x_1\vert \right)^2~.</script>
<p>This term is zero when the constraint is fulfilled while it adds a positive value when it is not. The original strict constraint is formally obtained by tending $\lambda$ to infinity. In practice, we can optimize this loss using a finite value of $\lambda$.</p>
<h2 id="is-the-wasserstein-gan-really-minimizing-an-optimal-transport-divergence">Is the Wasserstein GAN really minimizing an optimal transport divergence?</h2>
<p>The Wasserstein GAN is clearly a very effective algorithm that naturally follows from a neat theoretical principle. But does it really work by minimizing the Wasserstein distance between the generator and the data distribution? The dual formulation of the Wasserstein distance crucially relies on the fact that we are using the optimal nonlinear feature map $f$ under all possible Lipschitz continuous functions. The constraint makes an enormous difference. For example, if we use the same difference of expectation loss but we replace Lipschitz functions with continuous functions with values bounded between $-1$ and $1$ we obtain the total variation divergence. While training the wGAN we are actually restricting $f$ to be some kind of deep neural network with a fixed architecture. This constraint is enormously more restrictive that the sole Lipschitz constraint and it leads to a radically different divergence. The set of Lipschitz functions from images to real numbers is incredible flexible and does not induce any relevant inductive bias. The theoretically optimal $f$ can detect differences that are invisible to the human eye and does not assign special importance to differences that are very obvious for humans. Conversely, deep convolution networks have a very peculiar inductive bias that somehow matches the bias of the human visual system. Therefore, it is possible that the success of the wGAN is not really due to the mathematical properties of the Wasserstein distance but rather to the biases induced by the parameterization of the feature map (discriminator).</p>
<h2 id="the-anatomy-of-a-good-paper">The anatomy of a good paper</h2>
<p>The original Wasserstein GAN paper is a clear example of a beautiful machine learning publication, something that all machine learning researchers should be striving to be able to write. The starting point of the paper is a simple and elegant mathematical idea motivated by observations about the distribution of real data. The algorithm follows very naturally from the theory, with the adversarial scheme popping out from the formulation of the loss. Finally, the experimental section is very well made and the results are state-of-the-art. Importantly, there is nothing difficult in the paper and most of us could have pulled it off with just some initial intuition followed by hard work. I feel that there are still many low hanging fruits in our field and the key for grasping them is to follow the example of papers like this.</p>
23 Sep 2018
https://mindcodec.ai/2018/09/23/an-intuitive-guide-to-optimal-transport-part-ii-the-wasserstein-gan-made-easy/
https://mindcodec.ai/2018/09/23/an-intuitive-guide-to-optimal-transport-part-ii-the-wasserstein-gan-made-easy/An intuitive guide to optimal transport, part I: formulating the problem<p>Following the success of the <a href="https://arxiv.org/abs/1701.07875">Wasserstein GANs</a> and of the <a href="https://arxiv.org/abs/1306.0895">Sinkhorn divergences</a>, <a href="https://arxiv.org/abs/1803.00567">Optimal transport theory</a> is rapidly becoming an essential theoretical tool for machine learning research. Optimal transport has been used for <a href="https://arxiv.org/abs/1706.00292">generative modeling</a>, <a href="https://arxiv.org/abs/1711.01558">probabilistic autoencoders</a>, <a href="https://arxiv.org/abs/1805.11284">variational inference</a>, <a href="https://arxiv.org/abs/1806.01265v2">reinforcement learning</a> and <a href="https://arxiv.org/abs/1806.09045v3">clustering</a>, among many other things. In other words, if you are a machine learning researcher you need to have some understanding of optimal transport or you’ll risk to be left behind. Unfortunately, optimal transport theory is often presented in heavily mathematical jargon that risks to scare away the non-mathematicians among us. This is a pity since the parts of optimal transport theory that are most relevant for modern machine learning research are often very intuitive. In this series of posts, I’ll try to provide an intuitive guide to optimal transport methods without relying on complex mathematical concepts. In this first post I will formulate the (Kantorovich) optimal transport problem both from a deterministic and from a probabilistic viewpoint. I bet that the curiosity of most of my readers has been driven to this topic by the celebrated Wasserstein GAN. While I will cover the <a href="https://arxiv.org/abs/1701.07875">Wasserstein GAN</a> in a future post of this series, I will conclude this post with an introduction to the dual optimal transport problem and I will give a simple proof of the Kantorovich-Rubinstein duality which is the theoretical foundation of this adversarial method.</p>
<h2 id="optimal-transport-problems">Optimal transport problems</h2>
<p>Optimal transport problems can be formulated in a very intuitive way. Consider the following example: an online retailer has $N$ storage areas and there are $K$ customers who ordered e-book readers. The n-th storage area $x_n$ contains $m_n$ readers while the k-th customer $y_k$ ordered $h_k$ readers. The transport cost $c(x,y)$ is the distance between the storage area $x$ and the address of costumer $y$. The optimal transport problem consists of finding the least expensive way of moving all the readers stored in the storage areas to the customers who ordered them. A transportation map $\Gamma$ is a matrix whose $\Gamma_{nk}$ entry represents the number of e-book readers sent from the n-th storage area to the k-th customer. For consistency, the sum of all the readers leaving the n-th storage areas has to be equal to the total number of readers stored in that area while the sum of all the readers arriving to a costumer’s house has to be equal to the number of e-book readers she ordered. These are the hard constraints of the transport problem and can be written in formulas as follows:</p>
<script type="math/tex; mode=display">\sum_k \Gamma_{nk} = m_n~,</script>
<script type="math/tex; mode=display">\sum_n \Gamma_{nk} = h_k~.</script>
<p>The final constraint is that the entries of the matrix have to be positive-valued (for obvious reasons). The optimal solution $\hat{T}$ is the transportation matrix that minimizes the total cost while respecting the constraints:</p>
<script type="math/tex; mode=display">\hat{T} = \text{argmin}_\Gamma \sum \Gamma_{nk} c(x_n,y_k)~.</script>
<p>In this expression we are assuming that transporting $L$ e-readers from $x_n$ to $y_k$ is $L$ times more expensive than transporting one reader. Note that this assumption is not realistic in most real world transportation problems since the transportation cost usually does not scale linearly with the number of transported units. Nevertheless, this simplified problem gives rise to a very elegant and useful mathematical theory.</p>
<h3 id="probabilistic-formulation">Probabilistic formulation</h3>
<p>In machine learning and statistics it is often useful to reformulate the optimal transport problem in probabilistic terms. Consider two finite probability spaces $(X, P)$ and $(Y,Q)$ where $X$ and $Y$ are finite sets and $P$ and $Q$ are probability functions assigning a probability to each element of their set. The optimal transport between $P$ and $Q$ is the conditional probability function $\gamma(y\vert x)$ that minimizes the following loss function:</p>
<script type="math/tex; mode=display">\text{argmin}_\Gamma \sum \Gamma(y_n\vert x_k) P(x_k) c(x_n,y_k)~,</script>
<p>subject to the following marginalization constraint:</p>
<script type="math/tex; mode=display">\sum \Gamma(y_n\vert x_k) P(x_k) = Q(y_n)~.</script>
<p>This simply means that the marginal distribution of the joint probability $\Gamma(y\vert x) P(x)$ is $Q(y)$. In other words, $\Gamma(y_n\vert x_k)$ is transporting the distribution $P(x)$ into the distribution $Q(y)$. This transportation can be interpreted as a stochastic function that takes $x$ as input and outputs a $y$ with probability $\gamma(y\vert x)$. The problem thus consists of finding a stochastic transport that maps the probability distribution $P$ into the probability distribution $Q$ while minimizing the expected transportation cost. It is easy to see that this problem is formally identical to the deterministic problem that I introduced in the previous section. The transportation matrix $\Gamma_{nk}$ is given by $\Gamma(y_n\vert x_k) P(x_k)$. This assures that the first constraint is automatically fulfilled while the second constraint still needs to be enforced.</p>
<h3 id="continuous-formulation">Continuous formulation</h3>
<p>It is straightforward to extend the definition of probabilistic optimal transport to continuous probability distributions. This can be done by replacing the probabilities $P(x)$ and $Q(x)$ with the probability densities $p(x)$ and $q(x)$ and the summation with an integration:</p>
<script type="math/tex; mode=display">\text{argmin}_\gamma \int \gamma(y\vert x) p(x) c(x,y) dx dy~.</script>
<p>Analogously, the marginalization constraint becomes:</p>
<script type="math/tex; mode=display">\int \gamma(y\vert x) p(x) dx = q(y)~.</script>
<p>This continuous optimal transport problem is usually introduced in a slightly different (and in my opinion less intuitive) form. I will denote the joint density $\gamma(y\vert x) p(x)$ as $\gamma(x,y)$. It is easy to see that the problem can be reformulated as follows:</p>
<script type="math/tex; mode=display">\text{arginf}_\gamma \int \gamma(x,y) c(x,y) dx dy~,</script>
<p>with the two marginalization constraints:</p>
<script type="math/tex; mode=display">\int \gamma(x,y) dx = q(y)</script>
<p>and</p>
<script type="math/tex; mode=display">\int \gamma(x,y) dy = p(x)~.</script>
<h2 id="optimal-transport-divergences">Optimal transport divergences</h2>
<p>In many situations the primary interest is not to obtain the optimal transportation map. Instead, we are often interested in using the optimal transportation cost as a statistical divergence between two probability distributions. A statistical divergence is a function that takes two probability distributions as input and outputs a non-negative number that is zero if and only if the two distributions are identical. Statistical divergences such as the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL divergence</a> are massively used in statistics and machine learning as a way of measuring dissimilarity between two probability distributions. Statistical divergences have a central role in several of the most active areas of statistical machine learning, such as generative modeling and variational Bayesian inference.</p>
<h3 id="optimal-transport-divergences-and-the-wasserstein-distance">Optimal transport divergences and the Wasserstein distance</h3>
<p>An optimal transport divergence is defined as the optimal transportation cost between two probability distributions:</p>
<script type="math/tex; mode=display">OT_c[p,q] = \text{inf}_\gamma \int \gamma(x,y) c(x,y) dx dy~,</script>
<p>where the optimization is subject to the usual marginalization constraints. This expression provides a valid divergence as far as the cost is always non-negative and $c(x,x)$ vanishes for all values of $x$. Clearly, the properties of an optimal transport divergence depend on its cost function. A common choice is the squared Euclidean distance:</p>
<script type="math/tex; mode=display">c(x,y) = \left\lVert x - y \right\rVert^2_2~.</script>
<p>Using the Euclidean distance as a cost function, we obtain the famous (squared) 2-Wasserstein distance:</p>
<script type="math/tex; mode=display">\mathcal{W}_2[p,q]^2 = \text{inf}_\gamma \int \gamma(x,y) \left\lVert x - y \right\rVert^2_2 dx dy~.</script>
<p>The squared root of $\mathcal{W}_2[p,q]^2$ is a proper <a href="https://en.wikipedia.org/wiki/Metric_(mathematics)">metric function</a> between probability distributions as it respects the <a href="https://en.wikipedia.org/wiki/Triangle_inequality">triangle inequality</a>. Using a proper metric such as the Wasserstein distance instead of other kinds of optimal transport divergences is not crucial for most machine learning applications, but it often simplifies the mathematical treatment. Finally, given an integer $k$, the k-Wasserstein distance is defined as follows:</p>
<script type="math/tex; mode=display">\mathcal{W}_k[p,q]^k = \text{inf}_\gamma \int \gamma(x,y) \left\lVert x - y \right\rVert^k_k dx dy~,</script>
<p>where $\left\lVert \cdot \right\rVert_k$ denotes the $L_k$ norm.</p>
<h2 id="the-dual-problem-and-the-wasserstein-gan">The dual problem and the Wasserstein GAN</h2>
<p>Optimal transport problems are a special case of linear programming problems since both the function to be optimized and the constraints are linear functions of the transportation map. The theory behind <a href="https://en.wikipedia.org/wiki/Linear_programming">linear programming</a> dates back to the beginning of the last century and is one of the cornerstones of mathematical optimization. One of the most fundamental results of linear programming is that any linear problem has a dual problem whose solution provides an upper bound to the solution of the original (primal) problem. Fortunately, it turns out that in the case of optimal transport the solution of the dual problem does not simply provide a bound but is indeed identical to the solution of the primal problem. Furthermore, the dual formulation of the optimal transport problem is the starting point for adversarial algorithms and the Wasserstein GAN. The dual formulation of an optimal transport divergence is given by the following formula:</p>
<script type="math/tex; mode=display">OT_c[p,q] = \text{sup}_{f \in L_c} \left[ \int f(x) p(x) dx - \int f(y) q(y) dy \right] ~,</script>
<p>where $L_c$ is the set of functions whose growth is bounded by $c$:</p>
<script type="math/tex; mode=display">L_c = \left \{ f:\Re \rightarrow \Re \middle\vert f(x) - f(y) \leq c(x,y) \right\}~.</script>
<p>It is far from obvious why this expression is equivalent to the primal expression that I gave in the previous sections and I will spend the rest of the post proving this result. However, the formula in itself has a rather intuitive interpretation. Clearly, if $p$ is equal to $q$, the difference between their expected values of any function $f$ will be zero and consequently the divergence will vanish. Now assume that $p$ and $q$ differ in some region of their domain. In this case the divergence is obtained by finding the function $f$ that maximizes this difference in terms of its expected value. In other words, $f$ acts like a feature detector that extract the features that maximally differentiate $p$ from $q$. For example, imagine that $p$ is a distribution over landscape images without traces of human activity while $q$ is a distribution over landscape images with a plane in the sky. In this case, the optimal $f$ will be a plane detector. From this example you can see how $f$ plays the role of a discriminator in the Wasserstein GAN. Note that without any constraints on $f$ any small difference in the distributions can be magnified arbitrarily and the divergence would be infinite.</p>
<h3 id="proving-the-duality">Proving the duality</h3>
<p>In order to prove the duality, we need to reformulate the constrained optimization in the primal problem as an unconstrained one. Consider the following optimization:</p>
<script type="math/tex; mode=display">\text{sup}_{f} \left[ \int f(x') p(x') dx' - \int f(x) \gamma(x,y) dx dy \right] ~,</script>
<p>where $f$ can be any function. The term on the left-hand side is the expectation of $f$ under $p$ while the term on the right-hand side is the expectation of $f$ under the marginal distribution $\int \gamma(x,y) dy$. This expression is clearly zero for all possible $f$ if the marginal constraint over $p$ is met since the two terms will be identical. However, if the constraint is not met the values of $f$ can be chosen to be arbitrarily large for the values of $x$ where the two marginals are different and the result of the optimization will be infinity. Therefore, adding two terms of this form to the loss function of our optimization problem will not change the problem when the constraints are met but it will exclude all the possible solutions that do not satisfy the constraints. Also note that the term on the left ($\int f(x’) p(x’) dx’$) can be moved inside the expectation integral on the right since the expectation integral of a constant is the constant itself:</p>
<script type="math/tex; mode=display">\text{sup}_{f} \left[ \int f(x') p(x') dx' - \int f(x) \gamma(x,y) dx dy \right] = \text{sup}_{f} \left[ \int f(x') p(x') dx' - f(x) \right] \gamma(x,y) dx dy~.</script>
<p>We can now write a modified loss function that incorporates the constraints:</p>
<script type="math/tex; mode=display">OT_c[p,q] = \text{inf}_\gamma \left[ \int \gamma(x,y) c(x,y) dx dy + \text{sup}_{f} \mathcal{L}![f,\gamma] \right] ~,</script>
<p>where</p>
<script type="math/tex; mode=display">\mathcal{L}\![f,\gamma] = \left[ \left(\int f(x') p(x') dx' - f(x)\right) - \left(\int f(y') p(y') dy' - f(y) \right) \right] \gamma(x,y) dx dy~.</script>
<p>The next thing to do is to exchange the order of the infinum and the supremum. This can be done by using <a href="https://en.wikipedia.org/wiki/Sion%27s_minimax_theorem">Sion’s minimax theorem</a> since the loss function is linear in both $f$ and $\gamma$:</p>
<script type="math/tex; mode=display">OT_c[p,q] = \text{sup}_{f} \text{inf}_\gamma \left[ \int \gamma(x,y) c(x,y) dx dy + \mathcal{L}![f,\gamma] \right] ~,</script>
<script type="math/tex; mode=display">= \text{sup}_{f} \left[ \int f(x') p(x') dx' - \int f(y') p(y') dy' + \text{inf}_\gamma \int l(x,y) \gamma(x,y) dx dy \right]~,</script>
<p>where</p>
<script type="math/tex; mode=display">l(x,y) = c(x,y) - (f(x) - f(y))~.</script>
<p>We are almost there! The optimization over $f$ on the right-hand side of this expression can be converted into a constraint. In fact, if $l(x,y) \geq 0$ for all $x$ and $y$ then the infimum is zero and is reached by assigning the whole probability density on the $x = y$ subspace. Conversely, if there is a region where $l(x,y) < 0$, the cost can become arbitrarily large by assigning an arbitrarily large amount of density to that region. By converting this term into a constraint we arrive at the dual formulation of the optimal transport problem.</p>
19 Sep 2018
https://mindcodec.ai/2018/09/19/an-intuitive-guide-to-optimal-transport-part-i-formulating-the-problem/
https://mindcodec.ai/2018/09/19/an-intuitive-guide-to-optimal-transport-part-i-formulating-the-problem/Joint-contrastive inference and cycle GANs<p>This is the second post of our series about joint-contrastive inference. I suggest to read <a href="https://www.mindcodec.com/wp-admin/post.php?post=397&action=edit">our previous post</a> and the <a href="https://www.inference.vc/variational-inference-using-implicit-models-part-iii-joint-contrastive-inference-ali-and-bigan/">seminal blog post by Ferenc Huszár</a> for the required background. This post is partially based on the probabilistic reformulation of cycle-consistent GANs as introduced in the paper: <a href="https://arxiv.org/abs/1806.01771">Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference</a>. In my opinion this reformulation clearly shows the potential of the joint-contrastive inference framework as it offers a very elegant theoretical derivation of a very popular deep learning method. The reformulation also suggests several possible further developments that I’ll cover at the end of this post.</p>
<h2 id="cycle-consistent-adversarial-learning">Cycle-consistent adversarial learning</h2>
<p><a href="https://arxiv.org/abs/1806.01771">Cycle-consistent adversarial training</a> is one of the latest hot topics in deep learning. The aim of the original cycle GAN is to perform unpaired image-to-image translation. Imagine to have a set $X = {x_1,..,x_n}$ of photographic portraits and a set $Y = {y_1,..,y_n}$ of painted portraits. The goal is to convert photos into paintings and <em>vice versa</em>. We can summarize the models as follows:</p>
<script type="math/tex; mode=display">x = f(y),\~\~ y = g(x),</script>
<p>where $f$ transforms photos into a painting and $g$ transforms paintings into photos. Unfortunately the people portraited in one set are not the same people portraited in the other set. This forbids the use of conventional supervised training. The idea is then to train each transformation independently using adversarial training with the two discriminators $D_1$ and $D_2$:</p>
<script type="math/tex; mode=display">\mathcal{L}_1\[f, D_1] = E_{y \sim k(y)}!\left\[ \log{D\_1(y)} \right] + E\_{x \sim k(x)}!\left\[ \log{1 - D_1(f(x))} \right]~,</script>
<script type="math/tex; mode=display">\mathcal{L}_2\[g,D_2] = E_{x \sim k(x)}!\left\[ \log{D\_2(y)} \right] + E\_{y \sim k(y)}!\left\[ \log{1 - D_2(g(y))} \right]~.</script>
<p>Optimizing the first loss minimizes the (<a href="https://arxiv.org/abs/1406.2661">Jansen-Shannon</a>) divergence between the distribution of the transformed photos $f(x)$ and the real paintings $y$ while the second loss minimizes the divergence between transformed paintings $f(y)$ and real photos $x$. We still need one fundamental ingredient. We want to achieve image-to-image translation and any good translator should be consistent. If you translate something to a language and then back to the original language you should get something similar to what you started with. We can enforce this behavior with a cycle-consistent loss:</p>
<script type="math/tex; mode=display">\mathcal{L}_{\text{cyl}}\[f, D_1] = E_{y \sim k(y)}!\left\[ \left\lVert y - f(g(y)) \right\rVert\_2^2 \right] + E\_{x \sim k(x)}!\left\[ -\left\lVert x - g(f(x)) \right\rVert_2^2 \right]~.</script>
<p>This combination of losses defines the cycle GAN and works very well in practice, as you can see from this picture taken from the original paper.</p>
<p><img src="/images/joint-contrastive-inference-and-cycle-gans-2.png" alt="Joint-contrastive-inference-and-cycle-GANs-2.png" /></p>
<h3 id="one-problem-infinitely-many-bad-solutions-and-a-hidden-trick">One problem, infinitely many (bad) solutions and a hidden trick</h3>
<p>The loss functions described in the previous section can seem hopelessly ill-posed to the keen reader. In fact, any one-to-one mapping between $X$ and $Y$ minimizes those losses. A photo of myself can be converted into the portrait of Theodore Roosevelt as far as his portrait gets turned back into my photo. In other words, the cycle GAN loss doesn’t fully capture our intention of image-to-image translation since it assigns equal loss to a huge amount of mappings that don’t preserve the identity of the portraited person. So why do cycle GANs work so well in practice? Like almost all questions in deep learning, the (rather uninformative) answer is: “Because they tend to converge on good local minima”. However, there is a more informative answer in the case of cycle GANs. The secret is that the optimization is initialized around a very good guess: the identity mapping f(x) = x. This is very easy to understand. If the initial transformation of my photo is the very same photo, it will probably not be turned into a portrait of Donald Trump during optimization. In other words, the intended local minimum is very close to the identity mapping. It is therefore not a coincidence that cycle GANs are usually parameterized by residual architectures since <a href="https://arxiv.org/abs/1512.03385">ResNets</a> are build as a perturbation of an identity mapping. Note that the key ingredient here is the initialization around a good guess, which does not necessarily need to be the identity. The extension of cycle GAN methods in situations where there is not an obvious identity map (e.g. mapping a picture of myself to my voice saying my name) is in my opinion an interesting research topic.</p>
<h2 id="a-probabilistic-joint-contrastive-reformulation-of-cycle-gans">A probabilistic joint-contrastive reformulation of cycle GANs</h2>
<p>I can finally move to the probabilistic reformulation of cycle-consistent adversarial learning. If you are anything like me, everything will feel more natural from there (if things will feel more complicated you either need to read more or less of this blog). Our starting point is the empirical distributions of photos and paintings, respectively $k(x)$ and $k(y)$. The aim is to construct a joint probability $p(x,y)$ such that if you sample the pair $(x,y)$ from this joint you will get a photo and a painting of the same person. The joint distribution can be expressed in two different ways:</p>
<script type="math/tex; mode=display">p_1!(x,y) = q(y\vert x) k(x)</script>
<p>and</p>
<script type="math/tex; mode=display">p_2!(x,y) = h(x\vert y) k(y)~.</script>
<p>The aim is to find the right conditional distributions $q(y\vert x)$ and $h(x\vert y)$. These conditional distributions can be interpreted as random functions such as in <a href="https://arxiv.org/abs/1312.6114">variational autoencoders</a>. $p_1(x,y)$ and $p_2(x,y)$ are two ways of modeling the same joint distribution. Note that $p_1(x,y)$ has the correct marginal distribution on the photos while $p_2(x,y)$ has the correct marginal on the paintings. The idea is then to get a single self-consistent model by minimizing a joint-contrastive divergence between $p_1$ and $p_2$:</p>
<script type="math/tex; mode=display">\mathcal{L}_{\text{prob}} = D!\left\[p_1(x,y), p_2(x,y)\right]~.</script>
<p>This optimization results in two conditional distributions $q(y\vert x)$ and $h(x\vert y)$ that are (approximately) consistent (meaning that $q(y\vert x)$ is the “probabilistic inverse” of $h(x\vert y)$ in the sense given by Bayes theorem). Furthermore, the marginal $\int p_1(x,y) dx $ will be fitted to $k(y)$ and the marginal $\int p_2(x,y) dy $ will be fitted to $k(x)$. The first of these results can be interpreted as the probabilistic generalization of the requirement of cycle-consistency in cycle GANs while the second is analogous to the adversarial training. Of course the details of the algorithms depend on the divergence we use and on the parameterization of the random functions $q(y\vert x)$ and $h(x\vert y)$. Note that the resulting probabilistic algorithm is going to be as initialization dependent as regular cycle GANs since there are infinitely many joint distributions that are consistent with our two marginals. However, the role of the initial guess can be made explicit in the loss by including a prior over the parameters of the conditional probabilities $q(y\vert x)$ and $h(x\vert y)$ whose probability is maximized by the identity mapping.</p>
<h3 id="going-into-details-the-symmetrized-kl-divergence">Going into details: the symmetrized KL divergence</h3>
<p>What kind of divergence should we use? The form of the problem suggests the use of a symmetric divergence ($D(p_1,p_2) = D(p_2,p_1)$) since there isn’t an obvious asymmetry between $q(y\vert x)$ and $h(x\vert y)$. This is in sharp contrast to regular Bayesian inference where there is a clear interpretative difference between the likelihood and the posterior. There are many possible symmetric divergences to choose from. One interesting possibility is to use the Wasserstein distance like we did in our recent <a href="https://arxiv.org/abs/1805.11284">Wasserstein variational inference paper</a>. This choice is particularly interesting given the stability and good performance of <a href="https://arxiv.org/abs/1701.07875">Wasserstein GAN</a>s. For now I will consider the symmetrized KL divergence since it leads to a clear similarity with the standard cycle GAN algorithm:</p>
<script type="math/tex; mode=display">D_{\text{symKL}}!\left\[p_1(x,y), p_2(x,y)\right] = \frac{1}{2} \left( D_{\text{KL}}!\left\[p_1(x,y), p\_2(x,y)\right] + D\_{\text{KL}}!\left\[p_2(x,y), p_1(x,y)\right] \right)~.</script>
<p>If we expand this expression we obtain several interpretable terms. The term corresponding to the requirement of cycle-consistency is nothing less than the <a href="https://arxiv.org/abs/1805.11542">forward amortized loss</a> that I discussed in a <a href="https://www.mindcodec.com/joint-contrastive-variational-inference/">previous post</a>:</p>
<script type="math/tex; mode=display">E_{x,y \sim h(x\vert y)k(y)}!\left\[ -\log{q(y\vert x)}\right]~.</script>
<p>The optimization of this term enforces a variational probabilistic inversion (Bayesian inference) based on the forward KL divergence. The analogy with the original cycle-consistency loss can be seen by parameterizing $q(y\vert x)$ as a diagonal Gaussian with unit variance and mean given by the deep network $f(x)$ and $q(x\vert y)$ as the deterministic network $g(y)$. This leads to the cycle-consistent loss</p>
<script type="math/tex; mode=display">E_{x,y \sim h(x\vert y)k(y)}!\left\[ -\log{q(y\vert x)}\right] = E_{y \sim k(y)} !\left\[ -\left\lVert y - f(g(y)) \right\rVert_2^2 \right] ~.</script>
<p>Note that this parameterization is not consistent since it assumes $q(y\vert x)$ to be Gaussian when used in one direction and deterministic when used in the other direction. Nevertheless the probabilistically correct way of modeling the conditionals symmetrically leads to a very similar kind of loss. Analogously, the adversarial losses follow from the following term of the symmetrized KL divergence:</p>
<script type="math/tex; mode=display">E_{x,y \sim q(y\vert x) k(x)}!\left\[ -\log{\frac{q(y\vert x)}{k(y)}}\right]~.</script>
<p>This term is the expectation of a log likelihood ratio and can be converted into an adversarial loss using <a href="http://blog.shakirm.com/2018/01/machine-learning-trick-of-the-day-7-density-ratio-trick/">this very neat trick</a>. Finally, the remaining terms of the symmetrized KL are entropy terms and can be interpreted as regularization.</p>
<h3 id="lossy-cycle-consistency">Lossy cycle-consistency</h3>
<p>Some strange behaviors of cycle GANs become clear in light of the probabilistic viewpoint. An interesting example is the sneaky way in which cycle GANs store information in situations where we would have expected to have an information loss. For example, cycle GANs are often used to transform satellite images to stylized maps, as shown in the following figure:</p>
<p><img src="/images/joint-contrastive-inference-and-cycle-gans-3.png" alt="Joint-contrastive-inference-and-cycle-GANs-3.png" /></p>
<p>This raises a question: How is it possible that cycle GANs can reach near perfect cycle consistency when the stylized map doesn’t seem to contain enough information to recover a realistic satellite image? The answer has been found in <a href="https://arxiv.org/pdf/1712.02950.pdf">this lovely named paper</a> and it is very interesting. It turns out that the network sneakily encodes the key for its inversion in unnoticeable high frequency components! The reason for this behavior is obvious in light of the probabilistic point of view. The transformation $q(y\vert x)$ from realistic images to stylized maps is many-to-one since several details should be discarded. In other words, $q(y\vert x)$ is lossy, meaning that it reduces the information content of the image. This implies that its probabilistic inverse $h(x\vert y)$ cannot be a deterministic function. However, the conventional cycle GAN training assumes both transformation to be deterministic and this forces the generators to encode the information that should be discarded in a form that is not noticed by the discriminator. A more appropriate lossy cycle GAN can be obtained by parameterizing $h(x\vert y)$ as a random function and using the forward amortized loss in place of the original cycle-consistency loss.</p>
23 Jul 2018
https://mindcodec.ai/2018/07/23/joint-contrastive-inference-and-cycle-gans/
https://mindcodec.ai/2018/07/23/joint-contrastive-inference-and-cycle-gans/Joint-contrastive inference and model-based deep learning<p>In this post I will discuss <a href="http://www.inference.vc/variational-inference-using-implicit-models-part-iii-joint-contrastive-inference-ali-and-bigan/">joint-contrastive variational inference</a>, a new form of stochastic variational inference that is gaining traction in the machine learning community. In this and in following posts, I will use the joint-contrastive inference framework in order to show that several commonly used deep learning methods are actually Bayesian inference methods in disguise. This post is mostly based on <a href="https://arxiv.org/abs/1805.11542">this paper from our lab about the connection between variational inference and model-based deep learning.</a> In future posts I will cover several other papers using the joint-contrastive framework. In the last decade, deep neural networks have revolutionized machine learning with their surprising generalization properties. Deep learning methods pushed forward the state-of-the-art in most machine learning problems such as image recognition and machine translation. The rise of deep learning also led several researchers to abandon Bayesian methods and to revert to the deterministic and maximal likelihood methods that were an integral part of the neural network tradition. This happened in my lab as well and led to great works such as <a href="http://www.jneurosci.org/content/35/27/10005.short">this one</a> about the connection between deep learning and the human visual system. However, there isn’t any incompatibility between deep learning and Bayesian inference. Deep parametric models, stochastic gradient descent and backpropagation are just tools that can be used for constructing and training any kind of machine learning model. The key conceptual technology that is leading to the incorporation of Bayesian inference and deep learning is stochastic variational inference.</p>
<h2 id="conventional-deep-learning-as-model-free-inference">Conventional deep learning as model-free inference</h2>
<p>Deep neural networks are trained to approximate complex functional relationships between paired data. Consider a meteorological example. Let $y$ be ground measurements of the state of the earth atmosphere and $x$ be a set of images of our planet taken from a satellite. I will denote the empirical distribution of experimentally collected pairs $(x,y)$ as $k(x,y)$. We can train a deep neural network for recovering the probability of the ground measurements from the images. If our aim is to recover the full conditional distribution instead of a single point-estimate, the loss will have the following form:</p>
<script type="math/tex; mode=display">\mathcal{L}[w] = E_{(x,y)\sim k(x,y)}\!\left[ \log{\mathfrak{q}(y\vert g_w(x))} \right]~,</script>
<p>where $\mathfrak{q}(y\vert g_w(x))$ is a probability distribution parameterized by the output $g_w(x)$ of a neural network. A common choice for $\mathfrak{q}$ is a diagonal Gaussian parameterized by a vector of means and variances. This is a form of model-free probabilistic inference. The architecture of the deep network usually does not encode any information about the process that generated the data and the causal relationships between variables. Conversely, the functional association is directly learned from the data by leveraging a huge training set.</p>
<h2 id="bayesian-statistics-as-model-based-inference">Bayesian statistics as model-based inference</h2>
<p>What is Bayesian inference? To put it simply, it is a mathematically sound method for model-based probabilistic reasoning. Consider again the meteorological example. Let $z$ be the state of the earth atmosphere and $x$ a satellite picture. The generative model is given by a factorized joint distribution of the variables:</p>
<script type="math/tex; mode=display">p(x,z) = p(x\vert z) p(z)~.</script>
<p>The most interesting feature of this expression is that it can be interpreted causally. The prior $p(z)$ describes our knowledge of the dynamics of the atmosphere as encoded by known <a href="https://en.wikipedia.org/wiki/Primitive_equations">fluidodynamic equations</a>. Analogously, $p(x\vert z)$ encapsulates our knowledge about the causal relation between the atmospheric state and the generation of the images. This source of knowledge can for example be encoded in a 3D simulator that generates CG images. Bayes rule is a deceivingly simple formula for combining the data $x$ and our knowledge of the process in order to infer the state of the latent variable $z$:</p>
<script type="math/tex; mode=display">p(z\vert x) = \frac{p(x\vert z) p(z)}{p(x)}~.</script>
<p>Does this mean we are done? We have the formula for the optimal model-based inference right here, who needs deep learning! Unfortunately it is not so easy. Firstly, the term $p(x)$ is a high dimensional integral that we are usually not able to solve:</p>
<script type="math/tex; mode=display">p(x) = \int_{\text{Space of all states of the atmosphere}} p(x\vert z) p(z) dz~.</script>
<p>Furthermore, in our example we cannot even compute the probabilities $p(x\vert z)$ and $p(z)$ since they come from very complex models. These distributions are said to be <a href="https://arxiv.org/abs/1702.08235">implicit</a> in the probabilistic inference jargon since we can sample from them but we cannot evaluate the probabilities.</p>
<h2 id="stochastic-variational-inference">Stochastic variational inference</h2>
<p>Since deep learning techniques are centered around non-convex optimization, it is not surprising that variational inference is the major force behind the unification between Bayesian inference and deep learning. In fact, variational inference is a family of methods that turn Bayesian inference problems into (usually non-convex) optimization problems. In variational inference, an approximate Bayesian posterior distribution is obtained by minimizing a statistical divergence between the intractable posterior $p(z\vert x)$ and a parameterized variational approximation $q_w(z\vert x)$:</p>
<script type="math/tex; mode=display">D\!\left(q_w(z\vert x), p(z\vert x)\right)~,</script>
<p>where $w$ is a set of parameters. Statistical divergences measure the dissimilarity between probability distributions. The most commonly used variational loss is the (reverse) KL divergence:</p>
<script type="math/tex; mode=display">D_{KL}\!\left(q_w(z\vert x), p(z\vert x)\right) = E_{z \sim q_w(z)}\!\left[ \log{q_w(z\vert x)} - \log{p(z\vert x)} \right]~.</script>
<p>At first glance this expression seems to be intractable since we cannot evaluate the true posterior $\log{p(z\vert x)}$. Fortunately, since $\log{p(z\vert x)} = \log{p(z,x)} - \log{p(x)}$, we can decompose the KL divergence as follows:</p>
<script type="math/tex; mode=display">D_{KL}\!\left(q_w(z\vert x), p(z\vert x)\right) = -\text{ELBO}(w,x) - \log{p(x)}~,</script>
<p>where the first term is the <em>evidence lower bound</em> (ELBO):</p>
<script type="math/tex; mode=display">\text{ELBO}(w,x) = -E_{z \sim q_w(z)}\!\left[\log{q_w(z\vert x)} - \log{p(z,x)} \right]~.</script>
<p>Note that $\log{p(x)}$ does not depend on $w$. Consequently, maximizing the ELBO is equivalent to minimizing the reverse KL divergence between the variational approximation and the real posterior distribution.</p>
<h3 id="amortized-inference">Amortized inference</h3>
<p>Maximizing the ELBO is somewhat wasteful since we need to re-optimize a variational posterior for each satellite image. This is where deep neural networks come into play. We can parameterize the whole variational conditional distribution using a deep network $g_w(x)$:</p>
<script type="math/tex; mode=display">q_w(z\vert x) = \mathfrak{q}(z\vert g_w(x))~,</script>
<p>where $\mathfrak{q}(z\vert g_w(x))$ is a distribution over $z$ and parameterized by $g_w(x)$ . Therefore, the deep network $g_w(x)$ maps each image $x$ to the variational posterior $q_w(z\vert x)$. In order to train $g_w(x)$ we can use an amortized variational loss obtained by averaging the negative ELBO over all the images in the training set:</p>
<script type="math/tex; mode=display">\mathcal{L}_{AVI}[w] = -E_{x \sim k(x)} \!\left[\text{ELBO}(w,x) \right] = E_{z,x \sim q_w(z\vert x)k(x)}\!\left[\log{q_w(z\vert x)} - \log{p(z,x)} \right]~.</script>
<p>This expression looks very similar to the kind of loss function that we are used to minimizing in deep learning. The main difference is that one of the variables is sampled from a generative model instead of being sampled from the empirical distribution of some dataset. In this sense, we can see that Bayesian variational inference is analogous to a form of model-based deep learning where we use a generative model instead of a set of paired data-points.</p>
<h2 id="joint-contrastive-variational-inference">Joint-contrastive variational inference</h2>
<p>I will call this form of variational inference posterior-contrastive since the divergence to minimize measures the difference between the posterior $p(z\vert x)$ and the variational approximation of the posterior. This kind of terminology was introduced in <a href="https://arxiv.org/abs/1702.08235">this neat paper</a> and in <a href="http://www.inference.vc/variational-inference-using-implicit-models-part-iii-joint-contrastive-inference-ali-and-bigan/">this insightful blog post</a>. Joint-amortized variational inference was introduced in <a href="https://arxiv.org/abs/1606.00704">the adversarially learned inference (ALI) paper</a>. Since there is already quite some material concerning the use of adversarial methods for variational inference, I will focus on non-adversarial methods. The loss of joint-contrastive variational inference is a divergence between the model joint distribution and a joint variational distribution:</p>
<script type="math/tex; mode=display">D\!\left(p(z,x), q(z,x)\right)~.</script>
<p>Without further constraints the minimization of this loss functional is not particularly useful as the model joint $p(z,x)$ is usually tractable and does not need to be approximated. The key idea for approximating the intractable posterior $p(z\vert x)$ by minimizing this loss is to factorize the variational joint as the product of a variational posterior $q_w(z\vert x)$ and the sampling distribution of the data:</p>
<script type="math/tex; mode=display">q(x,z) = q_w(z\vert x)k(x)~.</script>
<p>Given this factorization, the minimization of the joint-contrastive loss approximates the model posterior with $q(z\vert x)$. Furthermore, if the generative model $p(x)$ has some free parameters, this minimization will also fit $p(x)$ to the sampling distribution of the data.</p>
<h3 id="why-is-the-joint-contrastive-loss-useful">Why is the joint-contrastive loss useful?</h3>
<p>Now the reader might wonder why we should take the trouble to use a joint-contrastive loss when the posterior-contrastive one works just fine. There are two main reasons for that. First, the posterior-contrastive loss requires to take a divergence with respect to the intractable true posterior distribution. In the case of the reverse KL divergence we got lucky since the intractable normalization constant $p(x)$ does not affect the gradient. Unfortunately this magic does not happen with almost any other divergence. A common example is the forward KL divergence:</p>
<script type="math/tex; mode=display">D_{fKL}\!\left( p(z\vert x), q_w(z\vert x)\right) = E_{z \sim p(z\vert x)}\!\left[ \log{p(z\vert x) - \log{q_w(z\vert x)} } \right]~.</script>
<p>In this expression we need to sample the latent variable from $p(z\vert x)$. Needless to say, if we could sample from the posterior we would not need variational inference in the first place. This is a pity since the forward KL divergence has some very nice properties. For example, it is optimized by the exact marginal distributions even when the variational approximation is fully factorized (mean field approximation). The second reason for favoring a joint-contrastive variational loss is that it naturally leads to inference amortization.</p>
<h3 id="amortized-inference-revisited">Amortized inference revisited</h3>
<p>In the previous section I presented amortized inference loss $\mathcal{L}<em>{AVI}[w]$ without any theoretical motivation. I will now show that $\mathcal{L}</em>{AVI}[w]$ follows directly from a joint-contrastive inference loss. Consider the following joint-contrastive loss:</p>
<script type="math/tex; mode=display">D_{rKL}\!\left( p(z,x), q_w(z,x)\right) = E_{z,x \sim q(z\vert x)k(x)}\!\left[ \log{q_w(z,x)} - \log{p(z,x)}\right]</script>
<script type="math/tex; mode=display">= E_{x \sim k(x)}\!\left[E_{z \sim q(z\vert x)}\!\left[\log{q_w(z\vert x)} - \log{p(z,x)}\right]\right] + E_{x \sim k(x)}\!\left[ \log{k(x)} \right]</script>
<script type="math/tex; mode=display">= \mathcal{L}_{AVI}[w] + \mathcal{H}_x~,</script>
<p>where $\mathcal{H}_x$ is the differential entropy of the sampling distribution $k(x)$, which does not depend on the parameters $w$. Therefore, the reverse KL joint-contrastive loss has the same gradient as the amortized ELBO and it leads to the same optimization problem.</p>
<h3 id="forward-amortized-inference-and-simulation-based-deep-learning">Forward amortized inference and simulation-based deep learning</h3>
<p>We can now try to use the forward KL divergence in a joint-contrastive inference loss:</p>
<script type="math/tex; mode=display">D_{fKL}\!\left( p(z,x), q_w(z,x)\right) = E_{z,x \sim p(x\vert z)p(z)}\!\left[\log{p(z,x)} - \log{q_w(z\vert x) k(x)}\right]~.</script>
<p>Note that neither $\log{p(z,x)}$ nor $k(x)$ depend on $w$. Therefore, we can minimize the forward KL divergence by minimizing the following loss:</p>
<script type="math/tex; mode=display">\mathcal{L}_{FAVI}[w] = -E_{z,x \sim p(x\vert z)p(z)}\!\left[\log{q_w(z\vert x)}\right]~.</script>
<p>This loss function has a very intuitive interpretation as a form of simulation-based deep learning. The pairs $(z,x)$ are sampled from the generative model. In our example, the atmospheric state $z$ is obtained by integrating some fluidodynamic equations while the images $x$ are generated by a 3D environment that takes as input the state of the atmosphere and outputs a picture of our planet. The network is then trained to predict the probability of the atmospheric state given an image. The generative model can be used for simulating as many data-points as needed. Consequently, the network parameterizing $q_w(z\vert x)$ cannot overfit the training set. The forward loss has a very important advantage over other forms of variational inference. The first advantage is that in order to evaluate it you only need to be able to sample from the generative model. Conversely, if you want to evaluate the reversed loss you need to explicitly evaluate the probability $p(z,x)$. This can be extremely difficult if your generative model is not defined in terms of simple mathematical formulas such as in the case of our 3D simulator which generates images of our planet. This kind of generative models are said to be <a href="https://arxiv.org/abs/1702.08235">implicit</a> and the the focus of much research in machine learning. Most existing variational inference methods that can be used with implicit models are based on <a href="https://arxiv.org/abs/1406.2661">adversarial training</a>. Unfortunately, variational training only works when the generator is a differentiable program. You cannot just create a simulator in a 3D engine like <a href="https://unity3d.com/">Unity</a> and perform adversarial inference on it, you will need to re-implement Unity in something like <a href="https://www.tensorflow.org/">Tensorflow</a> and be sure to only use differentiable functions! Conversely, you can use forward amortized inference with any kind of simulator straight away without any extra implementation effort.</p>
07 Jul 2018
https://mindcodec.ai/2018/07/07/joint-contrastive-inference-and-model-based-deep-learning/
https://mindcodec.ai/2018/07/07/joint-contrastive-inference-and-model-based-deep-learning/Information is in the eye of the beholder<p>The formalization of information theory is arguably one of the most important scientific advancements of the twentieth century. In some sense, the most important scientific breakthroughs of the last century, such as the discovery of DNA and quantum mechanics, can be interpreted in terms of information. Information theory also plays a major role in machine learning (ML) and statistical learning theory since any learning machine can be seen as integrating prior information (e.g. Bayesian priors, structural constraints) with information extracted from its environment. In a ML task such as regression and classification, the environment can be formalized as a probability space $\Omega = (\mathcal{X},P)$ over a (possibly infinite) set of possible states $\mathcal{X} = {x_1, x_2, … }$. We can think of the usual training, validation and test sets as being sampled from this hypothetical environment. Following the definition given by Shannon, we can quantify the information content of a state $x$ as the logarithm of its probability:</p>
<script type="math/tex; mode=display">\mathfrak{I}(x) = -\log_2 p_\mathcal{X}(x)~.</script>
<p>The intuitive explanation is that rare events convey more information than common events. The Shannon entropy of the environment is defined as its average information content:</p>
<script type="math/tex; mode=display">\mathcal{H}\[\Omega] = \langle \mathfrak{I}(x) \rangle = -\sum_k p_\mathcal{X}( x_k ) \log_2 p_\mathcal{X}( x_k )~.</script>
<p>While entropy is a good measure of the objective average information content of an environment, it is intuitively rather dissatisfying since it assigns the highest values to environments that we as humans would consider as completely devoid of meaningful information. For example, the space of white noise realizations has a much higher entropy than the space of human vocalizations. However, an untrained human can rarely discriminate between two instances of white noise while she would be extremely perceptive to differences in two similar human vocalizations. Arguably, the human brain magnifies the variability in the human vocalization space and suppresses the variability in the white noise space. This consideration suggests a new definition of the concept of entropy that is intrinsically relational, involving an environment (information source) and a perceptual system.</p>
<h2 id="relational-entropy">Relational entropy</h2>
<p>In this post we will operationalize a perceptual system as a deterministic function $f:\mathcal{X} \rightarrow \mathcal{Y}$ that maps the environment set $\mathcal{X}$ into the set $\mathcal{Y}$ of internal (mental) states. For example, $f$ can be a (trained) deterministic classifier that maps each image $x$ to its class label $y = f(x)$. We can now define the relational entropy between the environment and the perceptual system $f$ as the Shannon entropy of the stochastic variable $y = f(x)$:</p>
<script type="math/tex; mode=display">\mathfrak{R}_f\[\mathcal{X}] = \langle -\log_2 p\big(f(x)\big) \rangle = -\sum_k p_\mathcal{X}( x_k ) \log_2 p_\mathcal{Y}\big( f(x_k) \big)~.</script>
<p>The probability distribution over the internal states can be obtained using the following well known formula:</p>
<script type="math/tex; mode=display">p_\mathcal{Y}\big(y = f(x) \big) = \sum_{x^_ \in {x\vert f(x) = y}} p_\mathcal{X}(x^_) ~.</script>
<p>When the perceptual mapping is a one-to-one function, the relational entropy is equal to the Shannon entropy of the environment. Furthermore, the relational entropy is always equal or smaller than the entropy of the environment since a deterministic mapping cannot increase the Shannon entropy. Specifically, the relational entropy $\mathfrak{R}[f,\Omega]$ is smaller than $\mathcal{H}[\Omega]$ when the perceptual system is not one-to-one, meaning that it cannot completely discriminate all the states in the environment.</p>
<h2 id="relational-entropy-and-data-analysis">Relational entropy and data analysis</h2>
<p>From a signal processing perspective, one of the intuitively disappointing features of classical information theory is that the Shannon entropy of the data can only be reduced or left unaltered by deterministic data analysis. Here, for simplicity, we will conceptualize a data analysis procedure as a function $\theta: \mathcal{X} \rightarrow \mathcal{X}$ that maps an element $x$ of the environment into another element $x’$. For example, if the environment is composed by noise-corrupted images of human faces, the function could perform denoising, mapping each noise-corrupted image into its noise-free version. Since this mapping is clearly not one-to-one, the denoising procedure always reduces the entropy of the source. However, a human would surely be better able to discriminate between different faces after the denoising procedure and its likely that the range of her brain responses would be increased. In the following, using a toy example, we will show that the relational entropy can actually increase as a consequence of data analysis. Consider an environment consisting of quadruplets of binary random variables:</p>
<script type="math/tex; mode=display">\boldsymbol{\epsilon} = (\tilde{\epsilon}_1, \tilde{\epsilon}_2, \xi_1, \xi_2)~.</script>
<p>The two “noise” variables $\xi_1$ and $\xi_2$ are statistically independent Bernoulli variables with probability 0.5. The first two variables are generated hierarchically by sampling the “noise-free” variables $\epsilon_1$ and $\epsilon_2$ and then by adding the noise as follows:</p>
<script type="math/tex; mode=display">\tilde{\epsilon}_k = \epsilon_k + \xi_k \pmod{2}~.</script>
<p>The “noise-free” variables are statistically dependent. Specifically:</p>
<script type="math/tex; mode=display">p(\epsilon_1 = 0, \epsilon_2 = 0) = 1/2,</script>
<p>~, while all other configurations have a probability of $1/6$. It is easy to see that after the injection of the “noise” all the configurations of $\tilde{\epsilon}_1$ and $\tilde{\epsilon}_2$ have probability equal to $1/4$. Consider a creature with a perceptual system that can only use the first two variables. In particular, its perceptual system $f(\boldsymbol{\epsilon})$ outputs $1$ if $\tilde{\epsilon}_1 = \tilde{\epsilon}_2 = 0$ and outputs $0$ otherwise. The relational entropy of this environment/perceptual system pair is</p>
<script type="math/tex; mode=display">- 1/4 \log_2{1/4} - 3 \times 1/4 \log_2{3/4} \~\text{bits} \approx 0.8\~\text{bits}~.</script>
<p>Now consider the data analysis procedure $\theta(\boldsymbol{\epsilon}) = (\epsilon_1,\epsilon_2,0,0)$. (This operation is well defined since the second couple of variables contains all the “noise” information). It is easy to see that this procedure reduces the Shannon entropy of the environment. However, the relation entropy is actually increased since</p>
<script type="math/tex; mode=display">- 1/2 \log_2{1/2} - 3 \times 1/6 \log_2{1/2}\~\text{bits} = 1\~\text{bits}~.</script>
<p>Therefore, the data analysis procedure $\theta$ increases the relational entropy by approximately $0.2$ bits.</p>
<h2 id="relational-entropy-in-deep-neural-networks">Relational entropy in deep neural networks</h2>
<p>Deep convolutional neural networks have often be considered as simplified models of the human visual system. The following figure shows how the relational entropy of images (MNIST digits) with respect to either a trained or an untrained network changes as we add white noise:</p>
<p><img src="/images/information-is-in-the-eye-of-the-beholder-2.png" alt="Information-is-in-the-eye-of-the-beholder-2.png" /></p>
<p>Note that the entropy of the data increases as we add noise. In spite of this, the relational entropy of a trained convolutional network sharply decreases. Interestingly, the relational entropy of an equivalent randomly initialized network slightly increases.</p>
<h2 id="relational-mutual-information">Relational mutual information</h2>
<p>Our definition of relational entropy can be used for defining other important information theoretic quantities. The relational mutual information between the random variable $\mathcal{Y}$ and the random variable $\mathcal{Y}$ through the perceptive system $f$ is defined as follows:</p>
<script type="math/tex; mode=display">\mathfrak{I}_f\[\mathcal{X}, \mathcal{Y}] = \mathfrak{R}_f\[\mathcal{X}] - \mathfrak{R}_f\[\mathcal{X}\vert \mathcal{Y}]~,</script>
<p>where</p>
<script type="math/tex; mode=display">\mathfrak{R}_f\[\mathcal{X}\vert \mathcal{Y}] = \sum \mathfrak{R}_f\[\mathcal{X}\vert \mathcal{Y} = y] p(y)</script>
<p>is the conditional relational entropy. Relational mutual information captures the similarity between two data sources through the lens of a perception system. For example, two paired streams of natural images, one unaltered and the other corrupted with low amplitude high frequency noise, have high mutual information in relation to the human visual system. Conversely, two sources can have high mutual information but vanishing relational mutual information when the perceptive system is insensitive to the features of the sources which are statistically dependent.</p>
<h2 id="towards-more-human-like-machines">Towards more human-like machines</h2>
<p>As a field, machine learning is deeply rooted in statistics. It is probably fair to say that the most successful ML methods are glorified version of methods developed in the statistical literature. Probability theory is the theoretical bedrock of statistics. Information theory, which naturally follows from probability theory, is the natural way to connect probability theory with the real world. Consequently, when a (probabilistic) ML method is derived from first principles it necessarily reflects the fundamental assumptions of probability theory and information theory. Unfortunately, as modern physics has largely shown, mathematical naturalness does not imply human intuitiveness. This is not a big problem in physics, an intuitive physical theory neither has more value nor more credibility than a counter-intuitive one. Lack of intuitiveness of the underlying principles is also not a problem in general ML. Nevertheless, it is arguably a big problem in AI since our very definition of intelligence is rooted in human intuition. We could rephrase the Turing test as follows: a machine is intelligent if it can make a human think that it is indeed intelligent. While the goal of deriving ML and AI techniques from first principles is lofty, there is no guarantee that this will result in something that a human would consider intelligent. Instead of reverting to a mainly empirical approach, my suggestion is to work on new mathematical principles which are both rigorous and embed human intuition. In our opinion, the concept of relational entropy is a small step in this direction.</p>
01 Jul 2018
https://mindcodec.ai/2018/07/01/information-is-in-the-eye-of-the-beholder/
https://mindcodec.ai/2018/07/01/information-is-in-the-eye-of-the-beholder/Using GANs for brain-reading<p>The current wave of generative machine learning models for image synthesis are impressively powerful. The Generative Adversarial Networks (GANs) algorithm in particular has become popular due to resulting in near photo-realistic images. While research into GANs is scientifically and aesthetically intriguing, what is still quite unclear is which real-world tasks are out there where powerful generative models could turn out to be indispensable. Among <a href="https://github.com/nashory/gans-awesome-applications">those mentioned</a> in research discussions one area is often missed, likely because for most of us it sounds like an obscure parascience: <em>Reconstructing</em> what is happening in a human visual system – what someone is seeing or imagining. This is a small, but real neuroscience research area, and you can confidently call it a variant of <em>brain reading.</em> While the reconstruction problem is cool in itself, neuroscience has not ruled out that advances will pave the way out for recording vivid memories or even dreams.</p>
<p>Reconstruction methods exist for any data collected along the visual pathway (e.g. for the retina & and the LGN), however here we focus on fMRI data collected from the brain’s visual cortex. It can be collected non-invasively from humans in an MRT scanner. Its pattern-like character - its basic measurement are continuous, let’s say <em>activation strengths</em> of 3D pixels (voxels) - is quite suitable for machine learning. Consequently machine learning methods have found their way into <a href="http://www.sciencedirect.com/science/article/pii/S0896627315004328">fMRI pattern analysis early</a>, and there are various approaches around for gaining insight into the workings of the human brain, usually not far away from state-of-the-art methods (e.g. check out the popularity of <code class="highlighter-rouge">libsvm</code> in MVPA). Long before the current powerful generative models were around, <em>reconstruction</em> projects have made use of machine learning methods, and experimented with large data sets of visual system activity in response to visual stimuli. You may have seen the video reconstruction from a still very impressive study at one point, as it has appeared widely in media, in documentaries (the most recent entry <a href="http://www.imdb.com/title/tt5275828/">Lo and Behold</a> by Werner Herzog) and even in <a href="http://www.imdb.com/title/tt1273722/">an episode of House M.D</a>. If not, please watch it:</p>
<p>https://www.youtube.com/embed/nsjDnYxJ0bo</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/nsjDnYxJ0bo" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""> </iframe>
<p>The left clip shows a video presented to a subject in MRI. The right clip shows what researchers in Berkeley <a href="https://doi.org/10.1016/j.cub.2011.08.031">(Nishimoto, 2011)</a> reconstructed from the brain activity in response to the clip. The reconstruction method averages the most likely clips given the fMRI activations, from a very large library that did not contain the original video clip. The procedure is more complex – actually the aim of this research was testing whether a hypothesized Gabor-feature based representation (an encoding) is indeed used for representing the external world in the visual system. Their method resulted in a predictive model for brain activity given the presented video clip described in its encoding (we explain this in a little more detail below). This allowed them to build a likelihood model given the brain activity for the clips they want to reconstruct (a test set left out during any training). The aim of the reconstruction was demonstrating how powerful their hypothesis for the real representation was. This experiment was done around 2010 and – as those following current machine learning research will know – thanks to the rediscovery of convolutional networks, the availability of large image data sets and powerful ideas such as adversarial training we have much more impressive image synthesis methods now. Can we use these models to improve reconstruction of visual system content? In 2017, a handful of new studies that point towards the feasibility of this idea were published. We will briefly explain each of them in this post.</p>
<h3 id="reconstructing-highly-detailed-faces">Reconstructing highly-detailed faces</h3>
<p>Researchers in our group recently presented an approach to solve the reconstruction problem by combining probabilistic inference with deep learning <a href="http://papers.nips.cc/paper/7012-reconstructing-perceived-faces-from-brain-activations-with-deep-adversarial-neural-decoding">(Güçlütürk, 2017)</a>. They tested the approach with an fMRI experiment (two subjects passively viewing face stimuli in an fMRI scanner), showing that it can generate face reconstructions with a high amount of detail. In a nutshell, the approach first models the transformation from face images to fMRI responses, and then <em>inverts</em> it for reconstruction.</p>
<p><img src="/images/using-gans-for-brain-reading-2.gif" alt="Using-GANs-for-brain-reading-2.gif" /></p>
<p>The <em>images-responses transformation</em> is in principle the encoding idea also used in <a href="https://doi.org/10.1016/j.cub.2011.08.031">(Nishimoto, 2011)</a>. It consists of two parts: A transformation from the presented images to the representation features, and a transformation from the representation features to the fMRI responses. The basic assumption of encoding models is that neurons represent the visual world as nonlinear latent features, and fMRI linearly pools their responses. Encoding models seek to find the representation code of the brain (<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3037423/">Naselaris, 2011</a>).</p>
<p>In the face reconstruction model the <em>images-representation features__transformation</em> is modeled by passing the image through the VGG-Face convolutional neural network for face recognition to obtain a brain-like feature representation (following some recent work that links representations learned by convolutional neural networks to neural representations (<a href="http://www.pnas.org/content/111/23/8619.long">Yamins, 2014</a>; <a href="http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003915">Khaligh-Razavi, 2014</a>; <a href="http://www.jneurosci.org/content/35/27/10005">Güçlü, 2015</a>) ). This feature representation is then further compressed with a PCA as this has proven beneficial. The <em>representation features-response transformation</em> is modeled with standard linear regression under the assumption that latent features and brain responses are Gaussian. At this point, we have an encoding model that can predict brain responses to presented images – but how do we invert this to get to the reconstruction from here? First of all the Gaussianity assumption makes it possible to derive a simple closed-form solution to inverting the <em>feature-response</em> transformation. So we can get the most likely latent representation features for the presented faces with a maximum a-posteriori estimation given the measured brain activity. The tricky part is inverting the initial transformation, i.e. going from representation features to the presented face images. For this, we need to train a GAN in the typical set-up – a generator and a discriminator competing against each other. Here, the goal of the discriminator is distinguishing the presented face images from reconstructed face images. The goal of the generator is synthesizing the reconstructions, given the already estimated latent feature representation as input. As soon as this GAN is trained, we can use its generator for getting reconstructed face images from the latent features estimated in the <em>response-features transformation</em>, which closes the circle. The approach recovers several face image details from brain responses, including aspects such as gender, skin color and facial features, which is difficult to achieve by previous reconstruction methods such as averaging over a large database. The images below show some reconstructions, and the animation shows what happens when we use more and more principal components for compressing the latent representation:</p>
<p><img src="/images/using-gans-for-brain-reading-3.gif" alt="Using-GANs-for-brain-reading-3.gif" /></p>
<p><img src="/images/using-gans-for-brain-reading-4.png" alt="Using-GANs-for-brain-reading-4.png" /></p>
<p>Furthermore, due to using adversarial training, the reconstructions also achieve the photo-realistic qualities that we hoped for. Given these successes, one wonders to which extent such models already allow reconstructing arbitrary naturalistic images. Overall, natural images have specific statistical properties (e.g. distribution of edges), and one thing GANs learn quite well are such distributions. They could thus be a good prior for generating all kinds of natural images, too.</p>
<h3 id="reconstructing-naturalistic-grayscale-images">Reconstructing naturalistic grayscale images</h3>
<p>Another project from our lab <a href="https://doi.org/10.1101/226688">(Seeliger, 2017)</a> aimed at reconstructing natural grayscale photos presented in the scanner. For this, a regular deep convolutional GAN was trained on a grayscale version of ImageNet. Random generations of a deep convolutional GAN trained on a large database of grayscale photos turn out quite aesthetic, by the way:</p>
<p><img src="/images/using-gans-for-brain-reading-5.png" alt="Using-GANs-for-brain-reading-5.png" /></p>
<p></span> As the latent space of deep convolutional GANs appears to learn structure, we attempted to predict it from brain activity with a straight-forward linear model. The learning objective was to reduce the distance between pixels and lower level convolutional neural network image features (using higher level features can lead to category matching / decoding, which is then no arbitrary reconstruction) of the currently predicted reconstruction and the actually presented image. The predicted reconstruction is obtained by passing the predicted latent space through the previously trained GAN. Using the same procedure for limited domains such as handwritten characters (with a GAN trained to create the same set) leads to structurally almost perfect reconstructions:</p>
<p><img src="/images/using-gans-for-brain-reading-6.png" alt="Using-GANs-for-brain-reading-6.png" /></p>
<p>The near infinite space of possible naturalistic images is a much more difficult problem, though. Nevertheless many reconstructions achieve quite some similarity:</p>
<p><img src="/images/using-gans-for-brain-reading-7.png" alt="Using-GANs-for-brain-reading-7.png" /></p>
<p><img src="/images/using-gans-for-brain-reading-8.png" alt="Using-GANs-for-brain-reading-8.png" /></p>
<p>Though this is not the case for all reconstructions, and as we mention in the manuscript, using the GAN as the main basis may both support and hinder reconstructing naturalistic images, as any small change of the predicted latent space vector can strongly influence the result.</p>
<h3 id="reconstructing-naturalistic-color-images">Reconstructing naturalistic color images</h3>
<p>The visual system appears to learn a hierarchical code for representing the external world, ranging from low level edge detectors over patterns to abstract object representations. At the moment, this hierarchy can be described best and most completely by the feature hierarchy learned inside convolutional neural networks. We have made use of these similarities in the <em>image-feature representation transformation</em> of the face reconstruction model, using VGG-FACE as a basis. Using the encoding model idea again, you can have the basis to find out which voxel responds most similarly to specific deep neural network features. Then, measuring the voxel responses for the images you want to reconstruct, you can derive a set of image features the reconstruction must have – essentially reading the target image features directly from the brain activity. This is what <a href="https://doi.org/10.1101/240317">(Shen 2017)</a> did in their new preprint. After building the model that provided them with the target features they took a noise image and changed it with small steps until the reconstructed photo had a similar distribution of convolutional neural network features. They also used a pre-trained GAN model as a natural image statistics prior. The results are impressive, and you can observe the optimization trajectory here:</p>
<p>https://www.youtube.com/watch?v=jsp1KaM-avU&t=37s</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/jsp1KaM-avU?start=37" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""> </iframe>
<p>They could achieve similar results with geometric shapes and letters. They also went one step further and tried reconstructing from visual imagery. This did not work for photos, but for imagined geometric shapes you can definitely see that there is something:</p>
<p>https://www.youtube.com/watch?v=b7kVwoN8Cx4</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/b7kVwoN8Cx4" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""> </iframe>
<p>All this said, neuroscience is far away from reading private thoughts. Media tends to misrepresent the current and realistic capabilities of this research. As any fMRI experimenter will confirm, getting clean data from such experiments requires hours of staying in a noisy scanner without even millimeters of head motion, while paying attention to a video screen like a robot. More so, if an evil power ever forces you to endure an fMRI scan for e.g. lie detection, you can spoil the data collection by subtle movements, by falling asleep, or by thinking about pink elephants. We may be much closer to recording vivid imagery or dreams, and there have been impressive early efforts in this direction <a href="https://www.ncbi.nlm.nih.gov/pubmed/23558170">(Horikawa, 2013)</a>. The video below shows which concept a person in half-sleep in the MRI scanner had in mind, and illustrates it with a collection of example images for that category. Following the “Awakening” screen you get to see their dream protocol.</p>
<p>https://www.youtube.com/watch?v=inaH_i_TjV4</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/inaH_i_TjV4" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""> </iframe>
<p>Recording memories, imagery, dreams are ambitious long-term goals of research into the human mind. In fact, the first scientists observing the detailed pattern-like structure of fMRI responses the 90s realized that they may have a window into the human brain in front of them. Imagine the possibilities of having a working dream recorder – given the recent advances in machine learning, such powerful science-transforming instruments may be just within reach.</p>
12 Jan 2018
https://mindcodec.ai/2018/01/12/using-gans-for-brain-reading/
https://mindcodec.ai/2018/01/12/using-gans-for-brain-reading/About<p>MindCodec is the blog of <a href="https://www.ru.nl/cai/">Artificial Intelligence Department</a>, <a href="http://www.ru.nl/donders">Donders Institute for Brain, Cognition and Behaviour</a>, <a href="http://www.ru.nl/english">Radboud University</a>, Nijmegen, Netherlands. It aims to inform researchers and students on new developments in artificial intelligence.</p>
<h3 id="editors">Editors</h3>
<ul>
<li>Luca Ambrogioni</li>
<li>Marcel van Gerven</li>
<li>Umut Güçlü</li>
<li>Yağmur Güçlütürk</li>
</ul>
https://mindcodec.ai/about.html