The current wave of generative machine learning models for image synthesis are impressively powerful. The Generative Adversarial Networks (GANs) algorithm in particular has become popular due to resulting in near photo-realistic images. While research into GANs is scientifically and aesthetically intriguing, what is still quite unclear is which real-world tasks are out there where powerful generative models could turn out to be indispensable. Among those mentioned in research discussions one area is often missed, likely because for most of us it sounds like an obscure parascience: Reconstructing what is happening in a human visual system – what someone is seeing or imagining. This is a small, but real neuroscience research area, and you can confidently call it a variant of brain reading. While the reconstruction problem is cool in itself, neuroscience has not ruled out that advances will pave the way out for recording vivid memories or even dreams.

Reconstruction methods exist for any data collected along the visual pathway (e.g. for the retina & and the LGN), however here we focus on fMRI data collected from the brain’s visual cortex. It can be collected non-invasively from humans in an MRT scanner. Its pattern-like character - its basic measurement are continuous, let’s say activation strengths of 3D pixels (voxels) - is quite suitable for machine learning. Consequently machine learning methods have found their way into fMRI pattern analysis early, and there are various approaches around for gaining insight into the workings of the human brain, usually not far away from state-of-the-art methods (e.g. check out the popularity of libsvm in MVPA). Long before the current powerful generative models were around, reconstruction projects have made use of machine learning methods, and experimented with large data sets of visual system activity in response to visual stimuli. You may have seen the video reconstruction from a still very impressive study at one point, as it has appeared widely in media, in documentaries (the most recent entry Lo and Behold by Werner Herzog) and even in an episode of House M.D. If not, please watch it:

The left clip shows a video presented to a subject in MRI. The right clip shows what researchers in Berkeley (Nishimoto, 2011) reconstructed from the brain activity in response to the clip. The reconstruction method averages the most likely clips given the fMRI activations, from a very large library that did not contain the original video clip. The procedure is more complex – actually the aim of this research was testing whether a hypothesized Gabor-feature based representation (an encoding) is indeed used for representing the external world in the visual system. Their method resulted in a predictive model for brain activity given the presented video clip described in its encoding (we explain this in a little more detail below). This allowed them to build a likelihood model given the brain activity for the clips they want to reconstruct (a test set left out during any training). The aim of the reconstruction was demonstrating how powerful their hypothesis for the real representation was. This experiment was done around 2010 and – as those following current machine learning research will know – thanks to the rediscovery of convolutional networks, the availability of large image data sets and powerful ideas such as adversarial training we have much more impressive image synthesis methods now. Can we use these models to improve reconstruction of visual system content? In 2017, a handful of new studies that point towards the feasibility of this idea were published. We will briefly explain each of them in this post.

Reconstructing highly-detailed faces

Researchers in our group recently presented an approach to solve the reconstruction problem by combining probabilistic inference with deep learning (Güçlütürk, 2017). They tested the approach with an fMRI experiment (two subjects passively viewing face stimuli in an fMRI scanner), showing that it can generate face reconstructions with a high amount of detail. In a nutshell, the approach first models the transformation from face images to fMRI responses, and then inverts it for reconstruction.

The images-responses transformation is in principle the encoding idea also used in (Nishimoto, 2011). It consists of two parts: A transformation from the presented images to the representation features, and a transformation from the representation features to the fMRI responses. The basic assumption of encoding models is that neurons represent the visual world as nonlinear latent features, and fMRI linearly pools their responses. Encoding models seek to find the representation code of the brain (Naselaris, 2011).

In the face reconstruction model the images-representation features__transformation is modeled by passing the image through the VGG-Face convolutional neural network for face recognition to obtain a brain-like feature representation (following some recent work that links representations learned by convolutional neural networks to neural representations (Yamins, 2014; Khaligh-Razavi, 2014; Güçlü, 2015) ). This feature representation is then further compressed with a PCA as this has proven beneficial. The representation features-response transformation is modeled with standard linear regression under the assumption that latent features and brain responses are Gaussian. At this point, we have an encoding model that can predict brain responses to presented images – but how do we invert this to get to the reconstruction from here? First of all the Gaussianity assumption makes it possible to derive a simple closed-form solution to inverting the feature-response transformation. So we can get the most likely latent representation features for the presented faces with a maximum a-posteriori estimation given the measured brain activity. The tricky part is inverting the initial transformation, i.e. going from representation features to the presented face images. For this, we need to train a GAN in the typical set-up – a generator and a discriminator competing against each other. Here, the goal of the discriminator is distinguishing the presented face images from reconstructed face images. The goal of the generator is synthesizing the reconstructions, given the already estimated latent feature representation as input. As soon as this GAN is trained, we can use its generator for getting reconstructed face images from the latent features estimated in the response-features transformation, which closes the circle. The approach recovers several face image details from brain responses, including aspects such as gender, skin color and facial features, which is difficult to achieve by previous reconstruction methods such as averaging over a large database. The images below show some reconstructions, and the animation shows what happens when we use more and more principal components for compressing the latent representation:

Furthermore, due to using adversarial training, the reconstructions also achieve the photo-realistic qualities that we hoped for. Given these successes, one wonders to which extent such models already allow reconstructing arbitrary naturalistic images. Overall, natural images have specific statistical properties (e.g. distribution of edges), and one thing GANs learn quite well are such distributions. They could thus be a good prior for generating all kinds of natural images, too.

Reconstructing naturalistic grayscale images

Another project from our lab (Seeliger, 2017) aimed at reconstructing natural grayscale photos presented in the scanner. For this, a regular deep convolutional GAN was trained on a grayscale version of ImageNet. Random generations of a deep convolutional GAN trained on a large database of grayscale photos turn out quite aesthetic, by the way:

</span> As the latent space of deep convolutional GANs appears to learn structure, we attempted to predict it from brain activity with a straight-forward linear model. The learning objective was to reduce the distance between pixels and lower level convolutional neural network image features (using higher level features can lead to category matching / decoding, which is then no arbitrary reconstruction) of the currently predicted reconstruction and the actually presented image. The predicted reconstruction is obtained by passing the predicted latent space through the previously trained GAN. Using the same procedure for limited domains such as handwritten characters (with a GAN trained to create the same set) leads to structurally almost perfect reconstructions:

The near infinite space of possible naturalistic images is a much more difficult problem, though. Nevertheless many reconstructions achieve quite some similarity:

Though this is not the case for all reconstructions, and as we mention in the manuscript, using the GAN as the main basis may both support and hinder reconstructing naturalistic images, as any small change of the predicted latent space vector can strongly influence the result.

Reconstructing naturalistic color images

The visual system appears to learn a hierarchical code for representing the external world, ranging from low level edge detectors over patterns to abstract object representations. At the moment, this hierarchy can be described best and most completely by the feature hierarchy learned inside convolutional neural networks. We have made use of these similarities in the image-feature representation transformation of the face reconstruction model, using VGG-FACE as a basis. Using the encoding model idea again, you can have the basis to find out which voxel responds most similarly to specific deep neural network features. Then, measuring the voxel responses for the images you want to reconstruct, you can derive a set of image features the reconstruction must have – essentially reading the target image features directly from the brain activity. This is what (Shen 2017) did in their new preprint. After building the model that provided them with the target features they took a noise image and changed it with small steps until the reconstructed photo had a similar distribution of convolutional neural network features. They also used a pre-trained GAN model as a natural image statistics prior. The results are impressive, and you can observe the optimization trajectory here:

They could achieve similar results with geometric shapes and letters. They also went one step further and tried reconstructing from visual imagery. This did not work for photos, but for imagined geometric shapes you can definitely see that there is something: