This is the second post of our series about joint-contrastive inference. I suggest to read our previous post and the seminal blog post by Ferenc Huszár for the required background. This post is partially based on the probabilistic reformulation of cycle-consistent GANs as introduced in the paper: Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference. In my opinion this reformulation clearly shows the potential of the joint-contrastive inference framework as it offers a very elegant theoretical derivation of a very popular deep learning method. The reformulation also suggests several possible further developments that I’ll cover at the end of this post.

## Cycle-consistent adversarial learning

Cycle-consistent adversarial training is one of the latest hot topics in deep learning. The aim of the original cycle GAN is to perform unpaired image-to-image translation. Imagine to have a set $X = {x_1,..,x_n}$ of photographic portraits and a set $Y = {y_1,..,y_n}$ of painted portraits. The goal is to convert photos into paintings and *vice versa*. We can summarize the models as follows:

where $f$ transforms photos into a painting and $g$ transforms paintings into photos. Unfortunately the people portraited in one set are not the same people portraited in the other set. This forbids the use of conventional supervised training. The idea is then to train each transformation independently using adversarial training with the two discriminators $D_1$ and $D_2$:

Optimizing the first loss minimizes the (Jansen-Shannon) divergence between the distribution of the transformed photos $f(x)$ and the real paintings $y$ while the second loss minimizes the divergence between transformed paintings $f(y)$ and real photos $x$. We still need one fundamental ingredient. We want to achieve image-to-image translation and any good translator should be consistent. If you translate something to a language and then back to the original language you should get something similar to what you started with. We can enforce this behavior with a cycle-consistent loss:

This combination of losses defines the cycle GAN and works very well in practice, as you can see from this picture taken from the original paper.

### One problem, infinitely many (bad) solutions and a hidden trick

The loss functions described in the previous section can seem hopelessly ill-posed to the keen reader. In fact, any one-to-one mapping between $X$ and $Y$ minimizes those losses. A photo of myself can be converted into the portrait of Theodore Roosevelt as far as his portrait gets turned back into my photo. In other words, the cycle GAN loss doesn’t fully capture our intention of image-to-image translation since it assigns equal loss to a huge amount of mappings that don’t preserve the identity of the portraited person. So why do cycle GANs work so well in practice? Like almost all questions in deep learning, the (rather uninformative) answer is: “Because they tend to converge on good local minima”. However, there is a more informative answer in the case of cycle GANs. The secret is that the optimization is initialized around a very good guess: the identity mapping f(x) = x. This is very easy to understand. If the initial transformation of my photo is the very same photo, it will probably not be turned into a portrait of Donald Trump during optimization. In other words, the intended local minimum is very close to the identity mapping. It is therefore not a coincidence that cycle GANs are usually parameterized by residual architectures since ResNets are build as a perturbation of an identity mapping. Note that the key ingredient here is the initialization around a good guess, which does not necessarily need to be the identity. The extension of cycle GAN methods in situations where there is not an obvious identity map (e.g. mapping a picture of myself to my voice saying my name) is in my opinion an interesting research topic.

## A probabilistic joint-contrastive reformulation of cycle GANs

I can finally move to the probabilistic reformulation of cycle-consistent adversarial learning. If you are anything like me, everything will feel more natural from there (if things will feel more complicated you either need to read more or less of this blog). Our starting point is the empirical distributions of photos and paintings, respectively $k(x)$ and $k(y)$. The aim is to construct a joint probability $p(x,y)$ such that if you sample the pair $(x,y)$ from this joint you will get a photo and a painting of the same person. The joint distribution can be expressed in two different ways:

and

The aim is to find the right conditional distributions $q(y\vert x)$ and $h(x\vert y)$. These conditional distributions can be interpreted as random functions such as in variational autoencoders. $p_1(x,y)$ and $p_2(x,y)$ are two ways of modeling the same joint distribution. Note that $p_1(x,y)$ has the correct marginal distribution on the photos while $p_2(x,y)$ has the correct marginal on the paintings. The idea is then to get a single self-consistent model by minimizing a joint-contrastive divergence between $p_1$ and $p_2$:

This optimization results in two conditional distributions $q(y\vert x)$ and $h(x\vert y)$ that are (approximately) consistent (meaning that $q(y\vert x)$ is the “probabilistic inverse” of $h(x\vert y)$ in the sense given by Bayes theorem). Furthermore, the marginal $\int p_1(x,y) dx $ will be fitted to $k(y)$ and the marginal $\int p_2(x,y) dy $ will be fitted to $k(x)$. The first of these results can be interpreted as the probabilistic generalization of the requirement of cycle-consistency in cycle GANs while the second is analogous to the adversarial training. Of course the details of the algorithms depend on the divergence we use and on the parameterization of the random functions $q(y\vert x)$ and $h(x\vert y)$. Note that the resulting probabilistic algorithm is going to be as initialization dependent as regular cycle GANs since there are infinitely many joint distributions that are consistent with our two marginals. However, the role of the initial guess can be made explicit in the loss by including a prior over the parameters of the conditional probabilities $q(y\vert x)$ and $h(x\vert y)$ whose probability is maximized by the identity mapping.

### Going into details: the symmetrized KL divergence

What kind of divergence should we use? The form of the problem suggests the use of a symmetric divergence ($D(p_1,p_2) = D(p_2,p_1)$) since there isn’t an obvious asymmetry between $q(y\vert x)$ and $h(x\vert y)$. This is in sharp contrast to regular Bayesian inference where there is a clear interpretative difference between the likelihood and the posterior. There are many possible symmetric divergences to choose from. One interesting possibility is to use the Wasserstein distance like we did in our recent Wasserstein variational inference paper. This choice is particularly interesting given the stability and good performance of Wasserstein GANs. For now I will consider the symmetrized KL divergence since it leads to a clear similarity with the standard cycle GAN algorithm:

If we expand this expression we obtain several interpretable terms. The term corresponding to the requirement of cycle-consistency is nothing less than the forward amortized loss that I discussed in a previous post:

The optimization of this term enforces a variational probabilistic inversion (Bayesian inference) based on the forward KL divergence. The analogy with the original cycle-consistency loss can be seen by parameterizing $q(y\vert x)$ as a diagonal Gaussian with unit variance and mean given by the deep network $f(x)$ and $q(x\vert y)$ as the deterministic network $g(y)$. This leads to the cycle-consistent loss

Note that this parameterization is not consistent since it assumes $q(y\vert x)$ to be Gaussian when used in one direction and deterministic when used in the other direction. Nevertheless the probabilistically correct way of modeling the conditionals symmetrically leads to a very similar kind of loss. Analogously, the adversarial losses follow from the following term of the symmetrized KL divergence:

This term is the expectation of a log likelihood ratio and can be converted into an adversarial loss using this very neat trick. Finally, the remaining terms of the symmetrized KL are entropy terms and can be interpreted as regularization.

### Lossy cycle-consistency

Some strange behaviors of cycle GANs become clear in light of the probabilistic viewpoint. An interesting example is the sneaky way in which cycle GANs store information in situations where we would have expected to have an information loss. For example, cycle GANs are often used to transform satellite images to stylized maps, as shown in the following figure:

This raises a question: How is it possible that cycle GANs can reach near perfect cycle consistency when the stylized map doesn’t seem to contain enough information to recover a realistic satellite image? The answer has been found in this lovely named paper and it is very interesting. It turns out that the network sneakily encodes the key for its inversion in unnoticeable high frequency components! The reason for this behavior is obvious in light of the probabilistic point of view. The transformation $q(y\vert x)$ from realistic images to stylized maps is many-to-one since several details should be discarded. In other words, $q(y\vert x)$ is lossy, meaning that it reduces the information content of the image. This implies that its probabilistic inverse $h(x\vert y)$ cannot be a deterministic function. However, the conventional cycle GAN training assumes both transformation to be deterministic and this forces the generators to encode the information that should be discarded in a form that is not noticed by the discriminator. A more appropriate lossy cycle GAN can be obtained by parameterizing $h(x\vert y)$ as a random function and using the forward amortized loss in place of the original cycle-consistency loss.