An intuitive guide to optimal transport, part III: entropic regularization and the sinkhorn iterations

One of the goals of this series of posts is to explain methods that many researchers are using without really knowing the underlying theory. The topic of today’s post is the Sinkhorn iterative algorithm for efficiently solving regularized optimal transport problems. This algorithm is one of the main driving forces behind the current wave of optimal transport theory in machine learning research as it can be several orders of magnitude faster than that of solvers. I start by confessing an embarrassing fact. The Sinkhorn iterations are the fundamental ingredient behind our Wasserstein variational inference paper that we are going to present at NIPS 2018. However, I did not really study the reason for why this beautiful algorithm works until a few days ago when I started the preparations for this post! This is particularly embarrassing since it turns out that the derivation of the Sinkhorn iterations is really simple. All in all this has definitively been very useful for me and I hope it will be useful for you as well. While in the case of the Wasserstein GAN our starting point was the dual formulation, we will now start from the primal formulation of the optimal transport problem (if these terms are confusing for you go back to our first post on optimal transport). Let’s get started! (If you are looking for a reference or if you are in a rush you can find a simple Python implementation at the end of the post).

Another look at probabilistic optimal transport

I will start by refreshing the probabilistic interpretation of the optimal transport problem. Consider a discrete probability distribution $P$ defined on a finite set $X = { x_1,…,x_k,…,x_K }$ and another discrete distribution $Q$ defined on the set $Y = {y_1,…,y_k,…,y_K }$. Stochastic transportation maps $\Gamma$ between $P$ and $Q$ are conditional distributions that fulfill the following marginalization property:

This can be interpreted in the following way: Sampling an element $x \in X$ from $P$ followed by $y \in Y$ from the conditional $\Gamma(y\vert x)$ is equivalent to sampling $y$ from $Q$ directly. I will refer to the set of all possible stochastic transportation maps as $G$. Using this notation, we can write the optimal transport divergence between $P$ and $Q$ as the expected cost of the least expensive transportation map:

An important feature of optimal transportation maps is that they have a tendency to be “as deterministic as possible”. More formally, the optimal conditional probability $\Gamma(y\vert x)$ is non-zero only for few values of $y$. The reason for this behavior is rather simple to understand. Given an element $\hat{x}$, there is usually a $\hat{y}$ such that the transportation cost is the smallest:

In this case, the cheapest thing to do is clearly to transport all the probability mass from $\hat{x}$ to $\hat{y}$. Of course this is not always possible because the probability mass $P(\hat{x})$ could be larger than the mass $Q(\hat{y})$ and this would violate the marginalization constraint. Moreover, even if $P(\hat{x}) \leq Q(\hat{y})$ there could be other elements of $X$ such that $(\hat{y})$ is the least expensive target. These other elements compete for the probability mass $Q(\hat{y})$. If this is the case then the best thing to do is to allocate as much mass as possible to $\hat{y}$ and allocate the remaining mass to the second cheapest element of $Y$. Clearly this second allocation could not be fully possible as well and the remaining mass will be allocated by repeating this procedure. Clearly, if for each $x \in X$ there is a unique and distinct $y \in Y$ that minimizes the transportation cost and such that $P(x) = Q(y)$, the optimal transportation map will be fully deterministic.

Some bits of information theory

Entropic regularization is the main theoretical concept behind this post. In order to explain this form of regularization I need to introduce some basics of information theory. Entropy is one of the deepest ideas of statistical physics and information theory and is a measure of the level of uncertainty or unpredictability in a probability distribution. Deterministic distributions have zero entropy since they are fully predictable. Conversely, uniform distributions have the maximal possible entropy. The entropy of a discrete distribution $P$ is defined by averaging negative log probabilities:

The entropy of the joint distribution $\Gamma(y_n, x_k) =\Gamma(y_n\vert x_k) P(x_k)$ quantifies the (average) randomness of the transportation maps:

Entropic regularization

Entropic regularization is a way to counteract the tendency of optimal transport to produce nearly deterministic transportation maps by adding a term that favors randomness. As the reader will have guessed by now, this term is given by the (averaged) entropy of the transportation maps. The result is a regularized transportation problem:

The solution of the regularized transport between $P$ and $Q$ can be visualized by plotting the joint probability $\Gamma(y,x) = \Gamma(y\vert x) P(x)$. The unregularized transport problem has a sparse solution, meaning that $\Gamma(y,x)$ is different from zero only for few $(y,x)$ couples. When we start adding regularization the transportation maps become dense and the joint distribution becomes more spread out.

The Sinkhorn iterations

The biggest advantage of including entropic regularization in an optimal transport problem is that the regularized solution can be found very efficiently using a simple iterative algorithm. The derivation of those iterations is quite simple and requires just some basics of calculus. I will start by introducing a simplified notation. I will denote the joint distribution $\Gamma(y_n\vert x_k) P(x_k)$ as $\Gamma_{nk}$. The probability $P(x_k)$ will be denoted as $P_k$ and $Q(x_k)$ will be denoted as $Q_k$. We need to optimize the following function with respect to the transportation maps:

under the following set of constraints:

and

This last constraint assures that the solutions are proper conditional probability distributions. The constraints can be embedded into the loss function using the two sets of Lagrange multipliers ${\lambda_1,…,\lambda_K}$ and ${\chi_1,…,\chi_N}$. This results in the following modified loss function that can be minimized without constraints:

This loss function is clearly smooth with respect to the transportation maps and it can easily be shown that it is convex. This implies that we can find the global minimum by differentiating the loss and setting the gradient equal to zero:

where $\nabla_{n,k}$ denotes the derivative with respect to $\Gamma_{nk}$. By setting the derivatives to zero and applying the exponential function to both sides we obtain the following equation:

We are almost there! All we need to do is find the values of the Lagrange multipliers. First, lets rewrite the equation as follows:

where $K_{nk} =\exp!\left(-C_{kn}/\epsilon \right)$, $v_n =\exp(-\chi_n/\epsilon)$ and $u_n = \exp(-\lambda_k/\epsilon - 1)$. In order to determine the constants $v_n$ and $u_k$ we need to enforce the constraints:

and

These two equations have to be satisfied simultaneously. An obvious strategy for finding a solution is to iteratively update each set of variables while keeping the other set fixed:

and

There you go, we just derived the Sinkhorn iterations! I will leave convergence proofs, complexity analysis and all this beautiful stuff to the mathematicians! There is still much to say about the Sinkhorn iterations from an applicational point of view. In the next post I will explain how to use them in the context of deep learning and variational Bayesian inference.

Sinkhorn in few lines of Python

The best algorithms in machine learning are also the simplest to explain and to program and I hope I convinced you that the Sinkhorn iterative algorithm is among them. I will conclude the post with a simple Python implementation:

def sinkhorn(C, P, Q, epsilon, num_iter):
	scale = np.max(C), K = np.exp(-np.transpose(C/scale)/epsilon)
	v = np.ones(shape=Q.shape)/len(Q)
	u = np.ones(shape=P.shape)/len(P)
	for itr in range(num_iter): 	
		v = Q/np.matmul(K, u)
		u = P/(np.matmul(np.transpose(K), v))
	coupling = np.outer(v,u)
	optimal_cost = np.sum(coupling)
	return coupling, scale*optimal_cost