The formalization of information theory is arguably one of the most important scientific advancements of the twentieth century. In some sense, the most important scientific breakthroughs of the last century, such as the discovery of DNA and quantum mechanics, can be interpreted in terms of information. Information theory also plays a major role in machine learning (ML) and statistical learning theory since any learning machine can be seen as integrating prior information (e.g. Bayesian priors, structural constraints) with information extracted from its environment. In a ML task such as regression and classification, the environment can be formalized as a probability space $\Omega = (\mathcal{X},P)$ over a (possibly infinite) set of possible states $\mathcal{X} = {x_1, x_2, … }$. We can think of the usual training, validation and test sets as being sampled from this hypothetical environment. Following the definition given by Shannon, we can quantify the information content of a state $x$ as the logarithm of its probability:

The intuitive explanation is that rare events convey more information than common events. The Shannon entropy of the environment is defined as its average information content:

While entropy is a good measure of the objective average information content of an environment, it is intuitively rather dissatisfying since it assigns the highest values to environments that we as humans would consider as completely devoid of meaningful information. For example, the space of white noise realizations has a much higher entropy than the space of human vocalizations. However, an untrained human can rarely discriminate between two instances of white noise while she would be extremely perceptive to differences in two similar human vocalizations. Arguably, the human brain magnifies the variability in the human vocalization space and suppresses the variability in the white noise space. This consideration suggests a new definition of the concept of entropy that is intrinsically relational, involving an environment (information source) and a perceptual system.

## Relational entropy

In this post we will operationalize a perceptual system as a deterministic function $f:\mathcal{X} \rightarrow \mathcal{Y}$ that maps the environment set $\mathcal{X}$ into the set $\mathcal{Y}$ of internal (mental) states. For example, $f$ can be a (trained) deterministic classifier that maps each image $x$ to its class label $y = f(x)$. We can now define the relational entropy between the environment and the perceptual system $f$ as the Shannon entropy of the stochastic variable $y = f(x)$:

The probability distribution over the internal states can be obtained using the following well known formula:

When the perceptual mapping is a one-to-one function, the relational entropy is equal to the Shannon entropy of the environment. Furthermore, the relational entropy is always equal or smaller than the entropy of the environment since a deterministic mapping cannot increase the Shannon entropy. Specifically, the relational entropy $\mathfrak{R}[f,\Omega]$ is smaller than $\mathcal{H}[\Omega]$ when the perceptual system is not one-to-one, meaning that it cannot completely discriminate all the states in the environment.

## Relational entropy and data analysis

From a signal processing perspective, one of the intuitively disappointing features of classical information theory is that the Shannon entropy of the data can only be reduced or left unaltered by deterministic data analysis. Here, for simplicity, we will conceptualize a data analysis procedure as a function $\theta: \mathcal{X} \rightarrow \mathcal{X}$ that maps an element $x$ of the environment into another element $x’$. For example, if the environment is composed by noise-corrupted images of human faces, the function could perform denoising, mapping each noise-corrupted image into its noise-free version. Since this mapping is clearly not one-to-one, the denoising procedure always reduces the entropy of the source. However, a human would surely be better able to discriminate between different faces after the denoising procedure and its likely that the range of her brain responses would be increased. In the following, using a toy example, we will show that the relational entropy can actually increase as a consequence of data analysis. Consider an environment consisting of quadruplets of binary random variables:

The two “noise” variables $\xi_1$ and $\xi_2$ are statistically independent Bernoulli variables with probability 0.5. The first two variables are generated hierarchically by sampling the “noise-free” variables $\epsilon_1$ and $\epsilon_2$ and then by adding the noise as follows:

The “noise-free” variables are statistically dependent. Specifically:

~, while all other configurations have a probability of $1/6$. It is easy to see that after the injection of the “noise” all the configurations of $\tilde{\epsilon}_1$ and $\tilde{\epsilon}_2$ have probability equal to $1/4$. Consider a creature with a perceptual system that can only use the first two variables. In particular, its perceptual system $f(\boldsymbol{\epsilon})$ outputs $1$ if $\tilde{\epsilon}_1 = \tilde{\epsilon}_2 = 0$ and outputs $0$ otherwise. The relational entropy of this environment/perceptual system pair is

Now consider the data analysis procedure $\theta(\boldsymbol{\epsilon}) = (\epsilon_1,\epsilon_2,0,0)$. (This operation is well defined since the second couple of variables contains all the “noise” information). It is easy to see that this procedure reduces the Shannon entropy of the environment. However, the relation entropy is actually increased since

Therefore, the data analysis procedure $\theta$ increases the relational entropy by approximately $0.2$ bits.

## Relational entropy in deep neural networks

Deep convolutional neural networks have often be considered as simplified models of the human visual system. The following figure shows how the relational entropy of images (MNIST digits) with respect to either a trained or an untrained network changes as we add white noise:

Note that the entropy of the data increases as we add noise. In spite of this, the relational entropy of a trained convolutional network sharply decreases. Interestingly, the relational entropy of an equivalent randomly initialized network slightly increases.

## Relational mutual information

Our definition of relational entropy can be used for defining other important information theoretic quantities. The relational mutual information between the random variable $\mathcal{Y}$ and the random variable $\mathcal{Y}$ through the perceptive system $f$ is defined as follows:

where

is the conditional relational entropy. Relational mutual information captures the similarity between two data sources through the lens of a perception system. For example, two paired streams of natural images, one unaltered and the other corrupted with low amplitude high frequency noise, have high mutual information in relation to the human visual system. Conversely, two sources can have high mutual information but vanishing relational mutual information when the perceptive system is insensitive to the features of the sources which are statistically dependent.

## Towards more human-like machines

As a field, machine learning is deeply rooted in statistics. It is probably fair to say that the most successful ML methods are glorified version of methods developed in the statistical literature. Probability theory is the theoretical bedrock of statistics. Information theory, which naturally follows from probability theory, is the natural way to connect probability theory with the real world. Consequently, when a (probabilistic) ML method is derived from first principles it necessarily reflects the fundamental assumptions of probability theory and information theory. Unfortunately, as modern physics has largely shown, mathematical naturalness does not imply human intuitiveness. This is not a big problem in physics, an intuitive physical theory neither has more value nor more credibility than a counter-intuitive one. Lack of intuitiveness of the underlying principles is also not a problem in general ML. Nevertheless, it is arguably a big problem in AI since our very definition of intelligence is rooted in human intuition. We could rephrase the Turing test as follows: a machine is intelligent if it can make a human think that it is indeed intelligent. While the goal of deriving ML and AI techniques from first principles is lofty, there is no guarantee that this will result in something that a human would consider intelligent. Instead of reverting to a mainly empirical approach, my suggestion is to work on new mathematical principles which are both rigorous and embed human intuition. In our opinion, the concept of relational entropy is a small step in this direction.