Information Theory Primer

This document provides a conceptual introduction to the core ideas of information theory, with particular attention to the challenges of estimation and higher-order information measures. For formal treatments, see Shannon’s original work or Cover & Thomas’s Elements of Information Theory. Much of the content here is based on the review article “Information Theory for Complex Systems Scientists”.

Entropy: Quantifying Uncertainty

Shannon entropy is the foundational quantity of information theory. It quantifies the uncertainty associated with a random variable—how “surprised” we should expect to be when we observe its outcome.

The Intuition

There are two complementary ways to think about entropy:

Expected Surprise. Rare events are surprising; common events are not. The “surprise” or “self-information” of observing an outcome \(x\) with probability \(P(x)\) is:

\[h(x) = -\log P(x)\]

If an outcome is certain (\(P(x) = 1\)), there is no surprise (\(h(x) = 0\)). If an outcome is very rare, observing it is highly surprising. Entropy is simply the expected surprise across all possible outcomes:

\[H(X) = \mathbb{E}_X[h(x)] = -\sum_{x \in \mathcal{X}} P(x) \log P(x)\]

Minimum Questions. Entropy can also be interpreted as the minimum average number of yes/no questions needed to identify the state of a random variable. A fair 6-sided die requires about 2.58 questions on average, while a biased die (where one face appears most of the time) requires fewer questions—you can usually guess the common outcome first.

Key Properties

  • Non-negativity: \(H(X) \geq 0\) (for discrete variables).

  • Maximum at uniformity: Entropy is maximized when all outcomes are equally probable.

  • Additivity for independent variables: If \(X_1\) and \(X_2\) are independent, then \(H(X_1, X_2) = H(X_1) + H(X_2)\).

  • Subadditivity: In general, \(H(X_1, X_2) \leq H(X_1) + H(X_2)\), with equality only when the variables are independent.

Mutual Information: Quantifying Dependence

While entropy measures uncertainty about a single variable, mutual information measures the statistical dependence between two variables—how much knowing one reduces uncertainty about the other.

Definition and Interpretations

Mutual information can be expressed in several equivalent ways:

\[\begin{split}I(X; Y) &= H(X) - H(X|Y) \\ &= H(Y) - H(Y|X) \\ &= H(X) + H(Y) - H(X, Y)\end{split}\]

The first form says: mutual information is how much the entropy of \(X\) decreases when we learn \(Y\). The third form shows it as the “overlap” between the individual entropies, minus the joint entropy.

Bayesian Interpretation. Mutual information equals the Kullback-Leibler divergence between the true joint distribution and the product of marginals:

\[I(X; Y) = D_{KL}(P(X,Y) \| P(X) P(Y))\]

This measures how much information we gain when updating from an assumption of independence to the actual joint distribution.

Local Mutual Information

The pointwise or local mutual information for specific observations \(x\) and \(y\) is:

\[i(x; y) = \log \frac{P(x, y)}{P(x) P(y)} = \log \frac{P(x|y)}{P(x)}\]

Unlike the expected value, local mutual information can be negative—indicating that observing \(y\) makes \(x\) less likely than the marginal would suggest. This is sometimes called “misinformation.”

Conditional Mutual Information

Conditional mutual information measures the dependence between \(X\) and \(Y\) given knowledge of a third variable \(Z\):

\[I(X; Y | Z) = H(X|Z) - H(X|Y, Z)\]

This is essential for identifying direct versus indirect dependencies in complex systems.

The Estimation Problem

A central challenge in applied information theory is estimation: we rarely have access to the true probability distribution and must estimate it from finite samples.

Discrete Data

For discrete random variables, the naive approach is to estimate probabilities from observed frequencies (the “plug-in” estimator). However, this systematically underestimates the true entropy, especially when sample sizes are small relative to the number of possible states.

The problem is that rare events may not appear in a finite sample, leading us to assign them zero probability. Several bias-correction methods exist (Miller-Madow, Grassberger, Bayesian estimators), but the fundamental issue remains: estimating entropy from discrete data requires careful attention to sample size and the number of possible states.

Syntropy’s syntropy.discrete module computes exact information measures from fully specified probability distributions. If you only have samples, you must first estimate the distribution.

Continuous Data

Continuous random variables present additional challenges. The direct analog of Shannon entropy is differential entropy:

\[h(X) = -\int p(x) \log p(x) \, dx\]

Unlike discrete entropy, differential entropy can be negative and depends on the coordinate system. Several strategies exist for estimating information from continuous data:

Binning (Coarse-Graining). Discretize the continuous data into bins, then apply discrete entropy formulas. This is simple but problematic: results depend heavily on bin width, and the “true” differential entropy is only recovered in the limit of infinitely fine bins with infinite data.

Gaussian Assumption. If the data are (approximately) Gaussian, entropy depends only on the covariance matrix:

\[H(X) = \frac{1}{2} \log \left( (2 \pi e)^d |\Sigma| \right)\]

where \(d\) is the dimensionality and \(|\Sigma|\) is the determinant of the covariance matrix. This yields closed-form solutions and is computationally efficient, but obviously requires the Gaussianity assumption to hold. Syntropy’s syntropy.gaussian module implements these estimators.

K-Nearest Neighbor (KNN) Estimators. KNN estimators (e.g., Kraskov-Stogbauer-Grassberger) avoid explicit density estimation by using the distances to nearby points to estimate local density. They are:

  • Non-parametric (no distributional assumptions)

  • Adaptive to local data structure

  • Effective across varying dimensions

However, they can be computationally expensive for large datasets and sensitive to the choice of \(k\). Syntropy’s syntropy.knn module implements KNN-based estimators.

Neural Estimators. Modern approaches use neural networks (e.g., normalizing flows) to estimate densities or directly estimate information-theoretic quantities. These can be powerful for high-dimensional data but require careful training and more computational resources. Syntropy’s syntropy.neural module implements flow-based estimators.

Choosing an Estimator

There is no universally “best” estimator. The choice depends on:

  • Data type: Discrete data requires discrete estimators; continuous data requires continuous estimators.

  • Sample size: KNN and neural estimators generally need more samples than parametric methods.

  • Distributional assumptions: If your data are approximately Gaussian, Gaussian estimators will be more efficient. If not, they will be biased.

  • Dimensionality: High-dimensional estimation is inherently difficult. KNN estimators suffer from the curse of dimensionality, though less severely than binning approaches.

  • Computational budget: Gaussian estimators are fast; neural estimators are slow.

Higher-Order Information

Mutual information captures pairwise dependencies, but many complex systems exhibit irreducibly higher-order interactions—statistical dependencies that cannot be reduced to pairwise relationships.

Total Correlation

Total correlation (also called multi-information) generalizes mutual information to \(N\) variables:

\[TC(X_1, \ldots, X_N) = \sum_{i=1}^N H(X_i) - H(X_1, \ldots, X_N)\]

This measures the total amount of statistical dependence in the system—how much the joint entropy falls short of the sum of individual entropies.

Dual Total Correlation

Dual total correlation (also called binding information) takes a complementary perspective:

\[DTC(X_1, \ldots, X_N) = H(X_1, \ldots, X_N) - \sum_{i=1}^N H(X_i | X_{-i})\]

where \(X_{-i}\) denotes all variables except \(X_i\). This measures the information that is redundantly encoded across variables.

O-Information

The O-information combines total correlation and dual total correlation:

\[\Omega(X_1, \ldots, X_N) = TC(X_1, \ldots, X_N) - DTC(X_1, \ldots, X_N)\]

The sign of O-information indicates whether a system is dominated by redundancy (\(\Omega > 0\)) or synergy (\(\Omega < 0\)):

  • Redundancy-dominated: Variables carry overlapping, duplicated information about each other.

  • Synergy-dominated: Information emerges from the combination of variables that cannot be found in subsets.

S-Information

The S-information is the sum of total correlation and dual total correlation:

\[\Sigma(X_1, \ldots, X_N) = TC(X_1, \ldots, X_N) + DTC(X_1, \ldots, X_N)\]

This provides an overall measure of statistical structure in the system.

Partial Information Decomposition

Partial Information Decomposition (PID) attempts to decompose the mutual information between a set of sources and a target into distinct components:

  • Redundancy: Information provided identically by multiple sources.

  • Unique information: Information provided by only one source.

  • Synergy: Information available only when sources are considered jointly.

For two sources \(X_1, X_2\) and target \(Y\):

\[I(X_1, X_2; Y) = \text{Red}(X_1, X_2 \to Y) + \text{Unq}(X_1 \to Y) + \text{Unq}(X_2 \to Y) + \text{Syn}(X_1, X_2 \to Y)\]

The XOR function is the canonical example of synergy: neither input alone predicts the output, but together they determine it completely.

Further Reading

  • Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal.

  • Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley.

  • Varley, T. F. (2024). Information Theory for Complex Systems Scientists. https://www.sciencedirect.com/science/article/pii/S037015732500256X.

  • Timme, N. M., & Lapish, C. (2018). A Tutorial for Information Theory in Neuroscience. eNeuro.