Information Theory Primer
=========================

This document provides a conceptual introduction to the core ideas of information theory, with particular attention to the challenges of estimation and higher-order information measures. For formal treatments, see Shannon's original work or Cover & Thomas's *Elements of Information Theory*. Much of the content here is based on the review article `"Information Theory for Complex Systems Scientists" <https://www.sciencedirect.com/science/article/pii/S037015732500256X>`_.

Entropy: Quantifying Uncertainty
--------------------------------

Shannon entropy is the foundational quantity of information theory. It quantifies the uncertainty associated with a random variable—how "surprised" we should expect to be when we observe its outcome.

The Intuition
^^^^^^^^^^^^^

There are two complementary ways to think about entropy:

**Expected Surprise.** Rare events are surprising; common events are not. The "surprise" or "self-information" of observing an outcome :math:`x` with probability :math:`P(x)` is:

.. math::

   h(x) = -\log P(x)

If an outcome is certain (:math:`P(x) = 1`), there is no surprise (:math:`h(x) = 0`). If an outcome is very rare, observing it is highly surprising. Entropy is simply the *expected* surprise across all possible outcomes:

.. math::

   H(X) = \mathbb{E}_X[h(x)] = -\sum_{x \in \mathcal{X}} P(x) \log P(x)

**Minimum Questions.** Entropy can also be interpreted as the minimum average number of yes/no questions needed to identify the state of a random variable. A fair 6-sided die requires about 2.58 questions on average, while a biased die (where one face appears most of the time) requires fewer questions—you can usually guess the common outcome first.

Key Properties
^^^^^^^^^^^^^^

- **Non-negativity:** :math:`H(X) \geq 0` (for discrete variables).
- **Maximum at uniformity:** Entropy is maximized when all outcomes are equally probable.
- **Additivity for independent variables:** If :math:`X_1` and :math:`X_2` are independent, then :math:`H(X_1, X_2) = H(X_1) + H(X_2)`.
- **Subadditivity:** In general, :math:`H(X_1, X_2) \leq H(X_1) + H(X_2)`, with equality only when the variables are independent.

Mutual Information: Quantifying Dependence
------------------------------------------

While entropy measures uncertainty about a single variable, mutual information measures the statistical dependence between two variables—how much knowing one reduces uncertainty about the other.

Definition and Interpretations
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Mutual information can be expressed in several equivalent ways:

.. math::

   I(X; Y) &= H(X) - H(X|Y) \\
           &= H(Y) - H(Y|X) \\
           &= H(X) + H(Y) - H(X, Y)

The first form says: mutual information is how much the entropy of :math:`X` decreases when we learn :math:`Y`. The third form shows it as the "overlap" between the individual entropies, minus the joint entropy.

**Bayesian Interpretation.** Mutual information equals the Kullback-Leibler divergence between the true joint distribution and the product of marginals:

.. math::

   I(X; Y) = D_{KL}(P(X,Y) \| P(X) P(Y))

This measures how much information we gain when updating from an assumption of independence to the actual joint distribution.

Local Mutual Information
^^^^^^^^^^^^^^^^^^^^^^^^

The *pointwise* or *local* mutual information for specific observations :math:`x` and :math:`y` is:

.. math::

   i(x; y) = \log \frac{P(x, y)}{P(x) P(y)} = \log \frac{P(x|y)}{P(x)}

Unlike the expected value, local mutual information can be negative—indicating that observing :math:`y` makes :math:`x` *less* likely than the marginal would suggest. This is sometimes called "misinformation."

Conditional Mutual Information
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Conditional mutual information measures the dependence between :math:`X` and :math:`Y` given knowledge of a third variable :math:`Z`:

.. math::

   I(X; Y | Z) = H(X|Z) - H(X|Y, Z)

This is essential for identifying direct versus indirect dependencies in complex systems.

The Estimation Problem
----------------------

A central challenge in applied information theory is *estimation*: we rarely have access to the true probability distribution and must estimate it from finite samples.

Discrete Data
^^^^^^^^^^^^^

For discrete random variables, the naive approach is to estimate probabilities from observed frequencies (the "plug-in" estimator). However, this systematically **underestimates** the true entropy, especially when sample sizes are small relative to the number of possible states.

The problem is that rare events may not appear in a finite sample, leading us to assign them zero probability. Several bias-correction methods exist (Miller-Madow, Grassberger, Bayesian estimators), but the fundamental issue remains: estimating entropy from discrete data requires careful attention to sample size and the number of possible states.

Syntropy's ``syntropy.discrete`` module computes exact information measures from fully specified probability distributions. If you only have samples, you must first estimate the distribution.

Continuous Data
^^^^^^^^^^^^^^^

Continuous random variables present additional challenges. The direct analog of Shannon entropy is *differential entropy*:

.. math::

   h(X) = -\int p(x) \log p(x) \, dx

Unlike discrete entropy, differential entropy can be negative and depends on the coordinate system. Several strategies exist for estimating information from continuous data:

**Binning (Coarse-Graining).** Discretize the continuous data into bins, then apply discrete entropy formulas. This is simple but problematic: results depend heavily on bin width, and the "true" differential entropy is only recovered in the limit of infinitely fine bins with infinite data.

**Gaussian Assumption.** If the data are (approximately) Gaussian, entropy depends only on the covariance matrix:

.. math::

   H(X) = \frac{1}{2} \log \left( (2 \pi e)^d |\Sigma| \right)

where :math:`d` is the dimensionality and :math:`|\Sigma|` is the determinant of the covariance matrix. This yields closed-form solutions and is computationally efficient, but obviously requires the Gaussianity assumption to hold. Syntropy's ``syntropy.gaussian`` module implements these estimators.

**K-Nearest Neighbor (KNN) Estimators.** KNN estimators (e.g., Kraskov-Stogbauer-Grassberger) avoid explicit density estimation by using the distances to nearby points to estimate local density. They are:

- Non-parametric (no distributional assumptions)
- Adaptive to local data structure
- Effective across varying dimensions

However, they can be computationally expensive for large datasets and sensitive to the choice of :math:`k`. Syntropy's ``syntropy.knn`` module implements KNN-based estimators.

**Neural Estimators.** Modern approaches use neural networks (e.g., normalizing flows) to estimate densities or directly estimate information-theoretic quantities. These can be powerful for high-dimensional data but require careful training and more computational resources. Syntropy's ``syntropy.neural`` module implements flow-based estimators.

Choosing an Estimator
^^^^^^^^^^^^^^^^^^^^^

There is no universally "best" estimator. The choice depends on:

- **Data type:** Discrete data requires discrete estimators; continuous data requires continuous estimators.
- **Sample size:** KNN and neural estimators generally need more samples than parametric methods.
- **Distributional assumptions:** If your data are approximately Gaussian, Gaussian estimators will be more efficient. If not, they will be biased.
- **Dimensionality:** High-dimensional estimation is inherently difficult. KNN estimators suffer from the curse of dimensionality, though less severely than binning approaches.
- **Computational budget:** Gaussian estimators are fast; neural estimators are slow.

Higher-Order Information
------------------------

Mutual information captures pairwise dependencies, but many complex systems exhibit irreducibly higher-order interactions—statistical dependencies that cannot be reduced to pairwise relationships.

Total Correlation
^^^^^^^^^^^^^^^^^

Total correlation (also called multi-information) generalizes mutual information to :math:`N` variables:

.. math::

   TC(X_1, \ldots, X_N) = \sum_{i=1}^N H(X_i) - H(X_1, \ldots, X_N)

This measures the total amount of statistical dependence in the system—how much the joint entropy falls short of the sum of individual entropies.

Dual Total Correlation
^^^^^^^^^^^^^^^^^^^^^^

Dual total correlation (also called binding information) takes a complementary perspective:

.. math::

   DTC(X_1, \ldots, X_N) = H(X_1, \ldots, X_N) - \sum_{i=1}^N H(X_i | X_{-i})

where :math:`X_{-i}` denotes all variables except :math:`X_i`. This measures the information that is *redundantly* encoded across variables.

O-Information
^^^^^^^^^^^^^

The O-information combines total correlation and dual total correlation:

.. math::

   \Omega(X_1, \ldots, X_N) = TC(X_1, \ldots, X_N) - DTC(X_1, \ldots, X_N)

The sign of O-information indicates whether a system is dominated by redundancy (:math:`\Omega > 0`) or synergy (:math:`\Omega < 0`):

- **Redundancy-dominated:** Variables carry overlapping, duplicated information about each other.
- **Synergy-dominated:** Information emerges from the combination of variables that cannot be found in subsets.

S-Information
^^^^^^^^^^^^^

The S-information is the sum of total correlation and dual total correlation:

.. math::

   \Sigma(X_1, \ldots, X_N) = TC(X_1, \ldots, X_N) + DTC(X_1, \ldots, X_N)

This provides an overall measure of statistical structure in the system.

Partial Information Decomposition
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Partial Information Decomposition (PID) attempts to decompose the mutual information between a set of sources and a target into distinct components:

- **Redundancy:** Information provided identically by multiple sources.
- **Unique information:** Information provided by only one source.
- **Synergy:** Information available only when sources are considered jointly.

For two sources :math:`X_1, X_2` and target :math:`Y`:

.. math::

   I(X_1, X_2; Y) = \text{Red}(X_1, X_2 \to Y) + \text{Unq}(X_1 \to Y) + \text{Unq}(X_2 \to Y) + \text{Syn}(X_1, X_2 \to Y)

The XOR function is the canonical example of synergy: neither input alone predicts the output, but together they determine it completely.

Further Reading
---------------

- Shannon, C. E. (1948). A Mathematical Theory of Communication. *Bell System Technical Journal*.
- Cover, T. M., & Thomas, J. A. (2006). *Elements of Information Theory* (2nd ed.). Wiley.
- Varley, T. F. (2024). Information Theory for Complex Systems Scientists. *https://www.sciencedirect.com/science/article/pii/S037015732500256X*.
- Timme, N. M., & Lapish, C. (2018). A Tutorial for Information Theory in Neuroscience. *eNeuro*.