🔗Conditional Mutual Information and Decomposition

In the preceding chapter, we established the foundational tools for quantifying the relationship between two variables. Mutual Information, $I(X;Y)$ , provides a robust, non-linear measure of their statistical dependence, moving beyond simple correlation to capture any shared information. This two-variable analysis is the bedrock of information-theoretic inquiry. However, real-world systems are rarely so simple. The intricate web of dependencies that defines complex phenomena—from market dynamics and neural circuits to strategic gameplay—involves the interplay of multiple variables simultaneously.

To dissect these multivariate relationships, we must move beyond pairwise analysis and ask a more sophisticated question: How does the presence of a third variable, $Z$ , alter the informational relationship between $X$ and $Y$ ?

This chapter introduces Conditional Mutual Information (CMI) as the principal tool for this task. CMI allows us to analyze relationships in context, quantifying the information shared between two variables while accounting for the influence of a third. By introducing this contextual layer, we unlock a deeper understanding of system dynamics, revealing how relationships can be either obscured by redundancy or illuminated by synergy.

Furthermore, we will explore the decomposition of information using the chain rule. This powerful technique provides a systematic framework for dissecting the total information that multiple sources provide about a target, attribute each component's contribution, and building a comprehensive model of multivariate information flow. This chapter thus marks the transition from measuring simple dependence to characterizing the architecture of complex informational interactions.

1. The Language of Coding and Divergence: Cross-Entropy and KL Divergence

Before defining the core measures of information, it is instructive to introduce two foundational concepts from coding theory that provide a rigorous mathematical basis for mutual information: Cross-Entropy and Kullback-Leibler (KL) Divergence. These concepts frame the measurement of information in terms of the efficiency of data encoding, offering a powerful, quantitative perspective on the cost of incorrect assumptions.

From the last Chapter, we have learnt about the Entropy, which measure the uncertainty or information volume of random variables. It is defined by: $H(p) = - \sum _x p(x)logp(x)$ .

1.1 Cross-Entropy: The Cost of a Suboptimal Model

In practical scenarios, we rarely have access to the true data distribution p. Instead, we build a model q to approximate it. Cross-Entropy, $H(p, q)$ , measures the average number of bits we will actually need if we use an encoding scheme optimized for our model $q$ to encode messages that are, in reality, drawn from the true distribution $p$ . It is defined as:

H(p, q) = \sum p(x) log_2(\frac {1}{q(x)})

Here, $log_2(q(x))$ is the code length our model q assigns to event x, and this length is averaged over the true probability $p(x)$ of that event occurring. Because $H(p)$ is the theoretical minimum, the cross-entropy $H(p, q)$ will always be greater than or equal to the entropy $H(p)$ . The closer our model $q$ is to reality $p$ , the closer $H(p, q)$ will be to $H(p)$ . If the assumed model perfectly matches the true distribution , the cross-entropy becomes equal to the entropy of $p$ . Any mismatch results in a higher average code length.

1.2 Kullback-Leibler (KL) Divergence: Quantifying the Penalty

We also need to quantify exactly how much worse our model q is compared to the optimal model p. KL Divergence, also known as relative entropy, is the fundamental measure of how one probability distribution p diverges from a second, expected probability distribution q. In other words, it is defined as the penalty, or the average number of extra bits required per message, because we used the suboptimal model q instead of the true model p.

Mathematically, it is the difference between the practical cost and the theoretical minimum:

D_{KL}(p || q) = H(p, q) - H(p)

This highlights its role as a measure of "relative entropy." By substituting the formulas for entropy and cross-entropy, we arrive at its more common form:

D_{KL}(p || q) = \sum p(x) log_2(\frac {p(x)} {q(x)})

A key property of KL Divergence is that it is always non-negative ( $D_{KL}(p || q) ≥ 0$ ) and is zero if and only if $p$ and $q$ are identical. This establishes a fundamental principle: you always incur an informational cost for using an incorrect model.

The key implication is that minimizing the KL Divergence is equivalent to minimizing the Cross-Entropy, as $H(p)$ is a constant during optimization. Both measures serve the same goal: to make the model distribution q as indistinguishable from the true distribution p as possible.

1.3 The Bridge to Mutual Information

This framework provides the definitive link to understanding Mutual Information. As we will see, Mutual Information is a direct application of KL Divergence. It measures the divergence of the true joint data distribution $p(x,y)$ from the idealized model of statistical independence, $p(x)p(y)$ . This concept of quantifying the "cost of assuming independence" is the rigorous core of our analytical methodology.

2. Mutual Information (MI): Quantifying Statistical Dependence

2.1 Mutual Information

Having established KL Divergence as a measure of the cost of inaccurate assumptions, we now apply this principle to one of the most fundamental questions in data analysis: Are two variables statistically related, and if so, how strongly? Mutual Information, $I(X;Y)$ , provides the definitive answer through several distinct yet complementary interpretations. This section explores these facets, moving from intuitive visualizations to rigorous mathematical definitions, to build a comprehensive understanding of its role in data analysis.

Interpretation 1: An Intuitive View from Venn Diagrams

The most accessible interpretation of Mutual Information is as the reduction in uncertainty. $I(X;Y)$ quantifies the amount of uncertainty about variable $X$ that is eliminated by observing variable $Y$ . This concept is best visualized using a Venn diagram where circles represent the entropy (total uncertainty) of each variable.

$H(X)$ and $H(Y)$ are the total uncertainties of variables $X$ and $Y$ .
$H(X,Y)$ is their joint uncertainty.
$H(X|Y)$ is the uncertainty remaining in $X$ after $Y$ is known.

Mutual Information, $I(X;Y)$ , is the overlapping area between the two circles, representing the information they share. This visual analogy directly leads to its fundamental formulas in terms of entropy:

I(X;Y) = H(X) - H(X|Y)

I(X;Y) = H(Y) - H(Y|X)

I(X;Y) = H(X) + H(Y) - H(X,Y)

From this intuitive model, we can derive several of MI's most critical properties, which govern its behavior and analytical utility:

Boundary Conditions: Mutual information is always non-negative and is upper-bounded by the entropy of the less uncertain variable. This is expressed as $0 ≤ I(X;Y) ≤ min(H(X), H(Y))$ . The shared information cannot exceed the total information contained in either of the constituent variables.
Symmetry: The overlapping area is identical regardless of which variable is considered first. This confirms the fundamental symmetry of mutual information: $I(X;Y) = I(Y;X)$ . The information that $X$ provides about $Y$ is precisely the same as the information $Y$ provides about $X$ .
Perfect Dependence Condition: A special boundary case occurs when the mutual information is equal to the entire entropy of one variable, for example, $I(X;Y) = H(X)$ . In the Venn diagram, this means the circle for $H(X)$ is completely contained within the circle for $H(Y)$ . This implies that knowing $Y$ completely resolves all uncertainty about $X$ , leading to a conditional entropy of zero: $H(X|Y) = 0$ . This signifies a state of perfect (though not necessarily deterministic) dependence of $X$ on $Y$ .

Interpretation 2: A Rigorous Definition via KL Divergence

For mathematical rigor, Mutual Information is formally defined as the Kullback-Leibler (KL) Divergence between the true joint distribution $p(x,y)$ and the idealized model of statistical independence, $p(x)p(y)$ .

I(X;Y) = D_{KL}( p(x,y) || p(x)p(y) ) = \sum p(x,y) log_2( \frac{p(x,y)} {(p(x)p(y))} )

This frames $I(X;Y)$ as the informational penalty incurred by incorrectly assuming two variables are independent. In coding theory, this translates to the average number of extra bits required to encode the pair $\{x,y\}$ if their dependency is ignored. A higher MI signifies a stronger statistical dependence, as the cost of the independence assumption becomes greater.

Interpretation 3: A Statistical and Bayesian Perspective

By applying Bayes' rule ( $p(x,y) = p(x|y)p(y)$ ), the KL divergence formula can be transformed into a statistical view that highlights its role in measuring dependence:

I(X;Y) = \sum p(x,y) log_2( \frac {p(x|y)} {p(x)} )

This form reveals two key insights:

A Test for Independence: $I(X;Y) = 0$ if and only if $p(x|y) = p(x)$ for all $x$ and $y$ , which is the definition of statistical independence. This property makes MI a definitive, non-linear measure of correlation, capable of detecting any form of statistical relationship.
A Measure of Belief Update: From a Bayesian perspective, it quantifies the average change between the prior belief $p(x)$ and the posterior belief $p(x|y)$ . It measures how much, on average, observing $Y$ updates our knowledge about $X$ .

Interpretation 4: The Identity of Self-Information and Uncertainty

A special case of mutual information is when a variable is compared to itself.

The self-information $I(X;X)$ is calculated as:

I(X;X) = H(X) + H(X) - H(X,X) = H(X)

This identity is profound: it establishes that the total uncertainty of a variable ( $H(X)$ ) is precisely equal to the information it contains about itself ( $I(X;X)$ ). Information and uncertainty are thus two sides of the same coin—complementary quantities where one is defined as the resolution of the other.

Interpretation 5: A Practical View from Kelly Gambling

A final, practical interpretation comes from the domain of investment and gambling theory. The Kelly criterion outlines an optimal strategy for capital growth over repeated trials. In this context, Mutual Information emerges as a direct measure of the financial value of information. If a gambler can access side information $Y$ (e.g., a tip) about the outcome of an event $X$ , then $I(X;Y)$ represents the average exponential growth rate of their investment that can be achieved by leveraging this side information, compared to gambling with only the baseline probabilities $p(x)$ . In other words, Mutual information $I(X;Y)$ quantifies the advantages of investing with additional information $Y$ compared to having no additional information. This interpretation grounds the abstract concept of "bits" in a tangible measure of strategic advantage.

Together, these five interpretations provide a holistic and robust understanding of Mutual Information, equipping the analyst to apply it effectively across diverse analytical challenges.

2.2 The Relationship Between KL Divergence and Mutual Information

The bridge between the abstract concept of distributional distance and the practical measure of statistical dependence is both direct and profound. As introduced, Mutual Information $I(X;Y)$ is formally defined as the KL Divergence between the true joint distribution $p(x,y)$ and the distribution that assumes independence, $p(x)p(y)$ .

I(X;Y) = D_{KL}( p(x,y) || p(x)p(y) )

This identity is not merely a mathematical convenience; it provides the deepest interpretation of what mutual information represents. Let us unpack this:

The "Reality": For any two variables $X$ and $Y$ , their true relationship is captured by their joint probability distribution, $p(x,y)$ . This is our ground truth.
The "Simplifying Assumption": A common baseline model is to assume the variables are independent, meaning $p(x,y) = p(x)p(y)$ . This model posits that there is no relationship between them.
The "Penalty": The KL Divergence, $D_{KL}(p(x,y) || p(x)p(y))$ , measures the "penalty" or "cost," in bits, of making this simplifying assumption. It quantifies how much our model of independence deviates from reality.

Therefore, Mutual Information is precisely the informational cost of incorrectly assuming that two variables are independent.

If the variables are truly independent, the cost is zero ( $I(X;Y) = 0$ ). If they are strongly dependent, the model of independence is a very poor approximation of reality, resulting in a high cost and thus a high mutual information value. This perspective anchors the entire methodology: we analyze complex systems by quantifying the informational penalty of making simplifying assumptions about their structure.

2.3 Pointwise or Local Mutual Information: From Averages to Events

While Mutual Information $I(X;Y)$ provides a powerful summary of the average relationship between two variables, data analysis often requires drilling down to understand specific events. Pointwise Mutual Information (PMI), denoted $i(x;y)$ , is the tool for this fine-grained analysis.

PMI measures the reduction in surprise about a single, specific outcome $x$ upon observing a single, specific outcome $y$ . It moves the analysis from the level of variables $(X, Y)$ to the level of individual events $(x, y)$ . Its formulas mirror those of its averaged counterpart:

i(x;y) = h(x) - h(x|y) = log_2( \frac {p(x|y)} {p(x)} )

The most critical distinction of PMI is that, unlike the always non-negative $I(X;Y)$ , pointwise mutual information can be positive, negative, or zero. This sign provides crucial diagnostic information about the nature of the interaction for a specific event pair:

Positive PMI ( $i(x;y) > 0$ ): Positive Information This occurs when observing $y$ makes the specific outcome $x$ more likely than it was a priori ( $p(x|y) > p(x)$ ). The observation $y$ has positively informed us about $x$ , increasing our expectation that it would occur and thus reducing our surprise.
Negative PMI ( $i(x;y) < 0$ ): Misinformation This occurs when observing $y$ makes the specific outcome $x$ less likely than it was a priori ( $p(x|y) < p(x)$ ). The observation $y$ has misinformed us about $x$ . It updated our beliefs in a way that made the actual outcome seem even more surprising than it would have been without the information. For example, if a "sunshine" forecast is followed by rain, that forecast has provided misinformation for that specific event, resulting in a negative PMI value.

The average mutual information $I(X;Y)$ is the expectation of the pointwise mutual information over all possible event pairs: $I(X;Y) = <i(x;y)>$ . The fact that $I(X;Y)$ must be non-negative means that, on average, the instances of positive information must outweigh the instances of misinformation. PMI thus provides the essential tool to dissect this average and identify which specific events drive the overall statistical dependence.

3. Conditional mutual information

3.1 Conditional mutual information

While Mutual Information quantifies the relationship between two variables in isolation, real-world systems are networks of interconnected influences. To dissect these complex dependencies, we must move beyond pairwise analysis and ask a more sophisticated question: How does the informational relationship between $X$ and $Y$ change when we account for the context provided by a third variable, $Z$ ?

Interpretation 1: An Entropy-Based View in the Context of Z

The most direct way to define CMI is by adapting the entropy-based formulas for MI, with every term now conditioned on $Z$ . This frames CMI as a measure of shared information within the context of $Z$ .

I(X;Y|Z) = H(X|Z) + H(Y|Z) - H(X,Y|Z)

I(X;Y|Z) = H(X|Z) - H(X|Y,Z)

I(X;Y|Z) = H(Y|Z) - H(Y|X,Z)

I(X;Y|Z) = H(X;Y,Z) - H(X;Z)

This perspective reveals several of CMI's core properties:

Boundary Conditions: CMI is non-negative and is upper-bounded by the conditional entropies: $0 ≤ I(X;Y|Z) ≤ min(H(X|Z), H(Y|Z))$ . A crucial implication is that if $Z$ completely explains $X$ (i.e., $H(X|Z) = 0$ ), then there is no residual uncertainty for $Y$ to reduce, so $I(X;Y|Z) = 0$ .
Symmetry: CMI is symmetric in $X$ and $Y$ , $I(X;Y|Z) = I(Y;X|Z)$ .
Perfect Conditional Dependence: If $I(X;Y|Z) = H(X|Z)$ , it implies that, given $Z$ , knowing $Y$ completely resolves all remaining uncertainty about $X$ , meaning $H(X|Y,Z) = 0$ .

A particularly useful identity, often called the chain rule for mutual information, is also derived from this view: $I(X; Y,Z) = I(X;Z) + I(X;Y|Z)$ . This shows that CMI is precisely the additional information that Y provides about X, beyond what Z already provided.

Interpretation 2: A Rigorous Definition via Conditional KL Divergence

For mathematical rigor, CMI is the KL Divergence between the true conditional joint distribution $p(x,y|z)$ and the model of conditional independence, $p(x|z)p(y|z)$ .

I(X;Y|Z) = \sum _{x\in A_x, y \in A_y, z \in A_z} p(x,y,z) log_2 \frac {p(x,y |z)}{p(x|z)p(y|z)}

I(X;Y|Z) = \sum _{x\in A_x, y \in A_y, z \in A_z} p(x,y,z) log_2 \frac {p(x| y,z)}{p(x|z)p(y|z)}

This definition yields two important properties:

I(X;Y|Z) = D_{KL}( p(x,y|z) || p(x|z)p(y|z) )

It frames CMI as the cost of assuming $X$ and $Y$ are independent, given the context $Z$ .
In coding theory, this translates to the penalty in code length for encoding $\{x,y\}$ assuming conditional independence, or for encoding $x$ using only knowledge of $z$ without the additional knowledge of $y$ .

Interpretation 3: A Statistical View as Non-Linear Partial Correlation

By applying the definition of conditional probability, the CMI formula highlights its statistical meaning:

I(X;Y|Z) = \sum _{x\in A_x, y \in A_y, z \in A_z} p(x,y,z) log_2( \frac {p(x|y,z)} {p(x|z)} )

This form underscores two key properties:

A Test for Conditional Independence: $I(X;Y|Z) = 0$ if and only if $X$ and $Y$ are independent conditional on $Z$ .
A Non-Linear Partial Correlation: This property establishes CMI as the information-theoretic analogue of partial correlation. It quantifies the direct statistical relationship between $X$ and $Y$ while controlling for the non-linear influences of $Z$ .

Conditional Mutual Information (CMI), denoted $I(X;Y|Z)$ , is the principal tool for this task. CMI measures the reduction in uncertainty about variable $X$ gained from variable $Y$ , given that the state of variable $Z$ is already known. It is the informational value of $Y$ about $X$ , in the context of $Z$ .

A crucial warning is in order regarding visualization: Venn diagrams should not be used to interpret three-variable information measures. While the areas in such a diagram add up correctly, they can give the misleading impression that all components (including the three-way interaction) are non-negative, which is not true.

3.2 The Power of Context: Redundancy and Synergy

The most profound insight from Conditional Mutual Information comes from comparing it to the unconditional Mutual Information, $I(X;Y)$ . This comparison reveals how a contextual variable $Z$ can fundamentally alter the perceived relationship between $X$ and $Y$ . This effect manifests in three primary ways:

No Effect: If all variables are independent, or if $Z$ is independent of $X$ and $Y$ , conditioning has no effect, and $I(X;Y|Z) = I(X;Y)$ .
Redundancy: $I(X;Y|Z) < I(X;Y)$ This occurs when the contextual variable $Z$ provides information about $X$ that is redundant with the information also provided by $Y$ . In this scenario, $Z$ "explains away" some of the statistical relationship that was visible between $X$ and $Y$ . The information from $Y$ becomes less valuable because $Z$ has already provided some of it.
- Example: Consider $X$ , $Y$ , and $Z$ to be three identical, independent and identically distributed (i.i.d.) random bits. On its own, the relationship between $X$ and $Y$ is perfect, so $I(X;Y) = 1$ bit. However, once we are given the value of $Z$ , which is identical to $X$ , we have learned everything there is to know about $X$ . The variable $Y$ can provide no additional information. Therefore, the conditional mutual information drops to zero: $I(X;Y|Z) = 0$ .
Synergy: $I(X;Y|Z) > I(X;Y)$ This occurs when $Y$ and $Z$ work together to provide synergistic information about $X$ that neither $Y$ nor $Z$ could provide alone. The context provided by $Z$ unlocks, reveals, or amplifies a hidden relationship between $X$ and $Y$ .
- Example: The classic case is the exclusive OR (XOR) function, where $X = Y$ XOR $Z$ and $Y$ and $Z$ are i.i.d. random bits. Knowing $Y$ alone tells us nothing about $X$ because its value is perfectly scrambled by $Z$ . Consequently, $I(X;Y) = 0$ . However, once $Z$ is known (the context is provided), the value of Y completely determines the value of $X$ . This results in maximum conditional information: $I(X;Y|Z) = 1$ bit.

The difference $I(X;Y|Z) - I(X;Y)$ serves as an indicator: a positive value implies the presence of synergy, while a negative value implies redundancy. It is crucial to understand that a single variable $Z$ can simultaneously provide some redundant information while also creating synergistic context. The net change only reveals which of these two effects is dominant.

3.3 Pointwise Conditional Mutual Information: From Averages to Events

Just as Mutual Information can be dissected into its event-specific contributions, so too can Conditional Mutual Information. Pointwise (or Local) Conditional Mutual Information, denoted $i(x;y|z)$ , is the tool for this fine-grained analysis.

It measures the reduction in surprise about a single, specific outcome $x$ upon observing a single outcome $y$ , given a specific contextual outcome $z$ . Its formulas are the direct, "pointwise" analogues of the average CMI formulas, using Shannon Information Content $h()$ instead of Entropy $H()$ :

i(x;y|z) = h(x|z) - h(x|y,z)

i(x;y|z) = log_2( \frac {p(x|y,z)} {p(x|z)} )

The critical property of $i(x;y|z)$ is that, like its unconditional counterpart, it can be positive or negative:

A positive value signifies that, in the context of $z$ , observing y made the outcome $x$ more likely, thus providing positive information.
A negative value signifies that, in the context of $z$ , observing y made the outcome $x$ less likely, thus providing misinformation.

The average Conditional Mutual Information $I(X;Y|Z)$ is the expectation of the pointwise values over all possible event triplets: $I(X;Y|Z) = <i(x;y|z)>$ . This tool allows analysts to move beyond assessing the average effect of context and instead identify the specific event combinations where context is most impactful.

4 The Chain Rule for Mutual Information: An Information Regression

4.1 The Chain Rule for Mutual Information

To analyze the information that multiple source variables $\{Y, Z\}$ provide about a target variable $X$ , we must have a method for correctly accounting for their contributions without double-counting. The chain rule for mutual information provides this systematic framework for decomposition.

The rule states that the total information $X$ shares with the joint variable $\{Y, Z\}$ can be decomposed as the sum of the information it shares with $Y$ alone, plus the additional information it shares with $Z$ given $Y$ .

I(X; Y,Z) = I(X;Y) + I(X;Z|Y)

Because mutual information is symmetric, the order of decomposition is interchangeable:

I(X; Y,Z) = I(X;Z) + I(X;Y|Z)

This principle generalizes to any number of source variables, allowing us to build a comprehensive model of multivariate information. For $n$ sources, the total information is:

I(X_1,...,X_n; Y) = \sum _{i=1}^n I(X_i; Y | X_1...X_{i-1})

This decomposition is a powerful analytical tool. It is conceptually equivalent to a form of information regression. Just as in linear regression where we assess the unique contribution of each predictor while controlling for others, the chain rule allows us to attribute the unique informational contribution of each variable in the context of those already considered. This same additive principle also applies at the pointwise level ( $i(x;y,z) = i(x;y) + i(x;z|y)$ ).

4.2 Aside: The Axiomatic Foundation of Mutual Information

A deeper question is why mutual information takes the mathematical form that it does. The answer lies in its axiomatic foundation. Pointwise Mutual Information, $i(x;y)$ , is the unique functional form that satisfies a set of fundamental, desirable properties or axioms for a measure of information.

The core form is defined as: $i(x;y) = log_2( p(x|y) / p(x) )$ This form is not arbitrary; it is uniquely determined by four axioms:

Differentiability: The function must be differentiable with respect to the probabilities $p(x)$ and $p(x|y)$ .
Conditional Form: The conditional version $i(x;y|z)$ must match the unconditional form $i(x;y)$ , with all underlying probability distributions simply conditioned on $z$ .
Additivity (Chain Rule): The measure must be additive. The information about a joint event $\{y,z\}$ must be decomposable into the sum of information about its components in sequence: $i(x; y,z) = i(x;z) + i(x;y|z)$ .
Separation for Independent Ensembles: If two pairs of variables $\{x,y\}$ and $\{u,v\}$ are from independent systems (i.e., $p(x,y,u,v) = p(x,y)p(u,v)$ ), then the information shared across the combined systems must be the sum of the information shared within each system: $i(x,u; y,v) = i(x;y) + i(u;v)$ .

These axioms ensure that our measure of information behaves consistently and logically across different analytical contexts. The average Mutual Information $I(X;Y)$ inherits these robust properties by being the expectation of the pointwise form, $I(X;Y) = <i(x;y)>$ .

5. Conclusion: From Uncertainty to Interaction Architecture

This chapter has extended our analytical toolkit from pairwise to multivariate systems, providing a rigorous mathematical framework to analyze relationships in context. Through the concepts of Conditional Mutual Information (CMI), redundancy, and synergy, we have moved beyond simply measuring dependence to understanding its underlying architecture. By learning how to decompose complex interactions using the chain rule, we can now dissect the flow of information in multifaceted systems, gaining deeper insights into how context shapes the informational landscape—a critical capability for fields ranging from strategic analysis and neuroscience to machine learning.

PreviousIntroduction to Information Theory NextEsitmators

Last updated 3 months ago

hashtag1. The Language of Coding and Divergence: Cross-Entropy and KL Divergence

hashtag1.1 Cross-Entropy: The Cost of a Suboptimal Model

hashtag1.2 Kullback-Leibler (KL) Divergence: Quantifying the Penalty

hashtag1.3 The Bridge to Mutual Information

hashtag2. Mutual Information (MI): Quantifying Statistical Dependence

hashtag2.1 Mutual Information

hashtag2.2 The Relationship Between KL Divergence and Mutual Information

hashtag2.3 Pointwise or Local Mutual Information: From Averages to Events

hashtag3. Conditional mutual information

hashtag3.1 Conditional mutual information

hashtag3.2 The Power of Context: Redundancy and Synergy

hashtag3.3 Pointwise Conditional Mutual Information: From Averages to Events

hashtag4 The Chain Rule for Mutual Information: An Information Regression

hashtag4.1 The Chain Rule for Mutual Information

hashtag4.2 Aside: The Axiomatic Foundation of Mutual Information

hashtag5. Conclusion: From Uncertainty to Interaction Architecture