📊Statistical Significance and Undersampling

1. Introduction

Information-theoretic measures, such as Mutual Information (MI), offer a powerful lens for data analysis, capable of detecting complex, non-linear relationships that simpler correlation-based methods often miss. They provide a model-free way to quantify the statistical dependencies between variables. However, this power comes with a significant caveat. When applied to finite, real-world datasets, these measures are susceptible to statistical artifacts that can lead to misinterpretation. A non-zero measurement does not automatically imply a true underlying relationship. This chapter provides a rigorous methodological framework for applying information-theoretic measures to time series data, ensuring that analyses are both statistically sound and scientifically robust.

The central challenge addressed in this document stems from a discrepancy between theory and practice.

In theory, a Mutual Information of zero, $I(X;Y) = 0$ , is equivalent to (if and only if) a statement of perfect statistical independence between two variables, $X$ and $Y$ .

In practice, when analyzing empirical data, it is possible to measure a non-zero MI, $I(X;Y) ≠ 0$ , even when the variables are truly independent.

This discrepancy presents a critical question for every analyst:

"Is a given non-zero estimate of $I(X;Y)$ meaningfully different from zero, or is it merely consistent with zero due to finite sampling effects?"
How many samples do we need to determine this, or indeed an accurate value of $I(X;Y)$ Without a formal method to answer this, any conclusion drawn from the analysis rests on uncertain ground.

To distinguish a genuine signal from statistical noise, a formal statistical testing framework is not just recommended, it is essential.

2. The Imperative of Statistical Significance: A Null Hypothesis Framework

To answer Question 1, we can adopt a formal statistical testing approach. It provides a structured methodology to determine whether an observed relationship in the data is statistically meaningful or likely the result of random chance inherent in finite sampling.

To understand the test, we must first look at the mathematical foundation of Mutual Information itself. MI is calculated as the expected value of the logarithmic ratio between the conditional probability (posterior) and the marginal probability (prior):

I(X; Y) = \sum p(x, y) log_2( \frac {p(x|y)}{p(x)} )

This formula quantifies, on average, how much observing a specific value of $Y$ changes our knowledge about the probability of observing a specific value of $X$ . This leads directly to the structure of our statistical test:

Null Hypothesis ( $H_0$ ): The variables X and Y are statistically independent. Under this hypothesis, the true Mutual Information is zero.
Alternative Hypothesis: A statistical dependence exists between $X$ and $Y$ .
The Test: The core of the test is to calculate the probability of observing the measured MI statistic under the assumption that the null hypothesis ( $H_0$ ) is true. This probability is the p-value. To calculate this probability, we must first understand what the distribution of MI values would look like if the null hypothesis were true. This is achieved by generating a surrogate distribution.

The Surrogate Distribution

The surrogate distribution represents the range and frequency of MI values one would expect to see from the data if the variables were completely independent. It is constructed by generating multiple sets of surrogate data. This process is designed to preserve the essential statistical properties of the original data (for instance, the marginal probability distribution $p(y)$ of a variable $Y$ is kept intact) while explicitly destroying any potential relationship to the other variable, $X$ . This ensures that the conditional probability $p(x|y)$ in the surrogate data is distributed as the marginal $p(x)$ . This is effective because shuffling breaks the specific per-sample pairing (e.g., the temporal alignment in a time series) between $X$ and $Y$ , thereby nullifying any joint information, while leaving the overall statistical profile of $Y$ untouched.

The p-value

Once the surrogate distribution is established, the p-value is calculated by comparing the originally measured MI value against this distribution. Specifically, the p-value is the probability that a value drawn randomly from the surrogate distribution would be greater than or equal to the actually measured value. In formal terms:

p-value = p(I(X; Y^s) ≥ I(X;Y))

Where $I(X;Y)$ is the MI measured from the original data and $I(X;Y^s)$ is the MI measured from surrogate data. A small p-value (e.g., $< 0.05$ ) suggests that the observed MI is unlikely to have occurred by chance alone, allowing us to reject the null hypothesis in favor of the alternative.

To construct this critical surrogate distribution, analysts can choose between two primary methodological paths: the empirical and the analytical.

3. Establishing the Null Distribution: Two Methodological Paths

Constructing the null distribution is the computational core of significance testing. There are two primary techniques to achieve this, each presenting a fundamental trade-off between computational speed and methodological accuracy. The choice between them depends on the specific estimator being used, the nature of the data, and the goals of the analysis. suo

3.1 The Empirical Approach: Surrogate Data Generation

The empirical approach, also known as permutation or resampling, builds the null distribution directly from the data. It is a robust method that makes minimal assumptions about the underlying data distributions. The process is as follows:

Start with the original paired samples of the time series data for variables $X$ and $Y$ .
Create a surrogate variable, $Y^s$ , by randomly permuting (shuffling) the samples of the $Y$ variable. This action rigorously preserves the marginal distribution of $Y$ but completely destroys the specific, sample-by-sample temporal relationship with $X$ .
Compute the Mutual Information between the original variable $X$ and the first surrogate variable $Y^{s1}$ . This calculation yields one sample point from the null distribution.
Repeat this process many times (e.g., 100 or 1000 times), each time with a new, independent permutation of $Y$ . The resulting collection of MI values forms a histogram that represents the empirical surrogate distribution.

To make this concrete, let's use the example of investigating the relationship between students' study time (variable $Y$ ) and their final exam scores (variable $X$ ).

Calculate the Original Statistic: First, compute the Mutual Information, $I(X;Y)$ , from the original, correctly paired data. This gives us our benchmark value—the one we want to test for significance.
Generate a Surrogate Variable: Next, create a surrogate variable, $Y^s$ , by taking the entire column of study times ( $Y$ ) and randomly shuffling its values. As shown in the data table example, this critical action means every student's original exam score ( $X$ ) is now paired with a randomly chosen study time from a different student.
Preserve Marginals, Destroy Joint Relationship: This permutation rigorously preserves the overall statistical profile of study times (the marginal distribution $p(y)$ is unchanged) but completely destroys the specific, sample-by-sample relationship with exam scores. The joint probability $p(x,y)$ is broken, effectively forcing the data to conform to the null hypothesis of independence.
Calculate the Surrogate Statistic: Compute the Mutual Information between the original exam scores ( $X$ ) and this first surrogate variable of study times ( $Y^{s1}$ ). This calculation yields one sample point for our null distribution.
Repeat to Build the Distribution: Repeat this process many times (e.g., 100 or 1,000 times), each time with a new, independent permutation of Y. The resulting collection of MI values, when plotted as a histogram, forms the empirical surrogate distribution. This histogram visualizes the range of MI values we would expect to see if there were no true relationship between study time and exam scores.

Methodological Nuances: The Directionality of the Test While the process for $I(X;Y)$ seems symmetrical, the choice of which variable to shuffle introduces a subtle directionality. This concept becomes even clearer when we consider Conditional Mutual Information, $I(X;Y|Z)$ , where the same principle holds but with an important modification.

For Conditional MI $I(X;Y|Z)$ : To create the surrogate distribution, we must preserve the conditioning context. This means we perform a conditional permutation: for each unique value of $Z$ , we shuffle the corresponding $Y$ values among themselves. This process is designed to test if $p(x|y,z)$ is distributed as $p(x|z)$ . In other words, we are asking, "Given the context $Z$ , does knowing $Y$ still provide any additional information about $X$ ?"
Difference from $I(X;Y)$ : By shuffling $Y$ (conditionally or not) while keeping $X$ fixed, we are specifically testing the dependence of $X$ on $Y$ . This becomes a directional test. We are assessing the flow of information from $Y$ to $X$ . While asymptotically (with infinite data) it makes no difference whether you resample $X$ or $Y$ , for finite datasets, the choice matters. Often, as is the case with measures like Transfer Entropy, researchers are specifically interested in such a directional test, making this the standard and methodologically sound approach.

3.2 The Analytical Approach: The Chi-Squared (χ²) Approximation

For certain estimators, the surrogate distribution can be described by a known mathematical function, providing an analytical shortcut that bypasses the need for computationally intensive resampling.

Applicable Estimators: This method is valid for the Linear-Gaussian and the discrete (plug-in) estimators.
The Formula: For these specific estimators, the quantity $2N \times I$ (where N is the number of samples and $I$ is measured in nats) is known to follow a chi-squared ( $\chi ^2$ ) distribution.

The specific shape of the $\chi ^2$ distribution is determined by its "degrees of freedom," which depend on the properties of the variables being analyzed, as shown in the table below.

Estimator

Mutual info $$I(X;Y^s)$$

Conditional Mutual Info $I(X;Y^s|Z)$

Linear-Gaussian

$dim(X)dim(Y)$

Discrete (plug-in)

$(|A_X| - 1) \times (|A_Y| - 1)$

$(|AX| - 1) \times (|A_Y| - 1) \times |A_Z|$

Note: $dim(X)$ is the number of dimensions in a multivariate $X$ , and $|AX|$ is the alphabet size for discrete variable $X$ . For Conditional MI, the degrees of freedom for the discrete estimator become

$(|A_X| - 1) * (|A_Y| - 1) * |A_Z|$ .

The mutual information of the alternative distribution may be expressed as either $2N \times I(X; Y^S)$ , or $2N × I(X; Y^S|Z)$ . Note that the unit is Nats. Nats is a unit used to measure information quantity. Similar to bits, it is a unit of information quantity calculated using the natural logarithm. When using Nats as the unit, the formula for information quantity is $I(X) = −lnp(X)$ .

3.3 A Methodological Aside: Normalizing Measurements for Bias Correction

Beyond testing for statistical significance, the surrogate distribution provides another valuable utility: it allows us to calculate and correct for estimation bias. Bias refers to the systematic tendency of an estimator to produce a non-zero value from finite samples, even when the true underlying value is zero.

The mean of the surrogate distribution, denoted as $⟨I(X; Y^s)⟩$ , represents exactly this expected estimation bias under the null hypothesis of independence. Some researchers use this insight to create a "normalized" or bias-corrected estimate of Mutual Information, $I^n(X; Y)$ . The formula is a simple subtraction:

I^n(X; Y) = I(X; Y) – ⟨I(X; Y^s)⟩

The purpose of this procedure is to remove the baseline "noise floor" component that arises purely from having a finite sample size. The resulting $I^n(X; Y)$ is often considered a more accurate point estimate of the true information shared between the variables.

It is important to note, however, that this step is not always necessary. If the chosen estimator is already designed to be inherently bias-corrected (for example, the KSG estimator aims to achieve this), performing this manual subtraction can be redundant. This normalization technique is most valuable when working with estimators that are known to have a positive bias, such as the basic discrete "plug-in" estimator.

3.4 Methodological Trade-offs: A Comparative Analysis

Analytical Approach
- Pro: It is far faster than empirical generation, making it ideal for exploratory analyses or situations involving millions of calculations (e.g., all-pairs analysis in brain imaging data).
- Cons: The approximation is only completely correct asymptotically (as the number of samples $N$ approaches infinity). For finite datasets, it can be inaccurate, especially when analyzing highly multivariate data, or when using the discrete estimator with heavily skewed distributions.
Empirical Approach
- Pro: It is generally more accurate for finite sample sizes and complex, non-standard data distributions, as it derives the null distribution directly from the properties of the data itself.
- Con: It is computationally intensive, as it requires recalculating the information-theoretic measure hundreds or thousands of times. It is best reserved for final, high-stakes analyses where accuracy is paramount.

While these methods provide a solid foundation for statistical testing, applying them to time series data introduces a critical complication that requires specific methodological adjustments.

4. A Critical Complication for Time Series: The Impact of Autocorrelation

Time series data requires special consideration because its samples are often not independent. Many time series exhibit autocorrelation, meaning the value of a signal at one point in time is correlated with its values at previous time points. The presence of significant autocorrelation violates a key assumption underlying many statistical estimators and significance tests, demanding a specific corrective approach.

4.1 How Autocorrelation Violates Core Assumptions

Autocorrelation introduces several distinct problems into the analysis pipeline:

Loss of Sample Independence: Both information-theoretic estimators and the significance tests themselves often assume that data samples are independent. Autocorrelation directly violates this assumption, meaning the effective number of independent samples is lower than the total number of samples, which can invalidate test results that rely on sample count ( $N$ ).
Artificial Inflation of MI: For neighbor-based estimators like the Kraskov-Stögbauer-Grassberger (KSG) estimator, autocorrelation can be particularly problematic. Because adjacent samples in time are similar, they artificially inflate the number of nearby neighbors in the joint data space. This can cause the estimator to underestimate the distances to neighbors, leading to an overestimation of the final MI value.
Invalidation of Significance Tests: The standard empirical test (via permutation) generates a null distribution by creating surrogates that have the same marginal distribution as the original data but are, by design, not autocorrelated. This creates an invalid test, as the MI statistic is calculated from autocorrelated data, while the null distribution it's compared against is derived from non-autocorrelated data, violating the core principle of a null hypothesis test.

4.2 A Robust Solution: Dynamic Correlation Exclusion

The standard and most robust method to control for the effects of autocorrelation, particularly for KSG-type estimators, is Dynamic Correlation Exclusion. This technique is also known as a Theiler window or serial correlation exclusion.

The mechanism is straightforward: when the algorithm searches for the k-nearest neighbors for a given sample point, it is instructed to explicitly ignore any other samples that are too close in time. This prevents the artificially close, autocorrelated neighbors from biasing the MI estimate.

Implementing this correction involves a clear, data-driven process:

Plot the autocorrelation function for each time series involved in the analysis. This plot reveals the correlation of the signal with itself at increasing time lags. From this plot, determine the "autocorrelation length"—the time lag at which the autocorrelation drops to an insignificant level.
Identify the maximum autocorrelation length across all variables being analyzed. This ensures that the correction is sufficient for the "slowest" or most persistent variable.
Set the estimator's exclusion parameter to this maximum length. For example, in the JIDT toolkit, this parameter is DYN_CORR_EXCL.

By correctly controlling for autocorrelation, the analyst ensures that the MI estimate is not artificially inflated and that the assumptions of the statistical framework remain valid. This correction is a crucial step before proceeding to the final practical consideration: ensuring the analysis is built on a sufficient amount of data.

5. Practical Considerations: Undersampling and Data Requirements

Beyond statistical significance and autocorrelation, a third critical issue is undersampling, which occurs when data is insufficient to reliably estimate the required probability distributions. This issue manifests differently depending on the estimator, resulting in distinct failure modes.

5.1 Discrete Estimators: The Risk of Catastrophic Failure

For discrete, or "plug-in," estimators that rely on counting occurrences in bins, undersampling leads to a catastrophic failure. If the number of samples is too small relative to the number of possible joint states (bins), many bins in the joint probability distribution will contain zero samples. This makes it impossible to reliably estimate the joint probability distribution $p(x,y)$ , rendering the resulting MI calculation meaningless.

Heuristic for Data Sufficiency: To avoid this failure mode, the number of samples ( $N$ ) must be significantly larger than the number of possible joint state configurations. A common minimum guideline is $N ≥ 3 × (|A_X| × |A_Y|)$ . For a robust estimate, 10 times this amount or more is preferable.
Implication: The required number of samples grows multiplicatively as variables become more fine-grained (larger alphabet size $|A|$ ) or more multivariate. An analysis that is well-supported with two binary variables may become severely undersampled if those variables are discretized into ten bins each.

5.2 Continuous Estimators: The Degradation of Sensitivity

For continuous, neighbor-based estimators like KSG, the issue of undersampling manifests as a degradation of sensitivity. The KSG estimator is designed to adapt to local data density by adjusting its effective resolution (i.e., the distance to the k-th nearest neighbor). This adaptive mechanism prevents the catastrophic failure seen in discrete estimators. However, this adaptation comes at a cost.

The Trade-off: If the data is sparse, either because there are too few samples ( $N$ ) or the dimensionality is too high, the KSG estimator must use a larger effective resolution to find its neighbors.
Implication: By operating at a coarser resolution, the estimator may miss subtle relationships that only exist on a smaller spatial scale. The algorithm will still produce a value, but its ability to detect fine-grained dependencies will be compromised. Pushing the dimensionality of an analysis too far with a fixed amount of data will inevitably reduce the reliability and sensitivity of the results.

6. Synthesis: A Framework for Best Practices

Calculating a raw information-theoretic value is only the first step in a sound analysis. Moving from that number to a robust, defensible conclusion requires a multi-faceted approach that carefully considers statistical significance, the inherent structure of the data, and the properties of the chosen estimators. By integrating null hypothesis testing, correcting for data-specific artifacts like autocorrelation, and ensuring data sufficiency, analysts can harness the full power of these measures while avoiding common pitfalls.

The following checklist distills the core principles of this framework into a set of actionable best practices for any researcher or data scientist working with information-theoretic measures on time series data.

Always Test for Significance: Never interpret a raw, non-zero MI value as evidence of a relationship without performing a formal statistical significance test. An observed value may be entirely consistent with the null hypothesis of independence due to finite sampling effects.
Choose the Right Null Distribution Method: Use the fast analytical ( $\chi ^2$ ) method for initial exploration with compatible estimators (Linear-Gaussian, discrete). For final, publication-quality conclusions, use the more accurate empirical (resampling) method, especially with complex data or non-analytical estimators like KSG where the $\chi ^2$ approximation does not apply.
Check for Autocorrelation: Before beginning the analysis, always plot the autocorrelation function for your time series data. This will allow you to quantify the temporal dependencies and determine the autocorrelation length for each variable.
Control for Autocorrelation: If significant autocorrelation is present, you must use a corrective technique. For KSG estimators, this means applying a Dynamic Correlation Exclusion (Theiler window) by setting the appropriate parameter (e.g., DYN_CORR_EXCL) to the maximum observed autocorrelation length.
Assess Data Sufficiency: Use established heuristics to evaluate if you have enough samples for your chosen number of variables and discretization levels. Be mindful of the curse of dimensionality; increasing the number of variables without a massive, corresponding increase in sample size will reduce statistical power and reliability.

PreviousEsitmators NextSelf-Organisation

Last updated 2 months ago

hashtag1. Introduction

hashtag2. The Imperative of Statistical Significance: A Null Hypothesis Framework

hashtag3. Establishing the Null Distribution: Two Methodological Paths

hashtag3.1 The Empirical Approach: Surrogate Data Generation

hashtag3.2 The Analytical Approach: The Chi-Squared (χ²) Approximation

hashtag3.3 A Methodological Aside: Normalizing Measurements for Bias Correction

hashtag3.4 Methodological Trade-offs: A Comparative Analysis

hashtag4. A Critical Complication for Time Series: The Impact of Autocorrelation

hashtag4.1 How Autocorrelation Violates Core Assumptions

hashtag4.2 A Robust Solution: Dynamic Correlation Exclusion

hashtag5. Practical Considerations: Undersampling and Data Requirements

hashtag5.1 Discrete Estimators: The Risk of Catastrophic Failure

hashtag5.2 Continuous Estimators: The Degradation of Sensitivity

hashtag6. Synthesis: A Framework for Best Practices