📐Esitmators
1. Introduction
Information theory provides a powerful, model-free lens for analyzing the intricate dynamics of complex systems. By quantifying uncertainty and shared information, these methods allow us to uncover relationships that traditional linear statistics might miss. However, a significant challenge arises when moving from theoretical concepts to empirical application: the robust estimation of information-theoretic quantities from continuous-valued, real-world data. Unlike discrete data, where probabilities can be estimated by simple counting, continuous data requires the estimation of underlying probability density functions, a non-trivial task fraught with potential pitfalls.
This document will first establish the foundational concepts of estimation, including bias and variance, and introduce differential entropy as the continuous counterpart to Shannon entropy. We will then provide a detailed analysis of key estimator types, moving from simple linear models to the state-of-the-art in non-linear estimation. Following this, we will culminate in a recommended workflow to help practitioners select the appropriate estimator for their analytical needs.
2. Estimating Information from Continuous Data
Before applying any computational tool, it is strategically important to understand the fundamental principles of estimation. A critical distinction must be made at the outset: all information-theoretic measures computed from finite, empirical data are estimates of a true underlying value, not the true value itself. This distinction has profound implications for the interpretation of results, as every estimate is subject to potential errors in the form of bias and variance.
2.1 The Nature of Estimation: Bias and Variance
When we work with a finite data sample, we are attempting to infer properties of a larger, underlying process. The value we compute, such as the estimated entropy H^, is an approximation of the true, unobservable value H. This process introduces two key types of error:
Bias: The expected difference between our estimated value and the true value, often expressed as B=E<H^>−H. It represents a systematic error in one direction.
Variance: Expressed as v(H^). The range or spread of estimates one would expect to see across different finite realizations of the same data-generating process. It represents the random error or instability of the estimate. A crucial insight is that both bias and variance are typically more pronounced when an estimate is made from a limited amount of data. For information-theoretic measures, entropy tends to be systematically underestimated, while mutual information tends to be systematically overestimated.
2.2 The Simplest Case: The Plugin Estimator for Discrete Data
For discrete data, where variables take on a finite set of distinct symbols or categories, the most straightforward estimation method is the plugin estimator, also known as the maximum likelihood estimator. Its mechanism is intuitive.
Example: Rolling a Fair Die Imagine a standard six-sided die. A fair die has maximum uncertainty (entropy) because each of the six outcomes is equally likely. However, if we only roll it 12 times, we are very unlikely to see each face appear exactly twice. We might see '5' appear four times and '2' appear zero times. When we calculate the entropy from this small sample, the uneven distribution of outcomes will lead to an estimated entropy that is lower than the true maximum entropy. We have underestimated the die's true uncertainty.
This "count-and-plug-in" approach can be summarized as:
We compute the probability of each symbol, xj, by counting its frequency of occurrence (nj) out of the total number of samples (N), such that p(X=xj)=Nnj.
We then "plug" these estimated probabilities directly into the relevant information-theoretic equations, such as the formula for Shannon entropy. This systematic underestimation of entropy has been analytically derived. The expected bias (B) for the plugin entropy estimator is given by:
Where ∣Ax∣ is the "alphabet size" (the number of possible symbols, which is 6 for our die) and N is the number of samples. This formula confirms our intuition: the bias is negative (an underestimation), and it becomes smaller as our number of samples (N) increases.
2.3 The Challenge of Continuous Data: Binning
The fundamental problem with continuous data is that one cannot simply count occurrences. The most direct way to adapt our discrete method is through binning, a process that partitions the continuous range into a set of discrete bins.

Example: Measuring Daily Temperature Consider measuring daily high temperatures. We could create bins such as [20-25°C), [25-30°C), etc. This allows us to count how many days fall into each category and use the plugin estimator. However, this approach has severe drawbacks.
Sensitivity: Imagine two days with temperatures of 24.9°C and 25.1°C. If our bin boundary is exactly at 25.0°C, these two almost identical days are forced into different categories, arbitrarily distorting the data's structure. The final result becomes highly dependent on where we choose to draw the lines.
Loss of Subtleties: Within the [20-25°C) bin, a day at 20.1°C is treated as identical to a day at 24.9°C. This smoothing effect erases the fine-grained variations in the data and can easily obscure subtle but important non-linear patterns.
2.4 A Natural Approach: Differential Entropy
The limitations of binning motivate the need for a more sophisticated and theoretically sound approach that handles continuous variables in their natural domain. This leads to the concept of differential entropy, the continuous counterpart to Shannon entropy. Instead of summing over discrete probabilities, it integrates over a continuous probability density function, f(x):
Here, f(x) is the probability density function of the continuous variable X, and the integral is taken over the domain Sx where f(x)>0. This integral is the natural counterpart to the summation in Shannon's formula. But where does this definition come from, and how does it relate to our previous discussion of binning? The connection is revealed by examining what happens to the Shannon entropy of binned data, HΔ(X), as we make the bin size Δ infinitesimally small.
For a very small bin of width Δ, the probability of a sample falling within it can be approximated by p≈f(x)Δ. If we substitute this into the standard Shannon entropy formula and take the limit as Δ approaches zero, the sum splits into two distinct components. One component converges to the integral above, while the other becomes a diverging term that depends only on the bin size. This gives us the crucial bridge between the discrete and continuous worlds:
This relationship provides a profound interpretation. It shows that the entropy of the binned data (HΔ(X)) is the sum of two parts:
A finite, meaningful component (HD(X)): This is the differential entropy, which captures the uncertainty arising from the shape of the distribution, independent of the measurement resolution.
A resolution-dependent, diverging component (−logΔ): This term goes to infinity as Δ→0, reflecting the intuitive idea that specifying a truly continuous variable requires an infinite amount of information.
Differential entropy, therefore, is the part of the uncertainty we can meaningfully work with once the infinite, resolution-dependent part has been mathematically separated out. This structure gives rise to its unique properties:
Scaling a Variable Changes Its Entropy: Unlike its discrete counterpart, scaling a continuous variable by a factor a adds a constant term to its entropy: HD(aX)=HD(X)+loga. This means the entropy value is not an absolute quantity but depends on the units of measurement.
Entropy Can Be Negative: A direct consequence of the scaling property is that differential entropy can be negative. This is perhaps the most striking difference from Shannon entropy, which is always non-negative.
It is a Relative Measure of Uncertainty: Because of these properties, differential entropy cannot be interpreted as an absolute measure of uncertainty. Instead, it must be understood as a relative measure. A negative value simply means the distribution is more concentrated, or less uncertain, than a reference distribution (e.g., the uniform distribution on the interval, which has a differential entropy of zero).
2.5 Mutual Information: A Robust and Intuitive Measure
Fortunately, despite the interpretive complexities of differential entropy, the situation for mutual information is far more straightforward and robust. Mutual information for continuous variables is constructed from the same sums and differences of entropy terms as its discrete counterpart:
The crucial insight is that when these differential entropy terms are combined, their problematic properties systematically cancel out. For example, the additive terms that arise from scaling the variables are eliminated in this sum-and-difference structure. As a result, the continuous mutual information measure retains all the desirable and intuitive properties of discrete mutual information:
It is always non-negative: I(X;Y)≥0.
It is invariant to the scaling of the variables: I(aX;bY)=I(X;Y). Changing the units or scale of X or Y does not change the information they share. This robustness makes mutual information the primary and most reliable tool for analyzing relationships in continuous data. The core challenge, therefore, shifts away from theoretical interpretation and back to a practical implementation problem: how to accurately estimate the underlying probability density functions needed for these calculations, without resorting to the flawed method of binning.
3. A Taxonomy of Estimators for Continuous Data
Having established that mutual information provides a robust theoretical framework, the core challenge shifts to a practical one: how to accurately estimate the underlying probability density functions (PDFs) from finite data. This section dissects the primary methods for this task, presenting a hierarchy of estimators that range from simple, assumption-laden models to sophisticated, model-free techniques. Each presents a different set of trade-offs between computational speed, accuracy, and the types of relationships it can uncover.
3.1 Method 1: The Gaussian Model
Before introducing the Gaussian estimator, it is essential to understand its foundational concepts from linear statistics: covariance and correlation. These measures quantify the linear relationship between variables and form the core of the Gaussian model.
Covariance: Measures the joint variability of two variables. A positive covariance indicates that the variables tend to move in the same direction, while a negative covariance indicates they move in opposite directions.
Correlation Coefficient (ρ): A normalized version of covariance, ranging from -1 to +1. It measures not only the direction but also the strength of the linear relationship between two variables. A value of 1 or -1 signifies a perfect linear relationship, while 0 signifies no linear relationship.
Covariance Matrix (Ωx): For a set of d variables (a multivariate variable X), the covariance matrix is a d×d matrix that contains the pairwise covariances between all variables. The diagonal elements are the variances of each variable, and the off-diagonal elements are the covariances between pairs.
With these concepts in place, we can now define the Gaussian model estimator. It operates on a strong, simplifying assumption: that the data follows a multivariate Gaussian (or "normal") distribution. This assumption allows for a direct, analytical calculation of differential entropy:
Here, d is the number of dimensions (variables), and ∣Ωx∣ is the determinant of the covariance matrix. This powerful formula connects the geometric volume of the data's scatter plot (represented by the determinant) to its information-theoretic uncertainty.

From this, all other measures like mutual information can be derived. For univariate X with standard distribution, its standard deviation is σ, and its entropy can be counted as:
The most crucial result is for the mutual information between two univariate Gaussian variables, X and Y, which is explicitly linked to their Pearson correlation coefficient (ρ):
This formula elegantly demonstrates that under Gaussian assumptions, mutual information is a direct and monotonic function of the squared linear correlation.
Pros: This method is extremely fast and parameter-free, making it an excellent first-pass analysis tool to quickly identify the presence and strength of linear relationships.
Cons: It will only detect the linear component of any interaction. If a strong non-linear relationship exists (e.g., a U-shape), the Gaussian model will report a mutual information value near zero, completely missing the underlying dynamics. It provides a lower bound on the true mutual information.
3.2 Method 2: Box-Kernel Estimation
The box-kernel estimator is a model-free approach that moves beyond the rigid assumptions of the Gaussian model by estimating the local probability density directly from the data. Before diving into its mechanics, it is crucial to clarify a common point of confusion: the term "kernel" here refers to a function used in Kernel Density Estimation (KDE), not the kernel function used in machine learning methods like Support Vector Machines. In KDE, a kernel is a weighting function centered on each data point to estimate local density; in machine learning, it is typically used to compute dot products in a higher-dimensional feature space.
The operational principle of the box-kernel estimator is to build up a density estimate by examining the neighborhood around each sample point. This is a two-step process:
Step 1: Estimate Local Probability Mass (p^)
For each sample point (xn,yn), we estimate the local probability mass, p^(xn,yn), by calculating the fraction of other points in the dataset that fall within a defined neighborhood. This is done using a kernel function, Θ, and a fixed resolution, or kernel width, r:
Let's break this formula down:
The summation (∑i) is over all points i in the dataset.
The kernel Θ is the box kernel, which acts like a simple switch. By default, it returns 1 if its argument is less than or equal to zero, and 0 otherwise.
The norm ∣...∣ represents the maximum distance (or Chebyshev distance). This means we are defining the neighborhood as a square or hyper-box, not a circle.
Putting it together: The expression ∣∣...∣∣−r≤0 is true only if the point (xi,yi) is within a box of half-width r around our central point (xn,yn). Therefore, the formula simply counts how many points are inside the box and divides by the total number of points, N.
Step 2: Convert to a Probability Density Estimate (f^)
The value p^ represents a probability mass within a region. To convert it to a probability density, we must divide it by the volume (or area in 2D) of that region. For a 2D box of half-width r, the area is (2r)2. This gives us the final density estimate:
This process is illustrated visually. For a point in a dense region (the red box), many neighbors fall within the radius r, leading to a high p^ and thus a high density f^. For a point in a sparse region (the blue box), few neighbors are inside, resulting in a low density estimate.

Step 3: Compute the Final Entropy from the Density Estimates
Once we have a density estimate f^(xn) for each point in our dataset, how do we arrive at a single entropy value for the entire variable? The theoretical definition of differential entropy is an integral: H(X)=−∫f(x)logf(x)dx, which represents the expected value of −logf(x) over the entire distribution. In practice, with a finite set of samples, we approximate this expectation by averaging over the samples. This gives us the final computational formula for the entropy:
This is the key insight: the integral is the theoretical definition, but the summation is its practical, sample-based approximation. We calculate the "information content" (−logf^) at each observed data point and then average these values to get our final estimate of the overall entropy.
This estimator can be framed in terms of the analytical question it answers: "How does knowing x within a resolution r help me predict y within the same resolution r?"
Pros: Being model-free, it is capable of capturing non-linear relationships that the Gaussian model would miss.
Cons: The final estimate is highly sensitive to the choice of the kernel width r. A small r can lead to a noisy, undersampled estimate, while a large r can oversmooth the data and miss important features. This makes the results difficult to interpret reliably.
Cons: The method is known to be biased and is significantly less time-efficient than the Gaussian model.
While the box-kernel estimator is a crucial conceptual step into model-free estimation, its practical flaws motivate the need for a more advanced solution.
3.3 Method 3: The Kraskov (KSG) Nearest-Neighbor Estimator
The Kraskov, Stögbauer, and Grassberger (KSG) estimator is a benchmark method for model-free estimation that represents a significant leap forward from the box-kernel approach. It is specifically designed to calculate mutual information (MI) and conditional mutual information, and it directly addresses the primary weaknesses of the box-kernel method—its sensitivity to the radius r and its inherent bias—through two key innovations:
Dynamic Radius via Nearest-Neighbor Counting: Instead of using a fixed radius r for all points, the KSG estimator uses a dynamic radius for each point. This radius is sized to contain a fixed number (K) of nearest neighbors in the full joint space. This adaptive approach (KNN) is a major advantage: in dense regions of the data space, the radius will be small, capturing fine-grained structure; in sparse regions, it will be larger, preventing undersampling.
Systematic Bias Correction: The method harnesses the underlying principles of Kozachenko-Leonenko entropy estimators and a clever counting scheme to systematically reduce bias. By using the same radius (determined in the joint space) to count neighbors in the marginal spaces, it ensures that the biases in the individual entropy terms cancel out as much as possible when they are combined to compute mutual information.

KSG have two algorithms. The final calculation for MI using Algorithm 1 is given by:
Where:
ψ is the digamma function, a standard mathematical function related to the logarithm.
K is the number of nearest neighbors, a user-defined parameter.
nx and ny are the number of points found within the dynamic radius in each of the marginal spaces.
⟨...⟩ denotes an average over all sample points.
N is the total number of samples.
A Critical Choice: Algorithm 1 vs. Algorithm 2 The KSG method comes in two main variants, which present a critical trade-off for the practitioner:
Algorithm 1: Has smaller variance but larger bias. This makes it the superior choice for statistical significance testing, where a stable, low-variance estimate is crucial for distinguishing a true signal from a null distribution.
Algorithm 2: Has larger variance but smaller bias. This makes it the preferred choice when the goal is to obtain the most accurate possible single value for the mutual information itself. This changes the analytical question to a more adaptive one: "How does knowing our value X within the sample’s K closest neighbors in the full joint space help me predict Y?"
Pros: It is model-free, bias-corrected, and considered "best of breed" in terms of data efficiency and accuracy. It does not rely on any specific probability distribution assumptions, such as the Gaussian model which assumes data follows a Gaussian distribution. It is a non-parametric method applicable to a wide range of different distributions. By employing a density estimation method based on nearest neighbours, KSG corrects for systematic biases present in traditional mutual information estimation, ensuring the estimated results are closer to the true values. It is a powerful tool for detecting complex non-linear relationships even from limited data.
Pros: It is effectively parameter-free. It is capable of providing accurate mutual information estimates with relatively sparse data. It exhibits high stability with respect to the parameter K, meaning precise tuning of this parameter is unnecessary. Consequently, KSG may be considered parameter-free. While it depends on the parameter K, results are very stable for K≥4, a stark contrast to the kernel estimator's high sensitivity to r.
Cons: It is substantially less time-efficient than the Gaussian model, though this can be mitigated with fast nearest-neighbor search algorithms.
A note on negative values: Due to its powerful bias-correction, the KSG estimator can occasionally return small negative values. This is not an error but should be interpreted as being consistent with a true value of zero MI
4. Conclusion
This chapter has bridged the critical gap between the abstract theory of information and its practical application to continuous, real-world data. By navigating the foundational hurdles of bias and variance, and progressing through a hierarchy of estimators: from the rapid linear baseline of the Gaussian model to the robust, non-linear power of the KSG method. We have established a clear and principled workflow. The central takeaway is that the choice of estimator is not merely a technical step, but a strategic decision that defines the analytical question being asked. Armed with this understanding, practitioners are empowered to move beyond the limitations of simple correlation, enabling them to confidently dissect the complex, non-linear interaction architectures that govern systems ranging from financial markets and neuroscience to machine learning.
Last updated