Shannon Entropy Calculator
Compute information entropy H(X) in bits, nats, or hartleys from any probability distribution. Includes efficiency, redundancy, perplexity, and step-by-step breakdown.
Probability Distribution
Quick examples:
Enter values between 0 and 1. They should sum to 1 (auto-normalized if not).
Step-by-Step Breakdown
Distribution Visualization
Hover over a bar to see the exact value.
Binary Entropy Table
H(p) = −p·log₂(p) − (1−p)·log₂(1−p) for p from 0.01 to 0.99
| p | 1−p | H(p) (bits) | −p·log₂(p) | −(1−p)·log₂(1−p) |
|---|
What Is Shannon Entropy?
Shannon entropy, named after mathematician and electrical engineer Claude E. Shannon, is the foundational measure of information theory. Introduced in Shannon's landmark 1948 paper "A Mathematical Theory of Communication," it quantifies the average amount of uncertainty or surprise inherent in a random variable's possible outcomes.
Intuitively, entropy answers the question: "How unpredictable is this information source?" A source that always produces the same outcome (e.g., a biased coin that always lands heads) has zero entropy — no information is gained from observing it. A perfectly random source (e.g., a fair coin) has maximum entropy — each observation provides the most possible information.
The Formula: H(X) = −Σ pᵢ log(pᵢ)
For a discrete random variable X with n possible outcomes, each occurring with probability pᵢ (where Σpᵢ = 1), Shannon entropy is defined as:
Convention: 0 · log(0) = 0 (by continuity, since lim x→0 x·log(x) = 0)
Each term −pᵢ·log(pᵢ) represents the contribution of outcome i to total entropy. This quantity is always non-negative (since 0 ≤ pᵢ ≤ 1 means log(pᵢ) ≤ 0, and the negative sign makes it positive). The sum over all outcomes gives the average uncertainty per observation.
Understanding the Formula Intuitively
The term −log₂(pᵢ) can be interpreted as the "surprise" or "information content" of observing outcome i. Rare events (small pᵢ) carry high information — seeing something unexpected tells you a lot. Common events (large pᵢ) carry little information — seeing the expected tells you little. Entropy is the probability-weighted average of these individual surprises.
Base 2 (Bits) vs Base e (Nats) vs Base 10 (Hartleys)
The choice of logarithm base determines the unit of entropy but does not change the relative comparisons between distributions:
| Base | Unit | Common Use | Conversion |
|---|---|---|---|
| 2 | bits (shannons) | Information theory, computer science, data compression | 1 bit |
| e ≈ 2.718 | nats (natural units) | Physics, statistical mechanics, machine learning | ≈ 0.6931 bits |
| 10 | hartleys (dits/bans) | Communications engineering, some cryptography | ≈ 0.3010 bits |
In computing and data compression, bits are the natural choice since binary digits (0 and 1) are the fundamental unit. In physics (Boltzmann entropy, thermodynamics) and in many machine learning frameworks (cross-entropy loss), nats are preferred because natural logarithms arise naturally in calculus and differential equations.
Maximum Entropy and Minimum Entropy
Maximum entropy occurs when all outcomes are equally probable — a uniform distribution. For n equally likely outcomes, H_max = log(n). This is the most disordered, unpredictable state. For example:
- Fair coin (n=2): H_max = log₂(2) = 1 bit
- Fair die (n=6): H_max = log₂(6) ≈ 2.585 bits
- Deck of 52 cards (n=52): H_max = log₂(52) ≈ 5.700 bits
Minimum entropy is 0 bits, occurring when one outcome has probability 1 and all others have probability 0. A deterministic event provides no information — you already know what will happen. This is the most ordered, predictable state.
Entropy Efficiency and Redundancy
Efficiency (also called relative entropy or entropy ratio) measures how close a distribution is to maximum entropy:
An efficiency of 100% means the distribution is perfectly uniform. An efficiency of 0% means the outcome is completely deterministic. English text has an efficiency of roughly 10–15% (compared to maximum entropy for 26 letters), meaning natural language is highly redundant.
Redundancy is the complement of efficiency in absolute terms: Redundancy = H_max − H(X). It represents the "wasted capacity" — how far the actual entropy falls short of the theoretical maximum. High redundancy in language allows humans to understand speech even with missing words or background noise.
Perplexity
Perplexity is a related measure commonly used in natural language processing to evaluate language models. It is defined as the exponentiation of entropy:
Perplexity = e^H (for natural entropy)
Perplexity can be interpreted as the "effective number of equally likely choices" at each step. A fair die has H = log₂(6) ≈ 2.585 bits, so perplexity = 2^2.585 = 6, exactly the number of faces. A language model with perplexity 50 is, on average, as uncertain as choosing between 50 equally likely words.
Applications of Shannon Entropy
Data Compression (Huffman Coding)
Shannon's source coding theorem proves that the minimum average code length achievable by any lossless compression scheme is H(X) bits per symbol. Huffman coding achieves this theoretical minimum by assigning shorter binary codes to more probable symbols and longer codes to less probable ones. ZIP files, JPEG images (quantization stage), and MP3 audio all rely on entropy-based compression principles.
Cryptography and Security
In cryptography, entropy measures the unpredictability of keys and passwords. A 128-bit key generated from a uniform random source has 128 bits of entropy — it is maximally unpredictable. Human-chosen passwords typically have far lower entropy (research suggests 1–2 bits per character for natural language passwords), making them vulnerable to dictionary and brute-force attacks. Password strength estimators use entropy as a core metric.
Machine Learning — Decision Trees
The ID3 and C4.5 decision tree algorithms use entropy to select the most informative feature at each node. Information gain is computed as the reduction in entropy after splitting the training data on a given feature: IG = H(parent) − Σ (|child|/|parent|) · H(child). The feature with the highest information gain is chosen as the splitting criterion, producing trees that learn the most discriminative patterns first.
Thermodynamics — Boltzmann Entropy
Shannon entropy is mathematically identical to Boltzmann's thermodynamic entropy (S = −kB Σ pᵢ ln pᵢ, where kB is Boltzmann's constant). Both measure the number of microstates consistent with a macrostate. This deep connection between information theory and physics means that entropy is not just an abstract mathematical concept — it has physical meaning in terms of heat, temperature, and the second law of thermodynamics.
Linguistics — Letter Frequency Analysis
English text has an entropy of approximately 1.0–1.5 bits per character when taking into account long-range correlations, compared to a maximum of log₂(26) ≈ 4.70 bits for 26 equally likely letters. This redundancy is exploited in spelling correction, automatic speech recognition, and natural language processing. Languages with more complex morphology (like Finnish or Turkish) exhibit different entropy profiles due to agglutination.
Reference: Common Entropy Values
| Distribution | Entropy (bits) | Efficiency |
|---|---|---|
| Deterministic (p=1) | 0.000 | 0% |
| Biased coin (0.9, 0.1) | 0.469 | 46.9% |
| Biased coin (0.8, 0.2) | 0.722 | 72.2% |
| Biased coin (0.7, 0.3) | 0.881 | 88.1% |
| Fair coin (0.5, 0.5) | 1.000 | 100% |
| Uniform 4 outcomes | 2.000 | 100% |
| Fair die (6 outcomes) | 2.585 | 100% |
| English alphabet (26) | ≈ 4.18 | ≈ 89% |
| Uniform 8 outcomes | 3.000 | 100% |