Houjun Liu

SU-ENGR76 APR092024

joint entropy

Suppose we have two information sources; joint entropy is the measure of joint surprise when we are defined over more than one information source.

For a pair of random variables, \(X, Y\), their joint entropy is defined over a new random variable of \(X \cup Y\).

\begin{equation} H(x,y) = \sum_{i \in X}^{} \sum_{j \in Y}^{} P(X=i, Y=j) \log_{2} \qty(\frac{1}{P(X=i, Y=j)}) \end{equation}

If \(X \perp Y\), we can write \(H(x,y) = H(x)+H(y)\) (shown below.) Further, we have for all \(X\) and \(Y\), \(H(x,y) \leq H(x) + H(y)\) because you can never be more surprised than if you got two completely independent pieces of information.

joint entropy of independent events

Recall independence; so if \(X \perp Y\), we can write \(P(X=i, Y=j) = P(X=i) \cdot P(Y=j)\) and plug in and chug.

Now, plugging that in and applying log laws:

\begin{align} H(x,y) &= \sum_{i \in X}^{} \sum_{j \in Y}^{} P(X=i)P(Y=j) \qty(\log_{2} \frac{1}{P(X=i)}+\log_{2} \frac{1}{P(Y=j)}) \\ &= \sum_{i \in X}^{} \sum_{j \in Y}^{} P(X=i)P(Y=j) \log_{2} \frac{1}{P(X=i)}+ \sum_{i \in X}^{} \sum_{j \in Y}^{} P(X=i)P(Y=j) \log_{2} \frac{1}{P(Y=j)} \\ &= \sum_{i \in X}^{} P(X=i)\log_{2}\frac{1}{P(X=i)}\sum_{j \in Y}^{} P(Y=j) + \sum_{i \in X}^{}P(X=i) \sum_{j \in Y}^{} P(Y=j) \log_{2} \frac{1}{P(Y=j)} \\ &= \sum_{i \in X}^{} P(X=i)\log_{2}\frac{1}{P(X=i)} 1 + 1 \sum_{j \in Y}^{} P(Y=j) \log_{2} \frac{1}{P(Y=j)} \\ &= H(X) + H(Y) \end{align}

Where we can factor the \(P(Y=j)\) out and the \(P(X=j)\) out on the left and right sides because they don’t depend on the other argument in the summation. After which, recall \(\sum_{k}^{}P(K=k)=1\), so each factored out term reduces to \(1\); multiplying in, we result in the entropy expressions for each random variable.

Source Coding

A (source) code is a mapping from symbol set \(X\) to the set of binary sequence \(c(x)\) which corresponds to each \(x \in X\).

For lossless compression, we need a uniquely decodeable encoding.

non-singular code

for all \(x, x’ \in X\) where \(x \neq x’\), a non-singular code assigns \(c(x) \neq c(x’)\).

Though lossless compression requires non-singular code, having it alone does NOT guarantee lossless encoding. You will want an uniquely decodeable encoding instead. For instance:


Is a non-singular code, but 0110 could be ATTA or AGA or …

uniquely decodeable

a sequence being uniquely decodeable means that any possible source sequence \(x_1, …, x_{n}\) is mapped into a unique binary representation.

One simple scheme would be the Prefix-Free scheme, which is uniquely decodeable and instantaneously decodable. Not all uniquely decodeable codes need to be Prefix-Free.

every uniquely decodeable sequence is a non-singular code

Suppose an encoding is uniquely decodeable. We desire that it is also non-singular code. Consider sequences of length 1, each with one symbol in it. Because the encoding is uniquely decodeable, each encoding of these sequences is unique. Now we can assign that sequence encoding to the corresponding symbol.

instantaneously decodable

We do not need to perform lookaheads to decode the sequence; once we see a valid code we can decode it without lookahead.


A Prefix-Free codes is a coding scheme for which no codeword is a prefix of another codeword.

any prefix free code for a source has at least entropy length

any Prefix-Free code for some source \(X\) satisfies that \(\bar{l} \geq H(x)\), where \(\bar{l}\) is the Average Codeword Length of the resulting code and \(H(x)\) is the entropy of the source \(X\).

prefix-free code has the same codeword lengths as any code

If there exists a uniquely decodeable code with codeword lengths \(\{l_1, …, l_{n}\}\) from some source \(X\); then there exists a Prefix-Free version of this code which has exactly the same codeword lengths \(\{l_1, …, l_{n}\}\) — so really we only need to generate the most optimal Prefix-Free code for us to generate a good code.

Average Codeword Length

For a specific code, we have a quality \(\bar{\ell}\). This is a quantity which is a probability-scaled average number of bits to encode each source symbol.

Consider, then, this code:


Our average codeword length her, then is:

\begin{equation} \frac{1}{2} + \frac{1}{4} \cdot 2 + \frac{1}{8} \cdot 3 + \frac{1}{8} \cdot 3 = 1.75 \end{equation}

which would be a bit better than using 2 bits to encode everything.


“communication” needs to happen by formalizing your data into a sequence of symbols \(x_1, x_2, \dots\), where \(x \in X\) in some symbol set. For instance, an image is a long sequence of pixels; words is a long sequence of letters, etc.

After which, we will need to convert our sequence of symbols into bits using some encoder; then, we will need some decoder to convert it back.


A DNA is a piece of information with four symbols: \(X \in \{A, T, C, G\}\).

The most trivial encoding could just be assign a 2-bit sequence to each symbol:

XEncoding (“Code Words”)

Our symbol/encoding mapping has to be non-singular; otherwise, we will be very sad.

How do we improve this code?

Counting by Frequency

Suppose we got some empirical frequencies of the occurances:


We will encode each symbol by a number of bits inversely proportional to the frequency of its occurrence. So:


This is actually uniquely decodeable because its Prefix-Free.

Block Coding

Let’s encode a block of symbols together!

Say we have a sequence:

\begin{equation} X_1, \dots, X_{n} \end{equation}

assuming all \(X_{j}\) are independent of each other. We can devise a block code by coming with a code to encode pair of symbols together.

To calculate the Average Codeword Length, we have to now divide by the number of symbols each block includes.

Shannon’s Source-Coding Theorem

the entropy \(H(X)\) of an information source \(X\) equals the minimum number of bits per source symbol necessary to encode IID sequences from that source.

Meaning: therefore, entropy is the limit—the minimum Average Codeword Length—of lossless information communication.

For a Block Coding scheme with block size \(n\), we have:

\begin{equation} H(X_j) \leq \frac{\bar{\ell}}{n} \leq H(X_j) + \frac{1}{n} \end{equation}

Entropy Rate of the Source

An alternate formulation, whenever we are not drawing IID \(X_{j}\), is that:

\begin{equation} \lim_{n \to \infty} \frac{H(X_1, \dots, X_{n})}{n} = \frac{\bar{\ell}_{n}}{n} \end{equation}

which is the average number of bits per symbol. This is call the “Entropy Rate of the Source”. Most things in the world is not an IID sequence, so this makes sense.

Proof Sketch

Assuming we have IID sequence of symbols \(x_1, x_2, \ldots\), suppose we applied the Huffman Coding scheme with Block Coding block size \(n\). Recall that Huffman Coding is Bounded, so our new bound for a block would be:

\begin{equation} H(X_1, \dots, X_{n}) \leq \bar{\ell} \leq H(X_1, \dots, X_{n}) +1 \end{equation}

Recall now that joint entropy of independent events is the sum, so we get:

\begin{equation} \sum_{j} H(X_j) \leq \bar{\ell} \leq 1+ \sum_{j} H(X_j) \end{equation}

For IID sequences, then, the sum is just multiplication by \(n\):

\begin{equation} n H(X_j) \leq \bar{\ell} \leq 1+ n H(X_j) \end{equation}

dividing by \(n\):

\begin{equation} H(X_j) \leq \frac{\bar{\ell}}{n} \leq H(X_j) + \frac{1}{n} \end{equation}

(now we are tracking per symbol because we average our large blocked codeword length over the block size number of symbols)

Now, this says that Block Coding gets more efficient with larger block size. HOWEVER! this is impractical because the number of blocks scales exponentially based on \(M^{n}\) for \(M\) instantiations of each symbol.