Houjun Liu

Huffman Coding

The Huffman Coding scheme is a coding scheme which gives the best Prefix-Free code (with the smallest Average Codeword Length) for any source with any arbitrary distribution.

Huffman Coding is Bounded

the Huffman Coding has the property that:

\begin{equation} \bar{l} \leq H(x) +1 \end{equation}

So, recall that any prefix free code for a source has at least entropy length; this gives:

\begin{equation} H(x) \leq \bar{\ell} \leq H(x)+1 \end{equation}

for source \(X\), and entropy of \(X\) given by \(H(X)\), and a Average Codeword Length returned by Huffman Coding \(\bar{\ell}\).

Example

Suppose we have a distribution of a source sequence:

XP(x)
A1/2
T1/4
C1/8
G1/8

We iteratively generate a tree by “merging” two sources of the smallest probability into one node.

So after one merge:

XP(x)
A1/2
T1/4
C, G1/8 + 1/8 = 1/4

Doing that again:

XP(x)
A1/2
T, (C, G)1/4 + 1/4 = 1/2

Doing that again:

XP(x)
A, (T, (C, G))1/2 + 1/2 = 1

This now creates a binary tree where each symbol is a leaf!

     .
A         .
       T      .
            C    G

We now assign \(0\) to left edge and \(1\) to the right edge

     .
  0    1
A         .
         0  1
       T      .
             0  1
            C    G

To get the actual codes, we can just traverse to each leaf to get the code:

XCode
A0
T1 0
C1 1 0
G1 1 1

Notice how this is prefix free!