The Huffman Coding scheme is a coding scheme which gives the best Prefix-Free code (with the smallest Average Codeword Length) for any source with any arbitrary distribution.
Huffman Coding is Bounded
the Huffman Coding has the property that:
\begin{equation} \bar{l} \leq H(x) +1 \end{equation}
So, recall that any prefix free code for a source has at least entropy length; this gives:
\begin{equation} H(x) \leq \bar{\ell} \leq H(x)+1 \end{equation}
for source \(X\), and entropy of \(X\) given by \(H(X)\), and a Average Codeword Length returned by Huffman Coding \(\bar{\ell}\).
Huffman Coding is Greedy
since the huffman tree sticks the highest frequnet thing on top, if we swap something up the tree to down the tree the cost increases.
Example
Suppose we have a distribution of a source sequence:
| X | P(x) |
|---|---|
| A | 1/2 |
| T | 1/4 |
| C | 1/8 |
| G | 1/8 |
We iteratively generate a tree by “merging” two sources of the smallest probability into one node.
So after one merge:
| X | P(x) |
|---|---|
| A | 1/2 |
| T | 1/4 |
| C, G | 1/8 + 1/8 = 1/4 |
Doing that again:
| X | P(x) |
|---|---|
| A | 1/2 |
| T, (C, G) | 1/4 + 1/4 = 1/2 |
Doing that again:
| X | P(x) |
|---|---|
| A, (T, (C, G)) | 1/2 + 1/2 = 1 |
This now creates a binary tree where each symbol is a leaf!
.
A .
T .
C G
We now assign \(0\) to left edge and \(1\) to the right edge
.
0 1
A .
0 1
T .
0 1
C G
To get the actual codes, we can just traverse to each leaf to get the code:
| X | Code |
|---|---|
| A | 0 |
| T | 1 0 |
| C | 1 1 0 |
| G | 1 1 1 |
Notice how this is prefix free!
