Algorithms for optimization problems typically go through a sequence of steps, with a set of choices for each step. For many optimization problems, using dynamic programming to determine the best choices is overkill; simpler, more efficient algorithms are sufficient. A greedy algorithm always makes a choice that is locally optimal in the hope that it will lead to a globally optimal solution.
We design greedy algorithms according to the following sequence of steps:
Huffman codes are widely used for data compression. We consider the data to be a sequence of characters. Huffman's greedy algorithm uses a table of frequencies of occurrence of the characters to build up an optimal way to represent each character as a binary string.
We consider only codes that cannot be prefixes of another code. For example,
a 0
b 101
c 100
d 111
e 1101
f 1100
abc = 0 101 100
Prefix codes simplify decoding. For example, the string 00111101 parses uniquely as 0.0.111.101 = aadb. The mapping can be represented by the following binary tree containing frequencies and characters.
100%
(0) a:45% (1) 55%
(0)25% (1)30%
(0)c:12% (1)b:13% (0)14% (1)d:16%
(0)f:5% (1)e:9%
An optimal code for a file is always represented by a full binary tree, in which every non-leaf node has two children. If the tree is not full, then certain codes are unused, which makes the coding not optimal.
If C is the alphabet from which characters are drawn, then the tree for an optimal prefix code has exactly |C| leaves, one for each letter and exactly |C| -1 internal nodes. We prove the latter by induction. It is true for |C|=2. Assume it is true n=k. Then n=k+2, an internal node is needed for two more children.
Given a tree T corresponding to a prefix code, it is straightforward to compute the number of bits required to encode a file. Denote f(c) frequency of character c in the file, and d(c) is the depth of c's leaf in the tree. The number of bits to encode a file is:
B(T) = sum_{c in C}(f(c) d(c)),
which we define as the cost of the tree T.
Huffman developed a greedy algorithm that constructs an optimal prefix code. The algorithm builds the tree T corresponding to the optimal code in a bottom-up manner. It begins with a set of |C| leaves and performs a sequence of |C|-1 "merging" operations to create the final tree. At each step, the two least-frequent objects are merged. The result of the merger is a new object whose frequency is the sum of two children.
Huffman(C)
1 n = |C|
2 Q = C //
min-priority queue based on frequencies
3 for i = 1
to (n-1)
4 <allocate a new node z>
5 left[z] = x = Extract-Min(Q)
6 right[z] = y = Extract-Min(Q)
7 f[z] = f[x] + f[y]
8 Insert(Q, z)
9 return
Extract-Min(Q) // return the root of the tree
Assume that Q is implemented as binary min-heap. For a set C of n characters, building a min-heap (step 2) takes O(n) time. Each heap operation in loop 3-8 takes O(lgn) time. The loop is executed (n-1) times, hence the total running time of Huffman is O(nlgn).
We first show that the greedy-choice property holds.
Let C be an alphabet in which each character c has frequency f[c]. Let x and y be two characters in C having the lowest frequencies. Then there exists an optimal prefix code for C in which codewords for x and y have the same length and differ only in the last bit.
Proof.
The idea of the proof is to convert an arbitrary optimal prefix code to the one that has x and y as siblings.
Let a and b be two siblings at the maximum depth in T. Without the loss of generality we assume f(x) < f(y), f(a) < f(b). Create T' by exchanging positions of a and x. The difference in cost between T and T':
B(T) – B(T') = f[x]dt(x) + f[a]dt(a) – f[x]dt'(x) – f[a]dt'(a) =
f[x]dt(x) + f[a]dt(a) – f[x]dt(a) – f[a]dt(x) =
- f[x] (dt(a) – dt(x)) + f[a](dt(a) – dt(x)) =
(f[a] – f[x]) (dt(a) – dt(x)) >= 0
Note that, f[a] – f[x] >= 0, because f[x] is a minimum frequency leaf. Since a is the lowest leaf in T, dt(a) >= dt(x). Similarly, we can obtain T'' from T' by exchanging b and y. We can show that B(T'') <= B(T') <= B(T), which implies that B(T'') = B(T) meaning T'' is an optimal prefix tree. ¢
Lemma 16.2 implies that the process of building up an optimal tree by mergers can begin with the greedy choice of merging two characters with lowest frequency. This is a greedy choice since of all possible mergers at each step, Huffman chooses the one that incurs the least cost.
The next lemma shows that the problem has the optimal-substructure property.
Lemma 16.3
Let C be a given alphabet with frequency f[c]. Let x and y be two characters in C with minimum frequency. Let C' be the alphabet C with x and y removed, and a new character z added: C' = C – {x,y} È {z}, f[z] = f[x] + f[y]. Let T' be a tree representing an optimal prefix code for C'. Then the tree T, obtained from T' by replacing z with an internal node with children x and y, represents an optimal prefix code for C.
Proof.
We show that B(T) can be expressed in terms of B(T'). Note that, for each c in C – {x,y}, we have dt(c) = dt'(c), f[c]dt(c) = f[c]dt'(c). Since dt(x)=dt(y)=dt'(z)+1, we have
f[x]dT(x) + f[y]dT(y) = (f[x] + f[y]) (dT'(z)+1) =
(f[x]+f[y]) dT'(z) + (f[x] + f[y]) =
f[z] dT'(z) + f[x] + f[y]
Hence
B(T) = B(T') + f[x] + f[y], or
B(T') = B(T) – f[x] – f[y]
We now prove the lemma by contradiction. Suppose T is not an optimal prefix code. Then there exists T'': B(T'') < B(T). By Lemma 16.2, T'' has x and y as siblings. Let T''' be the tree T'' with the common parent of x and y replaced by a leaf z with f[z] = f[x] + f[y]. Then
B(T''') = B(T'') – f[x] – f[y]
< B(T) – f[x] – f[y]
= B(T')
yielding a contradiction to the assumption that T' is an optimal prefix code for C'. Thus, T must represent an optimal prefix code for C. ¢
Theorem 16.4
Procedure Huffman produces an optimal prefix code.
Proof.
Follows from Lemmas 16.2 and 16.3. ¢