Class 12: Greedy Algorithms

Definitions

 

Algorithms for optimization problems typically go through a sequence of steps, with a set of choices for each step. For many optimization problems, using dynamic programming to determine the best choices is overkill; simpler, more efficient algorithms are sufficient. A greedy algorithm always makes a choice that is locally optimal in the hope that it will lead to a globally optimal solution.

 

We design greedy algorithms according to the following sequence of steps:

 

  1. Cast the optimization problem as one in which we make a choice and are left with one subproblem to solve.
  2. Prove that there is always an optimal solution to the original problem that makes the greedy choice, so that the greedy choice is always safe.
  3. Demonstrate that, having made the greedy choice, what remains is a subproblem with the property that if we combine an optimal solution to the subproblem with the greedy choice that we have made, we arrive at an optimal solution to the original problem.

 

Huffman codes

 

Huffman codes are widely used for data compression. We consider the data to be a sequence of characters. Huffman's greedy algorithm uses a table of frequencies of occurrence of the characters to build up an optimal way to represent each character as a binary string.

 

Prefix codes

 

We consider only codes that cannot be prefixes of another code. For example,

 

a          0

b          101

c          100

d          111

e          1101

f           1100

 

abc = 0 101 100

 

Prefix codes simplify decoding. For example, the string 00111101 parses uniquely as 0.0.111.101 = aadb. The mapping can be represented by the following binary tree containing frequencies and characters.

 

 

                                                100%

                        (0) a:45%                                      (1) 55%

                                                            (0)25%                                                (1)30%

                                                (0)c:12%            (1)b:13%         (0)14%                        (1)d:16%

                                                                                    (0)f:5%            (1)e:9%

 

An optimal code for a file is always represented by a full binary tree, in which every non-leaf node has two children. If the tree is not full, then certain codes are unused, which makes the coding not optimal.

 

If C is the alphabet from which characters are drawn, then the tree for an optimal prefix code has exactly |C| leaves, one for each letter and exactly |C| -1 internal nodes. We prove the latter by induction. It is true for |C|=2. Assume it is true n=k. Then n=k+2, an internal node is needed for two more children.

 

Given a tree T corresponding to a prefix code, it is straightforward to compute the number of bits required to encode a file. Denote f(c) frequency of character c in the file, and d(c) is the depth of c's leaf in the tree. The number of bits to encode a file is:

 

B(T) = sum_{c in C}(f(c) d(c)),

 

which we define as the cost of the tree T.

 

Constructing a Huffman code

 

Huffman developed a greedy algorithm that constructs an optimal prefix code. The algorithm builds the tree T corresponding to the optimal code in a bottom-up manner. It begins with a set of |C| leaves and performs a sequence of |C|-1 "merging" operations to create the final tree. At each step, the two least-frequent objects are merged. The result of the merger is a new object whose frequency is the sum of two children.

 

Huffman(C)

1     n = |C|

2     Q = C // min-priority queue based on frequencies

3     for i = 1 to (n-1)

4           <allocate a new node z>

5           left[z] = x = Extract-Min(Q)

6           right[z] = y = Extract-Min(Q)

7           f[z] = f[x] + f[y]

8           Insert(Q, z)

9     return Extract-Min(Q) // return the root of the tree

 

Assume that Q is implemented as binary min-heap. For a set C of n characters, building a min-heap (step 2) takes O(n) time. Each heap operation in loop 3-8 takes O(lgn) time. The loop is executed (n-1) times, hence the total running time of Huffman is O(nlgn).

 

Correctness of Huffman's algorithm

 

We first show that the greedy-choice property holds.

 

Lemma 16.2

Let C be an alphabet in which each character c has frequency f[c]. Let x and y be two characters in C having the lowest frequencies. Then there exists an optimal prefix code for C in which codewords for x and y have the same length and differ only in the last bit.

 

Proof.

The idea of the proof is to convert an arbitrary optimal prefix code to the one that has x and y as siblings.

 

Let a and b be two siblings at the maximum depth in T. Without the loss of generality we assume f(x) < f(y), f(a) < f(b). Create T' by exchanging positions of a and x. The difference in cost between T and T':

 

B(T) – B(T') = f[x]dt(x) + f[a]dt(a) – f[x]dt'(x) – f[a]dt'(a) =

 

f[x]dt(x) + f[a]dt(a) – f[x]dt(a) – f[a]dt(x) =

 

- f[x] (dt(a) – dt(x)) + f[a](dt(a) – dt(x)) =

 

(f[a] – f[x]) (dt(a) – dt(x)) >= 0

 

Note that, f[a] – f[x] >= 0, because f[x] is a minimum frequency leaf. Since a is the lowest leaf in T, dt(a) >= dt(x). Similarly, we can obtain T'' from T' by exchanging b and y. We can show that B(T'') <= B(T') <= B(T), which implies that B(T'') = B(T) meaning T'' is an optimal prefix tree. ¢

 

Lemma 16.2 implies that the process of building up an optimal tree by mergers can begin with the greedy choice of merging two characters with lowest frequency. This is a greedy choice since of all possible mergers at each step, Huffman chooses the one that incurs the least cost.

 

The next lemma shows that the problem has the optimal-substructure property.

 

Lemma 16.3

Let C be a given alphabet with frequency f[c]. Let x and y be two characters in C with minimum frequency. Let C' be the alphabet C with x and y removed, and a new character z added: C' = C – {x,y} È {z}, f[z] = f[x] + f[y]. Let T' be a tree representing an optimal prefix code for C'. Then the tree T, obtained from T' by replacing z with an internal node with children x and y, represents an optimal prefix code for C.

 

Proof.

We show that B(T) can be expressed in terms of B(T'). Note that, for each c in C – {x,y}, we have dt(c) = dt'(c), f[c]dt(c) = f[c]dt'(c). Since dt(x)=dt(y)=dt'(z)+1, we have

 

f[x]dT(x) + f[y]dT(y) = (f[x] + f[y]) (dT'(z)+1) =

 

                                    (f[x]+f[y]) dT'(z) + (f[x] + f[y]) =

 

                                    f[z] dT'(z) + f[x] + f[y]

 

Hence

 

B(T) = B(T') + f[x] + f[y], or

 

B(T') = B(T) – f[x] – f[y]

 

We now prove the lemma by contradiction. Suppose T is not an optimal prefix code. Then there exists T'': B(T'') < B(T). By Lemma 16.2, T'' has x and y as siblings. Let T''' be the tree T'' with the common parent of x and y replaced by a leaf z with f[z] = f[x] + f[y]. Then

 

B(T''') = B(T'') – f[x] – f[y]

 

            < B(T) – f[x] – f[y]

 

            = B(T')

 

yielding a contradiction to the assumption that T' is an optimal prefix code for C'. Thus, T must represent an optimal prefix code for C. ¢

 

Theorem 16.4

Procedure Huffman produces an optimal prefix code.

 

Proof.

Follows from Lemmas 16.2 and 16.3. ¢