Class 8: Hash Tables

 

Hash tables are used in applications that rely on basic dictionary operations: insert, search, and delete. A hash table is a generalization of the simpler notion of an ordinary array. Directly addressing into an ordinary array makes effective use of ability to examine an arbitrary position in the array in O(1) time. The difference is that a hash table allows direct addressing of multiple elements that share the same key.

 

Direct-address tables

 

Direct addressing works well when the universe U={1, ... , m} of keys is reasonably small. We assume that no two elements have the same key. To represent the dynamic set we use an array T[0..m-1] in which each position represents a slot in the universe U. If the set contains no element with key k, T[k] = NIL. The dictionary operations are trivial to implement:

 

Direct-Address-Search(T,k)

1     return T(k)

 

Direct-Address-Insert(T,x)

1      T[key[x]] = x

 

Direct-Address-Delete(T,x)

1      T[key[x]] = NIL

 

All of these operations run in O(1) time.

 

Hash tables

 

When U is large, it is impractical to use direct-address tables. In this case, the smaller set K is used to store the dictionary. An element with key k is stored in slot h(k):

 

h: U -> {0, ... , m-1}

 

We say that an element with key k hashes to slot h(k), or h(k) is the hash value of key k. Two keys may share the same slot. This situation is called a collision. The goal of a hash function is to minimize the number of collisions.

 

In chaining all elements that hash to the same slot are put in a linked list. The dictionary operations are implemented as follows.

 

Chained-Hash-Insert(T,x)

<insert x at the head of list T[h(key[x])]>

 

Chained-Hash-Search(T,k)

      <search for an element with key k in list T[h(k)]>

 

Chained-Hash-Delete(T,x)

      <delete x from list T[h(key[x])]>

 

The worst-case running time for insertion is O(1). For searching, the worst-case running time is proportional to the list's size. Deletion assumes the element x is known. Its running is O(1) if the list is doubly linked. (In singly linked list we would need to know the previous element).

 

Chained algorithm analysis

 

Given a hash table T with m slots that store n elements, define the load factor a for T as n/m, which is an average size of a linked list.

 

The worst-case behavior of chained hashing is poor since some slot may contain all n keys. In that case, hashing is no better than a linked list. Hash tables are used for their average performance.

 

Assume that any given element is equally likely to hash into any of m slots. This assumption is called simple uniform hashing. For j = 0, ... , m-1, denote the size of list T[j] by n_j, so that:

 

n = n_0 + n_1 + ... + n_(m-1)

 

The average value of n_j = E[n_j] = a = n/m.

 

Assume h(k) can be computed in O(1) time, so that the time required to search for an element with key k linearly depends on n_{h(k)}. We consider the expected number of elements examined by the search algorithm.

 

Theorem 11.1

 

Unsuccessful Chained-Hash-Search takes expected time Q(1+a), under the assumption of simple uniform hashing.

 

Proof. The key k is not stored in the hash table, hence all n_k elements will need to be examined, which takes Q(n_k) time. Since E[n_k] = a, and it takes O(1) to compute h, we have the expected average time Q(1+a). ¢

 

 

 

Theorem 11.2

 

Successful Chained-Hash-Search takes expected time Q(1+a), under the assumption of simple uniform hashing.

 

Proof. We assume the element being searched for equality is equally likely to be any of the n elements in the list. The number of elements examined during the search is 1 more than the number of elements appearing before x in the list. These are elements that were inserted after x was inserted in the list. Denote x_i the ith element inserted in the table for i=1,...,n. Let k_i = key[x_i]. For keys k_i and k_j, define X_ij = I{h(k_i)=h(k_j)}. We have Pr{h(k_i)=h(k_j)} = 1/m, so E[X_ij] = 1/m.

 

We now average the number of searches required for each element inserted in the hash table (to find ith element we will need to search through at most (n-i) element plus 1):

 

E[1/n sum_{i=1,n}(1+sum_{j=i+1,n}(X_ij))] =

 

1/n sum_{i=1,n}(1 + sum_{j=i+1,n}(E[X_ij])) =

 

1/n sum_{i=1,n}(1 + sum_{j=i+1,n}(1/m)) =

 

1 + 1/nm sum_{i=1,n}(n-i) =

 

... = 1 + a - a/2n = Q(1+a)

 

Hash functions

 

A good hash function satisfies the assumption of simple uniform hashing: each key is likely to hash to any of m slots. Typically it is not possible to check this condition, since we rarely know the probability distribution by which the keys are drawn. Moreover, the keys may not be drawn independently.

 

Most hash functions assume that the universe of keys is the set N = {0, 1, 2, ...} of natural numbers. Thus, if keys are not natural numbers, they are converted to such.

 

The division method

 

h(k) = k mod m

 

For example, m=8. For key = 100, h(k) = 4.

 

Certain values of m are usually avoided. If m = 2^p, then h(k) is the p lowest-order bits of k. A prime not too close to an exact power of 2 is often a good choice for m.

 

The multiplication method

 

This method operates in two steps:

 

 

In short, the hash function is:

 

h(k) = floor(m(kA mod 1))

 

In this case the value of m is not critical. It is typically chosen as 2^p. The hash function can then be implemented as follows.

 

 

Open Addressing

 

In this method all elements are stored in the hash table itself. There are no lists stored outside the table; hence it can "fill up". Thus, the load factor a <= 1.

 

To perform insertion, we successively probe the hash table until we find an empty slot. To determine which slot to probe, the probe number becomes the second input:

 

h: U x {0, 1, ... , m-1} -> {0, 1, ... , m-1}

 

We require that for every key k, the probe sequence

 

<h(k,0), h(k,1), ... , h(k,m-1)>

 

be a permutation of <0,1, ... , m-1>.

 

In the following algorithms we assume the elements in T are keys with no satellite data:

 

Hash-Insert(T, k)

1     i = 0

2     repeat

3           j = h(k,i)

4           if T[j] = NIL

5                 T[j] = k

6                 return j

7           else

8                 i++

9     until i=m

10    throw "hash table overflow"

 

The worst-case time of the insertion algorithm is O(m).

The following is the search algorithm:

 

Hash-Search(T, k)

1     i = 0

2     repeat

3           j = h(k,i)

4           if T[j] = k

5                 return j

6           else

7                 i++

8     until T[j] = NIL or i=m

9     return NIL

 

The worst-case running time of the search algorithm is O(m).

 

Deleting from an open-address table is difficult, since marking an element NIL may make it impossible to retrieve another key. Hence in cases where deletions are allowed, chaining is used.

 

Linear probing

 

Given an ordinary hash function h': U -> {0,1, ... , m-1}, the linear probing uses the following function:

 

h(k,i) = (h'(k) + i) mod m

 

Note that, once the first probe is determined, h'(k), other probes follow in the"wrap around" mode.

 

Linear probing has a primary clustering problem. Long runs of occupied slots build up, increasing the average time. Clusters arise since an empty slot preceded by i full slots gets filled up with probability (i+1)/m (1/m probability for each of i slots plus 1/m).

 

Quadratic probing

 

h(k,i) = (h'(k) + c1 i + c2 i^2) mod m,

 

where h' is an ordinary hash function and c1,c2 <> 0. The initial position is T[h'(k)], other positions are offset by amounts determined by quadratic function. It works better than linear probing, but depends on the choice of c1, c2, and m.

 

If two keys have the same initial probe position, their probe sequences are the same. h(k1,0) = h(k2,0) => h(k1,i) = h(k2,i). This leads to a milder form of clustering, called secondary clustering.

 

Double hashing

 

This is one of the best methods because the permutations produced are similar to randomly chosen permutations.

 

h(k,i) = (h1(k) + ih2(k)) mod m,

 

where h1 and h2 are auxiliary hash functions. The initial probe is T[h1(k)]; the successive probes are offset from previous position by h2(k) modulo m.

 

The value h2(k) must be relatively prime to m for the entire hash table to be searched. For example:

 

m = 2^p; h2(k) = some odd number.

 

Double hashing uses Q(m^2) probe sequences, since each possible (h1(k),h2(k)) pair produces a distinct sequence. This is an improvement over Q(m) sequences in case of linear and quadratic probing.