INFS 519 - November 2, 2010

Table of Contents

Reading

This lecture covers material from chapter 11.

Searching

Serial Search

Search for an item by iterating over every element until the target is found.

Time Analysis

If the target is that last item in the collection, then all n elements must be accessed.

Consider an array with /n/=10 elements, foo[10]. In the worst case, the target is stored in the last position. There are then 10 * 1 = 10 accesses to find it.

Worst-case time
O(n)

Along with the worst-case, we can consider the average-case. That is, on average, how many accesses are needed to find the target using a serial search.

If the target is in the first position, then only 1 access is needed. If at the second, then two accesses are needed. For each position the target may be located, there are i+1 accesses required.

For i in [0, 9]: 0+1, 1+1, 2+1, …

To calculate the average we sum the access counts and divide by the total number of counts: (1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10) / 10 = 5.5

Instead of enumerating all the counts, We can use the general formula for calculating the sum of (1, n).

1 + 2 + … + n = (n/(/n + 1)) / 2

If we substitute in the general formula for calculating the average we get:

(1 + 2 + … + n) / n = (n * (n + 1) / 2) / n = (n + 1) / 2

While practically this average-case is better than the worst-case, the growth rate of the runtime is still O(n).

Average-case time
O(n)

We can also consider the best-case time, but unless we can prove that the best-case occurs with high probability then it isn't a useful metric for an algorithm. The best-case for serial search is that the target is located in the first position of the collection.

Best-case time
O(1)

Binary Search

If a collection is sorted, then a faster search algorithm, binary search, maybe used.

A binary search involves splitting the collection into halves, selecting the half which may contain the target value, then recursively calling binary search on that half.

See implementation on p. 563.

See example on p. 563-564.

Time Analysis

In each iteration of the binary search, we:

  • Find the midpoint and stop if it's the target.
  • Use the midpoint to define two halves.
  • Run binary search on the half that may contain the target.
    • More specifically, if the target is less than the midpoint, then choose the lower half; if the target is greater than the midpoint, then choose the upper half.

We can see that each time we call binary search on a half that we are dividing the number of elements we're searching by two.

To calculate the worst-case running time, we assume that the collection does not contain the target we're searching for.

The worst-case running time for a collection of length n can be defined as:

T(n) = (8 + c) * length of longest recursive chain + number of operations in stopping case

The (8 + c) is the count of operations performed when assuming that the target will not be found.

There are 2 operations performed at the stopping case: the test size <= 0 and the returning of -1.

Revising our equation:

T(n) = (8 + c) * length of longest recursive chain + 2

So we're left to find an equation that will provide us with the longest recursive chain. To put it in different terms, we want an equation that calculates how many times the collection of n items can be divided in half. Let's define a halving function, H(n), which provides the number of times n can be divided by 2, stopping when the result is less than one. Substituting in H(n) we have:

T(n) = (8 + c) * H(n) + 2

Now we only need to find H(n). It should be obvious that it is close to log2 n.

H(n) = floor(log2 n) + 1

We finally have:

T(n) = (8 + c) * floor(log2 n) + 2

When we throw out the constants we see that the worst-case running time is logarithmic.

Worst-case time
O( log n)

Open-Address Hashing

Hashing maps a key to an index. Specifically, a hash function is used to calculate an index from a given key value.

A hash function must always return a valid index.

For example, consider mapping the keys: 0, 100, 200, … 4800, 4900; to indexes 0, 1, 2, … 48, 29.

That hash function is: f(k) = k/100

If a hash function maps multiple keys to the same index, a collision occurs. That is, there are multiple values that belong in data[i], where i is the index calculated for multiple keys.

To avoid collisions, one may use a data array (or other backing store) larger than necessary in order to increase the number of valid indexes.

While collisions can be made less likely by increasing the number of indexes, they may still occur. Open-address hashing is a method for handling collisions by using the next available, or open, index. The process of searching for the next available index is called linear probing.

The basic algorithm is:

  • For a key of some record compute the index as hash(key).
  • If data[hash(key)] is empty, we can store the object there and we're finished.
  • If the location data[hash(key)] is not empty, then try data[hash(key)+1]. If that is also unavailable, then check the next location data[hash(key)+2], and so on.

Non-Integer Keys

Often times a record we want to store in a hash table, the key we want to use won't be an integer. For example, we could use last names to store records containing contact information in an address book application. An encoding function can be used to convert a string, or any other data type, to an integer. In Java, objects have a method called hashCode which does just that.

Using hashCode, we would begin looking for an open position by checking this location first: data[hash(key.hashCode())]

Hash Tables

public class Table<K,V> {
    private int manyItems;
    private Object[] keys;
    private Object[] data;
    private boolean[] hasBeenUsed;
    
    public Table(int capacity);
    public V get(K key);
    public void put(K key, V element);
    public void V remove(K key);
    public boolean containsKey(K key);
    
    private int findIndex(K key);
    private int hash(K key);
    private int nextIndex(int i);
}

See example on p. 582-584 of hash table use.

Choosing a Hash Function to Minimize Collisions

We have been using a common hash function:

Math.abs(key.hashCode()) % data.length

This is known as a division hash function.

According to research by Radke in 1970, a good table size choice for a division hash function is a prime number of the form 4 k + 3.

Two other hash functions are:

Mid-square hash function
The key is converted to an integer and multiplied by itself. The hash function returns some middle digits of the result.
Multiplicative hash function
The key is converted to an integer and multiplied by a constant less than one. The hash function returns the first few digits of the fractional part of the result.

Clustering

One problem of linear probing is that clustering can occur if several keys hash to the same index. A common technique for avoiding clustering is double hashing. In this technique we use a second hash function to calculate a step size. For example, if hash1(key) = 210, but 210 is unavailable, then instead of looking at hash1(key) + 1 we can add the value of hash2(key) to hash1(key) to find the next position to examine. For example, if hash2(key) = 9, then we would examine these positions first: hash1(key), hash1(key) + 9, hash1(key) + 18,etc.

Double hashing has one potential problem. What happens if in the search for an open position we cycle back to hash(key)? A way to avoid this to guarantee that hash2(key) is relatively prime to data.length.

Donald Knuth recommends the following solution:

  • Both data.length and data.length - 2 should be prime numbers. For example, 1231 and 1229. Such primes are known as twin primes.
  • hash1(key) = Math.abs(key.hashCode()) % data.length
  • hash2(key) = 1 + (Math.abs(key.hashCode()) % (data.length - 2))

Chaining

In open-address hashing a collision is handled by probing for an alternate, open position. But if the data array becomes full, then a new array must be created and all elements must be rehashed.

Chained hashing, or chaining, is an alternative to resizing and rehashing. In chaining, when a collision occurs the new element is placed in a list that contains all elements which hash to that position.

class ChainedHashNode<K,V> {
    V element;
    K key;
    ChainedHashNode<K,V> next;
}
class ChainedTable<K,V> {
    private ChainedHashNode<K,V>[] table;
    ...
}

Time Analysis

The worst-case scenario for search is every key getting hashed to the same index. This results in potentially having to search through every item in order to find the target. That is a linear runtime. However, the average-case is significantly brighter.

The load factor, alpha, is: Number of elements in the table / The size of the table's array

For open-addressing, the load factor will never exceed 1 since each array cell may only hold one item.

The average number of table elements examined during a successful search is:

Open-address hashing with linear probing

1/2 * (1 + (1 / (1 - alpha)))

Open-address hashing with double hashing

(-ln * (1 - alpha)) / alpha

Chained hashing

1 + (alpha / 2)

HTML generated by org-mode 6.35i in emacs 23