INFS 519 - November 29, 2011

Reading
B-trees

Reading

This lecture covers material from Chapter 10, Section 2.

B-trees

The Problem with Unbalanced Trees

Unbalanced trees essentially become a linked list and defeat the purpose of using a tree data structure.

There are two solutions:

Periodically run a balancing algorithm on trees.
Use B-trees.

What Is a B-tree?

B-trees are similar to binary search trees, but they are not binary trees. B-trees may have many more than two children. B-tree nodes are different in that a single node may store multiple data elements.

Elements in B-trees must follow total-order semantics, that is every element must be comparable with every other and is transitive, antisymmetric, and total.

B-trees can be used to implement a set or multiset (bag).

A MINIMUM is used to control how many items are stored in a single node. The value may be as small as 1 but is usually larger.

Rule 1: The root can have as few as zero elements (when it has not children); all other nodes have at least MINIMUM elements.
Rule 2: The max number of elements in a node is twice the MINIMUM.
Rule 3: The elements of each node are stored in a partially filled array, sorted from smallest to largest.
Rule 4: The number of subtrees below a nonleaf node is always one more than the number of elements in the node.
Rule 5: For all nonleaf nodes: an element at index i is greater than all the elements in subtree i of the node, and and element at index i is less than all the elements in subtree i + 1 of the node.
Rule 6: Every leaf in a B-tree has the same depth. A B-tree is balanced.

A Balanced Set

Here we define a BalancedSet class.

Configuration

private static final MINIMUM = 200;
private static final MAXIMUM = 2 * MINIMUM;

Instance variables

int dataCount;
int[] data = new int[MAXIMUM + 1];
int childCount;
BalancedSet subset = new BalancedSet[MAXIMUM + 2];

Note that both the data and subset arrays are initialized to one larger than what is allowed. This is useful for temporary storage when splitting or combining nodes, as we will see later.

Also note that every child of the root node is a root node of a smaller subtree. Their is a recursive structure, where a B-tree is made up of many smaller B-trees.

Searching for an Element

In order to implement the contains method, our implementation must support searching for a particular element in the B-tree.

Make a local variable, i, equal to the first index such that data[i] >= target. If there is no such index, set i to dataCount, which indicates that no element in the node is greater than or equal to the target.

If the target is at data[i] -> return true; Else if the root has no children -> return false; Else recurse into the correct subtree -> return subset[i].contains(target)

(See p. 525-527 for example)

Adding an Element

There are multiple steps that may be required when adding an element to a B-tree. We can make use of the extra space in the data array to break up the work into smaller steps that are easier to reason about. These steps are the loose add, fix excess in child nodes, fix excess in root node.

The first step is a "loose add", called so because after we have loosely added an element the B-tree may no longer be a proper B-tree.

The looseAdd method:

If new element already at data[i], then return; Else if the root has no children, add the element at data[i], shifting other items to the right as necessary; Else run subset[i].looseAdd(element).

Now since the root node of the subtree we added to may violate the B-tree rules, we need to check for this and fix if necessary. To fix, we split the child node that has MAXIMUM + 1 elements into two nodes that each have MINIMUM elements and place the remaining element in the parent node.

During the process of fixing the child nodes, it may happen that the root node gains too many elements. When this happens, a new root node is constructed and the original root node is split into two nodes, with the middle element rising up into the new root node. B-trees only grow from the creation of new root nodes.

Structure of calls for add method.

add
- looseAdd
  - firstGE
  - fixExcess
- fixExcess

Removing an Element

// Pseudocode for remove
answer = looseRemove(target);
if ((dataCount == 0) && (childCount == 1))
    // Fix the root of the entire tree so that it no longer has 0 elements.

return answer;

If a root is empty, and therefore has below MINIMUM elements, we can merge it with a child node.

The looseRemove method:

Deal with one of four cases:

a) The root has no children and target was not found: return false
b) The root has no children and target was found: remove target, return true
c) The root has children and target was not found: recursive call, answer = subset[i].looseRemove(target)
d) The root has children and target was found: remove target by removing largest value from subset[i] and replacing target at data[i] with that largest value

In 2c and 2d, since we are removing elements from a node we may end up in a situation where a node has MINIMUM - 1 elements and needs to be repaired by a fixShortage method. That is, if fixShortage(i) is activated, it is to fix subset[i] who has MINIMUM - 1 elements.

There are four cases to handle:

Transfer an extra element from subset[i-1]

Transfer data[i-1] down to the front of subset[i].data. Shift over existing elements and adding one to subset[i].dataCount.
Transfer last element of subset[i-1].data up to data[i-1]. Subtract one from subset[i-1].dataCount.
If subset[i-1] has children, transfer the last child to subset[i]. Shift over existing childrent and add one to subset[i].childCount and substract one from subset[i-1].childCount.

Transfer an extra element from subset[i+1]

Similar to transferring an element from subset[i-1], but modified as necessary to move from the next subset.

Combine subset[i] with subset[i-1]

Transfer data[i-1] to the end of subset[i-1].data, shift data[i], and the others to the right, leftward. Subtract one from dataCount and add one to subset[i-1].dataCount.
Transfer all elements and children from subset[i] to the end of subset[i-1]. Update subset[i-1].dataCount and .childCount.
Disconnect subset[i] from B-tree by shifting subset[i+1], and others to right, leftward. Reduce childCount by one.

Combine subset[i] with subset[i+1]

Similar to combining subset[i] with subset[i-1].

Remove the Largest Element

Recursively call answer = subset[childCount-1].removeBiggest();

This will recurse down the rightmost path, once we reach the rightmost child we remove the largest element and return it. There's one catch though, before returning we need to check if the node has at least MINIMUM elements, if not, run fixShortage.

B-trees in the Real World

Because B-trees are good at storing items in sequential order and grouping many elements in a single node, they are useful to databases and filesystems. For example, with a MINIMUM of 1000 elements, 10⁹ (one billion) elements can be stored in with a tree of depth 2 - this means that assuming the root node is kept in memory, no more than two nodes will ever need to be read from disk when conducting a search. Compare this to a binary tree which would require a tree of depth 30 to store the same number of nodes - that means 30 nodes would need to be read from disk in the worst case.

B-trees are used in the HFS+ (Mac), NTFS (Windows), and Ext4 (Linux) filesystems, among others..