Road to C++ Programmer #14

B Trees

In the last article, we discussed RBTs, which maintain balance in a BST and keep time complexity scaling with height $log_2(n)$ . However, we’re not limited to having only one key and two child nodes per node. Like a deque, we can store multiple keys in contiguous memory while utilizing fragmented memory. We can also have more than two child nodes, reducing the height of the tree and traversal time. The B-tree is an attempt to implement such a tree with a self-balancing mechanism during operations.

Each node in a B-tree contains n keys in sorted order and n+1 child nodes, which correspond to the ranges between each key. Below is an example of a B-tree with n=4 and a minimum number of keys set to 2. The root node is a special case that can have fewer than the minimum number of keys. When searching for an element, the process starts with the keys in the root. Then, each comparison will either find an element in a node or determine the appropriate range to continue searching. This process repeats until the key is found, resulting in a time complexity of $O(log(n))$ (or $O(log_5(n))$ ).

class BTreeNode {
public:
    int *keys;
    int minimum;
    BTreeNode **children;
    int n;
    bool isLeaf;
 
    BTreeNode(int minimum, bool isLeaf): maximum(minimum), isLeaf(isLeaf) {
        // Allocate memory
        keys = new int[2 * minimum - 1];
        children = new BTreeNode *[2 * minimum];
        n = 0;
    };
 
    ~BTreeNode() {
        // Delete child nodes recursively
        if (!isLeaf) {
            for (int i = 0; i <= n; ++i) {
                delete children[i];
            }
        }
        delete[] keys;
        delete[] children;
    };
 
    // Search within node
    BTreeNode* search(int k) {
        // Find the first key greater than or equal to k
        int i = 0;
        while (i < n && k > keys[i]) {
            i++;
        }
 
        // If the found key is equal to k, return this node
        if (i < n && keys[i] == k) {
            return this;
        }
 
        // If the key is not found here and this is a leaf node
        if (isLeaf) {
            return nullptr;
        }
 
        // Go to the appropriate child
        return children[i]->search(k);
    };
};
 
class BTree {
public:
    BTreeNode *root;
    int minimum;
 
    BTree(int minimum): minimum(minimum) {
        root = nullptr;
    };
 
    ~BTree() {
        delete root;
    };
 
    // Function to search a key in this tree
    BTreeNode* search(int k) {
        return (root == nullptr) ? nullptr : root->search(k);
    };
};

Above is the implementation of a B-tree for the int data type. Each node contains a list of keys, a list of pointers to its children, the minimum number of children, the current number of keys, and an indicator of whether the node is a leaf. The search function is implemented within the node to find the key among the node's keys or identify the appropriate child to continue the search. The tree's search function calls the root node’s search function to locate a given key.

Insertion

Insertion in a B-tree must be carefully managed to maintain the tree’s balance. When inserting an element into the root, we simply add keys in sorted order until the maximum number of keys is reached. If we need to insert an additional key into a full root node, we split the node in half and create left and right child nodes, each containing the minimum number of keys allowed in the tree. As the node is already sorted, this process does not violate any B-tree rules.

The key is always stored in the appropriate leaf node, and the same mechanism of splitting a full node at its midpoint works for other nodes as well. When a parent node has room for another key, we can place the middle key in the parent and split the node into two children, each representing the range created by adding the key to the parent. This algorithm can be applied recursively to maintain the balance of the B-tree.

void BTreeNode::splitChild(int i, BTreeNode *child) {
    // Making sibling
    BTreeNode *sibling = new BTreeNode(child->minimum, child->isLeaf);
    sibling->n = minimum - 1;
 
    // Transfering keys from child to sibling
    for (int j = 0; j < minimum - 1; j++) {
        sibling->keys[j] = child->keys[j + minimum];
    }
 
    // Transfering children from child to sibling
    if (!child->isLeaf) {
        for (int j = 0; j < minimum; j++) {
            sibling->children[j] = child->children[j + minimum];
        }
    }
 
    child->n = minimum - 1;
 
    // Making room for new child, sibling
    for (int j = n; j >= i + 1; j--) {
        children[j + 1] = children[j];
    }
 
    children[i + 1] = sibling;
 
    // making room for new key from child
    for (int j = n - 1; j >= i; j--) {
        keys[j + 1] = keys[j];
    }
 
    keys[i] = child->keys[minimum - 1];
    n = n + 1;
};
 
void BTreeNode::insertNonFull(int k) {
    int i = n - 1;  // Initialize index of rightmost element
 
    if (isLeaf) {
        // Insert the new key into this leaf node
        while (i >= 0 && keys[i] > k) {
            keys[i + 1] = keys[i];
            i--;
        }
        keys[i + 1] = k;
        n = n + 1;
    } else {
        // Find the child that will have the new key
        while (i >= 0 && keys[i] > k) {
            i--;
        }
 
        // Check if the found child is full
        if (children[i + 1]->n == 2 * minimum - 1) {
            splitChild(i + 1, children[i + 1]);
 
            if (keys[i + 1] < k) {
                i++;
            }
        }
        children[i + 1]->insertNonFull(k);
    }
};
 
int BTree::insert(int k) {
    if (root == nullptr) {
        root = new BTreeNode(minimum, true);
        root->keys[0] = k;
        root->n = 1;
        return 1;
    }
    
    if (root->n == 2 * minimum - 1) {
        BTreeNode *newRoot = new BTreeNode(minimum, false);
        newRoot->children[0] = root;
        newRoot->splitChild(0, root);
 
        int i = 0;
        if (newRoot->keys[0] < k) {
            i++;
        }
        newRoot->children[i]->insertNonFull(k);
 
        root = newRoot;
        return 1;
    }
 
    root->insertNonFull(k);
    return 1;
};

The code above implements insertion for the B-tree node and the B-tree. The splitChild function does not delete the keys split and transferred to the sibling. Instead, as the sibling is always created to the right, adjusting n effectively makes the tree ignore keys beyond this point. The insertNonFull function calls splitChild as needed to recursively insert a key into a non-full node. The insert function utilizes both splitChild and insertNonFull for insertion, handling the special case where the root node is full by creating a new root node. Since the algorithm only needs to traverse each level of the tree once, and the operations at each level do not scale with the number of keys stored, the time complexity of insertion is $O(\log(n))$ .

Deletion

Deleting a key from the B-tree can be as simple as traversing the tree, finding the key, and deleting it from the node without violating any rules. However, deletion may lead to several violations. The first case occurs when the leaf node from which we delete already has the minimum number of keys. In this case, we can take the rightmost or leftmost key from a right or left sibling, substitute it with the parent node separator, and place the separator in the node where the deletion occurred. This substitution maintains the B-tree properties while fixing the minimum key count violation.

However, the above solution only applies if the sibling has more than the minimum number of keys. The second case occurs when the sibling also has the minimum number of keys. This can be handled by merging the node we’re deleting from with its sibling and the parent separator, forming a single child node. If the parent node then has fewer than the minimum number of keys, we can either borrow from a sibling or merge recursively to resolve this violation.

Deleting from an intermediate node introduces additional challenges, as it involves deleting the child nodes’ separator. To fix this violation, we need to borrow the rightmost or leftmost key from a neighboring node and make it the new separator. Any resulting minimum key violation can be addressed by using the previously mentioned sibling borrowing or merging approach. Unfortunately, code implementation for deletion is beyond the scope of this article. For those interested in implementing it, a helpful hint is to create additional member functions like borrow and merge in the BTreeNode class to simplify the implementation. As with insertion, this algorithm only needs to traverse each level of the tree once, and the operations at each level do not scale with the number of stored keys, giving deletion a time complexity of $O(\log(n))$ .

B+ Trees

The B Tree performs exceptionally well in many scenarios, including databases, but it has a limitation when it comes to range-based queries. When querying a database, we often need to retrieve multiple keys at once. For example, if this blog queries 10 articles per page and the maximum number of keys per node is 4, navigating back and forth between parent nodes, grandparent nodes, and leaf nodes becomes costly.

To solve this problem, B+ trees duplicate all keys from parent nodes into the leaf nodes and create a linked list of leaf nodes by adding a pointer to each leaf’s sibling. Therefore, when querying 10 articles, we only need to traverse through the linked list of leaf nodes. This structure allows values to be retrieved in a single pass, making it much more efficient for range-based queries. As the range size increases, the benefit of faster range-based queries outweighs the cost of introducing duplicates and additional pointers.

Database Indexing

B+ trees offer another advantage in the context of databases: faster querying through indexing. Indexing involves attaching a pointer to each record so that queries do not have to inspect every record individually. In a B Tree, nodes at every level contain pointers to records. This is efficient when the database is small and can fit in memory, but as some nodes move to storage (which is much slower to access), queries may encounter unpredictable latency. The problem becomes worse with range-based queries.

However, in a B+ tree, all intermediate nodes exist solely for traversal and do not contain pointers to records; only the leaf nodes hold pointers. This setup makes the intermediate nodes much lighter (typically only 1% of the entire data) and more likely to fit into memory. As a result, queries tend to experience less latency as the database scales, making B+ trees a preferred data structure for databases.

Conclusion

In this article, we discussed how B Trees increase cache hits and reduce tree height and traversal time by enforcing relatively simple rules compared to RBT. B Trees also avoid constant pointer updates and tree restructuring with each insertion or deletion, making them a popular choice in various applications, especially databases and file systems. We also explored B+ trees and why they are preferred for database indexing. For a challenge, you might want to try using a B Tree to implement an ordered set or map in C++.

Resources

Nasser, H. 2021. B-tree vs B+ tree in Database Systems. YouTube.
Spanning Tree. 2024. Understanding B-Trees: The key Structure Behind Modern Databases. YouTube.