Wednesday, April 11, 2007

Trees, Hash Tables and Tries

As a computer scientist, I'm supposed to possess thorough grounding in fundamental data structures and algorithms. I must be getting rusty, for lately I've been perturbed by basic facts taught to novices.

After some thought I've resolved my difficulties, but now I feel many textbooks need to be rewritten. Either that, or I've made a monumental embarrassing mistake.

Let's say you want an associative array that maps strings to something.

I've known for years that hash table lookups take O(1) time. And that binary search trees have O(logn) lookup time. (In both cases I'm assuming certain conditions are satisfied, i.e. there aren't too many hash collisions and that the binary tree is fairly well-balanced.)

More recently, I learned about crit-bit trees, aka radix trees, Patricia trees or Patricia tries. (See D. J. Bernstein's crit-bit tree page for more references. Briefly, crit-bit trees are like binary trees with two major differences. Firstly, they are tries, which means keys are not stored in any non-leaf node and instead, the position of a node on a tree represents the prefix of the keys stored in its descendant leaves. Secondly, no node has only one child: a crit-bit tree is a trie where any node with a single child has been merged with its child.)

What's the cost of a lookup in a radix tree? It sounds like it must be O(logn) because you need a tree of that depth to store n things.

But you only need to touch each bit (or character, in a k-ary tree where k is the size of the alphabet) at most once during a lookup to decide which descendant of a node to traverse. Isn't this the same as determining the hash of a string? To compute it correctly we must involve every bit of the key. (If traversing the tree is cheap enough, then crit-bit trees are faster than hash tables. Furthermore, unlike a hashing operation, only certain bits, the aptly-named critical bits, need be examined in a crit-bit tree.)

Thus the lookup cost of a radix tree must also be O(1). Therein lies a problem. Why is it that a binary tree of the same size has an O(logn) lookup time? We travel along the same number of nodes!

The answer is that usually when computer scientists analyze these data structures, as an approximation, we view computing the hash function as a constant-time operation, and similarly for a comparison in the binary tree. But to refer to n different objects, the keys must have length at least O(logn), which means it actually takes O(logn) time to hash a key, or to perform a key comparison.

So what we've been saying is that it's okay to ignore a factor of logn. This is sloppy and inconsistent. Why can't we ignore the logn steps required to walk down a binary tree as well? Why can we turn a blind eye to one logn yet scrutinize the other?

We should say that hash lookups and crit-bit tree lookups cost O(logn) (or more accurately, O(m) where m is the size of the keys, which must be at least logn), and that binary tree lookups cost O((logn)2).

This inaccuracy caused me to initially overlook that crit-bit trees could match or even outperform hash tables. It's possible that others have suffered from the same affliction: despite being described (twice, independently) in 1968, the courses and textbooks from my time do not mention radix trees at all, and I suspect the teachers and authors responsible may also not have realized how fast these data structures can be.

This is a shame because they have other advantages over hash tables too: crit-bit trees are deterministic with good worst-case times, grow and shrink as needed (they never require an expensive rehashing nor let vast tracts of an array lie fallow), and support successor and predecessor operations.

The good news is that they seem to be gaining prominence. I feel they are mentioned more on the web (and I'm helping this as well!). I've heard evidence that some CS undergraduates these days are at least aware of their existence. The fact that even I know about them now means they can't be that obscure.