Wednesday, April 11, 2007

Trees, Hash Tables and Tries

As a computer scientist, I'm supposed to possess thorough grounding in fundamental data structures and algorithms. I must be getting rusty, for lately I've been perturbed by basic facts taught to novices.

After some thought I've resolved my difficulties, but now I feel many textbooks need to be rewritten. Either that, or I've made a monumental embarrassing mistake.

Let's say you want an associative array that maps strings to something.

I've known for years that hash table lookups take O(1) time. And that binary search trees have O(logn) lookup time. (In both cases I'm assuming certain conditions are satisfied, i.e. there aren't too many hash collisions and that the binary tree is fairly well-balanced.)

More recently, I learned about crit-bit trees, aka radix trees, Patricia trees or Patricia tries. (See D. J. Bernstein's crit-bit tree page for more references. Briefly, crit-bit trees are like binary trees with two major differences. Firstly, they are tries, which means keys are not stored in any non-leaf node and instead, the position of a node on a tree represents the prefix of the keys stored in its descendant leaves. Secondly, no node has only one child: a crit-bit tree is a trie where any node with a single child has been merged with its child.)

What's the cost of a lookup in a radix tree? It sounds like it must be O(logn) because you need a tree of that depth to store n things.

But you only need to touch each bit (or character, in a k-ary tree where k is the size of the alphabet) at most once during a lookup to decide which descendant of a node to traverse. Isn't this the same as determining the hash of a string? To compute it correctly we must involve every bit of the key. (If traversing the tree is cheap enough, then crit-bit trees are faster than hash tables. Furthermore, unlike a hashing operation, only certain bits, the aptly-named critical bits, need be examined in a crit-bit tree.)

Thus the lookup cost of a radix tree must also be O(1). Therein lies a problem. Why is it that a binary tree of the same size has an O(logn) lookup time? We travel along the same number of nodes!

The answer is that usually when computer scientists analyze these data structures, as an approximation, we view computing the hash function as a constant-time operation, and similarly for a comparison in the binary tree. But to refer to n different objects, the keys must have length at least O(logn), which means it actually takes O(logn) time to hash a key, or to perform a key comparison.

So what we've been saying is that it's okay to ignore a factor of logn. This is sloppy and inconsistent. Why can't we ignore the logn steps required to walk down a binary tree as well? Why can we turn a blind eye to one logn yet scrutinize the other?

We should say that hash lookups and crit-bit tree lookups cost O(logn) (or more accurately, O(m) where m is the size of the keys, which must be at least logn), and that binary tree lookups cost O((logn)2).

This inaccuracy caused me to initially overlook that crit-bit trees could match or even outperform hash tables. It's possible that others have suffered from the same affliction: despite being described (twice, independently) in 1968, the courses and textbooks from my time do not mention radix trees at all, and I suspect the teachers and authors responsible may also not have realized how fast these data structures can be.

This is a shame because they have other advantages over hash tables too: crit-bit trees are deterministic with good worst-case times, grow and shrink as needed (they never require an expensive rehashing nor let vast tracts of an array lie fallow), and support successor and predecessor operations.

The good news is that they seem to be gaining prominence. I feel they are mentioned more on the web (and I'm helping this as well!). I've heard evidence that some CS undergraduates these days are at least aware of their existence. The fact that even I know about them now means they can't be that obscure.

13 comments:

Chad Austin said...

If your data fits in memory, O(1) = O(log n)

Juha-Mikko said...

Big O notation is for complexity, not speed. O(1) can take longer than O(n): it depends on the function.

O(1) means it's a constant time, despite how long it takes. O(anything n) means the more there are elements to process, the more time is needed.

caezar said...

Go the other way with this; almost all O(1) operations are actually O(nlogn). For instance, "multiply" might appear to be a constant time operation but it actually grows with the number of bits in the value (rather than the number of bits available in the integer type).

If you expect a log distribution of symbols, it would be better to store them in a logn structure so that your most frequent elements are fastest.

A reserved symbol might take exactly one node of a trie or a tree if it is never ambiguously used as a sub-symbol of other symbols, making it faster than a lookup for other symbols. That might be a big win in practice.

I heard once that the definition of "systems programming" is simply "programming where the constant coefficients matters".

Vilhelm S said...

At the end of the day, we care about what is faster to actually run on a computer; complexity theory is a tool to help us figure out why something is fast or slow.

In practice, hash tables tend to be several times faster than search trees. Why is this? Well, it's because on a modern CPU, computing a hash of a short string is essentially free compared to the cost of doing a non-cached memory access. So the hash table, which needs to do O(1) memory accesses is much faster than the tree, which needs to do O(n) of them. The point being: we first need to figure out what operations are costly on our computers, then we can start counting how many such operations the algorithm does.

There is a similar example (I forget who pointed this out) that the number of comparisons done by a sorting algorithm predicts the running time of it really well, and so the classic results about O(n log n) comparisons are still relevant. This should sound quite fishy, because on modern computers you can compare numbers really really quickly (single cycle) compared to doing most other things. The answer turns out to be that what is really costing you is not the comparison operation but the mispredicted branches -- but if the outcome of the comparisons is random, the two things are proportional to each other.

Jay Freeman (saurik) said...

I seriously just sent an hour and a half writing a 4,096 character treatise on hash tries, only to go back and notice the article was talking about radix tries in general, not when you are doing a lookup using a hash function. Wow: go me! :(

lisper001 said...

Because the O(log N) for binary tree lookup is a memory hit for each log N, while the 'hidden log N' for hashes is done entirely in CPU and is blazingly fast.

Zemantic dreams said...

When you actually implement those solutions in performance sensitive programs you notice that random memory access is extremely expansive. Computing a hash is much more efficient operation.

There might be a special case where you have a very small mapping that fits in L1 cache nicely and you could traverse it faster than calculate the hash value. However mispredictions at each node would quickly add up.

Jack said...

Hashing is most definite NOT O(log n). This would mean that your cost to hash scales with the log of the number of elements you're hashing, which is wrong. If you want to define m (or k) as the avg length of your input, then you might have an argument for hashing to be an O(m) function, but as others have pointed out that is a blazingly fast O(m), which disappears in comparison to a tree traversal which wrecks your cache.

Also: crit-bit trees are pretty good, but they only shine when they are storing large numbers of elements. For small values you both have to have and traverse more nodes than a good Binary or 2-3-4 tree would give you.

Darren said...

The advantage of hashes is that you can design your key function so that if you update a string (change 1 character) you do not need to go through the rest of the characters in order to calculate the new key.

But you still make a good point that big oh notation is extremely inconsistant and ill defined.

Gareth McCaughan said...

Darren: (1) Ben's point isn't that big-O notation is inconsistent or ill-defined (it's neither), it's that it's often used sloppily. (2) I don't believe I have ever encountered a system that makes use of the fact that some string hash functions can be updated incrementally; no doubt some such systems exist, but it's hardly "*the* advantage of hashes". (In computer chess, however, it is common to use hash functions with a similar incremental property and take advantage of it. But no one is suggesting that chess programs' transposition tables be implemented with tries.)

Gabriel C. said...

I think you're mixing things:
n in O notation represents the length of the input, but to compute the hash key or travel the trie you need to do a max of k operations where k is the length of your longest key. So lookups are still constant = O(1) and don't vary with the size of the input.
Besides, if we assume that calculating the hash is not trivial and is O(log n), why we can assume that comparing two keys in the binary tree is O(1)? Comparing string is O(log n) in the length of the string.

Reid Kleckner said...

Especially in Java or Python, where the native string data types cache their hash values, computing the hash can be equivalent to reading the value from memory. In the case that you're implementing a namespace (ie, the majority of dictionary objects in Python) all the strings are interned, so you only have to resort to comparing the bytes of the string for equality in the event of a hash collision. So the common case is very good: two RAM accesses, one for the hash, and one for the table entry.

That said, I totally agree with your post! The mantra that hash tables have O(1) access time is so well repeated that we never consider the effects of extremely long keys, such as, say, C++ mangled names.

Jeffrey Yasskin said...

+1 to Vilhelm's point. Not thrashing the cache is really important, and tries (and all other linked containers) tend to fail this badly, while probed hash tables tend to do well. There's been a lot of research about this recently under the headings of cache-aware and cache-oblivious algorithms.