I’ve been pleased with my recent implementation of crit-bit trees based on Adam Langley’s code, but in the back of my mind, I’ve been bothered by Bernstein’s statement that the overhead on top of each string stored need only be two control words: one pointer and one integer.
To obtain relief, I updated my library so that internal nodes only have a single child pointer instead of one pointer each for the left and right child nodes. In a crit-bit tree, an internal node always has two children, so we may as well allocate sibling nodes in one contiguous chunk of memory.
Bernstein suggests storing strings directly in the crit-bit tree, which can be done by storing the left string backwards: it resides to the left of the integer control word. I’ve chosen an alternate scheme which he also mentions: expressing the strings as pointers in a separate storage pool.
Removing one pointer per internal node has several benefits on a 64-bit machine. Firstly, malloc is typically 16-byte aligned, which means that the 24-byte internal nodes of the original version were actually taking 32 bytes. In contrast, now we only need just one pointer and one integer, so we can fit internal nodes within 16 bytes.
Secondly, we no longer have to tag pointers. Instead, 8 bytes hold an unadulterated child pointer, and the other 8 bytes store the crit-bit position as well as one bit that indicates whether the current node is internal or not.
Thirdly, allocating two nodes at once only requires a single malloc() call. Before, we would call malloc() to allocate a new leaf node, and again to allocate a new internal node.
Again, "unrolling" the child pointer lookup produced better benchmarks on my system, namely, instead of p = p->kid + predicate(), we write p = predicate() ? p->kid + 1 : p->kid. This obviated the need for some fun but complex micro-optimizations. While I was at it, I unrolled a few other routines.
Benchmarks:
old | new | |
---|---|---|
insert | 0.066135 | 0.056116 |
get | 0.043584 | 0.040845 |
iterate | 0.031851 | 0.015135 |
allprefixed | 0.013285 | 0.011924 |
delete | 0.048030 | 0.041754 |
overhead | 3966824 | 3173464 |
old | new | |
---|---|---|
insert | 2.700949 | 2.359999 |
get | 2.304070 | 2.214699 |
iterate | 0.912950 | 0.472594 |
allprefixed | 0.295322 | 0.264326 |
overhead | 79999984 | 63999992 |
delete | 2.339102 | 2.177294 |
The insert benchmark of my old library is slightly worse than the one I originally posted because I have since tweaked the code to make it easy to carry out operations such as "insert or replace" or "insert if absent" in a single step.
2 comments:
Hi Ben, Do you have a version of blt for non-strings similar to your implementation for CBT?
Sorry, no. I haven't had time to work on this.
Post a Comment