Friday, April 18, 2014

Crit-bit trees yet again

I’ve been pleased with my recent implementation of crit-bit trees based on Adam Langley’s code, but in the back of my mind, I’ve been bothered by Bernstein’s statement that the overhead on top of each string stored need only be two control words: one pointer and one integer.

To obtain relief, I updated my library so that internal nodes only have a single child pointer instead of one pointer each for the left and right child nodes. In a crit-bit tree, an internal node always has two children, so we may as well allocate sibling nodes in one contiguous chunk of memory.

Bernstein suggests storing strings directly in the crit-bit tree, which can be done by storing the left string backwards: it resides to the left of the integer control word. I’ve chosen an alternate scheme which he also mentions: expressing the strings as pointers in a separate storage pool.

Removing one pointer per internal node has several benefits on a 64-bit machine. Firstly, malloc is typically 16-byte aligned, which means that the 24-byte internal nodes of the original version were actually taking 32 bytes. In contrast, now we only need just one pointer and one integer, so we can fit internal nodes within 16 bytes.

Secondly, we no longer have to tag pointers. Instead, 8 bytes hold an unadulterated child pointer, and the other 8 bytes store the crit-bit position as well as one bit that indicates whether the current node is internal or not.

Thirdly, allocating two nodes at once only requires a single malloc() call. Before, we would call malloc() to allocate a new leaf node, and again to allocate a new internal node.

Again, "unrolling" the child pointer lookup produced better benchmarks on my system, namely, instead of p = p->kid + predicate(), we write p = predicate() ? p->kid + 1 : p->kid. This obviated the need for some fun but complex micro-optimizations. While I was at it, I unrolled a few other routines.

Benchmarks:

Table 1. Keys: /usr/share/dict/words
old new

insert

0.066135

0.056116

get

0.043584

0.040845

iterate

0.031851

0.015135

allprefixed

0.013285

0.011924

delete

0.048030

0.041754

overhead

3966824

3173464

Table 2. Keys: output of seq 2000000
old new

insert

2.700949

2.359999

get

2.304070

2.214699

iterate

0.912950

0.472594

allprefixed

0.295322

0.264326

overhead

79999984

63999992

delete

2.339102

2.177294

The insert benchmark of my old library is slightly worse than the one I originally posted because I have since tweaked the code to make it easy to carry out operations such as "insert or replace" or "insert if absent" in a single step.

https://github.com/blynn/blt

2 comments:

Nisha said...

Hi Ben, Do you have a version of blt for non-strings similar to your implementation for CBT?

Ben Lynn said...

Sorry, no. I haven't had time to work on this.