Sunday, November 3, 2013

Crit-bit tree micro-optimizations

Long ago, I implemented crit-bit trees (one of the many names of this data structure) after skimming a few online descriptions. I made obvious choices and put little effort into optimization. It worked well enough for my projects.

Earlier this year, I studied Adam Langley’s crit-bit tree library, and was inspired to write a new crit-bit tree library from scratch. Micro-optimizations are fun! And surely my rebooted library would be faster thanks to ingenious techniques such as tagged pointers and fancy bit-twiddling (notably a SWAR hack to find the critical bit).

On my machine, the benchmarks show improvement in space and time. Iteration is much slower since I opted to forgo linked-list pointers as advocated by Bernstein. Without them, we must travel up and down the tree to go from one leaf node to the next.

However, an application wishing to visit every element should not do this: it should instead call the allprefixed routine with a blank prefix. The difference between allprefixed and the my original library’s iterate is small enough that I’m inclined to agree with Bernstein: we’re better off without auxiliary next and previous pointers.

When I modified the benchmark program to measure the classes of C++, I changed char* to string, and an array to a vector which are natural choices for C++ and should not significantly hurt running times. As theory suggests, map insertion and lookup is slower, while unordered_map is much faster; as long as the drawbacks of the latter are borne in mind, it may be a reasonable choice for certain applications.

I was pleased my rewritten library sometimes seemed a touch faster than the library that inspired it ("critbit" in the tables below), especially since my library implements a map and not just a set. The extra optimizations it performs are listed in comments in the C file. My guess is that most of them do almost nothing, and "unrolling" the child pointer lookup is largely responsible for the gains.

I’m naming my code the BLT library for now. I already used up my first choice in my old "cbt" library. Though immodest, "Ben Lynn Trees" is easy for me to remember. Also, the name coincides with a delicious sandwich.

In the following tables, all entries are in seconds, except for the overhead which is measured in bytes. I used tcmalloc to get the best numbers.

Table 1. Keys: /usr/share/dict/words
BLT cbt critbit map unordered_map

insert

0.062939

0.085081

0.070779

0.086181

0.050796

get

0.044096

0.046323

0.053218

0.077201

0.034249

allprefixed

0.013312

0.015073

iterate

0.031755

0.006420

0.006167

0.004334

delete

0.048093

0.050687

0.051923

0.009572

0.011820

overhead

3966824

6346992

Table 2. Keys: output of seq 2000000
BLT cbt critbit map unordered_map

insert

2.675797

3.048633

2.683036

3.175450

1.457466

get

2.328416

2.477317

2.510085

2.913757

0.919519

allprefixed

0.296389

0.303059

iterate

0.917041

0.171143

0.193081

0.131707

delete

2.352568

2.522654

2.250305

0.299835

0.316262

overhead

79999984

128000048

Try BLT now!

No comments: