Thursday, November 21, 2013

To Brute-Force A Mockingbird

To Mock a Mockingbird by Raymond M. Smullyan should be required reading for any fan of the programming language Haskell. We learn combinatory logic through a series of delightful puzzles, almost without realizing.

We’re asked to imagine a forest populated by talking birds. On encountering one of these birds, we may call out the name of any bird. In response, the bird will say the name of some bird in the forest. (The reply could be the same bird we named, or the bird’s own name, or any other bird.)

An enchanted forest populated by birds is disarmingly endearing. We’re almost unaware we’re actually dealing with a set of functions that take a function as input and return a function. The evocative backdrop also pays homage to Haskell Curry, who was an avid birdwatcher.

One puzzle challenges us to find an egocentric bird given that a lark lives in the forest. Or, using mundane terminology, given a combinator L such that (Lx)y = x(yy) for all x and y, construct a combinator E such that EE = E.

The author states his solution is “a bit tricky” and consists of 12 correctly parenthesized Ls. Furthermore, the author states he doesn’t know if a shorter solution exists.

To maximize the likelihood of solving this puzzle, the reader should take advantage of facts learned from previous puzzles, and build up to the solution in stages. But that’s only if you’re playing fair and using pencil and paper! Instead, I saw an opportunity to bust out one of my favourite snippets of code.

Seeing the forest for the trees

Let us first recast the problem in terms of trees. Instead of Ls and parentheses, we work with syntax trees. In other words, we work with full binary trees where each leaf node corresponds to an L, and to evaluate an internal node, we recursively evaluate its child nodes, then apply the left child to the right child. (In Smullyan’s terminology, we call out the name of the bird represented by the right child to the bird represented by the left child.)

In this setting, the puzzle is to find a full binary tree such that repeatedly transforming parts of the tree according to the rule (Lx)y = x(yy) produces a tree where both of the root node’s children are identical to the original tree.

Hence to solve with brute force, we need only generate all full binary trees containing up to 12 leaf nodes, and for each one see if we can transform the tree into two copies of itself.

Here’s where my treasured routine comes in. The following roughly describes how to call a function on every full binary tree with exactly n leaf nodes:

  • Allocate a node x.

  • If n is 1, then mark x as a leaf, call the function, then return.

  • Otherwise mark x as an internal node, and for every 0 < k < n:

    • For every full binary tree y with k leaf nodes:

      • Set the left child of x to y.

      • For every full binary tree z with n - k leaf nodes:

        • Set the right child of x to z.

        • Call the function.

We generate the left and right subtrees by calling this algorithm recursively. More precisely, in Go:

type node struct {
kind int // 0 = leaf, 1 = branch.
left, right *node

// For each full binary tree with n leaf nodes,
// sets '*p' to a pointer to the tree and calls the given function.
func forall_tree(p **node, n int, fun func()) {
var t node
*p = &t
if (n == 1) {
t.kind = 0
t.kind = 1
for k := 1; k < n; k++ {
forall_tree(&t.left, k, func() {
forall_tree(&t.right, n - k, fun)

I was proud when I found this gem a few years ago while working on a Project Euler problem, though I’d be shocked if it were original. [Actually, my first version preallocated an array of 2n - 1 nodes and used indices instead of pointers save a bit of time and space, but this is less elegant.]

For example, we can print the first 10 Catalan numbers:

func main() {
for k := 1; k <= 10; k++ {
var root *node
n := 0
forall_tree(&root, k, func() { n++ })

Or print all full binary trees with exactly 6 leaf nodes, as parenthesized expressions:

func tree_print(p *node) {
if p.kind == 1 {

func main() {
var root *node
forall_tree(&root, 6, func() { tree_print(root); println() })

With a little more effort, we can write a program that solves the puzzle. However, some care is needed: if we replace subtrees of the form (Lx)y with x(yy) and vice versa without rhyme nor reason, we’ll have no idea when we’ll finish and we’ll only stumble across a solution by chance.

Instead, we observe that (Lx)y is either strictly smaller than x(yy), or has the form (Lx)L. Let us say that we are reducing the tree when we replace x(yy) by (Lx)y, and expanding when we perform the reverse. Thus rather than start from a tree t and repeatedly applying the rule to obtain the tree tt, we do the reverse: we start from tt, and consider reductions only. The above observation implies that every sequence of reductions must terminate eventually.

But what if we need to temporarily expand tt before reducing it in order to reach t? Let’s optimistically hope that Smullyan’s 12-L solution was sufficiently expanded; that is, only reductions are needed to go from tt to t, where t is his solution.

Multiple subtrees may be reducible, and choosing the wrong one prevents future reductions necessary to reach the goal. We therefore try every path: for each possible reduction, we apply it and recurse. This leads to a problem of wastefully repeating many computations because there can be several ways to arrive at the same tree. We tackle this in an obvious manner: by remembering the trees we’ve seen so far.

I wrote solutions in GNU C and Go. The Go solution is a bit too slow for comfort. The C code is slightly clumsier mainly because I had to name the anonymous functions (though one can define a macro to work around this). C also lacks a map data structure, but this was no problem thanks to my recently released BLT library.

Results (Spoiler Alert)

Optimism paid off. On my laptop, my C program took well under a minute to find 4 solutions:


The Go version took about 10 minutes.

These all contain 12 Ls, so in a sense, Smullyan’s solution is minimal. Since no other strings are printed, these four 12-L solutions are minimal when only reductions are permitted.

If we allow expansions (that is, (Lx)y → x(yy)), then firstly, we have at least 24 = 16 solutions of length 12, since in this case (L(LL)) and (LL(L)) are interchangeable, and secondly, we can reduce the above strings to find shorter solutions. For example, the solution:


reduces to:


which further reduces to:


which only has 8 Ls. I doubt Smullyan missed this. My guess is he meant that if you solve the problem the clever way, then you arrive at an expression with 12 Ls; reductions should be ignored because they only obscure the ingenuity of the solution.

Is there, say, a 7-L expression that expands to some expression (that is necessarily longer than 24 Ls) which can be reduced to half of itself? I think not, but I have no proof.

Exercise: Four Fours

I wanted to do something with the Go version of the the forall_tree() routine, so I tweaked it to solve the four fours puzzle. I just ploughed through all possible trees and evaluated each one; there are only 320 to do. For larger variants of the puzzle, I’d use dynamic programming; that is, memoize subtrees and their values. Division by zero is annoying, but I managed to keep the tree evaluation function short and sweet by using Go’s panic-recover mechanism.

Recursion: the best thing since recursion

The forall_tree() function is but one example of the eloquence of anonymous functions and recursion. For similar reasons, nested functions are also indispensable. We attain economy of expression by letting the stack automatically take care of the heavy lifting.

Curiously, early structured languages including ALGOL, Simula, and Pascal supported nested functions, but C shied away from this beautiful feature.

Its absence is sorely missed, as can be seen in C-style callbacks. An ugly void * argument is inevitably passed and cast. For instance, take the udata parameter in the SDL audio API.

Its absence is also egregiously gratuitous. GCC supports nested functions by using this one weird trick [compiler writers hate him!]. One might complain this weird trick conflicts with executable space protection, but executable space protection is worse than pointless thanks to return-oriented programming.

Code Complete

Wednesday, November 13, 2013

Chocolate, Logic Puzzles, and Dancing Links

Early last year, I visited a cafe that awards chocolate for solving logic puzzles. Naturally, I couldn’t resist free chocolate, and afterwards, just as naturally, I couldn’t resist thinking about programs that solve logic puzzles.

I’ve had to write such programs before for homework assignments or to test out frameworks. But oddly, I had never put much effort into it. I loved logic puzzles as a kid, even going so far as to buy a magazine or two that were filled with grids and clues. Why hadn’t I already written a decent tool to help?

Better late than never. After a spate of intense coding, I had a program that read clues in a terse text format and used brute force to find the solution. I spent most of my time devising the mini-language for the clues rather than the algorithm, as I figured the bottleneck would be the human entering the clues.

My solver worked well enough on a few examples, including the puzzle I solved to get a chocolate. But then I tried my program on the Zebra Puzzle. I was too impatient to let it finish. After a little thought, it was clear why.

On the grid

Logic grid puzzles can be abstracted as follows. We are given a table with M rows and N columns, and each cell contains a unique symbol. Our goal is to rearrange the symbols within each row except for those in the first row so that given constraints are satisfied. To be clear, symbols must stay in the row they started in, but apart from the first row, they can change places with other symbols in their row.

Some examples of constraints:

  • symbols A and B must be in the same column.

  • symbols A, B, and C must be in distinct columns.

  • symbol A’s column must be exactly one to the left of symbol B’s column.

We fix the first row because of contraints such as the last example: clues often refer to the order of the elements in the first row in the input table. Let us call them order contraints. This inspires the following convenient relabeling. Without loss of generality, let the symbols of the first row be 1, …, N from left to right.

My brute force solver generates every possible table and prints the ones that satisfy all given constraints. That means it has to examine up to N!(M-1) cases: there are N! permutations of the symbols in each row, and we have M-1 rows to rearrange. For the Zebra Puzzle, this is 1205.

Got it covered

I needed a smarter approach. Since I had already coded a sudoku solver, I chose the same approach, namely, represent a puzzle as an instance of the exact cover problem, then use the dancing links algorithm.

Firstly, we populate a set X with all the symbols in the table. Next, instead of generating every table, generate every possible column. Each such column corresponds to the subset of X consisting of the symbols in the column. Generating every possible column means brute force is still present, but to a lesser degree.

An exact cover of X corresponds to a collection of columns such that no two columns contain the same symbol, and furthermore, each symbol appears in one of the columns. Thus these columns can be joined together to form a candidate solution to the logic puzzle: we merely order them so that the first row reads 1, …, N.

It remains to disallow covers that violate the constraints. For some constraints, we achieve this by omitting certain columns. For example, suppose A and B must be in the same column. When we generate a column that only contains one of A or B, we omit it from the exact cover instance. Similarly, if a constraint requires that A and B lie in distinct columns, we omit subsets that contain both symbols from the exact cover instance.

Out of order

The above suffices for many puzzles, but falls short for those with order constraints such as "A and B lie in adjacent columns". For this particular constraint, we add N elements to X. Let us call them x[1], …, x[N]. Given a generated column whose first row contains the number n (recall we have relabeled so that the first row of the input table is 1, …, N), if the column contains:

  • both A and B: eliminate the column from consideration.

  • A and not B: we add x[n] to the column’s corresponding subset.

  • B and not A: we add x[k] to the column’s corresponding subset for all k in 1..N where |n-k| is not 1.

Lastly, we mark x[1] ,…, x[N] as optional, meaning that they need not be covered by our solution. (We could instead augment our collection of subsets with singleton subsets {x[n]} for each n, but with dancing links there’s a faster way.)

Thus any exact cover must have A and B in adjacent columns to avoid conflicts in the x[1], …, x[N] elements. We can handle other order constraints in a similar fashion.

The size of X is the number of symbols, MN, plus N for each order constraint. The number of subsets is the number of possible columns, which is NM, because we have M rows, and each can be one of N different symbols. Each subset has size M, one for each row, plus up to N more elements for each order constraint that involves it.

That’ll do

Though NM may be exponential in input size, typical logic grid puzzles are so small that my program solves them in a few milliseconds on my laptop. The bottleneck is now indeed the human entering the clues. [I briefly thought about writing a program to automate this by searching for keywords in each sentence.]

I was annoyed the above algorithm is adequate. Firstly, trying entire columns in a single step bears no resemblance to actions performed by human solvers. Secondly, it was too easy: I wish I were forced to think harder about the algorithm! Perhaps I’ll devise a larger puzzle that demands a better approach.

Around this time I lost focus because I felt my old code could use a few edits. I got sidetracked rewriting my dancing-links sudoku solver and comparing it against other ways of solving sudoku, and soon after that I moved on.

Luckily for my code, I recently felt compelled to clean it up enough for public release. It seemed a pity to let it languish in a forgotten corner until rendered useless from bit rot and lack of documentation.

The DLX library is now available at a git repository near you:

Tuesday, November 12, 2013

Lies, damned lies, and frequentist statistics

Earlier this year I rekindled an interest in probability theory. In my classes, Bayes' theorem was little more than a footnote, and we drilled frequentist techniques. Browsing a few books led me to question this. In particular, though parts of Jaynes' "Probability Theory: The Logic of Science" sounded like a conspiracy theory at first, I was soon convinced that the author’s militant condemnation of frequentism was justified.

Today, I had the pleasure of reading a Nature article about a paper by Valen E. Johnson directly comparing Bayesian and frequentist methods in scientific publications, who suggests the latter is responsible for a plague of irreproducible findings. I felt vindicated; or rather, I felt I had several more decibels of evidence for the hypothesis that Bayesian methods produce far better results than frequentist methods when compared against the hypothesis that the two methods produce equivalent results!

This post explains it well. In short, frequentist methods have led to bad science.

An apologist might retort that it’s actually the fault of bad scientists, who are misusing the methods due to insufficient understanding of the theory. There may be some truth here, but I still argue that Bayesian probability should be taught instead. I need only look at my undergraduate probability and statistics textbook. On page 78, I see the 0.05 P-value convention castigated by Johnson, right after recipe-like instructions for computing a P-value. If other textbooks are similar, no wonder scientists are robotically misapplying frequentist procedures and generating garbage.

Johnson’s recommended fix of using 0.005 instead 0.05 is curious. I doubt it has firm theoretical grounding, but perhaps the nature of data that most scientists collect mean that this rule of thumb will usually work well enough. Though perhaps striving for the arbitrary 0.005 standard may require excessive data: a Bayesian method might yield similar results with less input. I guess it’s an expedient compromise. Those with poor understanding of statistical inference can still obtain decent results, at the cost of gathering more data than necessary.

The above post also mentions a paper describing how even a correctly applied frequentist technique leads to radically different inferences from a Bayesian one. The intriguing discussion within is beyond me, but I’m betting Bayesian is better; or rather, the prior I’d assign to the probability that Bayesian inference will one day shown to be better is extremly close to one!

Sunday, November 3, 2013

Crit-bit tree micro-optimizations

Long ago, I implemented crit-bit trees (one of the many names of this data structure) after skimming a few online descriptions. I made obvious choices and put little effort into optimization. It worked well enough for my projects.

Earlier this year, I studied Adam Langley’s crit-bit tree library, and was inspired to write a new crit-bit tree library from scratch. Micro-optimizations are fun! And surely my rebooted library would be faster thanks to ingenious techniques such as tagged pointers and fancy bit-twiddling (notably a SWAR hack to find the critical bit).

On my machine, the benchmarks show improvement in space and time. Iteration is much slower since I opted to forgo linked-list pointers as advocated by Bernstein. Without them, we must travel up and down the tree to go from one leaf node to the next.

However, an application wishing to visit every element should not do this: it should instead call the allprefixed routine with a blank prefix. The difference between allprefixed and the my original library’s iterate is small enough that I’m inclined to agree with Bernstein: we’re better off without auxiliary next and previous pointers.

When I modified the benchmark program to measure the classes of C++, I changed char* to string, and an array to a vector which are natural choices for C++ and should not significantly hurt running times. As theory suggests, map insertion and lookup is slower, while unordered_map is much faster; as long as the drawbacks of the latter are borne in mind, it may be a reasonable choice for certain applications.

I was pleased my rewritten library sometimes seemed a touch faster than the library that inspired it ("critbit" in the tables below), especially since my library implements a map and not just a set. The extra optimizations it performs are listed in comments in the C file. My guess is that most of them do almost nothing, and "unrolling" the child pointer lookup is largely responsible for the gains.

I’m naming my code the BLT library for now. I already used up my first choice in my old "cbt" library. Though immodest, "Ben Lynn Trees" is easy for me to remember. Also, the name coincides with a delicious sandwich.

In the following tables, all entries are in seconds, except for the overhead which is measured in bytes. I used tcmalloc to get the best numbers.

Table 1. Keys: /usr/share/dict/words
BLT cbt critbit map unordered_map






























Table 2. Keys: output of seq 2000000
BLT cbt critbit map unordered_map






























Try BLT now!