## Thursday, November 21, 2013

### To Brute-Force A Mockingbird

To Mock a Mockingbird by Raymond M. Smullyan should be required reading for any fan of the programming language Haskell. We learn combinatory logic through a series of delightful puzzles, almost without realizing.

We’re asked to imagine a forest populated by talking birds. On encountering one of these birds, we may call out the name of any bird. In response, the bird will say the name of some bird in the forest. (The reply could be the same bird we named, or the bird’s own name, or any other bird.)

An enchanted forest populated by birds is disarmingly endearing. We’re almost unaware we’re actually dealing with a set of functions that take a function as input and return a function. The evocative backdrop also pays homage to Haskell Curry, who was an avid birdwatcher.

One puzzle challenges us to find an egocentric bird given that a lark lives in the forest. Or, using mundane terminology, given a combinator L such that (Lx)y = x(yy) for all x and y, construct a combinator E such that EE = E.

The author states his solution is “a bit tricky” and consists of 12 correctly parenthesized Ls. Furthermore, the author states he doesn’t know if a shorter solution exists.

To maximize the likelihood of solving this puzzle, the reader should take advantage of facts learned from previous puzzles, and build up to the solution in stages. But that’s only if you’re playing fair and using pencil and paper! Instead, I saw an opportunity to bust out one of my favourite snippets of code.

Seeing the forest for the trees

Let us first recast the problem in terms of trees. Instead of Ls and parentheses, we work with syntax trees. In other words, we work with full binary trees where each leaf node corresponds to an L, and to evaluate an internal node, we recursively evaluate its child nodes, then apply the left child to the right child. (In Smullyan’s terminology, we call out the name of the bird represented by the right child to the bird represented by the left child.)

In this setting, the puzzle is to find a full binary tree such that repeatedly transforming parts of the tree according to the rule (Lx)y = x(yy) produces a tree where both of the root node’s children are identical to the original tree.

Hence to solve with brute force, we need only generate all full binary trees containing up to 12 leaf nodes, and for each one see if we can transform the tree into two copies of itself.

Here’s where my treasured routine comes in. The following roughly describes how to call a function on every full binary tree with exactly n leaf nodes:

• Allocate a node x.

• If n is 1, then mark x as a leaf, call the function, then return.

• Otherwise mark x as an internal node, and for every 0 < k < n:

• For every full binary tree y with k leaf nodes:

• Set the left child of x to y.

• For every full binary tree z with n - k leaf nodes:

• Set the right child of x to z.

• Call the function.

We generate the left and right subtrees by calling this algorithm recursively. More precisely, in Go:

 type node struct { kind int // 0 = leaf, 1 = branch. left, right *node}// For each full binary tree with n leaf nodes,// sets '*p' to a pointer to the tree and calls the given function.func forall_tree(p **node, n int, fun func()) { var t node *p = &t if (n == 1) { t.kind = 0 fun() return } t.kind = 1 for k := 1; k < n; k++ { forall_tree(&t.left, k, func() { forall_tree(&t.right, n - k, fun) }) }}

I was proud when I found this gem a few years ago while working on a Project Euler problem, though I’d be shocked if it were original. [Actually, my first version preallocated an array of 2n - 1 nodes and used indices instead of pointers save a bit of time and space, but this is less elegant.]

For example, we can print the first 10 Catalan numbers:

 func main() { for k := 1; k <= 10; k++ { var root *node n := 0 forall_tree(&root, k, func() { n++ }) println(n) }}

Or print all full binary trees with exactly 6 leaf nodes, as parenthesized expressions:

 func tree_print(p *node) { if p.kind == 1 { print("(") tree_print(p.left) tree_print(p.right) print(")") return } print("L")}func main() { var root *node forall_tree(&root, 6, func() { tree_print(root); println() })}

With a little more effort, we can write a program that solves the puzzle. However, some care is needed: if we replace subtrees of the form (Lx)y with x(yy) and vice versa without rhyme nor reason, we’ll have no idea when we’ll finish and we’ll only stumble across a solution by chance.

Instead, we observe that (Lx)y is either strictly smaller than x(yy), or has the form (Lx)L. Let us say that we are reducing the tree when we replace x(yy) by (Lx)y, and expanding when we perform the reverse. Thus rather than start from a tree t and repeatedly applying the rule to obtain the tree tt, we do the reverse: we start from tt, and consider reductions only. The above observation implies that every sequence of reductions must terminate eventually.

But what if we need to temporarily expand tt before reducing it in order to reach t? Let’s optimistically hope that Smullyan’s 12-L solution was sufficiently expanded; that is, only reductions are needed to go from tt to t, where t is his solution.

Multiple subtrees may be reducible, and choosing the wrong one prevents future reductions necessary to reach the goal. We therefore try every path: for each possible reduction, we apply it and recurse. This leads to a problem of wastefully repeating many computations because there can be several ways to arrive at the same tree. We tackle this in an obvious manner: by remembering the trees we’ve seen so far.

I wrote solutions in GNU C and Go. The Go solution is a bit too slow for comfort. The C code is slightly clumsier mainly because I had to name the anonymous functions (though one can define a macro to work around this). C also lacks a map data structure, but this was no problem thanks to my recently released BLT library.

Optimism paid off. On my laptop, my C program took well under a minute to find 4 solutions:

(((L(LL))(L(LL)))((L(LL))(L(LL))))(((L(LL))((LL)L))((L(LL))((LL)L)))(((L(LL))((LL)L))(((LL)L)((LL)L)))((((LL)L)((LL)L))(((LL)L)((LL)L)))

The Go version took about 10 minutes.

These all contain 12 Ls, so in a sense, Smullyan’s solution is minimal. Since no other strings are printed, these four 12-L solutions are minimal when only reductions are permitted.

If we allow expansions (that is, (Lx)y → x(yy)), then firstly, we have at least 24 = 16 solutions of length 12, since in this case (L(LL)) and (LL(L)) are interchangeable, and secondly, we can reduce the above strings to find shorter solutions. For example, the solution:

(((L(LL))(L(LL)))((L(LL))(L(LL))))

reduces to:

(L((L(LL))(L(LL))))(L(LL))

which further reduces to:

((LL)(L(LL)))(L(LL))

which only has 8 Ls. I doubt Smullyan missed this. My guess is he meant that if you solve the problem the clever way, then you arrive at an expression with 12 Ls; at this point, reductions should be ignored because they only obscure the ingenuity of the solution.

Is there, say, a 7-L expression that expands to some expression (that is necessarily longer than 24 Ls) which can be reduced to half of itself? I think not, but I have no proof.

Exercise: Four Fours

I wanted to do something with the Go version of the the forall_tree() routine, so I tweaked it to solve the four fours puzzle. I just ploughed through all possible trees and evaluated each one; there are only 320 to do. For larger variants of the puzzle, I’d use dynamic programming; that is, memoize subtrees and their values. Division by zero is annoying, but I managed to keep the tree evaluation function short and sweet by using Go’s panic-recover mechanism.

Recursion: the best thing since recursion

The forall_tree() function is but one example of the eloquence of anonymous functions and recursion. For similar reasons, nested functions are also indispensable. We attain economy of expression by letting the stack automatically take care of the heavy lifting.

Curiously, early structured languages including ALGOL, Simula, and Pascal supported nested functions, but C shied away from this beautiful feature.

Its absence is sorely missed, as can be seen in C-style callbacks. An ugly void * argument is inevitably passed and cast. For instance, take the udata parameter in the SDL audio API.

Its absence is also egregiously gratuitous. GCC supports nested functions by using this one weird trick [compiler writers hate him!]. One might complain this weird trick conflicts with executable space protection, but executable space protection is worse than pointless thanks to return-oriented programming.

Code Complete

## Wednesday, November 13, 2013

### Chocolate, Logic Puzzles, and Dancing Links

Early last year, I visited a cafe that awards chocolate for solving logic puzzles. Naturally, I couldn’t resist free chocolate, and afterwards, just as naturally, I couldn’t resist thinking about programs that solve logic puzzles.

I’ve had to write such programs before for homework assignments or to test out frameworks. But oddly, I had never put much effort into it. I loved logic puzzles as a kid, even going so far as to buy a magazine or two that were filled with grids and clues. Why hadn’t I already written a decent tool to help?

Better late than never. After a spate of intense coding, I had a program that read clues in a terse text format and used brute force to find the solution. I spent most of my time devising the mini-language for the clues rather than the algorithm, as I figured the bottleneck would be the human entering the clues.

My solver worked well enough on a few examples, including the puzzle I solved to get a chocolate. But then I tried my program on the Zebra Puzzle. I was too impatient to let it finish. After a little thought, it was clear why.

On the grid

Logic grid puzzles can be abstracted as follows. We are given a table with M rows and N columns, and each cell contains a unique symbol. Our goal is to rearrange the symbols within each row except for those in the first row so that given constraints are satisfied. To be clear, symbols must stay in the row they started in, but apart from the first row, they can change places with other symbols in their row.

Some examples of constraints:

• symbols A and B must be in the same column.

• symbols A, B, and C must be in distinct columns.

• symbol A’s column must be exactly one to the left of symbol B’s column.

We fix the first row because of contraints such as the last example: clues often refer to the order of the elements in the first row in the input table. Let us call them order contraints. This inspires the following convenient relabeling. Without loss of generality, let the symbols of the first row be 1, …, N from left to right.

My brute force solver generates every possible table and prints the ones that satisfy all given constraints. That means it has to examine up to N!(M-1) cases: there are N! permutations of the symbols in each row, and we have M-1 rows to rearrange. For the Zebra Puzzle, this is 1205.

Got it covered

I needed a smarter approach. Since I had already coded a sudoku solver, I chose the same approach, namely, represent a puzzle as an instance of the exact cover problem, then use the dancing links algorithm.

Firstly, we populate a set X with all the symbols in the table. Next, instead of generating every table, generate every possible column. Each such column corresponds to the subset of X consisting of the symbols in the column. Generating every possible column means brute force is still present, but to a lesser degree.

An exact cover of X corresponds to a collection of columns such that no two columns contain the same symbol, and furthermore, each symbol appears in one of the columns. Thus these columns can be joined together to form a candidate solution to the logic puzzle: we merely order them so that the first row reads 1, …, N.

It remains to disallow covers that violate the constraints. For some constraints, we achieve this by omitting certain columns. For example, suppose A and B must be in the same column. When we generate a column that only contains one of A or B, we omit it from the exact cover instance. Similarly, if a constraint requires that A and B lie in distinct columns, we omit subsets that contain both symbols from the exact cover instance.

Out of order

The above suffices for many puzzles, but falls short for those with order constraints such as "A and B lie in adjacent columns". For this particular constraint, we add N elements to X. Let us call them x[1], …, x[N]. Given a generated column whose first row contains the number n (recall we have relabeled so that the first row of the input table is 1, …, N), if the column contains:

• both A and B: eliminate the column from consideration.

• A and not B: we add x[n] to the column’s corresponding subset.

• B and not A: we add x[k] to the column’s corresponding subset for all k in 1..N where |n-k| is not 1.

Lastly, we mark x[1] ,…, x[N] as optional, meaning that they need not be covered by our solution. (We could instead augment our collection of subsets with singleton subsets {x[n]} for each n, but with dancing links there’s a faster way.)

Thus any exact cover must have A and B in adjacent columns to avoid conflicts in the x[1], …, x[N] elements. We can handle other order constraints in a similar fashion.

The size of X is the number of symbols, MN, plus N for each order constraint. The number of subsets is the number of possible columns, which is NM, because we have M rows, and each can be one of N different symbols. Each subset has size M, one for each row, plus up to N more elements for each order constraint that involves it.

That’ll do

Though NM may be exponential in input size, typical logic grid puzzles are so small that my program solves them in a few milliseconds on my laptop. The bottleneck is now indeed the human entering the clues. [I briefly thought about writing a program to automate this by searching for keywords in each sentence.]

I was annoyed the above algorithm is adequate. Firstly, trying entire columns in a single step bears no resemblance to actions performed by human solvers. Secondly, it was too easy: I wish I were forced to think harder about the algorithm! Perhaps I’ll devise a larger puzzle that demands a better approach.

Around this time I lost focus because I felt my old code could use a few edits. I got sidetracked rewriting my dancing-links sudoku solver and comparing it against other ways of solving sudoku, and soon after that I moved on.

Luckily for my code, I recently felt compelled to clean it up enough for public release. It seemed a pity to let it languish in a forgotten corner until rendered useless from bit rot and lack of documentation.

The DLX library is now available at a git repository near you:

## Tuesday, November 12, 2013

### Lies, damned lies, and frequentist statistics

Earlier this year I rekindled an interest in probability theory. In my classes, Bayes' theorem was little more than a footnote, and we drilled frequentist techniques. Browsing a few books led me to question this. In particular, though parts of Jaynes' "Probability Theory: The Logic of Science" sounded like a conspiracy theory at first, I was soon convinced that the author’s militant condemnation of frequentism was justified.

Today, I had the pleasure of reading a Nature article about a paper by Valen E. Johnson directly comparing Bayesian and frequentist methods in scientific publications, who suggests the latter is responsible for a plague of irreproducible findings. I felt vindicated; or rather, I felt I had several more decibels of evidence for the hypothesis that Bayesian methods produce far better results than frequentist methods when compared against the hypothesis that the two methods produce equivalent results!

This post explains it well. In short, frequentist methods have led to bad science.

An apologist might retort that it’s actually the fault of bad scientists, who are misusing the methods due to insufficient understanding of the theory. There may be some truth here, but I still argue that Bayesian probability should be taught instead. I need only look at my undergraduate probability and statistics textbook. On page 78, I see the 0.05 P-value convention castigated by Johnson, right after recipe-like instructions for computing a P-value. If other textbooks are similar, no wonder scientists are robotically misapplying frequentist procedures and generating garbage.

Johnson’s recommended fix of using 0.005 instead 0.05 is curious. I doubt it has firm theoretical grounding, but perhaps the nature of data that most scientists collect mean that this rule of thumb will usually work well enough. Though perhaps striving for the arbitrary 0.005 standard may require excessive data: a Bayesian method might yield similar results with less input. I guess it’s an expedient compromise. Those with poor understanding of statistical inference can still obtain decent results, at the cost of gathering more data than necessary.

The above post also mentions a paper describing how even a correctly applied frequentist technique leads to radically different inferences from a Bayesian one. The intriguing discussion within is beyond me, but I’m betting Bayesian is better; or rather, the prior I’d assign to the probability that Bayesian inference will one day shown to be better is extremly close to one!

## Sunday, November 3, 2013

### Crit-bit tree micro-optimizations

Long ago, I implemented crit-bit trees (one of the many names of this data structure) after skimming a few online descriptions. I made obvious choices and put little effort into optimization. It worked well enough for my projects.

Earlier this year, I studied Adam Langley’s crit-bit tree library, and was inspired to write a new crit-bit tree library from scratch. Micro-optimizations are fun! And surely my rebooted library would be faster thanks to ingenious techniques such as tagged pointers and fancy bit-twiddling (notably a SWAR hack to find the critical bit).

On my machine, the benchmarks show improvement in space and time. Iteration is much slower since I opted to forgo linked-list pointers as advocated by Bernstein. Without them, we must travel up and down the tree to go from one leaf node to the next.

However, an application wishing to visit every element should not do this: it should instead call the allprefixed routine with a blank prefix. The difference between allprefixed and the my original library’s iterate is small enough that I’m inclined to agree with Bernstein: we’re better off without auxiliary next and previous pointers.

When I modified the benchmark program to measure the classes of C++, I changed char* to string, and an array to a vector which are natural choices for C++ and should not significantly hurt running times. As theory suggests, map insertion and lookup is slower, while unordered_map is much faster; as long as the drawbacks of the latter are borne in mind, it may be a reasonable choice for certain applications.

I was pleased my rewritten library sometimes seemed a touch faster than the library that inspired it ("critbit" in the tables below), especially since my library implements a map and not just a set. The extra optimizations it performs are listed in comments in the C file. My guess is that most of them do almost nothing, and "unrolling" the child pointer lookup is largely responsible for the gains.

I’m naming my code the BLT library for now. I already used up my first choice in my old "cbt" library. Though immodest, "Ben Lynn Trees" is easy for me to remember. Also, the name coincides with a delicious sandwich.

In the following tables, all entries are in seconds, except for the overhead which is measured in bytes. I used tcmalloc to get the best numbers.

Table 1. Keys: /usr/share/dict/words
BLT cbt critbit map unordered_map

insert

0.062939

0.085081

0.070779

0.086181

0.050796

get

0.044096

0.046323

0.053218

0.077201

0.034249

allprefixed

0.013312

0.015073

iterate

0.031755

0.006420

0.006167

0.004334

delete

0.048093

0.050687

0.051923

0.009572

0.011820

3966824

6346992

Table 2. Keys: output of seq 2000000
BLT cbt critbit map unordered_map

insert

2.675797

3.048633

2.683036

3.175450

1.457466

get

2.328416

2.477317

2.510085

2.913757

0.919519

allprefixed

0.296389

0.303059

iterate

0.917041

0.171143

0.193081

0.131707

delete

2.352568

2.522654

2.250305

0.299835

0.316262

79999984

128000048

Try BLT now!

## Sunday, July 21, 2013

### Inheritance Quiz

1. This code uses implementation inheritance. Where’s the bug?

 class Point { int x, int y; Point(int x, int y) { this.x = x; this.y = y; display(); } void display() { System.out.println(x + " " + y); }}class CPoint extends Point { Color c; CPoint(int x, int y, Color c) { super(x, y); this.c = c; } void display() { System.out.println(x + " " + y + " " + c.name()); }}
2. This code uses implementation inheritance. Where’s the bug?

 /** A version of Hashtable that lets you do * table.put("dog", "canine");, and then have * table.get("dogs") return "canine". **/public class HashtableWithPlurals extends Hashtable { /** Make the table map both key and key + "s" to value. **/ public Object put(Object key, Object value) { super.put(key + "s", value); return super.put(key, value); }}

Hint: After put("dog", foo), although "dogs" is in the table as expected, sometimes "dogss" is also in the table. Why?

3. Consider this object buffer in Java:

 public class Buffer { protected Object[] buf; protected int MAX; protected int current = 0; Buffer(int max) { MAX = max; buf = new Object[MAX]; } public synchronized Object get() throws Exception { while (current<=0) { wait(); } current--; Object ret = buf[current]; notifyAll(); return ret; } public synchronized void put(Object v) throws Exception { while (current>=MAX) { wait(); } buf[current] = v; current++; notifyAll(); }}

Use inheritance to extend this class to support the gget() method, which is identical to get() except it cannot be executed immediately after a get(). In other words, it blocks until a put() or a gget() finishes.

 public class HistoryBuffer extends Buffer { // What goes here?}

1. See "Masked types for sound object initialization" for the answer. Did you know initialization of superclasses in Java and C++ is unsound?

In Go, which lacks inheritance, we might write:

 type Point struct { x, y int}func NewPoint(int x, int y) *Point { p := &Point{x, y} p.Display() return p}func (p Point) Display() { println(p.x, p.y)}type CPoint struct { x, y int c Color}func NewCPoint(int x, int y, Color c) *CPoint { p := &Cpoint{x, y, c} p.Display() return p}func (p CPoint) Display() { println(p.x, p.y, c.Name())}

We must write factory methods instead of constructors. However, the language makes it easy to avoid the bug.

2. This question was taken from "The Java IAQ", which also explains the answer.

In Go, which lacks inheritance, we might write:

 type PluralsTable struct { *Table tab;};func (t *PluralsTable) Put(key string, value interface{}) interface{} { t.tab.Put(key + "s", value) return t.tab.Put(key, value)}func (t *PluralsTable) Get(key string) interface{} { return t.tab.Get(key)}...

We must define a wrappper for each method of Table that we want PluralsTable to support. Additionally, if we have code that should work on PluralsTable and Table, we must define an interface with Put and Get. However, the language makes it easy to avoid the surprise recursion bug.

We also avoid other kinds of bugs. For example, suppose we add PutFoo() to Table, which inserts "foo" into the table in a fast but strange way: the hash of "foo" is precomputed, so that the entry can be directly inserted into the underlying array.

With inheritance, calling PutFoo() on a PluralsTable will silently succeed, but neglect to insert "foos", a bug that might only be noticed long after a release. Without inheritance, the program will fail to compile when code calls PutFoo() on a PluralsTable, at which point a human can intervene and supply the PluralsTable edition of PutFoo().

3. Did you manage it without rewriting the entire class? If so, congratulations on solving the inheritance anomaly! Otherwise, you probably wrote something akin to the solution given in that paper:

 public class HistoryBuffer extends Buffer { boolean afterGet = false; public HistoryBuffer(int max) { super(max); } public synchronized Object gget() throws Exception { while ((current<=0)||(afterGet)) { wait(); } afterGet = false; return super.get(); } public synchronized Object get() throws Exception { Object o = super.get(); afterGet = true; return o; } public synchronized void put(Object v) throws Exception { super.put(v); afterGet = false; }}

Implementation inheritance mixes poorly with concurrent programming.

In Go, this question takes a different character because concurrent object buffers are built-in types:

 type Buffer chan interface{}func NewBuffer(max int) Buffer { return make(chan interface{}, max)}func (buf Buffer) Get() interface{} { return <-buf}func (buf Buffer) Put(v interface{}) { buf <- v}

One solution to the HistoryBuffer problem is:

 type HistoryBuffer struct { ch chan interface{} get, gget, put, done chan bool}func NewHistoryBuffer(max int) *HistoryBuffer { buf := &HistoryBuffer{ make(chan interface{}, max), make(chan bool), make(chan bool), make(chan bool), make(chan bool), } go func() { // Synchronization logic. maybe_gget := buf.gget for { select { case <-maybe_gget: case <-buf.put: maybe_gget = buf.gget case <-buf.get: maybe_gget = nil } <-buf.done } }() return buf}func (buf *HistoryBuffer) Get() interface{} { buf.get <- true v := <-buf.ch buf.done <- true return v}func (buf *HistoryBuffer) Put(v interface{}) { buf.put <- true buf.ch <- v buf.done <- true}func (buf *HistoryBuffer) GGet() interface{} { buf.gget <- true v := <-buf.ch buf.done <- true return v}

Most of the synchronization logic resides in the anonymous function. Outside, we only have channel writes at the beginning and end of each method, comparable to Java’s synchronized keyword. Thus behaviour is decoupled from synchronization.

We can effortlessly and independently change the synchronization logic. Suppose now GGet() can only be called after 3 Put() calls and 2 Get() calls have completed in some order since the previous GGet(), or, if there are no previous GGet() calls, since the program started. We simply change the anonymous function:

 go func() { var maybe_gget chan bool p, g := 0, 0 for { select { case <-maybe_gget: p, g = 0, 0 maybe_gget = nil case <-buf.put: p++ case <-buf.get: g++ } if p >= 3 && g >= 2 { maybe_gget = buf.gget } <-buf.done }}()

The official Go documentation suggests channels of ints instead of bools, which leads to slightly smaller code (mainly because we can replace true with 1), and probably results in the same compiled binaries.

## Thursday, March 14, 2013

### MathJax

About a decade ago, I began putting my notes on my homepage for the reasons cloud computing proponents love to spout (though I did it without uttering any buzzwords).

But I hit a snag. How do I put equations on the web? Among the many awful workarounds, I picked the one which I thought was noblest: MathML. My pages would be static content; operable without JavaScript. Text is far slimmer than images, and are far more agreeable to things like searching. As for PDF? Over my dead <body> element!

I was optimistic back then. Mozilla supported MathML provided you also downloaded a font or two, and despite the crushing dominance of Internet Explorer, I felt that righteous Free Software would ultimately triumph. One day, I hoped, a typical browser would render my site perfectly, out of the box.

Turns out my predictions were half right. The web broke free of Internet Explorer’s chokehold. Now, more often than not, we use open source browsers. And one of them, Firefox, supports MathML out of the box.

However, my mathematics notes still render incorrectly on most browsers. Popular search engines appear to shun them, possibly because I zealously followed the arcane XHTML 1.1 plus MathML guidelines. And everything supports JavaScript.

Maybe they’re all going to support real soon, but ten years is too long for me. I switched to MathJax, a clever JavaScript library that figures out what your system can do, then renders the equations using an appropriate technique. It just works.

## Monday, February 18, 2013

I’ve been leafing through a few books on probability, a subject which I’ve mostly avoided since undergrad. Originally thinking I’d just refresh what I already learned, to my surprise I was led to reconsider fundamental beliefs. What follows is my journey told via book reviews.

## Hexaflexagons and Other Mathematical Diversions by Martin Gardner

As a kid, I devoured this book and the others in the series, which I later learned were collections of Mathematical Games columns from Scientific American magazine. I didn’t always understand the material, and the puzzles were often too difficult, but Gardner’s writing skill kept me reading on.

Among the many fascinating chapters was “Probability Paradoxes”. Gardner’s ability to communicate was so strong that after many years I still remember much of the content. In particular, he asked:

Mr. Smith says, "I have two children and at least one of them is a boy." What is the probability that the other child is a boy?

and his explanation of 1/3 being the correct answer not only stuck in my mind, but shaped my early views on probability. For the details, see this New Scientist article on a Martin Gardner convention.

Only a few years ago, after a debate with a friend, did I reconsider the reasoning. It turns out Gardner’s statement of the problem is ambiguous. This revelation sparked a desire to hit the books and brush up on probability one day.

## A Primer of Statistics by M.C. Phipps and M.P. Quine

The second edition of this slim volume was the textbook for my first course on probability. I used it to cram for exams. For this purpose, it was good: I got decent grades.

Sadly, it wasn’t as good in other respects. I acquired a distaste for the subject. Why did Probability and Statistics seem like a bag of ad hoc tricks, with few explanations given? Do I have poor intuition for it? Or is it glorified guesswork that seems to work well enough with real-life data? Whatever the reason, I decided that for the rest of my degree I’d steer towards the Pure Mathematics offerings.

## The Signal and the Noise: Why So Many Predicitons Fail — but Some Don’t by Nate Silver

My renewed interest in probability was sparked by the United States presidential election of 2012, or rather, its aftermath. Many had predicted its outcome but few were accurate.

It was only then I read about Nate Silver, who turned out to have been famous for his prowess with predictions for quite some time. Eager to learn more, I thumbed through his bestseller.

Though necessarily light on theory, the equations that do appear are correct and lucidly explained. Also, the pages are packed with interesting data sets and anecdotes. General pronouncements are often backed up with concrete tables and graphs, though, as Silver readily admits, some qualities are difficult to quantify, resulting in potentially dubious but novel yardsticks (such as measuring scientific progress by average research and development expenditure per patent).

But most of all, I was intrigued by the tale of an ongoing conflict that I never knew existed, with frequentists on one side and Bayesians on the other. They never told me this in school!

I soon found out why: Silver states that Fisher may almost be single-handedly to blame for the dominance of frequentism, the ideology foisted upon me when I was just out of high school. Sure enough, I went back and confirmed Phipps and Quine listed Fisher in the bibliography.

## Against the Gods: The Remarkable Story of Risk by Peter L. Bernstein

My dad told me about this book. Technical details are scant as it is also aimed at the general public. But in contrast to Silver’s work, what little that appears is laughably erroneous. In some sections, I felt the author was trying to trick himself into believing fallacies.

The misinformation might be mostly harmless. Those with weak mathematical ability are going to skip the equations out of fear, and those with strong mathematical ability are probably also going to skip them because they already know them.

But conceivably this book could be a gifted reader’s first introduction to probability, and it’d be a shame to start off on the wrong foot. As a sort of public service, I’ll explain some of the gaffes.

Exercises

Chapter 6 contains an example expected value calculation involving a coin flip.

We multiply 50% by one for heads and do the same for the tails, take the sum---100%---and divide by two. The expected value of betting on a coin toss is 50%. You can expect either heads or tails, with equal likelihood.

Why is this wrong? How can we fix it?

The next example involves rolling two dice.

If we add the 11 numbers that might come up…the total works out to 77. The expected vale of rolling two dice is 77/11, or exactly 7.

Why is this wrong? How can we fix it?

What’s the difference?

Bernstein and Silver offer competing reasons why modern civilization differs from the past. Bernstein singles out our relatively newfound ability to quantify risk, and also suggests that key intermediate steps could only have occurred at certain points in history due to the overall mood of the era.

In contrast, Silver seems to place most importance on the printing press. In an early chapter, Silver suggests that after some teething trouble (lasting 330 years), the printing press paved the way for modern society. Apart from distribution of knowledge, perhaps more importantly the printing press helped with the preservation of knowledge; previously, writing would often be lost before it could be copied.

I’m inclined to side with Silver, partly because of Bernstein’s basic technical mistakes. After observing how fast and loose Bernstein was playing with mathematics, I’m tempted to believe some of his statements are gut feelings.

There is another glaring difference. Bernstein’s book lacks any mention of the frequentist-Bayesian war. Fisher’s name is conspicuously absent.

For or Against?

Against the Gods is riveting. My favourite feature is the backstories of famous scholars. For some of them, before reading the book, the only thing I knew about them were their names, and I would have known even less if their names weren’t attached to their most famous discoveries (or at least, discoveries vaguely connected with them). Learning about their life, motivations, temperament, beliefs, and so on was illuminating. An intellectually superior form of gossip, I suppose.

However, the elementary mathematical mistakes ultimately cast a cloud of suspicion over the book. How reliable are the author’s assertions in general? Although I heartily recommend Against the Gods, I also recommend thorough fact-checking before using it as a reference.

So a tip for bestseller authors: if a section is technical, then ask an expert, be an expert, or cut it out. Too many howlers make readers like me wary of the whole, no matter how well-written and accurate the non-technical parts are.

As Bernstein himself implies, an expected value is a weighted average. We need weights, and we need numbers to sum. It takes two to tango; the expected value dance can only proceed if probabilities are accompanied by values.

One example neglects the values, and the other neglects the probabilities. The author only computes the sum of the weights for the coin flip, and the sum of the values for the dice roll. In both cases the author divides by the number of outcomes, which might be considered another error: we already divided by the number of outcomes to compute the weights (probabilities) in the first place.

Why are these blunders amusing? For the coin example, let’s ignore that the expected value is confused with a probability. Instead of a coin, consider winning the lottery. The probability of winning the lottery plus the probability of not winning the lottery sums to 100%. Dividing this by the number of outcomes, i.e. 2, yields 50%, so apparently we win or lose the lottery with equal likelihood! It’s almost like saying “either it happens or it doesn’t happen, so the chances it happens is 50%”.

For the dice example, imagine rolling 2 loaded dice, both of which almost always show 6. The expected value should be close to 12, but because the probabilities are completely ignored, the author’s procedure leads to the same expected value of 7. Surely your calculation should change if the dice are loaded?

How do we fix these problems? For the dice example, the author supplies the correct method in the very next paragraph. At last, both the probabilities and values are taken into account. Unfortunately, the author then concludes:

The expected value…is exactly 7, confirming our calculation of 77/11. Now we can see why a roll of 7 plays such a critical role in the game of craps.

This should have never been written. The first sentence suggests both methods for computing the expected value are valid, when of course it just so happens the wrong method leads to the right answer.

The second sentence is difficult to interpret. Perhaps uncharitably, I’m guessing the sentence is an upgraded version of: “Look! Here’s a 7! Didn’t we see a 7 earlier?” What would have been written if we rolled a single die? The expected value is 3.5, but a roll of 3.5 obviously has no role in any game we play with one die.

As for fixing the coin example: computing an expected value requires us to attach a numerical value to each outcome. One does not simply plow ahead with “heads” versus “tails”. We need numbers; any numbers. We could assign 42 to heads, and 1001 to tails; here, the expected value of a fair coin toss would be 50% of 42 plus 50% of 1001, which is 521.5. Typically we pick values relevant to the problem at hand: for instance, in a game where we earn a dollar for flipping heads, and lose a dollar for tails, we’d assign the values 1 and -1 (here, our expected winnings would be 0).

[It may be possible to reinterpret the coin example as assigning the value 1 to both heads and tails. But if this were done, the expected value should also be 1, not “50%”. Furthermore, we learn nothing if the outcomes are indistinguishable.]

## Probability Theory: The Logic of Science by E. T. Jaynes

If only Jaynes' book had been my introduction to probability. Like a twist ending in a movie, reading it was a thought-provoking eye-opening earth-shattering experience that compelled me to re-evaluate what I thought I knew.

Whereas Silver presents whimsical examples that demonstrate the Bayesian approach, Jaynes forcefully argues for its theoretical soundness. From a few simple intuitive “desiderata” (too ill-defined to be axioms), Jaynes shows step-by-step how they imply more familiar probability axioms, and why the Bayesian approach is the natural choice. And all this happens within the first 3 chapters, which are free online.

I had been uneasy about probability because I thought it was a collection of mysterious hacks, perhaps because it had to deal with the real world. I was flabbergasted to learn probability could be put on the same footing as formal logic. All those hacks can be justified after all. Probability is not just intuition and duct tape: it can be as solid as any branch of mathematics.

Since there still exist competing philosophies of probability, presumably others find fault with Jaynes' arguments. I’m still working through it, but I’m convinced for now. If there’s another twist in this story, I’ll need another great book to show it to me.

Washington University in St. Louis maintains a page dedicated to Jaynes. It’s a shame he died before he finished writing. The remaining holes have been papered over with exercises, which explains their depth and difficulty.

It’s also a shame Jaynes left Stanford University many years ago. Had he stayed, with luck I would have discovered his work earlier, or even have met him. A backward look to the future describes his reasons for departure.

In short, Jaynes felt the “publish or perish” culture of academia was harmful and was taking over Stanford. I can’t tell if Jaynes was right because by the time I got into the game, this culture seemed universally well-established. I had no idea an alternative ever existed.