Sunday, April 2, 2017

Lambda Calculus Surprises

Much time has passed since my last entry. I’ve been frantically filling gaps in my education, so I’ve had little to say here. My notes are better off on my homepage, where I can better organize them, and incorporate interactive demos.

However, I want to draw attention to delightful surprises that seem unfairly obscure.

1. Succinct Turing-complete self-interpreters

John McCarthy’s classic paper showed how to write a Lisp interpreter in Lisp itself. By adding a handful of primitives (quote, atom, eq, car, cdr, cons, cond) to lambda calculus, we get a Turing-complete language where a self-interpreter is easy to write and understand. For contrast, see Turing’s universal machine of 1936.

Researchers have learned more about lambda calculus since 1960, but many resources seem stuck in the past. Writing a Turing-complete interpreter in 7 lines is ostensibly still a big deal. The Roots of Lisp by Paul Graham praises McCarthy’s self-interpreter but explores no further. The Limits of Mathematics by Gregory Chaitin chooses Lisp over plain lambda calculus for dubious reasons. Perhaps McCarthy’s work is so life-changing that some find it hard to notice new advances.


(I’ve suppressed the lambdas. Exercise: write a regex substitution that restores them.)

In fact, under some definitions, the program “λq.q(λx.x)” is a self-interpreter.

2. Hindley-Milner sort

Types and Programming Languages (TaPL) by Benjamin C. Pierce is a gripping action thriller. Types are the heroes, and we follow their epic struggle against the most ancient and powerful foes of computer science and mathematics.

When we first meet them, types are humble guardians of a barebones language that can only express the simplest of computations involving booleans and natural numbers. As the story progresses, types gain additional abilities, enabling them to protect more powerful languages.

However, there seems to be a plot hole when types level up from Hindley-Milner to System F. As a “nice demonstration of the expressive power of pure System F”, the book mentions a program that can sort lists.

The details are left as an exercise to the reader. Working through them, we realize a Hindley-Milner type system is already powerful enough to sort lists. Moreover, the details are far more pleasant in Hindley-Milner because we avoid the ubiquitous type spam of System F.

System F is indeed more powerful than Hindley-Milner and deserves admiration, but because of well-typed self-application and polymorphic identity functions, existential types, and other gems; not because lists can be sorted.

3. Self-interpreters for total languages

They said it couldn’t be done.

According to Breaking Through the Normalization Barrier: A Self-Interpreter for F-omega by Matt Brown and Jens Palsberg, “several books, papers, and web pages” assert self-interpreters for a strongly normalizing lambda calculus are impossible. The paper then shows that reports of their non-existence have been greatly exaggerated.

Indeed, famed researcher Robert Harper writes on his blog that “one limitation of total programming languages is that they are not universal: you cannot write an interpreter for T within T (see Chapter 9 of PFPL for a proof).”, and as of now (April 2017), the Wikipedia article they cite still declares “it is impossible to define a self-interpreter in any of the calculi cited above”, referring to simply typed lambda calculus, System F, and the calculus of constructions.

I was shocked. Surely academics are proficient with diagonalization by now? Did they all overlook a hole in their proofs?

More shocking is the stark simplicity of what Brown and Palsberg call a shallow self-interpreter for System F and System Fω, which is essentially a typed version of “λq.q(λx.x)”.

It relies on a liberal definition of representation (we only require an injective map from legal terms to normal forms) and self-interpretation (mapping a representation of a term to its value) which is nonetheless still strong enough to upend conventional wisdom.

Which brings us to the most shocking revelation: there is no official agreement on the definition of representation or self-interpretation, or even what we should name these concepts.

Does this mean I should be wary of even the latest textbooks? Part of me hopes not, because I want to avoid learning falsehoods, but another part of me hopes so, for it means I’ve reached the cutting edge of research.

See for yourself!

Interactive demos of the above:

Tuesday, November 10, 2015

Neural Networks in Haskell

Long ago, when I first looked into machine learning, neural networks didn’t stand out of the crowd. They seemed on par with decision trees, genetic algorithms, genetic programming, and a host of other techniques. I wound up dabbling in genetic programming because it seemed coolest.

Neural networks have since distinguished themselves. Lately, they seem responsible for each newsworthy machine learning achievement I hear about. To name a few:

Inspired, I began reading Michael Nielsen’s online book on neural networks. We can whip up a neural network without straying beyond a Haskell base install, though we do have to implement the Box-Muller transform ourselves to avoid pulling in a library to sample from a normal distribution.

The following generates a neural network with 3 inputs, a hidden layer of 4 neurons, and 2 output neurons, and feeds it the inputs [0.1, 0.2, 0.3].

import Control.Monad
import Data.Functor
import Data.List
import System.Random

main = newBrain [3, 4, 2] >>= print . feed [0.1, 0.2, 0.3]

newBrain szs@(_:ts) = zip (flip replicate 1 <$> ts) <$>
  zipWithM (\m n -> replicateM n $ replicateM m $ gauss 0.01) szs ts

feed = foldl' (((max 0 <$>) . ) . zLayer)

zLayer as (bs, wvs) = zipWith (+) bs $ sum . zipWith (*) as <$> wvs

gauss :: Float -> IO Float
gauss stdev = do
  x <- randomIO
  y <- randomIO
  return $ stdev * sqrt (-2 * log x) * cos (2 * pi * y)

The tough part is training the network. The sane choice is to use a library to help with the matrix and vector operations involved in backpropagation by gradient descent, but where’s the fun in that?

It turns out even if we stay within core Haskell, we only need a few more lines, albeit some hairy ones:

relu = max 0
relu' x | x < 0      = 0
        | otherwise  = 1

revaz xs = foldl' (\(avs@(av:_), zs) (bs, wms) -> let
  zs' = zLayer av (bs, wms) in ((relu <$> zs'):avs, zs':zs)) ([xs], [])

dCost a y | y == 1 && a >= y = 0
          | otherwise        = a - y

deltas xv yv layers = let
  (avs@(av:_), zv:zvs) = revaz xv layers
  delta0 = zipWith (*) (zipWith dCost av yv) (relu' <$> zv)
  in (reverse avs, f (transpose . snd <$> reverse layers) zvs [delta0])
    f _ [] dvs = dvs
    f (wm:wms) (zv:zvs) dvs@(dv:_) = f wms zvs $ (:dvs) $
      zipWith (*) [sum $ zipWith (*) row dv | row <- wm] (relu' <$> zv)

descend av dv = zipWith (-) av ((0.002 *) <$> dv)

learn xv yv layers = let (avs, dvs) = deltas xv yv layers
  in zip (zipWith descend (fst <$> layers) dvs) $
    zipWith3 (\wvs av dv -> zipWith (\wv d -> descend wv ((d*) <$> av))
      wvs dv) (snd <$> layers) avs dvs

See my Haskell notes for details. In short: ReLU activation function; online learning with a rate of 0.002; an ad hoc cost function that felt right at the time.

Despite cutting many corners, after a few runs, I obtained a neural network that correctly classifies 9202 of 10000 handwritten digits in the MNIST test set in just one pass over the training set.

I found this result surprisingly good. Yet there is much more to explore: top on my must-see list are deep learning (also described in Nielsen’s book) and long short-term memory.

I turned the neural net into an online digit recognition demo: you can draw on the canvas and see how it affects the outputs.

Monday, February 23, 2015

Mighty Warp

  • Outperforms nginx.

  • Under 1300 lines of source.

  • Clear control flow: handles one request per thread using blocking calls.

  • Slowloris DoS protection.

The secret is GHC’s runtime system (RTS). Every Haskell program must spend time in the RTS, and maybe this does hurt performance in certain cases, but for web servers it is a huge win: the RTS automatically transforms code that seems to handle one request per thread into a server with multiple event-driven processes. This saves many a context switch while keeping the source simple.

Best of all, this magic technology is widely available. To start a webserver on port 3000 using the Warp library, run these commands on Ubuntu [original inspiration]:

sudo apt-get install cabal-install
cabal update
cabal install warp
cat > server.hs << EOF
#!/usr/bin/env runghc
{-# LANGUAGE OverloadedStrings #-}
import Network.Wai (responseLBS)
import Network.Wai.Handler.Warp (run)
import Network.HTTP.Types (status200)
import Network.HTTP.Types.Header (hContentType)

main = run 3000 $ \_ f -> f $ responseLBS
  status200 [(hContentType, "text/plain")] "Hello, world!\n"
chmod +x server.hs

Eliminating context switches is the best part of the story, but there’s more. Copying data can be avoided with a simple but clever trick the authors call splicing. Using conduits instead of lazy I/O solves the non-deterministic resource finalization problem. And a few judiciously placed lockless atomic operations can work wonders: in particular, for basic Slowloris protection and for a robust file descriptor cache.


Those who fear straying too far from a C-like language can still reap the benefits in Go:

  • Goroutines are like green threads.

  • Channels are like conduits.

  • Array slices are like splices.

If I didn’t know better, I would say the designers of Go emulated the inventor of the California roll: they took some of the best features of languages like Haskell and made them palatable to a wider audience.

I wonder how Go’s RTS compares. One innate advantage GHC may have is Haskell’s type system, which leads to largely non-destructive computation, which ultimately leads to a cheap and effective scheduling scheme (namely, context switching on memory allocation). Still, I expect a well-written Go web server could achieve similar results.

Saturday, December 20, 2014

Haskell for programming contests

Farewell, Dr. Dobb’s. In a way, my previous post proved prescient: in the old days, I relied on printed magazines like Dr. Dobb’s Journal to learn about computers. Now, most things are but a few search terms away. Coding is easier than ever.

Though I have a soft spot for this particular magazine, ultimately I’m glad information has become more organized and accessible. I like to think I played a part in this, however small, by posting my own tutorials, articles, rants, and code.

Here’s hoping the remainder of my previous post also ages well. That is, may Haskell live long and prosper. [A sentiment echoed by Dr. Dobb’s.] Again, I’d like to play a small part in this: Haskell for programming contests.

Tuesday, August 12, 2014

Let's Code!

A recent article in Dr. Dobb’s Journal bemoans the complexity of today’s development toolchains: “it’s hard to get any real programming done”. However, my own experience suggests the opposite: I find programming is now easier than ever, partly due to better tools.

I say “partly” because when I was a kid, it was difficult to obtain code, compilers, and documentation, let alone luxuries like an SCM. I scoured public libraries for books on programming and checked out what they had, which meant I studied languages which I could never use because I lacked the right compiler, or even the right computer. I nagged my parents to buy me expensive books, and occasionally they’d succumb. Perhaps the most cost-efficient were magazines containing program listings which of course had to be keyed in by hand. (One of my most treasured was an issue of Dr. Dobb’s Journal, back when it was in print, and only in print.)

Nowadays, a kid can get free high-quality compilers, code, tutorials, and more at the click of a button. But I believe even without this freer flow of information, programming would still be easier than ever because our tools have improved greatly.

got git?

The author singles out Git as a source of trouble, but the reasoning is suspect. For example, we’re told that with respect to other “SCMs you’ve used…Git almost certainly does those same actions differently.”

This suggests that the author used other SCMs, then tried Git and found it confusing. In contrast, I used Git, then tried other SCMs and found them confusing. I predict as time passes, more and more developers will learn Git first, and their opinions of SCMs will mirror mine.

Nevertheless, I’m leery of ranking the friendliness of tools by the order you picked them up. I hereby propose a different yardstick. Take Git, and a traditional SCM. Implement, or at least think about implementing, a clone of each from scratch; just enough so it is self-hosting. Then the one that takes less time to implement is simpler.

I wrote a self-hosting Git clone in a few hours: longer than expected because I spent an inordinate amount of time debugging silly mistakes. Though I haven’t attempted it, I would need more time to write a clone of Perforce or Subversion (pretty much the only other SCMs I have used). With Git, there’s no transactions, revision numbers, rename tracking, central servers, and so on; Git is essentially SHA-1 hashes all the way down.

But let’s humour the author and suppose Git is complex. Then why not use tarballs and patches? This was precisely how Linux was managed for 10 years, so should surely suffice for a budding developer. In fact, I say you should only bother with Git once you realize, firstly, you’re addicted to coding, and secondly, how annoying it is to manage source with tarballs and patches!

In other words, although Git is handy, you only really need it when your project grows beyond a certain point, by which time you’ve already had plenty of fun coding. Same goes for tools like defect trackers.

Apps and Oranges

I agree that developing for mobiles is painful. However, comparing this against those “simple programs of a few hundred lines of C++ long ago” is unfair. With mobile apps, the program usually runs on a system different to the one used to write the code.

It might be fairer to compare writing an mobile app with, say, programming a dot matrix printer of yesteryear, as in both cases the target is different to the system used to write the code. I once did the latter, for the venerable Epson MX-80: after struggling with a ton of hardware-specific low-level nonsense, I was rewarded with a handful of crummy pictures. I would say it involved less “real programming” than writing an Android app.

All the same, I concede that writing Android software is harder than it should be, largely due to Java. But firstly, a mobile phone involves security and privacy issues that would never arise with a dot matrix printer, which necessarily implies more bookkeeping, and secondly, the Java problem can be worked around: either via native code, or a non-Java compiler that generates Dalvik bytecode. [I’ve only mentioned Android throughout because it’s the only mobile platform I’ve developed on.]

Comparing server-side web apps with the good old days is similarly unfair unless the good old days also involved networks, in which case they were really the bad old days. PC gamers of a certain age may remember a myriad of mysterious network options to configure multiplayer mode; imagine the even more mysterious code behind it. As for cloud apps, I would rather work on a cloud app than on an old-school equivalent: BBS software, which involves renting out extra phones lines if you want high availability.

What about client-side web apps? As they can run on the same system used to develop them, it is therefore fair to compare developing them against writing equivalent code in those halcyon days of yore. Let’s look at a couple of examples.


I wrote a tic-tac-toe web app with an AI that plays perfectly because it searches the entire game tree; modern hardware and browsers are so fast that this is bearable (though we’re spared one ply because the human goes first). It works on desktops, laptops, tablets, phones: anything with a browser.

Here’s the minimax game tree search, based on code from John Hughes, Why Functional Programming Matters:

score (Game _ Won 'X') = -1
score (Game _ Won 'O') = 1
score _ = 0

maximize (Node leaf []) = score leaf
maximize (Node _ kids) = maximum (map minimize kids)

minimize (Node leaf []) = score leaf
minimize (Node _ kids) = minimum (map maximize kids)

Despite my scant Haskell knowledge and experience, the source consists of a single file containing less than 150 lines like the above, plus a small HTML file: hardly a “multiplicity of languages”. Writing it was enjoyable, and I did so with a text editor in a window 80 characters wide.

Let’s rewind ten to twenty years. I’d have a hard time achieving the brevity and clarity of the above code. The compiler I used didn’t exist, and depending how far back we go, neither did the language. Not that I’d consider compiling to JavaScript in the first place: depending how far back we go, it was too slow or didn’t exist.


In my student days, I developed a clone of a Windows puzzle game named Netwalk. I chose C, so users either ran untrusted binaries I supplied (one for each architecture), or built their own binaries from scratch. Forget about running it on phones and PDAs.

I managed my files with tarballs and patches. The source consisted of a few thousand lines, though admittedly much of it is GUI cruft: menus, buttons, textboxes, and so on. Lately, I hacked up a web version of Netwalk. The line count? About 150.

Thanks to Git, you can view the entire source right now on Google Code or GitHub, all dolled up with syntax highlighting and line numbers.

Building native binaries has a certain charm, but I have to admit that a client-side web app has less overhead for developers and users alike. I only need to build the JavaScript once, then anyone with a browser can play.

Thus in this case, my new tools are better than my old tools in every way.

Choose Wisely

The real problem perhaps is the sheer number of choices. Tools have multiplied and diversified, and some indeed impede creativity and productivity. But others are a boon for programmers: they truly just let you code.

Which tools are the best ones? The answer probably depends on the person as well as the application, but I will say for basic client-side web apps and native binaries, I heartily recommend my choices: Haskell, Haste, Git.

I’m confident the above would perform admirably for other kinds of projects. I intend to find out, but at the moment I’m having too much fun coding games.

Play Now!

Tuesday, July 29, 2014

15 Shades of Grey

John Carmack indirectly controlled significant chunks of my life. For hours at a time, I would fight in desperate gun battles in beautiful and terrifying worlds he helped create. On top of this, the technical wizardry of id Software’s games inspired me to spend yet more hours learning how they managed to run Wolfenstein 3D and Doom on PCs, in an era when clockspeeds were measured in megahertz and dedicated graphics cards were rare.

I read about cool tricks like binary space partitioning, and eventually wrote a toy 3D engine of my own. The process increased my respect for the programmers: it’s incredibly difficult to get all the finicky details right while sustaining good frame rates.

Accordingly, I paid close attention when John Carmack spoke about programming languages in his QuakeCon 2013 keynote. Many people, myself included, have strong opinions on programming languages, but few have a track record as impressive as his.

Carmack’s Sneaky Plan

I was flabbergasted by Carmack’s thoughts on the Haskell language. He starts by saying: “My big software evolution over the last, certainly three years and stretching back tendrils a little bit further than that, has been this move towards functional programming style and pure functions.”

He then states that not only is Haskell suitable for programming games, but moreover, thinks Haskell may beat typical languages by roughly “a factor of two”, which “would be monumental” and “a really powerful thing for game development”. He has even begun reimplementing Wolfenstein 3D in Haskell as part of a “sneaky plan” to convince others.

Wow! I had always thought Haskell was a pretty but impractical language. I loved composing elegant Haskell snippets to solve problems that one might encounter in interviews and programming contests, but for real stuff I resorted to C.

Among my concerns is garbage collection: I have bad memories of unexpected frequent pauses in Java programs. But Carmack notes that Haskell’s almost uncompromising emphasis on purity simplifies garbage collection to the point where it is a predictable fixed overhead.

A second concern is lazy evaluation. It’s easy to write clear and simple but inefficient Haskell: computing the average of a list of numbers comes to mind. Carmack is also “still not completely sold on the value of laziness”, but evidently it’s not a showstopper for him. I suppose it’s all good so long as there are ways of forcing strict evaluation.

A third concern (but probably not for Carmack) is that I don’t know how to write a Haskell compiler; I’m more at ease with languages when I know how their compilers work. I can ignore this discomfort, though I intend to overcome my ignorance one day. I’m hoping it’s mostly a matter of understanding Hindley-Milner type inference.

Speaking of types, Carmack is a fan of static strong typing, because in his experience, “if it’s syntactically legal, it will make it into the codebase”. He notes during his recent foray into Haskell, the one time he was horribly confused was due to untyped data from the original Wolfenstein 3D.

My Obvious Plan

Once again, I’m inspired by Carmack. I plan to take Haskell more seriously to see if it really is twice as good. Although I lack the resources to develop a complex game, I may be able to slap together a few prototypes from time to time.

First up is the 15-Puzzle by Noyes Palmer Chapman with a cosmetic change: to avoid loading fonts and rendering text, I replaced the numbers 1 to 15 with increasingly darker shades of grey.

I began with a program depending on SDL. The result was surprisingly playable, and I found the source code surprisingly short in spite of my scant knowledge of Haskell. To better show off my work, I made a few edits to produce a version of my program suitable for the Haste compiler, which compiles Haskell to JavaScript. I added mouse support and tweaked the HTML so the game is tolerable on tablets and phones.

Play now!

Sunday, May 25, 2014

Straw Men in Black

There’s a phrase used to praise a book: “you can’t put it down”. Unfortunately, I felt the opposite while reading The Black Swan by Nassim N. Taleb.

I’ll admit some prejudice. We’re told not to judge a book by its cover, but review quotes in the blurb ought to be exempt. One such quote originated from Peter L. Bernstein, the author of Against the Gods. While I enjoyed reading it, his book contained a litany of elementary mathematical mistakes. Did this mean The Black Swan was similarly full of errors?

All the same, the book began well. Ideas were clear and well-expressed. The writing was confident: perhaps overly so, but who wants to read text that lacks conviction? It promised wonders: we would learn how statisticians have been fooling us, and then learn the right way to deal with uncertainty, with potentially enormous life-changing payoffs.

I failed to reach this part because several chapters in, I was exhausted by a multitude of issues. I had to put the book down. I intend to read further once I’ve recovered, and hopefully the book will redeem itself. Until then, here are a few observations.

One Weird Trick

What’s on the other end of those "one weird trick" online ads? You won’t find out easily. If clicked, one is forced to sit through a video that:

  • makes impressive claims about a product

  • takes pains to keep the product a secret

  • urges the viewer to wait until the end, when they will finally learn the secret

This recipe must be effective, because I couldn’t help feeling the book was similar. It took me on a long path, meandering from anecdote to anecdote, spiced with poorly constructed arguments and sprinkled with assurances that the best was yet to come.

Perhaps this sales tactic has become a necessary evil. With so much competition, how can a book distinguish itself? Additionally, I’m guessing fattening the book for any reason has a positive effect on sales.

Even so, the main idea of the book could be worth reading. I’ll post an update if I find out.

Lay Off Laplace

Chapter 4 features a story about a turkey. As days pass, a turkey’s belief in the proposition such as "I will be cared for tomorrow" grows ever stronger, right until the day of its execution, when its belief turns out to be false. This retelling of a parable about a chicken due to Bertrand Russell is supposed to warn us about inferring knowledge from observations, a repeated theme in the book.

But what about Laplace’s sunrise problem? By the Rule of Succession, if the sun rose every day for 5000 years, that is, for 5000 × 365.2426 days, the odds it will rise tomorrow are only 1826214 to 1. Ever since Laplace wrote about this, he has been mercilessly mocked because of this ludicrously small probability.

So which is it? Do repeated observations make our degrees of belief too strong (chicken) or too weak (sunrise)?

Live long and prosper

Much of this material is discussed in Chapter 18 of Probability Theory: The Logic of Science by Edwin T. Jaynes, which also contains the following story.

A boy turns 10 years old. The Rule of Succession implies the probability he lives one more year is (10 + 1) / (10 + 2), which is 11/12. A similar computation shows his 70-year old grandfather will live one more year with probability 71/72.

I like this example, because it contains both the chicken and the sunrise problem. Two for the price of one. Shouldn’t the old man’s number be lower than the young boy’s? One number seems too big and the other too small. How can the same rule be wrong in two different ways?

Ignorance is strength?

What should we do to avoid these ridiculous results?

Well, if the sun rose for every day for 5000 years and that is all you know, then 1826214 to 1 is correct. The only reason we think this is too low is because we know a lot more than the number of consecutive sunrises: we know about stars, planets, orbits, gravity, and so on. If we take all this into account, our degree of belief that the sun rises tomorrow grows much stronger.

The same goes for the other examples. In each one:

  1. We ignored what we know about real world.

  2. Calculated based on what little data was left.

  3. Un-ignored the real world so we could laugh at the results.

In other words, we have merely shown that ignoring data leads to bad results. It’s as obvious as noting that if you shut your eyes while driving a car, you’ll end up crashing.

Sadly, despite pointing this out, Laplace became a victim of this folly. Immediately after describing the sunrise problem, Laplace explains that the unacceptable answer arises because of wilfully neglected data. For some reason, his critics take his sunrise problem, ignore his explanation for the hilarious result, then savage his ideas.

The Black Swan joins the peanut gallery in condemning Laplace. However, its conclusion differs from those of most detractors. The true problem is that most of the data is ignored when computing probabilities. Taleb considers addressing this by ignoring even more data! This begs the question: why not toss out more? Why not throw away most of mathematics and assign arbitrary probabilities to arbitrary assertions?

Orthodox statistics is indeed broken, but not because more data should be ignored. It’s broken for the opposite reason: too much data is being ignored.

Poor Laplace. Give the guy a break.

Hempel’s Joke

Stop me if you’ve heard this one: 2 + 2 = 5 for sufficiently large values of 2. This is obviously a joke (though sometimes told so convincingly that the audience is unsure).

Hempel’s Paradox is a similar but less obvious joke that proceeds as follows. Consider the hypothesis: all ravens are black. This is logically equivalent to saying all non-black things are non-ravens. Therefore seeing a white shoe is evidence supporting the hypothesis.

The following Go program makes the attempted humour abundantly clear:

package main

import "fmt"

func main() {
state := true
for {
var colour, thing string
if _, e := fmt.Scan(&colour, &thing); e != nil {
if thing == "raven" && colour != "black" {
state = false
fmt.Println(" hypothesis:", state)

A sample run:

black raven
hypothesis: true
white shoe
hypothesis: true
red raven
hypothesis: false
black raven
hypothesis: false
white shoe
hypothesis: false

The state of the hypothesis is represented by a boolean variable. Initially the boolean is true, and it remains true until we encounter a non-black raven. This is the only way to change the state of the program: neither "black raven" nor "white shoe" has any effect.

Saying we have "evidence supporting the hypothesis" is saying there are truer values of true. It’s like saying there are larger values of 2.

The original joke exploits the mathematical concept “sufficiently large” which has applications, but is absurd when applied to constants.

Similarly, Hempel’s joke exploits the concept "supporting evidence", which has applications, but is absurd when applied to a lone hypothesis.

Off by one

If we want to talk about evidence supporting or undermining a hypothesis mathematically, we’ll need to advance beyond boolean logic. Conventionally we represent degrees of belief with numbers between 0 and 1. The higher the number, the stronger the belief. We call these probabilities.

Next, we propose some mutually exclusive hypotheses and assign probabilities between 0 and 1 to each one. The sum of the probabilities must be 1.

If we take a single proposition by itself, such as "all ravens are black", then we’re forced to give it a probability of 1. We’re reduced to the situation above, where the only interesting thing that can happen is that we see a non-black raven and we realize we must restart with a different hypothesis. (In general, probability theory taken to extremes devolves into plain logic.)

We need at least two propositions with nonzero probabilties for the phrase "supporting evidence" to make sense. For example, we might have two propositions A and B, with probabilities of 0.2 and 0.8 respectively. If we find evidence supporting A, then its probability increases and the probability of B decreases accordingly, for their sum must always be 1. Naturally, as before, we may encounter evidence that implies all our propositions are wrong, in which case we must restart with a fresh set of hypotheses.

To avoid nonsense, we require at least two mutually exclusive propositions, such as A: "all ravens are black", and B: "there exists a non-black raven", and each must have a nonzero probability. Now it makes sense to ask if a white shoe is supporting evidence. Does it support A at B’s expense? Or B at A’s expense? Or neither?

The propositions as stated are too vague to answer one way or another. We can make the propositions more specific, but there are infinitely many ways to do so, and the choices we make change the answer. See Chapter 5 of Jaynes.

One Card Trick

Instead of trying to flesh out hypotheses involving ravens, let us content ourselves with a simpler scenario. Suppose a manufacturer of playing cards has a faulty process that sometimes uses black ink instead of red ink to print the entire suit of hearts. We estimate one in ten packs of cards have black hearts instead of red hearts and is otherwise normal, while the other nine decks are perfectly fine.

We’re given a pack of cards from this manufacturer. Thus we believe the hypothesis A: "all hearts are red" with probability 0.9, and B: "there exists a non-red heart" with probability 0.1. We draw a card. It’s the four of clubs. What does this do to our beliefs?

Nothing. Neither hypothesis is affected by this irrelevant evidence. I believe this is at least intuitively clear to most people, and furthermore, had Hempel spoke of hearts and clubs instead of ravens and shoes, his joke would have been more obvious.

Great Idea, Poor Execution

The Black Swan attacks orthodox statistics using Hempel’s paradox, alleging that it shows we should beware of evidence supporting a hypothesis.

It turns out orthodox statistics can be attacked with Hempel’s paradox, but not by claiming "supporting evidence" is meaningless. That would be like claiming "sufficiently large" is meaningless.

Instead, Hempel’s joke reminds us we must consider more than one hypothesis if we want to talk about supporting evidence. This may seem obvious; assigning a degree of belief in a lone proposition is like awarding points in a competition with only one contestant.

However, apparently it is not obvious enough. The Black Swan misses the point, and so did my university professors. My probability and statistics textbook instructs us to consider only one hypothesis. (Actually, it’s worse: one of the steps is to devise an alternate hypothesis, but this second hypothesis is never used in the procedure!)

Mathematics Versus Society

In an off-hand comment, Taleb begins a sentence with “Mathematicians will try to convince you that their science is useful to society by…”

By this point, I already found faults. First and foremost: how often do mathematicians talk about their usefulness to society? There are many jokes about mathematicians and real life, such as:

Engineers believe their equations approximate reality. Physicists believe reality approximates their equations. Mathematicians don’t care.

The truth is being exaggerated for humour, but asserting their work is useful in the real world is evidently a low priority for mathematicians. It is almost a point of pride. In fact, Taleb himself later quotes Hardy:

The “real” mathematics of the “real” mathematicians…is almost wholly “useless”.

This outlook is not new. Gauss called number theory “the queen of mathematics”, because it was pure and beautiful and had no applications in real life. (He had no way of foreseeing that number theory would one day be widely used in real life for secure communication!)

But sure, whatever, let’s suppose mathematicians go around trying to convince others that their field is useful to society. [Presumably Hardy would call such a mathematician “imaginary” or “complex”.] They are trivially right. If you try to talk about how useful things are to society, then you’ll want to measure and compare usefulness of things, all the while justifying your statements with sound logical arguments. Measuring and comparing and logic all lie squarely in the domain of mathematics.

Jumping to Conclusions

So far, I feel the author’s heart is in the right place but his reasoning is flawed. Confirmation bias is indeed pernicious, and orthodox statistics is indeed erroneous. However, The Black Swan knocks down straw men instead of hitting these juicy targets.

The above are but a few examples of the difficulties I ran into while reading the book. I had meant to pick apart more specious arguments but I’ve already written more than I had intended.

Again, I stress I have not read the whole work, and it may improve in the second half.