Ben Lynn's Online Garbage

Laziness is next to godliness

2020-01-15T08:37:00.000-08:00

I’ve been working through Handbook of Practical Logic and Automated Reasoning by John Harrison. I enjoy translating its OCaml listings to Haskell: the two languages share much in common, so I can devote most of my attention to experimentation and exploration.

I largely focused on aesthetics. With the help of pattern synonyms, recursion schemes lead to succinct code for manipulating abstract syntax trees. Monads and typeclasses reduce clutter. Lazy evaluation simplifies some tasks such as enumerating all ground instances: we produce a never-ending list rather than manage drip-fed enumeration with a counter.

As I’ve come to expect from Haskell, frequently my code just worked with little or no debugging. But success bred suspicion; my code worked too well!

Flattening the competition

My first MESON port solved Schubert’s Steamroller in about 15 seconds on my laptop in GHCi. I was pleased, but the end of section 3.15 showed how to more efficiently distribute the size bound over subgoals to get a faster MESON that proves the steamroller "in a reasonable amount of time".

Wow! Did the author feel 15 seconds was sluggish? Would this optimization bring the time down to 5 seconds? Half a second? I eagerly implemented it to find out.

The change ruined my program. It crawled so slowly that I interrupted it, afraid it would eat all my system’s resources. In desperation I downloaded the original OCaml code to investigate why I was experiencing the opposite of what the book said. I expected to be awed by its speed, after which in a fit of jealously I’d figure out how I’d botched my rewrite.

Instead, I was shocked to find the OCaml version was even worse. In other words, my supposedly unoptimized MESON was an order of magnitude faster than the most advanced MESON in the book. But how? I had merely translated from one language to another, almost mechanically. Surely bugs were to blame.

After spending hours looking for them in vain, it dawned on me that my code might be correct after all, and I had fortuitously stumbled upon effective optimizations. Further analysis supported this: I now believe my implementation of MESON legitimately outperforms the book version due to lazy evaluation.

Because of how we use continuations, Haskell memoizes expensive computations and avoids repeating them during backtracking. It calls to mind a suggestion in the book to "somehow remember lemmas encountered earlier in proof search". Adding the sophisticated size bound distribution hampers the reuse of the memoized continuations because of an additional parameter, which explains why a purported optimization crippled my program.

Normally, lazy evaluation surprises me with an unpleasant space leak. I’m grateful that for once it surprised me by dramatically boosting performance!

Got something to prove?

Thanks to the Asterius GHC WebAssembly backend, we can use a web browser to confirm the unreasonable effectiveness of a lazy MESON:

First-order logic theorem provers

Click on "Presets", select "steamroller", then click "Lazy MESON".

Lambda the Penultimate

2018-11-16T10:39:00.000-08:00

Lambda expressions have proven so useful that even Java and C++ support them nowadays. But how do we compile them for a machine to run? No CPU has a lambda instruction.

One strategy is to convert lambda terms into point-free code, a process known as bracket abstraction. One such algorithm rewrites any program in terms of two functions: the S and K combinators. We can build a compiler by assembling just two simple functions.

Unfortunately, even with extra rewrite rules, classic bracket abstraction yields monstrous unwieldy expressions. Decades ago, they worked around this problem by building custom combinators tailored for each input program, known as supercombinators. Compilation is trickier, but at least the output is reasonable.

But recently, Oleg Kiselyov found a time-linear and space-linear bracket abstraction algorithm provided the De Bruijn indices of the input term are encoded in unary. It only takes 20 lines:

data Deb = Zero | Succ Deb | Lam Deb | App Deb Deb deriving Show
infixl 5 :#
data Com = Com :# Com | S | I | C | K | B | Sn Int | Bn Int | Cn Int
ski :: Deb -> (Int, Com)
ski deb = case deb of
  Zero                           -> (1,       I)
  Succ d    | x@(n, _) <- ski d  -> (n + 1,   f (0, K) x)
  App d1 d2 | x@(a, _) <- ski d1
            , y@(b, _) <- ski d2 -> (max a b, f x y)
  Lam d | (n, e) <- ski d -> case n of
                               0 -> (0,       K :# e)
                               _ -> (n - 1,   e)
  where
  f (a, x) (b, y) = case (a, b) of
    (0, 0)             ->         x :# y
    (0, n)             -> Bn n :# x :# y
    (n, 0)             -> Cn n :# x :# y
    (n, m) | n == m    -> Sn n :# x :# y
           | n < m     ->                Bn (m - n) :# (Sn n :# x) :# y
           | otherwise -> Cn (n - m) :# (Bn (n - m) :#  Sn m :# x) :# y

Our ski function returns an integer and a combinatory logic term equivalent to the input lambda term. The integer is zero if the given lambda term is closed; for an open term, it’s the number of lambdas needed to close it.

It uses bulk variants of the B, C, and S combinators:

linBulk :: Com -> Com
linBulk b = case b of
  Bn n   -> iterate ((B:#        B):#) B !! (n - 1)
  Cn n   -> iterate ((B:#(B:#C):#B):#) C !! (n - 1)
  Sn n   -> iterate ((B:#(B:#S):#B):#) S !! (n - 1)
  x :# y -> linBulk x :# linBulk y
  _      -> b

Linear complexity depends on memoizing these bulk combinators. If memoization is undesirable, we can replace each bulk combinator of order n with O(log n) ordinary combinators.

logBulk :: Com -> Com
logBulk b = case b of
  Bn n   -> go n (K:#I)         :# B              :# I
  Cn n   -> go n (K:#(C:#I:#I)) :# (B:#(B:#C):#B) :# I
  Sn n   -> go n (K:#(C:#I:#I)) :# (B:#(B:#S):#B) :# I
  x :# y -> logBulk x :# logBulk y
  _      -> b
  where
  go n base = foldr (:#) base $ ([b0, b1]!!) <$> bits [] n
  bits acc 0 = reverse acc
  bits acc n | (q, r) <- divMod n 2 = bits (r:acc) q
  b0 = C:#B:#(S:#B:#I)
  b1 = C:#(B:#S:#(B:#(B:#B):#(C:#B:#(S:#B:#I)))):#B

For example:

λ print $ logBulk $ Sn 1234
CB(SBI)(C(BS(B(BB)(CB(SBI))))B(CB(SBI)(CB(SBI)(C(BS(B(BB)(CB
(SBI))))B(CB(SBI)(C(BS(B(BB)(CB(SBI))))B(C(BS(B(BB)(CB(SBI))
))B(CB(SBI)(CB(SBI)(C(BS(B(BB)(CB(SBI))))B(K(CII))))))))))))
(B(BS)B)I
λ print $ logBulk $ Bn 1234
CB(SBI)(C(BS(B(BB)(CB(SBI))))B(CB(SBI)(CB(SBI)(C(BS(B(BB)(CB
(SBI))))B(CB(SBI)(C(BS(B(BB)(CB(SBI))))B(C(BS(B(BB)(CB(SBI))
))B(CB(SBI)(CB(SBI)(C(BS(B(BB)(CB(SBI))))B(KI)))))))))))BI

For completeness, we include our pretty-printing code:

instance Show Com where
  show S = "S"
  show I = "I"
  show C = "C"
  show K = "K"
  show B = "B"
  show (l :# r@(_ :# _)) = show l ++ "(" ++ show r ++ ")"
  show (l :# r)          = show l ++        show r
  show (Bn n) = "B_" ++ show n
  show (Cn n) = "C_" ++ show n
  show (Sn n) = "S_" ++ show n

In other words, we can easily rewrite a lambda term of length N as a combinatory logic term of length O(N log N).

Edward Kmett outlines a different approach (about 21 minutes into the video) though the details are so "horrific" that even he has yet to work through them. By the way, the slides from this talk are packed with excellent references on combinators.

Kiselyov laments bracket abstraction has "many descriptions and explanations and blogs", all of which take a syntactic approach. I’m one of the guilty parties, and hope to redeem myself with this post. Also, I rewrote one of my toy compilers to demonstrate another algorithm from Kiselyov’s paper. Though not linear, the algorithm avoids bulk combinators and often produces short and sweet programs.

Why Laziness Matters

2018-06-26T09:43:00.000-07:00

Should a programming language be lazy by default? Robert Harper says no. Lennart Augustsson says yes. No matter who is right, I say all computer scientists should become fluent in a lazy language, whether or not they speak it in daily life.

My evidence is a post by Russ Cox on parsing with derivatives: a very experienced programmer very convincingly argues why a parsing algorithm has exponential time complexity. But the claims are very wrong; Adams, Hollenbeck, and Might proved the algorithm is cubic.

How did he err so badly? Did he underestimate the power of lazy evaluation?

I once exclusively wrote eager code, and I imagine my younger self would have agreed with his analysis without a second thought. Today I know better. Marvel at these lines by Doug McIlroy:

int fs = 0 : zipWith (/) fs [1..]    -- integral from 0 to x
sins = int coss
coss = 1 - int sins

It seems too good to be true. Indistinguishable from magic perhaps. But somehow it all works when lazily evaluated. Beware of summarily dismissing lazy code because it looks implausibly amazing.

Also consider an earlier article by Cox on regular expressions. Again, a very experienced programmer very convincingly argues why a parsing algorithm has exponential time complexity. In this post, however, the claims are solid, and backed up by graphs of running times. (It’s worth reading by the way: it tells the tragedy of how popular regular expression implementations became sluggish twisted mockeries of true regular expressions, while offering hope for the future. My only criticism is it fails to mention regular expression derivatives.)

Why does the erroneous post lack similar graphs? Why didn’t the author throw some code together and benchmark it to produce damning evidence?

Perhaps he thought it was too tedious. This would imply unfamiliarity with lazy languages, because prototyping parsing with derivatives in Haskell is easier than criticizing it.

Preliminaries

We define a Pe data structure to represent parsing expressions, that is, the right-hand side of the production rules of a grammar.

import Control.Arrow
import Control.Monad.State
import qualified Data.Map as M
import qualified Data.Set as S

-- NT = non-terminal. (:.) = concatenation.
data Pe = NT String | Eps Char | Nul | Ch Char | Or [Pe] | Pe :. Pe | Del Pe

Although it represents the empty string, the Eps (for epsilon) expression holds a character that winds up in the abstract syntax tree (AST) returned by the parser. Similarly, the Del (for delta) expression, which is only generated internally, holds an expression which later helps build an AST.

A context-free grammar maps non-terminal symbols to parsing expressions:

type Grammar = M.Map String Pe

Our ASTs are full binary trees whose leaf nodes are characters (the free magma on the alphabet). The tree structure captures the order the production rules are applied.

data Ast = Bad | Lf Char | Ast :@ Ast deriving Show

isBad :: Ast -> Bool
isBad Bad = True
isBad _   = False

The Bad AST is returned for unparseable strings. An alternative is to drop Bad and replace Ast with Maybe Ast throughout our code.

A fancier parser might return a parse forest, that is, all parse trees for a given input. Ours simply settles on one parse tree.

Parsing with derivatives

To parse an input string, we first take successive derivatives of the start symbol with respect to each character of the input, taking care to leave bread crumbs in the Eps and Del expressions to record consumed characters. (The Del constructor is named for the delta symbol from the paper, but I also think of it as "deleted", because it remembers what has just been deleted from the input.)

Then the string is accepted if and only if the resulting expression is nullable, that is, accepts the empty string. As we traverse the expression to determine nullability, we also build an AST to return.

We memoize derivatives by adding entries to a state of type Grammar. Initially, this cache contains only the input grammar, mapping nonterminal symbols to Pe values. Later, we place a derivative at the key formed by concatenating the characters involved in the derivative with the nonterminal symbol being derived.

For example, if S is a nonterminal in the input grammar, then abS maps to derive 'a' (derive 'b' (NT "S")). We assume no nonterminal symbol in the input grammar is a suffix of any other nonterminal symbol, which is fine for a prototype.

It may help to imagine the grammar growing over time, gaining new production rules as we process input characters. Indeed, we consider nonterminals to refer to both nonterminals in the input grammar as well as their derivatives.

parse :: Grammar -> String -> String -> Ast
parse g start s = evalState (parseNull $ NT $ reverse s ++ start) g

Computing nullability requires finding a least fixed point. I found this the toughest part of the algorithm, partly because they never taught fixed point theory when I was in school. For some reason, the method reminds me of Hopcroft’s algorithm to minimize a DFA, where we repeatedly refine a partition until we reach a stable answer.

We initially guess each nonterminal is not nullable, which means it corresponds to the Bad AST. On encountering a nonterminal, if we’ve already seen it, then return our guess for that nonterminal. Otherwise, it’s the first time we’ve seen it and instead of guessing, we recursively traverse its corresponding expression. In doing so, we may discover our guess is wrong, so we correct it if necessary before returning an AST.

We repeat until our guesses stabilize. Guesses never change from a good AST to Bad, and the map of all guesses only changes if a guess is revised from Bad to a good AST. We exploit these facts to simplify our code slightly.

parseNull :: Pe -> State Grammar Ast
parseNull pe = leastFix M.empty where
  leastFix guessed = do
    (b, (_, guessed')) <- runStateT (visit pe) (S.empty, guessed)
    if M.size guessed == M.size guessed' then pure b else leastFix guessed'

visit :: Pe -> StateT (S.Set String, M.Map String Ast) (State Grammar) Ast
visit pe = case pe of
  Eps x  -> pure $ Lf x
  Del x  -> visit x
  Nul    -> pure Bad
  Ch _   -> pure Bad
  Or xs  -> chainsaw <$> mapM visit xs
  x :. y -> mul <$> visit x <*> visit y
  NT s -> do
    (seen, guessed) <- get
    case () of
      () | Just x <- M.lookup s guessed -> pure x
         | S.member s seen -> pure Bad
         | otherwise -> do
           modify $ first $ S.insert s
           b <- visit =<< lift (memoDerive s)
           unless (isBad b) $ modify $ second $ M.insert s b
           pure b

mul :: Ast -> Ast -> Ast
mul Bad _ = Bad
mul _ Bad = Bad
mul x y   = x :@ y

-- | Helps cut a non-empty parse forest down to one tree.
chainsaw :: [Ast] -> Ast
chainsaw xs | null xs'   = Bad
            | otherwise  = head xs'
            where xs' = filter (not . isBad) xs

Memoized derivatives are straightforward. For computing derivatives, we translate the rules given in the paper, and for memoization, on discovering a missing entry, we insert a knot-tying value before recursing, and replace it with the result of the recursion afteward.

memoDerive :: String -> State Grammar Pe
memoDerive cs@(c:s) = do
  m <- get
  unless (M.member cs m) $ do
    modify $ M.insert cs $ NT cs
    d <- derive c =<< memoDerive s
    modify $ M.insert cs d
  gets (M.! cs)
memoDerive _ = error "unreachable"

derive :: Char -> Pe -> State Grammar Pe
derive c pe = case pe of
  NT s             -> pure $ NT $ c:s
  Ch x | x == c    -> pure $ Eps x
  Or xs            -> Or <$> mapM (derive c) xs
  Del x :. y       -> (Del x :.) <$> derive c y
  x :. y           -> do
    b <- parseNull x
    dx <- derive c x
    if isBad b then pure $ dx :. y else do
      dy <- derive c y
      pure $ Or [dx :. y, Del x :. dy]
  _                -> pure Nul

Here’s the grammar that Cox claims will grind our parser to a halt:

cox :: Grammar
cox = M.fromList
  [ ("S", NT "T")
  , ("T", Or [NT "T" :. (Ch '+' :. NT "T"), NT "N"])
  , ("N", Ch '1')
  ]

Let’s try it on a small input in an interactive interpreter:

parse cox "S" "1+1+1"

The parser picks a particular parse tree:

(Lf '1' :@ (Lf '+' :@ Lf '1')) :@ (Lf '+' :@ Lf '1')

How about all strings of length 7 consisting of 1 or +?

filter (not . isBad . parse cox "S") $ replicateM 7 "+1"

Thankfully, we get:

["1+1+1+1"]

At last, it’s time to demolish Cox’s claims. We parse an 80-character input with a typo near the end:

main :: IO ()
main = print $ parse cox "S" $ concat (replicate 39 "1+") ++ "+1"

Our prototype is awful. We really should:

Add a slimmed down version of parseNull that returns a boolean instead of an AST, and call this in derive. We only want to recover the AST once the whole string has been parsed; the rest of the time, we only care whether an expression is nullable.
Use a better algorithm for finding the least fixed point. We’ve perhaps chosen the clunkiest and most obvious method.
Remove a layer of indirection when tying the knot. That is, point to a node directly rather than a string (which requires another lookup to get at the node).
Apply algebraic identities to reduce the number of nodes in parsing expressions and abstract syntax trees.

And yet, on my laptop:

Bad

real    0m0.220s
user    0m0.215s
sys     0m0.005s

Clearly, parsing with derivatives is efficient when run on the allegedly exponential-running-time example given by Cox.

The moral of the story

It’s best to test drive an algorithm before condemning it. If we see hilariously bad running times, then we can include them to hammer our points home. If we see surprisingly good running times, then there’s a mistake in our reasoning and we should keep quiet until we successfully attack the algorithm from another angle. (Cox rightly notes parsing with derivatives forgoes two key properties of yacc: linear running time and ambiguity detection. If only he had focused on these trade-offs.)

Is this practicable for parsing with derivatives? Well, we have presented an entire program, yet we have written less code than appears in Cox’s excellent article on regular expressions, which quotes just a few choice cuts from a presumably complete program. Indeed, with a splash of HTML, we can easily build an interactive online demo of parsing with derivatives.

The existence of the flawed post indicates no such sanity check was done. This was caused by poor understanding of lazy evaluation, or because it was deemed too troublesome to implement a lazy algorithm. Both problems are solved by learning a lazy language.

In sum, insufficient experience with lazy evaluation leads to faulty time complexity analysis. Therefore we should all be comfortable with lazy languages so computer science can progress unimpeded.

Regex Derivatives

2018-06-04T11:53:00.001-07:00

Like many of my generation, I was taught to use Thompson’s construction to convert a regular expression to a deterministic finite automaton (DFA). Namely, we draw tiny graphs for each component of a given regular expression, and stitch them together to form a nondeterministic finite automaton (NFA), which we then convert to a DFA.

The ideas are interesting. Sadly, there is no other reason to study them, because there’s a simpler approach that:

Constructs a DFA directly from a regular expression. Forget NFAs.
Supports richer regular expressions. Behold, logical AND and NOT: [a-z]+&!(do|for|if|while)
Immediately obtains smaller and often minimal DFAs in realistic applications.

All this is expertly explained in Regular-expression derivatives reexamined by Owens, Reppy, and Turon. To my chagrin, I only stumbled across it recently, almost a decade after its publication. And after I had already written a regex tool.

But it could be worse: the authors note the superior method was published over 50 years ago by Brzozowski, before being "lost in the sands of time".

Derive to succeed

Take "standard" regular expressions. We have the constants:

\$\emptyset\$: accepts nothing; the empty language.
\$\epsilon\$: accepts the empty string.
\$c\$: accepts the character \$c\$.

and regexes built from other regexes \$r\$ and \$s\$:

\$rs\$: the language built from all pairwise concatenations of strings in \$r\$ and strings in \$s\$.
$r\mid s$: logical or (alternation); the union of the two languages.
\$r\mbox{*}\$: Kleene closure; zero or more strings of \$r\$ concatenated together.

Then solve two problems:

Determine if a regex accepts the empty string.
For a character \$c\$ and a regex \$f\$, find a regex that accepts a string \$s\$ precisely when \$f\$ accepts \$c\$ followed by \$s\$. For example, feeding a to ab*c|d*e*f|g*ah results in the regex b*c|h.

The first problem is little more than a reading comprehension quiz. Going down the list, we see the answers are: no; yes; no; exactly when \$r\$ and \$s\$ do; exactly when \$r\$ or \$s\$ do; yes.

import Data.List

data Re = Nul | Eps | Lit Char | Kleene Re | Re :. Re | Alt [Re]
  deriving (Eq, Ord)

nullable :: Re -> Bool
nullable re = case re of
  Nul      -> False
  Eps      -> True
  Lit _    -> False
  r :. s   -> nullable r && nullable s
  Alt rs   -> any nullable rs
  Kleene _ -> True

In the second problem, the base cases remain easy: return \$\emptyset\$ except for the constant \$c\$, in which case return \$\epsilon\$.

The recursive cases are tougher. Given $r\mid s$, solve the problem on both alternatives to get \$r'\$ and \$s'\$ then return $r'\mid s'$. For $r\mbox{*}$, return $r'r\mbox{*}$.

The trickiest is concatenation: \$rs\$. First, determine if \$r\$ accepts the empty string (the problem we just solved). If so, return $r's\mid s'$. If not, return $r's$.

The answer to the second problem is the derivative of the regex \$f\$ with respect to the character \$c\$, and denoted \$\partial_c f\$.

derive :: Char -> Re -> Re
derive c f = case f of
  Nul                 -> Nul
  Eps                 -> Nul
  Lit a  | a == c     -> Eps
         | otherwise  -> Nul
  r :. s | nullable r -> mkAlt [dc r :. s, dc s]
         | otherwise  -> dc r :. s
  Alt rs              -> mkAlt $ dc <$> rs
  Kleene r            -> dc r :. f
  where dc = derive c

For now, pretend mkAlt = Alt. We shall soon reveal its true definition, why we need it, and why we represent an alternation with a list.

The regex is the state

We can now directly construct a DFA for any regex \$r\$.

Each state of our DFA corresponds to a regex. The start state is the input regex \$r\$. For each character \$c\$, create the state \$\partial_c r\$ if it doesn’t already exist, then draw an arrow labeled \$c\$ from \$r\$ to \$\partial_c r\$.

Repeat on all newly created states. The accepting states are those which accept the empty string. Done!

mkDfa :: Re -> ([Re], Re, [Re], [((Re, Re), Char)])
mkDfa r = (states, r, filter nullable states, edges) where
  (states, edges) = explore ([r], []) r
  explore gr q = foldl' (goto q) gr ['a'..'z']
  goto q (qs, es) c | qc `elem` qs = (qs, es1)
                    | otherwise    = explore (qc:qs, es1) qc
                    where qc  = derive c q
                          es1 = ((q, qc), c):es

So long as we’re mindful that the logical or operation is idempotent, commutative, and associative, that is, $r\mid r = r$, $r\mid s = s\mid r$, and $(r\mid s)\mid t = r\mid (s\mid t)$, the above is guaranteed to terminate.

This makes sense intuitively, because taking a derivative usually yields a simpler regex. The glaring exception is the Kleene star, but on further inspection, we ought to repeat ourselves eventually after taking enough derivatives so long as we can cope with the proliferating logical ors.

We handle idempotence with nub, commutativity with sort, and associativity by flattening lists:

mkAlt :: [Re] -> Re
mkAlt rs | [r] <- rs' = r
         | otherwise  = Alt rs'
         where rs' = nub $ sort $ concatMap flatAlt rs
               flatAlt (Alt as) = as
               flatAlt a        = [a]

This ties off the loose ends mentioned above, and completes our regex compiler. Not bad for 30 lines or so!

In practice, we apply more algebraic identities before comparing regexes to produce smaller DFAs, which empirically are often optimal. (Ideally, we’d like to tell if two given regexes describe the same language so we could always generate the minimal DFA, but this is too costly.)

Extending regexes

Adding new features to the regex language is easy with derivatives. Given an operation, we only need to:

Determine if it accepts the empty string.
Figure out the rules for its derivative.

(We should prove the algorithm still terminates, but we’ll just eyeball it and wave our hands.)

For example, we get the familiar $r\mbox{+}$ by rejecting the empty string and defining its derivative to be $r' r\mbox{*}$. We obtain \$r?\$ by accepting the empty string and defining its derivative to be \$r'\$. But let’s do something more fun.

The logical and \$r&s\$ of regexes \$r\$ and \$s\$ accepts if and only if both \$r\$ and \$s\$ match. Then \$r&s\$ accepts the empty string exactly when both \$r\$ and \$s\$ do (similar to concatenation), and the derivative of \$r&s\$ is \$r'&s'\$.

The complement \$!r\$ of a regex \$r\$ to accepts if and only if \$r\$ rejects. Then \$!r\$ accepts the empty string if and only if \$r\$ rejects it, and the derivative of \$!r\$ is \$!r'\$.

For example, if we write () for \$\epsilon\$ then !()&[a-z]* is the same as [a-z]+.

As before, we can plug these operations into our DFA-maker right away. Good luck doing this with NFAs! Well, I think it’s possible if we add weird rules, e.g. "if we can reach state A and state B, then we can magically reach state C", but then they’d no longer be true NFAs.

The unfortunate, undeserved, and hopefully soon-to-be unlamented prominence of the NFA approach are why these useful operations are considered exotic.

Regularly express yourself

Read Regular-expression derivatives reexamined by Owens, Reppy, and Turon.
If you’re teaching regexes, teach derivatives.
If someone is teaching you to convert regexes to DFAs via NFAs only, ask why they aren’t teaching derivatives as well.
If you’re implementing regexes, use derivatives.
See my online regex-to-DFA demo!

Solving the JavaScript Problem

2017-06-10T17:13:00.000-07:00

The JavaScript Problem is a good problem to have. Against the odds, “write once, run anywhere” is a reality for web browsers because of a language governed by a standards organization. Not so long ago, proprietary technologies such as Flash and Silverlight threatened the openness of the web.

So we should be grateful the JavaScript problem is merely a technical one, namely that JavaScript is poorly designed. Though the language has improved over the years, its numerous flaws are too deeply entrenched to remove. Transpilers help by unlocking access to better languages, but JavaScript was never intended to be an assembly language. It’s only thanks to heroic efforts of many talented engineers that JavaScript has gone so far.

Ideally, the lingua franca of the web should be low-level, clean, and simple, so we can develop in any language with little overhead.

WebAssembly

Recent versions of several popular browsers have fulfilled our wishes with WebAssembly, also known as wasm. WebAssembly is an open standard, and well-designed. At last, works such as Benjamin Pierce’s “Types and Programming Languages” are mainstream enough that WebAssembly has formally specified reduction and typing rules, and even a proof of soundness. In contrast, weakly typed languages such as JavaScript ignore a century or so of mathematical progress.

In WebAssembly, nondeterminstic behvaviour can only arise from exhaustion, external host functions, and the IEEE-754 floating-point standard, which fails to specify the NaN bit pattern for all cases. Recall in C and the many languages built upon it such as Go and Haskell, signed integer overflow causes undefined behaviour. WebAssembly fixes this by stipulating two’s complement for negative numbers, as competing representations of negative numbers are ultimately responsible for this defect of C. Endianness is similarly settled, though curiously by travelling the road not taken by network byte order: numbers in WebAssembly are encoded in little-endian.

The WebAssembly virtual machine is stack-based. Years ago, I read that register-based virtual machines are faster, but perhaps these results are now obsolete. Browsing briefly, I found newer papers:

It’s the same old story. Register-based virtual machines are still faster after all. It seems WebAssembly prioritizes code size, and trusts browsers will ship with good JIT compilers.

Toy Compilers

Online demonstrations of WebAssembly compilers are fun to build, and fun to describe: I compiled a Haskell program to JavaScript that when executed, reads a Haskell-like program and compiles it to WebAssembly, which some JavaScript loads and executes.

Perhaps it’s easier to invite the reader to:

For the last 2 programs, I transformed the source to a tree of S and K combinators, so apart from graph reduction, I only had to code 2 combinators in assembly. The resulting binaries are excruciatingly slow, especially since numbers are Church-encoded, but it all seems to work.

I look forward to a Haskell compiler that produces efficient WebAssembly, though it may have to wait until WebAssembly gains a few more features, such as threads and tail calls.

Lambda Calculus Surprises

2017-04-02T12:10:00.000-07:00

Much time has passed since my last entry. I’ve been frantically filling gaps in my education, so I’ve had little to say here. My notes are better off on my homepage, where I can better organize them, and incorporate interactive demos.

However, I want to draw attention to delightful surprises that seem unfairly obscure.

1. Succinct Turing-complete self-interpreters

John McCarthy’s classic paper showed how to write a Lisp interpreter in Lisp itself. By adding a handful of primitives (quote, atom, eq, car, cdr, cons, cond) to lambda calculus, we get a Turing-complete language where a self-interpreter is easy to write and understand. For contrast, see Turing’s universal machine of 1936.

Researchers have learned more about lambda calculus since 1960, but many resources seem stuck in the past. Writing a Turing-complete interpreter in 7 lines is ostensibly still a big deal. The Roots of Lisp by Paul Graham praises McCarthy’s self-interpreter but explores no further. The Limits of Mathematics by Gregory Chaitin chooses Lisp over plain lambda calculus for dubious reasons. Perhaps McCarthy’s work is so life-changing that some find it hard to notice new advances.

John Tromp’s fascinating website is a refreshing exception. It led me to a paper by Mogensen, who found a one-line self-interpreter without adding anything to lambda calculus:

(f.(x.f(xx))(x.f(xx)))(em.m(x.x)(mn.em(en))(mv.e(mv)))

(I’ve suppressed the lambdas. Exercise: write a regex substitution that restores them.)

In fact, under some definitions, the program “λq.q(λx.x)” is a self-interpreter.

2. Hindley-Milner sort

Types and Programming Languages (TaPL) by Benjamin C. Pierce is a gripping action thriller. Types are the heroes, and we follow their epic struggle against the most ancient and powerful foes of computer science and mathematics.

When we first meet them, types are humble guardians of a barebones language that can only express the simplest of computations involving booleans and natural numbers. As the story progresses, types gain additional abilities, enabling them to protect more powerful languages.

However, there seems to be a plot hole when types level up from Hindley-Milner to System F. As a “nice demonstration of the expressive power of pure System F”, the book mentions a program that can sort lists.

The details are left as an exercise to the reader. Working through them, we realize a Hindley-Milner type system is already powerful enough to sort lists. Moreover, the details are far more pleasant in Hindley-Milner because we avoid the ubiquitous type spam of System F.

System F is indeed more powerful than Hindley-Milner and deserves admiration, but because of well-typed self-application and polymorphic identity functions, existential types, and other gems; not because lists can be sorted.

3. Self-interpreters for total languages

They said it couldn’t be done.

According to Breaking Through the Normalization Barrier: A Self-Interpreter for F-omega by Matt Brown and Jens Palsberg, “several books, papers, and web pages” assert self-interpreters for a strongly normalizing lambda calculus are impossible. The paper then shows that reports of their non-existence have been greatly exaggerated.

Indeed, famed researcher Robert Harper writes on his blog that “one limitation of total programming languages is that they are not universal: you cannot write an interpreter for T within T (see Chapter 9 of PFPL for a proof).”, and as of now (April 2017), the Wikipedia article they cite still declares “it is impossible to define a self-interpreter in any of the calculi cited above”, referring to simply typed lambda calculus, System F, and the calculus of constructions.

I was shocked. Surely academics are proficient with diagonalization by now? Did they all overlook a hole in their proofs?

More shocking is the stark simplicity of what Brown and Palsberg call a shallow self-interpreter for System F and System F_ω, which is essentially a typed version of “λq.q(λx.x)”.

It relies on a liberal definition of representation (we only require an injective map from legal terms to normal forms) and self-interpretation (mapping a representation of a term to its value) which is nonetheless still strong enough to upend conventional wisdom.

Which brings us to the most shocking revelation: there is no official agreement on the definition of representation or self-interpretation, or even what we should name these concepts.

Does this mean I should be wary of even the latest textbooks? Part of me hopes not, because I want to avoid learning falsehoods, but another part of me hopes so, for it means I’ve reached the cutting edge of research.

See for yourself!

Interactive demos of the above:

Neural Networks in Haskell

2015-11-10T21:41:00.001-08:00

Long ago, when I first looked into machine learning, neural networks didn’t stand out of the crowd. They seemed on par with decision trees, genetic algorithms, genetic programming, and a host of other techniques. I wound up dabbling in genetic programming because it seemed coolest.

Neural networks have since distinguished themselves. Lately, they seem responsible for each newsworthy machine learning achievement I hear about. To name a few:

Inspired, I began reading Michael Nielsen’s online book on neural networks. We can whip up a neural network without straying beyond a Haskell base install, though we do have to implement the Box-Muller transform ourselves to avoid pulling in a library to sample from a normal distribution.

The following generates a neural network with 3 inputs, a hidden layer of 4 neurons, and 2 output neurons, and feeds it the inputs [0.1, 0.2, 0.3].

import Control.Monad
import Data.Functor
import Data.List
import System.Random

main = newBrain [3, 4, 2] >>= print . feed [0.1, 0.2, 0.3]

newBrain szs@(_:ts) = zip (flip replicate 1 <$> ts) <$>
  zipWithM (\m n -> replicateM n $ replicateM m $ gauss 0.01) szs ts

feed = foldl' (((max 0 <$>) . ) . zLayer)

zLayer as (bs, wvs) = zipWith (+) bs $ sum . zipWith (*) as <$> wvs

gauss :: Float -> IO Float
gauss stdev = do
  x <- randomIO
  y <- randomIO
  return $ stdev * sqrt (-2 * log x) * cos (2 * pi * y)

The tough part is training the network. The sane choice is to use a library to help with the matrix and vector operations involved in backpropagation by gradient descent, but where’s the fun in that?

It turns out even if we stay within core Haskell, we only need a few more lines, albeit some hairy ones:

relu = max 0
relu' x | x < 0      = 0
        | otherwise  = 1

revaz xs = foldl' (\(avs@(av:_), zs) (bs, wms) -> let
  zs' = zLayer av (bs, wms) in ((relu <$> zs'):avs, zs':zs)) ([xs], [])

dCost a y | y == 1 && a >= y = 0
          | otherwise        = a - y

deltas xv yv layers = let
  (avs@(av:_), zv:zvs) = revaz xv layers
  delta0 = zipWith (*) (zipWith dCost av yv) (relu' <$> zv)
  in (reverse avs, f (transpose . snd <$> reverse layers) zvs [delta0])
  where
    f _ [] dvs = dvs
    f (wm:wms) (zv:zvs) dvs@(dv:_) = f wms zvs $ (:dvs) $
      zipWith (*) [sum $ zipWith (*) row dv | row <- wm] (relu' <$> zv)

descend av dv = zipWith (-) av ((0.002 *) <$> dv)

learn xv yv layers = let (avs, dvs) = deltas xv yv layers
  in zip (zipWith descend (fst <$> layers) dvs) $
    zipWith3 (\wvs av dv -> zipWith (\wv d -> descend wv ((d*) <$> av))
      wvs dv) (snd <$> layers) avs dvs

See my Haskell notes for details. In short: ReLU activation function; online learning with a rate of 0.002; an ad hoc cost function that felt right at the time.

Despite cutting many corners, after a few runs, I obtained a neural network that correctly classifies 9202 of 10000 handwritten digits in the MNIST test set in just one pass over the training set.

I found this result surprisingly good. Yet there is much more to explore: top on my must-see list are deep learning (also described in Nielsen’s book) and long short-term memory.

I turned the neural net into an online digit recognition demo: you can draw on the canvas and see how it affects the outputs.

Mighty Warp

2015-02-23T00:35:00.000-08:00

Wow. There exists a Haskell web server with the following properties:

Outperforms nginx.
Under 1300 lines of source.
Clear control flow: handles one request per thread using blocking calls.
Slowloris DoS protection.

The secret is GHC’s runtime system (RTS). Every Haskell program must spend time in the RTS, and maybe this does hurt performance in certain cases, but for web servers it is a huge win: the RTS automatically transforms code that seems to handle one request per thread into a server with multiple event-driven processes. This saves many a context switch while keeping the source simple.

Best of all, this magic technology is widely available. To start a webserver on port 3000 using the Warp library, run these commands on Ubuntu [original inspiration]:

sudo apt-get install cabal-install
cabal update
cabal install warp
cat > server.hs << EOF
#!/usr/bin/env runghc
{-# LANGUAGE OverloadedStrings #-}
import Network.Wai (responseLBS)
import Network.Wai.Handler.Warp (run)
import Network.HTTP.Types (status200)
import Network.HTTP.Types.Header (hContentType)

main = run 3000 $ \_ f -> f $ responseLBS
  status200 [(hContentType, "text/plain")] "Hello, world!\n"
EOF
chmod +x server.hs
./server.hs

Eliminating context switches is the best part of the story, but there’s more. Copying data can be avoided with a simple but clever trick the authors call splicing. Using conduits instead of lazy I/O solves the non-deterministic resource finalization problem. And a few judiciously placed lockless atomic operations can work wonders: in particular, for basic Slowloris protection and for a robust file descriptor cache.

Uramaki

Those who fear straying too far from a C-like language can still reap the benefits in Go:

Goroutines are like green threads.
Channels are like conduits.
Array slices are like splices.

If I didn’t know better, I would say the designers of Go emulated the inventor of the California roll: they took some of the best features of languages like Haskell and made them palatable to a wider audience.

I wonder how Go’s RTS compares. One innate advantage GHC may have is Haskell’s type system, which leads to largely non-destructive computation, which ultimately leads to a cheap and effective scheduling scheme (namely, context switching on memory allocation). Still, I expect a well-written Go web server could achieve similar results.

Haskell for programming contests

2014-12-20T03:54:00.000-08:00

Farewell, Dr. Dobb’s. In a way, my previous post proved prescient: in the old days, I relied on printed magazines like Dr. Dobb’s Journal to learn about computers. Now, most things are but a few search terms away. Coding is easier than ever.

Though I have a soft spot for this particular magazine, ultimately I’m glad information has become more organized and accessible. I like to think I played a part in this, however small, by posting my own tutorials, articles, rants, and code.

Here’s hoping the remainder of my previous post also ages well. That is, may Haskell live long and prosper. [A sentiment echoed by Dr. Dobb’s.] Again, I’d like to play a small part in this: Haskell for programming contests.

Let's Code!

2014-08-12T23:15:00.000-07:00

A recent article in Dr. Dobb’s Journal bemoans the complexity of today’s development toolchains: “it’s hard to get any real programming done”. However, my own experience suggests the opposite: I find programming is now easier than ever, partly due to better tools.

I say “partly” because when I was a kid, it was difficult to obtain code, compilers, and documentation, let alone luxuries like an SCM. I scoured public libraries for books on programming and checked out what they had, which meant I studied languages which I could never use because I lacked the right compiler, or even the right computer. I nagged my parents to buy me expensive books, and occasionally they’d succumb. Perhaps the most cost-efficient were magazines containing program listings which of course had to be keyed in by hand. (One of my most treasured was an issue of Dr. Dobb’s Journal, back when it was in print, and only in print.)

Nowadays, a kid can get free high-quality compilers, code, tutorials, and more at the click of a button. But I believe even without this freer flow of information, programming would still be easier than ever because our tools have improved greatly.

got git?

The author singles out Git as a source of trouble, but the reasoning is suspect. For example, we’re told that with respect to other “SCMs you’ve used…Git almost certainly does those same actions differently.”

This suggests that the author used other SCMs, then tried Git and found it confusing. In contrast, I used Git, then tried other SCMs and found them confusing. I predict as time passes, more and more developers will learn Git first, and their opinions of SCMs will mirror mine.

Nevertheless, I’m leery of ranking the friendliness of tools by the order you picked them up. I hereby propose a different yardstick. Take Git, and a traditional SCM. Implement, or at least think about implementing, a clone of each from scratch; just enough so it is self-hosting. Then the one that takes less time to implement is simpler.

I wrote a self-hosting Git clone in a few hours: longer than expected because I spent an inordinate amount of time debugging silly mistakes. Though I haven’t attempted it, I would need more time to write a clone of Perforce or Subversion (pretty much the only other SCMs I have used). With Git, there’s no transactions, revision numbers, rename tracking, central servers, and so on; Git is essentially SHA-1 hashes all the way down.

But let’s humour the author and suppose Git is complex. Then why not use tarballs and patches? This was precisely how Linux was managed for 10 years, so should surely suffice for a budding developer. In fact, I say you should only bother with Git once you realize, firstly, you’re addicted to coding, and secondly, how annoying it is to manage source with tarballs and patches!

In other words, although Git is handy, you only really need it when your project grows beyond a certain point, by which time you’ve already had plenty of fun coding. Same goes for tools like defect trackers.

Apps and Oranges

I agree that developing for mobiles is painful. However, comparing this against those “simple programs of a few hundred lines of C++ long ago” is unfair. With mobile apps, the program usually runs on a system different to the one used to write the code.

It might be fairer to compare writing an mobile app with, say, programming a dot matrix printer of yesteryear, as in both cases the target is different to the system used to write the code. I once did the latter, for the venerable Epson MX-80: after struggling with a ton of hardware-specific low-level nonsense, I was rewarded with a handful of crummy pictures. I would say it involved less “real programming” than writing an Android app.

All the same, I concede that writing Android software is harder than it should be, largely due to Java. But firstly, a mobile phone involves security and privacy issues that would never arise with a dot matrix printer, which necessarily implies more bookkeeping, and secondly, the Java problem can be worked around: either via native code, or a non-Java compiler that generates Dalvik bytecode. [I’ve only mentioned Android throughout because it’s the only mobile platform I’ve developed on.]

Comparing server-side web apps with the good old days is similarly unfair unless the good old days also involved networks, in which case they were really the bad old days. PC gamers of a certain age may remember a myriad of mysterious network options to configure multiplayer mode; imagine the even more mysterious code behind it. As for cloud apps, I would rather work on a cloud app than on an old-school equivalent: BBS software, which involves renting out extra phones lines if you want high availability.

What about client-side web apps? As they can run on the same system used to develop them, it is therefore fair to compare developing them against writing equivalent code in those halcyon days of yore. Let’s look at a couple of examples.

Tic-tac-toe

I wrote a tic-tac-toe web app with an AI that plays perfectly because it searches the entire game tree; modern hardware and browsers are so fast that this is bearable (though we’re spared one ply because the human goes first). It works on desktops, laptops, tablets, phones: anything with a browser.

Here’s the minimax game tree search, based on code from John Hughes, Why Functional Programming Matters:

score (Game _ Won 'X') = -1
score (Game _ Won 'O') = 1
score _ = 0

maximize (Node leaf []) = score leaf
maximize (Node _ kids) = maximum (map minimize kids)

minimize (Node leaf []) = score leaf
minimize (Node _ kids) = minimum (map maximize kids)

Despite my scant Haskell knowledge and experience, the source consists of a single file containing less than 150 lines like the above, plus a small HTML file: hardly a “multiplicity of languages”. Writing it was enjoyable, and I did so with a text editor in a window 80 characters wide.

Let’s rewind ten to twenty years. I’d have a hard time achieving the brevity and clarity of the above code. The compiler I used didn’t exist, and depending how far back we go, neither did the language. Not that I’d consider compiling to JavaScript in the first place: depending how far back we go, it was too slow or didn’t exist.

Netwalk

In my student days, I developed a clone of a Windows puzzle game named Netwalk. I chose C, so users either ran untrusted binaries I supplied (one for each architecture), or built their own binaries from scratch. Forget about running it on phones and PDAs.

I managed my files with tarballs and patches. The source consisted of a few thousand lines, though admittedly much of it is GUI cruft: menus, buttons, textboxes, and so on. Lately, I hacked up a web version of Netwalk. The line count? About 150.

Thanks to Git, you can view the entire source right now on Google Code or GitHub, all dolled up with syntax highlighting and line numbers.

Building native binaries has a certain charm, but I have to admit that a client-side web app has less overhead for developers and users alike. I only need to build the JavaScript once, then anyone with a browser can play.

Thus in this case, my new tools are better than my old tools in every way.

Choose Wisely

The real problem perhaps is the sheer number of choices. Tools have multiplied and diversified, and some indeed impede creativity and productivity. But others are a boon for programmers: they truly just let you code.

Which tools are the best ones? The answer probably depends on the person as well as the application, but I will say for basic client-side web apps and native binaries, I heartily recommend my choices: Haskell, Haste, Git.

I’m confident the above would perform admirably for other kinds of projects. I intend to find out, but at the moment I’m having too much fun coding games.

Play Now!

15 Shades of Grey

2014-07-29T22:52:00.000-07:00

John Carmack indirectly controlled significant chunks of my life. For hours at a time, I would fight in desperate gun battles in beautiful and terrifying worlds he helped create. On top of this, the technical wizardry of id Software’s games inspired me to spend yet more hours learning how they managed to run Wolfenstein 3D and Doom on PCs, in an era when clockspeeds were measured in megahertz and dedicated graphics cards were rare.

I read about cool tricks like binary space partitioning, and eventually wrote a toy 3D engine of my own. The process increased my respect for the programmers: it’s incredibly difficult to get all the finicky details right while sustaining good frame rates.

Accordingly, I paid close attention when John Carmack spoke about programming languages in his QuakeCon 2013 keynote. Many people, myself included, have strong opinions on programming languages, but few have a track record as impressive as his.

Carmack’s Sneaky Plan

I was flabbergasted by Carmack’s thoughts on the Haskell language. He starts by saying: “My big software evolution over the last, certainly three years and stretching back tendrils a little bit further than that, has been this move towards functional programming style and pure functions.”

He then states that not only is Haskell suitable for programming games, but moreover, thinks Haskell may beat typical languages by roughly “a factor of two”, which “would be monumental” and “a really powerful thing for game development”. He has even begun reimplementing Wolfenstein 3D in Haskell as part of a “sneaky plan” to convince others.

Wow! I had always thought Haskell was a pretty but impractical language. I loved composing elegant Haskell snippets to solve problems that one might encounter in interviews and programming contests, but for real stuff I resorted to C.

Among my concerns is garbage collection: I have bad memories of unexpected frequent pauses in Java programs. But Carmack notes that Haskell’s almost uncompromising emphasis on purity simplifies garbage collection to the point where it is a predictable fixed overhead.

A second concern is lazy evaluation. It’s easy to write clear and simple but inefficient Haskell: computing the average of a list of numbers comes to mind. Carmack is also “still not completely sold on the value of laziness”, but evidently it’s not a showstopper for him. I suppose it’s all good so long as there are ways of forcing strict evaluation.

A third concern (but probably not for Carmack) is that I don’t know how to write a Haskell compiler; I’m more at ease with languages when I know how their compilers work. I can ignore this discomfort, though I intend to overcome my ignorance one day. I’m hoping it’s mostly a matter of understanding Hindley-Milner type inference.

Speaking of types, Carmack is a fan of static strong typing, because in his experience, “if it’s syntactically legal, it will make it into the codebase”. He notes during his recent foray into Haskell, the one time he was horribly confused was due to untyped data from the original Wolfenstein 3D.

My Obvious Plan

Once again, I’m inspired by Carmack. I plan to take Haskell more seriously to see if it really is twice as good. Although I lack the resources to develop a complex game, I may be able to slap together a few prototypes from time to time.

First up is the 15-Puzzle by Noyes Palmer Chapman with a cosmetic change: to avoid loading fonts and rendering text, I replaced the numbers 1 to 15 with increasingly darker shades of grey.

I began with a program depending on SDL. The result was surprisingly playable, and I found the source code surprisingly short in spite of my scant knowledge of Haskell. To better show off my work, I made a few edits to produce a version of my program suitable for the Haste compiler, which compiles Haskell to JavaScript. I added mouse support and tweaked the HTML so the game is tolerable on tablets and phones.

Play now!

Straw Men in Black

2014-05-25T10:49:00.000-07:00

There’s a phrase used to praise a book: “you can’t put it down”. Unfortunately, I felt the opposite while reading The Black Swan by Nassim N. Taleb.

I’ll admit some prejudice. We’re told not to judge a book by its cover, but review quotes in the blurb ought to be exempt. One such quote originated from Peter L. Bernstein, the author of Against the Gods. While I enjoyed reading it, his book contained a litany of elementary mathematical mistakes. Did this mean The Black Swan was similarly full of errors?

All the same, the book began well. Ideas were clear and well-expressed. The writing was confident: perhaps overly so, but who wants to read text that lacks conviction? It promised wonders: we would learn how statisticians have been fooling us, and then learn the right way to deal with uncertainty, with potentially enormous life-changing payoffs.

I failed to reach this part because several chapters in, I was exhausted by a multitude of issues. I had to put the book down. I intend to read further once I’ve recovered, and hopefully the book will redeem itself. Until then, here are a few observations.

One Weird Trick

What’s on the other end of those "one weird trick" online ads? You won’t find out easily. If clicked, one is forced to sit through a video that:

makes impressive claims about a product
takes pains to keep the product a secret
urges the viewer to wait until the end, when they will finally learn the secret

This recipe must be effective, because I couldn’t help feeling the book was similar. It took me on a long path, meandering from anecdote to anecdote, spiced with poorly constructed arguments and sprinkled with assurances that the best was yet to come.

Perhaps this sales tactic has become a necessary evil. With so much competition, how can a book distinguish itself? Additionally, I’m guessing fattening the book for any reason has a positive effect on sales.

Even so, the main idea of the book could be worth reading. I’ll post an update if I find out.

Lay Off Laplace

Chapter 4 features a story about a turkey. As days pass, a turkey’s belief in the proposition such as "I will be cared for tomorrow" grows ever stronger, right until the day of its execution, when its belief turns out to be false. This retelling of a parable about a chicken due to Bertrand Russell is supposed to warn us about inferring knowledge from observations, a repeated theme in the book.

But what about Laplace’s sunrise problem? By the Rule of Succession, if the sun rose every day for 5000 years, that is, for 5000 × 365.2426 days, the odds it will rise tomorrow are only 1826214 to 1. Ever since Laplace wrote about this, he has been mercilessly mocked because of this ludicrously small probability.

So which is it? Do repeated observations make our degrees of belief too strong (chicken) or too weak (sunrise)?

Live long and prosper

Much of this material is discussed in Chapter 18 of Probability Theory: The Logic of Science by Edwin T. Jaynes, which also contains the following story.

A boy turns 10 years old. The Rule of Succession implies the probability he lives one more year is (10 + 1) / (10 + 2), which is 11/12. A similar computation shows his 70-year old grandfather will live one more year with probability 71/72.

I like this example, because it contains both the chicken and the sunrise problem. Two for the price of one. Shouldn’t the old man’s number be lower than the young boy’s? One number seems too big and the other too small. How can the same rule be wrong in two different ways?

Ignorance is strength?

What should we do to avoid these ridiculous results?

Well, if the sun rose for every day for 5000 years and that is all you know, then 1826214 to 1 is correct. The only reason we think this is too low is because we know a lot more than the number of consecutive sunrises: we know about stars, planets, orbits, gravity, and so on. If we take all this into account, our degree of belief that the sun rises tomorrow grows much stronger.

The same goes for the other examples. In each one, we:

Ignored what we know about real world.
Calculated based on what little data was left.
Un-ignored the real world so we could laugh at the results.

In other words, we have merely shown that ignoring data leads to bad results. It’s as obvious as noting that if you shut your eyes while driving a car, you’ll end up crashing.

Sadly, despite pointing this out, Laplace became a victim of this folly. Immediately after describing the sunrise problem, Laplace explains that the unacceptable answer arises because of wilfully neglected data. For some reason, his critics take his sunrise problem, ignore his explanation for the hilarious result, then savage his ideas.

The Black Swan joins the peanut gallery in condemning Laplace. However, its conclusion differs from those of most detractors. The true problem is that most of the data is ignored when computing probabilities. Taleb considers addressing this by ignoring even more data! But then why not toss out more? Why not throw away most of mathematics and assign arbitrary probabilities to arbitrary assertions?

Orthodox statistics is indeed broken, but not because more data should be ignored. It’s broken for the opposite reason: too much data is being ignored.

Poor Laplace. Give the guy a break.

Hempel’s Joke

Stop me if you’ve heard this one: 2 + 2 = 5 for sufficiently large values of 2. This is obviously a joke (though sometimes told so convincingly that the audience is unsure).

Hempel’s Paradox is a similar but less obvious joke that proceeds as follows. Consider the hypothesis: all ravens are black. This is logically equivalent to saying all non-black things are non-ravens. Therefore seeing a white shoe is evidence supporting the hypothesis.

The following Go program makes the attempted humour abundantly clear:

package main

import "fmt"

func main() {
  state := true
  for {
    var colour, thing string
    if _, e := fmt.Scan(&colour, &thing); e != nil {
      break
    }
    if thing == "raven" && colour != "black" {
      state = false
    }
    fmt.Println("  hypothesis:", state)
  }
}

A sample run:

black raven
  hypothesis: true
white shoe
  hypothesis: true
red raven
  hypothesis: false
black raven
  hypothesis: false
white shoe
  hypothesis: false

The state of the hypothesis is represented by a boolean variable. Initially the boolean is true, and it remains true until we encounter a non-black raven. This is the only way to change the state of the program: neither "black raven" nor "white shoe" has any effect.

Saying we have "evidence supporting the hypothesis" is saying there are truer values of true. It’s like saying there are larger values of 2.

The original joke exploits the mathematical concept “sufficiently large” which has applications, but is absurd when applied to constants.

Similarly, Hempel’s joke exploits the concept "supporting evidence", which has applications, but is absurd when applied to a lone hypothesis.

Off by one

If we want to talk about evidence supporting or undermining a hypothesis mathematically, we’ll need to advance beyond boolean logic. Conventionally we represent degrees of belief with numbers between 0 and 1. The higher the number, the stronger the belief. We call these probabilities.

Next, we propose some mutually exclusive hypotheses and assign probabilities between 0 and 1 to each one. The sum of the probabilities must be 1.

If we take a single proposition by itself, such as "all ravens are black", then we’re forced to give it a probability of 1. We’re reduced to the situation above, where the only interesting thing that can happen is that we see a non-black raven and we realize we must restart with a different hypothesis. (In general, probability theory taken to extremes devolves into plain logic.)

We need at least two propositions with nonzero probabilties for the phrase "supporting evidence" to make sense. For example, we might have two propositions A and B, with probabilities of 0.2 and 0.8 respectively. If we find evidence supporting A, then its probability increases and the probability of B decreases accordingly, for their sum must always be 1. Naturally, as before, we may encounter evidence that implies all our propositions are wrong, in which case we must restart with a fresh set of hypotheses.

To avoid nonsense, we require at least two mutually exclusive propositions, such as A: "all ravens are black", and B: "there exists a non-black raven", and each must have a nonzero probability. Now it makes sense to ask if a white shoe is supporting evidence. Does it support A at B’s expense? Or B at A’s expense? Or neither?

The propositions as stated are too vague to answer one way or another. We can make the propositions more specific, but there are infinitely many ways to do so, and the choices we make change the answer. See Chapter 5 of Jaynes.

One Card Trick

Instead of trying to flesh out hypotheses involving ravens, let us content ourselves with a simpler scenario. Suppose a manufacturer of playing cards has a faulty process that sometimes uses black ink instead of red ink to print the entire suit of hearts. We estimate one in ten packs of cards have black hearts instead of red hearts and is otherwise normal, while the other nine decks are perfectly fine.

We’re given a pack of cards from this manufacturer. Thus we believe the hypothesis A: "all hearts are red" with probability 0.9, and B: "there exists a non-red heart" with probability 0.1. We draw a card. It’s the four of clubs. What does this do to our beliefs?

Nothing. Neither hypothesis is affected by this irrelevant evidence. I believe this is at least intuitively clear to most people, and furthermore, had Hempel spoke of hearts and clubs instead of ravens and shoes, his joke would have been more obvious.

Great Idea, Poor Execution

The Black Swan attacks orthodox statistics using Hempel’s paradox, alleging that it shows we should beware of evidence supporting a hypothesis.

It turns out orthodox statistics can be attacked with Hempel’s paradox, but not by claiming "supporting evidence" is meaningless. That would be like claiming "sufficiently large" is meaningless.

Instead, Hempel’s joke reminds us we must consider more than one hypothesis if we want to talk about supporting evidence. This may seem obvious; assigning a degree of belief in a lone proposition is like awarding points in a competition with only one contestant.

However, apparently it is not obvious enough. The Black Swan misses the point, and so did my university professors. My probability and statistics textbook instructs us to consider only one hypothesis. (Actually, it’s worse: one of the steps is to devise an alternate hypothesis, but this second hypothesis is never used in the procedure!)

Mathematics Versus Society

In an off-hand comment, Taleb begins a sentence with “Mathematicians will try to convince you that their science is useful to society by…”

By this point, I already found faults. First and foremost: how often do mathematicians talk about their usefulness to society? There are many jokes about mathematicians and real life, such as:

Engineers believe their equations approximate reality. Physicists believe reality approximates their equations. Mathematicians don’t care.

The truth is being exaggerated for humour, but asserting their work is useful in the real world is evidently a low priority for mathematicians. It is almost a point of pride. In fact, Taleb himself later quotes Hardy:

The “real” mathematics of the “real” mathematicians…is almost wholly “useless”.

This outlook is not new. Gauss called number theory “the queen of mathematics”, because it was pure and beautiful and had no applications in real life. (He had no way of foreseeing that number theory would one day be widely used in real life for secure communication!)

But sure, whatever, let’s suppose mathematicians go around trying to convince others that their field is useful to society. [Presumably Hardy would call such a mathematician “imaginary” or “complex”.] They are trivially right. If you try to talk about how useful things are to society, then you’ll want to measure and compare usefulness of things, all the while justifying your statements with sound logical arguments. Measuring and comparing and logic all lie squarely in the domain of mathematics.

Jumping to Conclusions

So far, I feel the author’s heart is in the right place but his reasoning is flawed. Confirmation bias is indeed pernicious, and orthodox statistics is indeed erroneous. However, The Black Swan knocks down straw men instead of hitting these juicy targets.

The above are but a few examples of the difficulties I ran into while reading the book. I had meant to pick apart more specious arguments but I’ve already written more than I had intended.

Again, I stress I have not read the whole work, and it may improve in the second half.

Crit-bit trees yet again

2014-04-18T21:53:00.000-07:00

I’ve been pleased with my recent implementation of crit-bit trees based on Adam Langley’s code, but in the back of my mind, I’ve been bothered by Bernstein’s statement that the overhead on top of each string stored need only be two control words: one pointer and one integer.

To obtain relief, I updated my library so that internal nodes only have a single child pointer instead of one pointer each for the left and right child nodes. In a crit-bit tree, an internal node always has two children, so we may as well allocate sibling nodes in one contiguous chunk of memory.

Bernstein suggests storing strings directly in the crit-bit tree, which can be done by storing the left string backwards: it resides to the left of the integer control word. I’ve chosen an alternate scheme which he also mentions: expressing the strings as pointers in a separate storage pool.

Removing one pointer per internal node has several benefits on a 64-bit machine. Firstly, malloc is typically 16-byte aligned, which means that the 24-byte internal nodes of the original version were actually taking 32 bytes. In contrast, now we only need just one pointer and one integer, so we can fit internal nodes within 16 bytes.

Secondly, we no longer have to tag pointers. Instead, 8 bytes hold an unadulterated child pointer, and the other 8 bytes store the crit-bit position as well as one bit that indicates whether the current node is internal or not.

Thirdly, allocating two nodes at once only requires a single malloc() call. Before, we would call malloc() to allocate a new leaf node, and again to allocate a new internal node.

Again, "unrolling" the child pointer lookup produced better benchmarks on my system, namely, instead of p = p->kid + predicate(), we write p = predicate() ? p->kid + 1 : p->kid. This obviated the need for some fun but complex micro-optimizations. While I was at it, I unrolled a few other routines.

Benchmarks:

**Table 1.** Keys: `/usr/share/dict/words`
	old	new
insert	0.066135	0.056116
get	0.043584	0.040845
iterate	0.031851	0.015135
allprefixed	0.013285	0.011924
delete	0.048030	0.041754
overhead	3966824	3173464

**Table 2.** Keys: output of `seq 2000000`
	old	new
insert	2.700949	2.359999
get	2.304070	2.214699
iterate	0.912950	0.472594
allprefixed	0.295322	0.264326
overhead	79999984	63999992
delete	2.339102	2.177294

The insert benchmark of my old library is slightly worse than the one I originally posted because I have since tweaked the code to make it easy to carry out operations such as "insert or replace" or "insert if absent" in a single step.

https://github.com/blynn/blt

Reentrant parsers with Flex and Bison

2013-12-19T23:41:00.000-08:00

By default, Flex and Bison generate old-school code with global variables. Trawling the manuals to find the options that generate re-entrant code is tedious, so I’m recording a small example that works on my system (which has Bison 2.7.12 and Flex 2.5.35).

Flex preamble

With these options, yylval is now a pointer. When converting existing Flex source, we mostly replace with yylval with *yylval.

%option outfile="flex.c" header-file="flex.h"
%option reentrant bison-bridge
%option noyywrap nounput noinput

%{
#include "bison.h"
%}

Bison preamble

%output  "bison.c"
%defines "bison.h"
%define api.pure full
%lex-param   { yyscan_t scanner }
%parse-param { yyscan_t scanner }
%parse-param { val_callback_t callback }

%code requires {
#include "val.h"
#define YYSTYPE val_ptr
#ifndef YY_TYPEDEF_YY_SCANNER_T
#define YY_TYPEDEF_YY_SCANNER_T
typedef void *yyscan_t;
#endif
}

%code {
#include "flex.h"
int yyerror(yyscan_t scanner, val_callback_t callback, const char *msg) {
  return 0;
}
}

val.h: semantic values

Rather than use the %union Bison declaration or similar, I prefer to define the type that holds the semantic values in a C source file. In general, I like to minimize the amount of C in the Bison and Flex source.

enum {
  T_INT,
  T_STRING,
};

struct val_s {
  int type;
  struct {
    char *s;
    struct val_s **kid;
    int nkid;
  };
};
typedef struct val_s *val_ptr;
typedef int (*val_callback_t)(val_ptr);

Calling the parser

Because the parser is no longer global, we must initialize and pass a yyscan_t variable to Bison and Flex.

  yyscan_t scanner;
  if (yylex_init(&scanner)) exit(1);
  YY_BUFFER_STATE buf = NULL;
  // Uncomment to parse from a string instead of standard input.
  // buf = yy_scan_string("input string", scanner);
  int f(struct val_s *v) {
    val_print_pre(v);
    putchar('\n');
    val_print_tree("", v);
    val_free(v);
    return 0;
  }
  if (yyparse(scanner, f)) exit(1);
  yy_delete_buffer(buf, scanner);
  yylex_destroy(scanner);

Complete example

See https://github.com/blynn/symple/, which reads an expression and pretty-prints it:

$ ./main 'sin(x)*cos(y) + e^x'
+(*(sin(x), cos(y)), ^(e, x))
+─┬─*─┬─sin───x
  │   └─cos───y
  └─^─┬─e
      └─x

To Brute-Force A Mockingbird

2013-11-21T00:54:00.000-08:00

To Mock a Mockingbird by Raymond M. Smullyan should be required reading for any fan of the programming language Haskell. We learn combinatory logic through a series of delightful puzzles, almost without realizing.

We’re asked to imagine a forest populated by talking birds. On encountering one of these birds, we may call out the name of any bird. In response, the bird will say the name of some bird in the forest. (The reply could be the same bird we named, or the bird’s own name, or any other bird.)

An enchanted forest populated by birds is disarmingly endearing. We’re almost unaware we’re actually dealing with a set of functions that take a function as input and return a function. The evocative backdrop also pays homage to Haskell Curry, who was an avid birdwatcher.

One puzzle challenges us to find an egocentric bird given that a lark lives in the forest. Or, using mundane terminology, given a combinator L such that (Lx)y = x(yy) for all x and y, construct a combinator E such that EE = E.

The author states his solution is “a bit tricky” and consists of 12 correctly parenthesized Ls. Furthermore, the author states he doesn’t know if a shorter solution exists.

To maximize the likelihood of solving this puzzle, the reader should take advantage of facts learned from previous puzzles, and build up to the solution in stages. But that’s only if you’re playing fair and using pencil and paper! Instead, I saw an opportunity to bust out one of my favourite snippets of code.

Seeing the forest for the trees

Let us first recast the problem in terms of trees. Instead of Ls and parentheses, we work with syntax trees. In other words, we work with full binary trees where each leaf node corresponds to an L, and to evaluate an internal node, we recursively evaluate its child nodes, then apply the left child to the right child. (In Smullyan’s terminology, we call out the name of the bird represented by the right child to the bird represented by the left child.)

In this setting, the puzzle is to find a full binary tree such that repeatedly transforming parts of the tree according to the rule (Lx)y = x(yy) produces a tree where both of the root node’s children are identical to the original tree.

Hence to solve with brute force, we need only generate all full binary trees containing up to 12 leaf nodes, and for each one see if we can transform the tree into two copies of itself.

Here’s where my treasured routine comes in. The following roughly describes how to call a function on every full binary tree with exactly n leaf nodes:

Allocate a node x.
If n is 1, then mark x as a leaf, call the function, then return.
Otherwise mark x as an internal node, and for every 0 < k < n:
- For every full binary tree y with k leaf nodes:
  - Set the left child of x to y.
  - For every full binary tree z with n - k leaf nodes:
    - Set the right child of x to z.
    - Call the function.

We generate the left and right subtrees by calling this algorithm recursively. More precisely, in Go:

type node struct {
  kind int  // 0 = leaf, 1 = branch.
  left, right *node
}

// For each full binary tree with n leaf nodes,
// sets '*p' to a pointer to the tree and calls the given function.
func forall_tree(p **node, n int, fun func()) {
  var t node
  *p = &t
  if (n == 1) {
    t.kind = 0
    fun()
    return
  }
  t.kind = 1
  for k := 1; k < n; k++ {
    forall_tree(&t.left, k, func() {
      forall_tree(&t.right, n - k, fun)
    })
  }
}

I was proud when I found this gem a few years ago while working on a Project Euler problem, though I’d be shocked if it were original. [Actually, my first version preallocated an array of 2n - 1 nodes and used indices instead of pointers save a bit of time and space, but this is less elegant.]

For example, we can print the first 10 Catalan numbers:

func main() {
  for k := 1; k <= 10; k++ {
    var root *node
    n := 0
    forall_tree(&root, k, func() { n++ })
    println(n)
  }
}

Or print all full binary trees with exactly 6 leaf nodes, as parenthesized expressions:

func tree_print(p *node) {
  if p.kind == 1 {
    print("(")
    tree_print(p.left)
    tree_print(p.right)
    print(")")
    return
  }
  print("L")
}

func main() {
  var root *node
  forall_tree(&root, 6, func() { tree_print(root); println() })
}

With a little more effort, we can write a program that solves the puzzle. However, some care is needed: if we replace subtrees of the form (Lx)y with x(yy) and vice versa without rhyme nor reason, we’ll have no idea when we’ll finish and we’ll only stumble across a solution by chance.

Instead, we observe that (Lx)y is either strictly smaller than x(yy), or has the form (Lx)L. Let us say that we are reducing the tree when we replace x(yy) by (Lx)y, and expanding when we perform the reverse. Thus rather than start from a tree t and repeatedly applying the rule to obtain the tree tt, we do the reverse: we start from tt, and consider reductions only. The above observation implies that every sequence of reductions must terminate eventually.

But what if we need to temporarily expand tt before reducing it in order to reach t? Let’s optimistically hope that Smullyan’s 12-L solution was sufficiently expanded; that is, only reductions are needed to go from tt to t, where t is his solution.

Multiple subtrees may be reducible, and choosing the wrong one prevents future reductions necessary to reach the goal. We therefore try every path: for each possible reduction, we apply it and recurse. This leads to a problem of wastefully repeating many computations because there can be several ways to arrive at the same tree. We tackle this in an obvious manner: by remembering the trees we’ve seen so far.

I wrote solutions in GNU C and Go. The Go solution is a bit too slow for comfort. The C code is slightly clumsier mainly because I had to name the anonymous functions (though one can define a macro to work around this). C also lacks a map data structure, but this was no problem thanks to my recently released BLT library.

Results (Spoiler Alert)

Optimism paid off. On my laptop, my C program took well under a minute to find 4 solutions:

(((L(LL))(L(LL)))((L(LL))(L(LL))))
(((L(LL))((LL)L))((L(LL))((LL)L)))
(((L(LL))((LL)L))(((LL)L)((LL)L)))
((((LL)L)((LL)L))(((LL)L)((LL)L)))

The Go version took about 10 minutes.

These all contain 12 Ls, so in a sense, Smullyan’s solution is minimal. Since no other strings are printed, these four 12-L solutions are minimal when only reductions are permitted.

If we allow expansions (that is, (Lx)y → x(yy)), then firstly, we have at least 2⁴ = 16 solutions of length 12, since in this case (L(LL)) and (LL(L)) are interchangeable, and secondly, we can reduce the above strings to find shorter solutions. For example, the solution:

(((L(LL))(L(LL)))((L(LL))(L(LL))))

reduces to:

(L((L(LL))(L(LL))))(L(LL))

which further reduces to:

((LL)(L(LL)))(L(LL))

which only has 8 Ls. I doubt Smullyan missed this. My guess is he meant that if you solve the problem the clever way, then you arrive at an expression with 12 Ls; reductions should be ignored because they only obscure the ingenuity of the solution.

Is there, say, a 7-L expression that expands to some expression (that is necessarily longer than 24 Ls) which can be reduced to half of itself? I think not, but I have no proof.

Exercise: Four Fours

I wanted to do something with the Go version of the the forall_tree() routine, so I tweaked it to solve the four fours puzzle. I just ploughed through all possible trees and evaluated each one; there are only 320 to do. For larger variants of the puzzle, I’d use dynamic programming; that is, memoize subtrees and their values. Division by zero is annoying, but I managed to keep the tree evaluation function short and sweet by using Go’s panic-recover mechanism.

Recursion: the best thing since recursion

The forall_tree() function is but one example of the eloquence of anonymous functions and recursion. For similar reasons, nested functions are also indispensable. We attain economy of expression by letting the stack automatically take care of the heavy lifting.

Curiously, early structured languages including ALGOL, Simula, and Pascal supported nested functions, but C shied away from this beautiful feature.

Its absence is sorely missed, as can be seen in C-style callbacks. An ugly void * argument is inevitably passed and cast. For instance, take the udata parameter in the SDL audio API.

Its absence is also egregiously gratuitous. GCC supports nested functions by using this one weird trick [compiler writers hate him!]. One might complain this weird trick conflicts with executable space protection, but executable space protection is worse than pointless thanks to return-oriented programming.

Code Complete

https://github.com/blynn/mockingbird

Chocolate, Logic Puzzles, and Dancing Links

2013-11-13T22:55:00.000-08:00

Early last year, I visited a cafe that awards chocolate for solving logic puzzles. Naturally, I couldn’t resist free chocolate, and afterwards, just as naturally, I couldn’t resist thinking about programs that solve logic puzzles.

I’ve had to write such programs before for homework assignments or to test out frameworks. But oddly, I had never put much effort into it. I loved logic puzzles as a kid, even going so far as to buy a magazine or two that were filled with grids and clues. Why hadn’t I already written a decent tool to help?

Better late than never. After a spate of intense coding, I had a program that read clues in a terse text format and used brute force to find the solution. I spent most of my time devising the mini-language for the clues rather than the algorithm, as I figured the bottleneck would be the human entering the clues.

My solver worked well enough on a few examples, including the puzzle I solved to get a chocolate. But then I tried my program on the Zebra Puzzle. I was too impatient to let it finish. After a little thought, it was clear why.

On the grid

Logic grid puzzles can be abstracted as follows. We are given a table with M rows and N columns, and each cell contains a unique symbol. Our goal is to rearrange the symbols within each row except for those in the first row so that given constraints are satisfied. To be clear, symbols must stay in the row they started in, but apart from the first row, they can change places with other symbols in their row.

Some examples of constraints:

symbols A and B must be in the same column.
symbols A, B, and C must be in distinct columns.
symbol A’s column must be exactly one to the left of symbol B’s column.

We fix the first row because of contraints such as the last example: clues often refer to the order of the elements in the first row in the input table. Let us call them order contraints. This inspires the following convenient relabeling. Without loss of generality, let the symbols of the first row be 1, …, N from left to right.

My brute force solver generates every possible table and prints the ones that satisfy all given constraints. That means it has to examine up to N!^(M-1) cases: there are N! permutations of the symbols in each row, and we have M-1 rows to rearrange. For the Zebra Puzzle, this is 120⁵.

Got it covered

I needed a smarter approach. Since I had already coded a sudoku solver, I chose the same approach, namely, represent a puzzle as an instance of the exact cover problem, then use the dancing links algorithm.

Firstly, we populate a set X with all the symbols in the table. Next, instead of generating every table, generate every possible column. Each such column corresponds to the subset of X consisting of the symbols in the column. Generating every possible column means brute force is still present, but to a lesser degree.

An exact cover of X corresponds to a collection of columns such that no two columns contain the same symbol, and furthermore, each symbol appears in one of the columns. Thus these columns can be joined together to form a candidate solution to the logic puzzle: we merely order them so that the first row reads 1, …, N.

It remains to disallow covers that violate the constraints. For some constraints, we achieve this by omitting certain columns. For example, suppose A and B must be in the same column. When we generate a column that only contains one of A or B, we omit it from the exact cover instance. Similarly, if a constraint requires that A and B lie in distinct columns, we omit subsets that contain both symbols from the exact cover instance.

Out of order

The above suffices for many puzzles, but falls short for those with order constraints such as "A and B lie in adjacent columns". For this particular constraint, we add N elements to X. Let us call them x[1], …, x[N]. Given a generated column whose first row contains the number n (recall we have relabeled so that the first row of the input table is 1, …, N), if the column contains:

both A and B: eliminate the column from consideration.
A and not B: we add x[n] to the column’s corresponding subset.
B and not A: we add x[k] to the column’s corresponding subset for all k in 1..N where |n-k| is not 1.

Lastly, we mark x[1] ,…, x[N] as optional, meaning that they need not be covered by our solution. (We could instead augment our collection of subsets with singleton subsets {x[n]} for each n, but with dancing links there’s a faster way.)

Thus any exact cover must have A and B in adjacent columns to avoid conflicts in the x[1], …, x[N] elements. We can handle other order constraints in a similar fashion.

The size of X is the number of symbols, MN, plus N for each order constraint. The number of subsets is the number of possible columns, which is N^M, because we have M rows, and each can be one of N different symbols. Each subset has size M, one for each row, plus up to N more elements for each order constraint that involves it.

That’ll do

Though N^M may be exponential in input size, typical logic grid puzzles are so small that my program solves them in a few milliseconds on my laptop. The bottleneck is now indeed the human entering the clues. [I briefly thought about writing a program to automate this by searching for keywords in each sentence.]

I was annoyed the above algorithm is adequate. Firstly, trying entire columns in a single step bears no resemblance to actions performed by human solvers. Secondly, it was too easy: I wish I were forced to think harder about the algorithm! Perhaps I’ll devise a larger puzzle that demands a better approach.

Around this time I lost focus because I felt my old code could use a few edits. I got sidetracked rewriting my dancing-links sudoku solver and comparing it against other ways of solving sudoku, and soon after that I moved on.

Luckily for my code, I recently felt compelled to clean it up enough for public release. It seemed a pity to let it languish in a forgotten corner until rendered useless from bit rot and lack of documentation.

The DLX library is now available at a git repository near you:

Lies, damned lies, and frequentist statistics

2013-11-12T23:24:00.000-08:00

Earlier this year I rekindled an interest in probability theory. In my classes, Bayes' theorem was little more than a footnote, and we drilled frequentist techniques. Browsing a few books led me to question this. In particular, though parts of Jaynes' "Probability Theory: The Logic of Science" sounded like a conspiracy theory at first, I was soon convinced that the author’s militant condemnation of frequentism was justified.

Today, I had the pleasure of reading a Nature article about a paper by Valen E. Johnson directly comparing Bayesian and frequentist methods in scientific publications, who suggests the latter is responsible for a plague of irreproducible findings. I felt vindicated; or rather, I felt I had several more decibels of evidence for the hypothesis that Bayesian methods produce far better results than frequentist methods when compared against the hypothesis that the two methods produce equivalent results!

This post explains it well. In short, frequentist methods have led to bad science.

An apologist might retort that it’s actually the fault of bad scientists, who are misusing the methods due to insufficient understanding of the theory. There may be some truth here, but I still argue that Bayesian probability should be taught instead. I need only look at my undergraduate probability and statistics textbook. On page 78, I see the 0.05 P-value convention castigated by Johnson, right after recipe-like instructions for computing a P-value. If other textbooks are similar, no wonder scientists are robotically misapplying frequentist procedures and generating garbage.

Johnson’s recommended fix of using 0.005 instead 0.05 is curious. I doubt it has firm theoretical grounding, but perhaps the nature of data that most scientists collect mean that this rule of thumb will usually work well enough. Though perhaps striving for the arbitrary 0.005 standard may require excessive data: a Bayesian method might yield similar results with less input. I guess it’s an expedient compromise. Those with poor understanding of statistical inference can still obtain decent results, at the cost of gathering more data than necessary.

The above post also mentions a paper describing how even a correctly applied frequentist technique leads to radically different inferences from a Bayesian one. The intriguing discussion within is beyond me, but I’m betting Bayesian is better; or rather, the prior I’d assign to the probability that Bayesian inference will one day shown to be better is extremly close to one!

Crit-bit tree micro-optimizations

2013-11-03T23:33:00.000-08:00

Long ago, I implemented crit-bit trees (one of the many names of this data structure) after skimming a few online descriptions. I made obvious choices and put little effort into optimization. It worked well enough for my projects.

Earlier this year, I studied Adam Langley’s crit-bit tree library, and was inspired to write a new crit-bit tree library from scratch. Micro-optimizations are fun! And surely my rebooted library would be faster thanks to ingenious techniques such as tagged pointers and fancy bit-twiddling (notably a SWAR hack to find the critical bit).

On my machine, the benchmarks show improvement in space and time. Iteration is much slower since I opted to forgo linked-list pointers as advocated by Bernstein. Without them, we must travel up and down the tree to go from one leaf node to the next.

However, an application wishing to visit every element should not do this: it should instead call the allprefixed routine with a blank prefix. The difference between allprefixed and the my original library’s iterate is small enough that I’m inclined to agree with Bernstein: we’re better off without auxiliary next and previous pointers.

When I modified the benchmark program to measure the classes of C++, I changed char* to string, and an array to a vector which are natural choices for C++ and should not significantly hurt running times. As theory suggests, map insertion and lookup is slower, while unordered_map is much faster; as long as the drawbacks of the latter are borne in mind, it may be a reasonable choice for certain applications.

I was pleased my rewritten library sometimes seemed a touch faster than the library that inspired it ("critbit" in the tables below), especially since my library implements a map and not just a set. The extra optimizations it performs are listed in comments in the C file. My guess is that most of them do almost nothing, and "unrolling" the child pointer lookup is largely responsible for the gains.

I’m naming my code the BLT library for now. I already used up my first choice in my old "cbt" library. Though immodest, "Ben Lynn Trees" is easy for me to remember. Also, the name coincides with a delicious sandwich.

In the following tables, all entries are in seconds, except for the overhead which is measured in bytes. I used tcmalloc to get the best numbers.

**Table 1.** Keys: `/usr/share/dict/words`
	BLT	cbt	critbit	map	unordered_map
insert	0.062939	0.085081	0.070779	0.086181	0.050796
get	0.044096	0.046323	0.053218	0.077201	0.034249
allprefixed	0.013312		0.015073
iterate	0.031755	0.006420		0.006167	0.004334
delete	0.048093	0.050687	0.051923	0.009572	0.011820
overhead	3966824	6346992

**Table 2.** Keys: output of `seq 2000000`
	BLT	cbt	critbit	map	unordered_map
insert	2.675797	3.048633	2.683036	3.175450	1.457466
get	2.328416	2.477317	2.510085	2.913757	0.919519
allprefixed	0.296389		0.303059
iterate	0.917041	0.171143		0.193081	0.131707
delete	2.352568	2.522654	2.250305	0.299835	0.316262
overhead	79999984	128000048

Try BLT now!

https://github.com/blynn/blt

Inheritance Quiz

2013-07-21T21:16:00.000-07:00

This code uses implementation inheritance. Where’s the bug?

class Point {
  int x, int y;
  Point(int x, int y) {
    this.x = x;
    this.y = y;
    display();
  }
  void display() {
    System.out.println(x + " " + y);
  }
}

class CPoint extends Point {
  Color c;
  CPoint(int x, int y, Color c) {
    super(x, y);
    this.c = c;
  }
  void display() {
    System.out.println(x + " " + y + " " + c.name());
  }
}

This code uses implementation inheritance. Where’s the bug?

/** A version of Hashtable that lets you do
 * table.put("dog", "canine");, and then have
 * table.get("dogs") return "canine". **/

public class HashtableWithPlurals extends Hashtable {

  /** Make the table map both key and key + "s" to value. **/
  public Object put(Object key, Object value) {
    super.put(key + "s", value);
    return super.put(key, value);
  }
}

Hint: After put("dog", foo), although "dogs" is in the table as expected, sometimes "dogss" is also in the table. Why?

Consider this object buffer in Java:

public class Buffer {
  protected Object[] buf;
  protected int MAX;
  protected int current = 0;

  Buffer(int max) {
    MAX = max;
    buf = new Object[MAX];
  }
  public synchronized Object get()
    throws Exception {
      while (current<=0) { wait(); }
      current--;
      Object ret = buf[current];
      notifyAll();
      return ret;
  }
  public synchronized void put(Object v)
    throws Exception {
      while (current>=MAX) { wait(); }
      buf[current] = v;
      current++;
      notifyAll();
  }
}

Use inheritance to extend this class to support the gget() method, which is identical to get() except it cannot be executed immediately after a get(). In other words, it blocks until a put() or a gget() finishes.

public class HistoryBuffer extends Buffer {
  // What goes here?
}

Answers

See "Masked types for sound object initialization" for the answer. Did you know initialization of superclasses in Java and C++ is unsound?

In Go, which lacks inheritance, we might write:

type Point struct {
  x, y int
}

func NewPoint(int x, int y) *Point {
  p := &Point{x, y}
  p.Display()
  return p
}

func (p Point) Display() {
  println(p.x, p.y)
}

type CPoint struct {
  x, y int
  c Color
}

func NewCPoint(int x, int y, Color c) *CPoint {
  p := &Cpoint{x, y, c}
  p.Display()
  return p
}

func (p CPoint) Display() {
  println(p.x, p.y, c.Name())
}

We must write factory methods instead of constructors. However, the language makes it easy to avoid the bug.

This question was taken from "The Java IAQ", which also explains the answer.

In Go, which lacks inheritance, we might write:

type PluralsTable struct {
  *Table tab;
};

func (t *PluralsTable) Put(key string, value interface{}) interface{} {
  t.tab.Put(key + "s", value)
  return t.tab.Put(key, value)
}

func (t *PluralsTable) Get(key string) interface{} {
  return t.tab.Get(key)
}

...

We must define a wrappper for each method of Table that we want PluralsTable to support. Additionally, if we have code that should work on PluralsTable and Table, we must define an interface with Put and Get. However, the language makes it easy to avoid the surprise recursion bug.

We also avoid other kinds of bugs. For example, suppose we add PutFoo() to Table, which inserts "foo" into the table in a fast but strange way: the hash of "foo" is precomputed, so that the entry can be directly inserted into the underlying array.

With inheritance, calling PutFoo() on a PluralsTable will silently succeed, but neglect to insert "foos", a bug that might only be noticed long after a release. Without inheritance, the program will fail to compile when code calls PutFoo() on a PluralsTable, at which point a human can intervene and supply the PluralsTable edition of PutFoo().

Did you manage it without rewriting the entire class? If so, congratulations on solving the inheritance anomaly! Otherwise, you probably wrote something akin to the solution given in that paper:

public class HistoryBuffer extends Buffer {
  boolean afterGet = false;

  public HistoryBuffer(int max) { super(max); }

  public synchronized Object gget()
    throws Exception {
      while ((current<=0)||(afterGet)) {
        wait();
      }
      afterGet = false;
      return super.get();
  }
  public synchronized Object get()
    throws Exception {
      Object o = super.get();
      afterGet = true;
      return o;
  }
  public synchronized void put(Object v)
    throws Exception {
      super.put(v);
      afterGet = false;
  }
}

Implementation inheritance mixes poorly with concurrent programming.

In Go, this question takes a different character because concurrent object buffers are built-in types:

type Buffer chan interface{}

func NewBuffer(max int) Buffer {
  return make(chan interface{}, max)
}

func (buf Buffer) Get() interface{} {
  return <-buf
}

func (buf Buffer) Put(v interface{}) {
  buf <- v
}

One solution to the HistoryBuffer problem is:

type HistoryBuffer struct {
  ch                   chan interface{}
  get, gget, put, done chan bool
}

func NewHistoryBuffer(max int) *HistoryBuffer {
  buf := &HistoryBuffer{
    make(chan interface{}, max),
    make(chan bool), make(chan bool), make(chan bool), make(chan bool),
  }
  go func() {  // Synchronization logic.
    maybe_gget := buf.gget
    for {
      select {
      case <-maybe_gget:
      case <-buf.put:
        maybe_gget = buf.gget
      case <-buf.get:
        maybe_gget = nil
      }
      <-buf.done
    }
  }()
  return buf
}

func (buf *HistoryBuffer) Get() interface{} {
  buf.get <- true
  v := <-buf.ch
  buf.done <- true
  return v
}

func (buf *HistoryBuffer) Put(v interface{}) {
  buf.put <- true
  buf.ch <- v
  buf.done <- true
}

func (buf *HistoryBuffer) GGet() interface{} {
  buf.gget <- true
  v := <-buf.ch
  buf.done <- true
  return v
}

Most of the synchronization logic resides in the anonymous function. Outside, we only have channel writes at the beginning and end of each method, comparable to Java’s synchronized keyword. Thus behaviour is decoupled from synchronization.

We can effortlessly and independently change the synchronization logic. Suppose now GGet() can only be called after 3 Put() calls and 2 Get() calls have completed in some order since the previous GGet(), or, if there are no previous GGet() calls, since the program started. We simply change the anonymous function:

go func() {
  var maybe_gget chan bool
  p, g := 0, 0
  for {
    select {
    case <-maybe_gget:
      p, g = 0, 0
      maybe_gget = nil
    case <-buf.put:
      p++
    case <-buf.get:
      g++
    }
    if p >= 3 && g >= 2 {
      maybe_gget = buf.gget
    }
    <-buf.done
  }
}()

The official Go documentation suggests channels of ints instead of bools, which leads to slightly smaller code (mainly because we can replace true with 1), and probably results in the same compiled binaries.

MathJax

2013-03-14T00:04:00.000-07:00

About a decade ago, I began putting my notes on my homepage for the reasons cloud computing proponents love to spout (though I did it without uttering any buzzwords).

But I hit a snag. How do I put equations on the web? Among the many awful workarounds, I picked the one which I thought was noblest: MathML. My pages would be static content; operable without JavaScript. Text is far slimmer than images, and are far more agreeable to things like searching. As for PDF? Over my dead <body> element!

I was optimistic back then. Mozilla supported MathML provided you also downloaded a font or two, and despite the crushing dominance of Internet Explorer, I felt that righteous Free Software would ultimately triumph. One day, I hoped, a typical browser would render my site perfectly, out of the box.

Turns out my predictions were half right. The web broke free of Internet Explorer’s chokehold. Now, more often than not, we use open source browsers. And one of them, Firefox, supports MathML out of the box.

However, my mathematics notes still render incorrectly on most browsers. Popular search engines appear to shun them, possibly because I zealously followed the arcane XHTML 1.1 plus MathML guidelines. And everything supports JavaScript.

Maybe they’re all going to support real soon, but ten years is too long for me. I switched to MathJax, a clever JavaScript library that figures out what your system can do, then renders the equations using an appropriate technique. It just works.

Probability Made Less Uneasy

2013-02-18T01:32:00.000-08:00

I’ve been leafing through a few books on probability, a subject which I’ve mostly avoided since undergrad. Originally thinking I’d just refresh what I already learned, to my surprise I was led to reconsider fundamental beliefs. What follows is my journey told via book reviews.

Hexaflexagons and Other Mathematical Diversions by Martin Gardner

As a kid, I devoured this book and the others in the series, which I later learned were collections of Mathematical Games columns from Scientific American magazine. I didn’t always understand the material, and the puzzles were often too difficult, but Gardner’s writing skill kept me reading on.

Among the many fascinating chapters was “Probability Paradoxes”. Gardner’s ability to communicate was so strong that after many years I still remember much of the content. In particular, he asked:

Mr. Smith says, "I have two children and at least one of them is a boy." What is the probability that the other child is a boy?

and his explanation of 1/3 being the correct answer not only stuck in my mind, but shaped my early views on probability. For the details, see this New Scientist article on a Martin Gardner convention.

Only a few years ago, after a debate with a friend, did I reconsider the reasoning. It turns out Gardner’s statement of the problem is ambiguous. This revelation sparked a desire to hit the books and brush up on probability one day.

A Primer of Statistics by M.C. Phipps and M.P. Quine

The second edition of this slim volume was the textbook for my first course on probability. I used it to cram for exams. For this purpose, it was good: I got decent grades.

Sadly, it wasn’t as good in other respects. I acquired a distaste for the subject. Why did Probability and Statistics seem like a bag of ad hoc tricks, with few explanations given? Do I have poor intuition for it? Or is it glorified guesswork that seems to work well enough with real-life data? Whatever the reason, I decided that for the rest of my degree I’d steer towards the Pure Mathematics offerings.

The Signal and the Noise: Why So Many Predicitons Fail — but Some Don’t by Nate Silver

My renewed interest in probability was also sparked by the United States presidential election of 2012, or rather, its aftermath. Many had predicted its outcome but few were accurate.

It was only then I read about Nate Silver, who turned out to have been famous for his prowess with predictions for quite some time. Eager to learn more, I thumbed through his bestseller.

Though necessarily light on theory, the equations that do appear are correct and lucidly explained. Also, the pages are packed with interesting data sets and anecdotes. General pronouncements are often backed up with concrete tables and graphs, though, as Silver readily admits, some qualities are difficult to quantify, resulting in potentially dubious but novel yardsticks (such as measuring scientific progress by average research and development expenditure per patent).

But most of all, I was intrigued by the tale of an ongoing conflict that I never knew existed, with frequentists on one side and Bayesians on the other. They never told me this in school!

I soon found out why: Silver states that Fisher may almost be single-handedly to blame for the dominance of frequentism, the ideology foisted upon me when I was just out of high school. Sure enough, I went back and confirmed Phipps and Quine listed Fisher in the bibliography.

Against the Gods: The Remarkable Story of Risk by Peter L. Bernstein

My dad told me about this book. Technical details are scant as it is also aimed at the general public. But in contrast to Silver’s work, what little that appears is laughably erroneous. In some sections, I felt the author was trying to trick himself into believing fallacies.

The misinformation might be mostly harmless. Those with weak mathematical ability are going to skip the equations out of fear, and those with strong mathematical ability are probably also going to skip them because they already know them.

But conceivably this book could be a gifted reader’s first introduction to probability, and it’d be a shame to start off on the wrong foot. As a sort of public service, I’ll explain some of the gaffes.

Exercises

Chapter 6 contains an example expected value calculation involving a coin flip.

We multiply 50% by one for heads and do the same for the tails, take the sum---100%---and divide by two. The expected value of betting on a coin toss is 50%. You can expect either heads or tails, with equal likelihood.

Why is this wrong? How can we fix it?

The next example involves rolling two dice.

If we add the 11 numbers that might come up…the total works out to 77. The expected value of rolling two dice is 77/11, or exactly 7.

Why is this wrong? How can we fix it?

What’s the difference?

Bernstein and Silver offer competing reasons why modern civilization differs from the past. Bernstein singles out our relatively newfound ability to quantify risk, and also suggests that key intermediate steps could only have occurred at certain points in history due to the overall mood of the era.

In contrast, Silver seems to place most importance on the printing press. In an early chapter, Silver suggests that after some teething trouble (lasting 330 years), the printing press paved the way for modern society. Apart from distribution of knowledge, perhaps more importantly the printing press helped with the preservation of knowledge; previously, writing would often be lost before it could be copied.

I’m inclined to side with Silver, partly because of Bernstein’s basic technical mistakes. After observing how fast and loose Bernstein was playing with mathematics, I’m tempted to believe some of his statements are gut feelings.

There is another glaring difference. Bernstein’s book lacks any mention of the frequentist-Bayesian war. Fisher’s name is conspicuously absent.

For or Against?

Against the Gods is riveting. My favourite feature is the backstories of famous scholars. For some of them, before reading the book, the only thing I knew about them were their names, and I would have known even less if their names weren’t attached to their most famous discoveries (or at least, discoveries vaguely connected with them). Learning about their life, motivations, temperament, beliefs, and so on was illuminating. An intellectually superior form of gossip, I suppose.

However, the elementary mathematical mistakes ultimately cast a cloud of suspicion over the book. How reliable are the author’s assertions in general? Although I heartily recommend Against the Gods, I also recommend thorough fact-checking before using it as a reference.

So a tip for bestseller authors: if a section is technical, then ask an expert, be an expert, or cut it out. Too many howlers make readers like me wary of the whole, no matter how well-written and accurate the non-technical parts are.

Answers to exercises

As Bernstein himself implies, an expected value is a weighted average. We need weights, and we need numbers to sum. It takes two to tango; the expected value dance can only proceed if probabilities are accompanied by values.

One example neglects the values, and the other neglects the probabilities. The author only computes the sum of the weights for the coin flip, and the sum of the values for the dice roll. In both cases the author divides by the number of outcomes, which might be considered another error: we already divided by the number of outcomes to compute the weights (probabilities) in the first place.

Why are these blunders amusing? For the coin example, let’s ignore that the expected value is confused with a probability. Instead of a coin, consider winning the lottery. The probability of winning the lottery plus the probability of not winning the lottery sums to 100%. Dividing this by the number of outcomes, i.e. 2, yields 50%, so apparently we win or lose the lottery with equal likelihood! It’s almost like saying “either it happens or it doesn’t happen, so the chances it happens is 50%”.

For the dice example, imagine rolling 2 loaded dice, both of which almost always show 6. The expected value should be close to 12, but because the probabilities are completely ignored, the author’s procedure leads to the same expected value of 7. Surely your calculation should change if the dice are loaded?

How do we fix these problems? For the dice example, the author supplies the correct method in the very next paragraph. At last, both the probabilities and values are taken into account. Unfortunately, the author then concludes:

The expected value…is exactly 7, confirming our calculation of 77/11. Now we can see why a roll of 7 plays such a critical role in the game of craps.

This should have never been written. The first sentence suggests both methods for computing the expected value are valid, when of course it just so happens the wrong method leads to the right answer.

The second sentence is difficult to interpret. Perhaps uncharitably, I’m guessing the sentence is an upgraded version of: “Look! Here’s a 7! Didn’t we see a 7 earlier?” What would have been written if we rolled a single die? The expected value is 3.5, but a roll of 3.5 obviously has no role in any game we play with one die.

As for fixing the coin example: computing an expected value requires us to attach a numerical value to each outcome. One does not simply plow ahead with “heads” versus “tails”. We need numbers; any numbers. We could assign 42 to heads, and 1001 to tails; here, the expected value of a fair coin toss would be 50% of 42 plus 50% of 1001, which is 521.5. Typically we pick values relevant to the problem at hand: for instance, in a game where we earn a dollar for flipping heads, and lose a dollar for tails, we’d assign the values 1 and -1 (here, our expected winnings would be 0).

[It may be possible to reinterpret the coin example as assigning the value 1 to both heads and tails. But if this were done, the expected value should also be 1, not “50%”. Furthermore, we learn nothing if the outcomes are indistinguishable.]

Probability Theory: The Logic of Science by E. T. Jaynes

If only Jaynes' book had been my introduction to probability. Like a twist ending in a movie, reading it was a thought-provoking eye-opening earth-shattering experience that compelled me to re-evaluate what I thought I knew.

Whereas Silver presents whimsical examples that demonstrate the Bayesian approach, Jaynes forcefully argues for its theoretical soundness. From a few simple intuitive “desiderata” (too ill-defined to be axioms), Jaynes shows step-by-step how they imply more familiar probability axioms, and why the Bayesian approach is the natural choice. And all this happens within the first 3 chapters, which are free online.

I had been uneasy about probability because I thought it was a collection of mysterious hacks, perhaps because it had to deal with the real world. I was flabbergasted to learn probability could be put on the same footing as formal logic. All those hacks can be justified after all. Probability is not just intuition and duct tape: it can be as solid as any branch of mathematics.

Since there still exist competing philosophies of probability, presumably others find fault with Jaynes' arguments. I’m still working through it, but I’m convinced for now. If there’s another twist in this story, I’ll need another great book to show it to me.

Washington University in St. Louis maintains a page dedicated to Jaynes. It’s a shame he died before he finished writing. The remaining holes have been papered over with exercises, which explains their depth and difficulty.

It’s also a shame Jaynes left Stanford University many years ago. Had he stayed, with luck I would have discovered his work earlier, or even have met him. A backward look to the future describes his reasons for departure.

In short, Jaynes felt the “publish or perish” culture of academia was harmful and was taking over Stanford. I can’t tell if Jaynes was right because by the time I got into the game, this culture seemed universally well-established. I had no idea an alternative ever existed.

Programming Dominion

2012-09-16T15:48:00.000-07:00

I recently learned to play Dominion, a game that spawned a genre known as deck-building card games. I’m a terrible player. While suffering defeats at the hands of a simple AI, I realized I might have more fun writing a Dominion-playing program.

Implementing just the basic rules is a boring exercise. Luckily, Dominion is a self-modifying game. For example, each turn, you’re supposed to start with one Action and 5 cards in your hand, but there are ways of increasing your Action count, or changing the number of cards in your hand.

Moreover, rule modifications interact with one another, further increasing complexity. For example, playing Witch causes other players to gain a Curse card, but not if the supply of Curse cards is exhausted, or a player is holding a Moat. Or take Throne Room, which plays another Action card twice. How can we design software to handle so many special cases?

Of course, sufficient spaghetti can get anything working. But we should try to minimize mess; ideally the logic for each card should be as isolated as possible. It’d awful if, say, Throne Room required us to bury code somewhere in the Action-playing routine so it runs twice instead of once.

Dominion in Go

I’m reasonably pleased with my first attempt. For the simplest cards, the logic is completely contained in a string, in a tiny domain-specific language:

Village,3,Action,+C1,+A2
Woodcutter,3,Action,+B1,$2

Less trivial cards require a bit more:

case "Feast":
  add(func(game *Game) {
    p := game.NowPlaying()
    game.trash = append(game.trash, p.played[len(p.played)-1])
    p.played = p.played[:len(p.played)-1]
    pickGain(game, 5)
  })

And that’s it! To add a card, just one string, and maybe one block of code. As time passed, it became easier to add new cards. For some cards, it was more like data entry than programming.

Moat is an exception. As the only Reaction card in the Base set, rather than figure out a clean way to implement it, I sprinkle ad hoc code here and there to get it working. If I were to add more Reaction cards, I’d factor out the common parts. There’s no reason to do so pre-emptively. In fact, that’s what happened with other cards: I would only refactor once there was duplicate code to eliminate.

Intrepid readers can browse my git repo: https://github.com/blynn/gominion.git

But beware. It’s all in one untidy monolithic file, the UI is horrible, and the AI is stupid, though it still beats me when I get too greedy with Action cards! The game state is shared by all players. If network play were added, to prevent cheating, information would need to be more tightly controlled.

I have no plans to work much more on this, as many mature implementations already exist, and Rio Grande Games plans to release an official online version soon. All the same, I highly recommend learning to play Dominion, and then trying to program it. Both are enlightening experiences.

Smashing the non-executable stack for fun and profit

2012-08-26T21:01:00.000-07:00

In 1996, Elias Levy ("Aleph One") published "Smashing The Stack For Fun And Profit" in Phrack magazine. The article showed how to overflow a buffer to launch a shell.

I’m almost ashamed I never took a closer look for over a decade. My background would suggest I’d be one of the early adopters. As a kid, I loved messing with assembly language and poking around the system. I collected computer viruses. I bypassed copy protection systems. I knew how to make free phone calls. In grad school, my advisor and my colleagues taught a computer security class, where rooting a system by smashing the stack was a homework assignment.

With pride, and relief, I can now announce that at long last, in 2012, I have exploited a buffer overflow. Moreover, I have written a truly marvelous step-by-step guide to this, which this post is too narrow to contain. (I’m afraid of overflowing it.) I took notes because I encountered difficulties with other tutorials:

32-bit systems are often assumed. My system is 64-bit.
Various countermeasures are now enabled on stock installs.
I wanted to try a newer variant of the attack known as return-oriented programming, which defeats one of the countermeasures.

Luckily my website has ample room. Read now, and get a bonus shell script that demonstrates the attack!

Isn't Algebra Necessary?

2012-08-08T12:12:00.000-07:00

A recent New York Times article ponders if we should downgrade mathematics taught to high school and college students, and in particular, cut basic algebra.

Seriously? A horizontal line may represent an unknown word in those fill-in-the-blank primary school comprehension tests ("The dog’s name is __."), but a letter should never represent an unknown number lest it cause undue mental stress?

Among my first thoughts was that the article was a professional troll posting. After all, The New York Times is sadly going through a rough patch, and I sympathize if they must occasionally stoop lower to catch some extra cash. (If it is a troll posting, hats off! You got me.)

But the truth is probably mundane; it seems the author genuinely believes that algebra should be dropped.

On the one hand, this benefits me. If the article is taken seriously, and algebra is withheld from the masses, then those of us who know it possess formidable advantages. (The conspiracy theorist in me wonders if the author actually finds elementary algebra, well, elementary, and the true intent is to get ahead by encouraging everyone else to dumb down.)

On the other hand, the piece smacks of ignorance-is-strength propaganda, and thus is worth smacking down.

Inflation

The article suggests that, instead of algebra, classes should perhaps focus on how the Consumer Price Index is computed. I agree studying this is important: for example, I feel more attention should be drawn to the 1996 recommendations of the Boskin commission. If the Fed did indeed repeat the mistakes of the 1970s, then I should bump up the official US inflation rate when analyizing my finances. However, this stuff belongs to disciplines outside mathematics.

More importantly, what use is the CPI without algebra? Take a simple example: say I owe you $1000, and the inflation rate is 5%. If all you care about is keeping up with inflation, is it fair if I pay you back $120 annually for 10 years? If not, what is the right amount?

Without algebra, you might be able to figure that $1000 today is the same as 1000×(1.05)¹⁰ = $1628.89 in 10 years. But how are you going to figure out that the yearly payment should be 0.05×1628.9/(1.05¹⁰ - 1)? The easiest way to arrive here is to temporarily treat 1.05 as an abstract symbol. In other words, elementary algebra. One does need to play this ballgame for personal finance after all.

You might counter that an amortized loan calculator can work out the answer for you; there’s no need to understand how it works, right?

Ignorance begets fraud

In the above calculation, do I make my first payment today, or a year from now? Don’t worry, I’ll figure it out for you. Or perhaps I’ll claim you’re using the wrong mode on the calculator and helpfully retrieve the "right" formula for you.

Maybe you’d avoid these shenanigans by entrusting an accountant to oversee deals like this. Okay, but what if it’s not a loan? Say you’re making a policy recommendation and I’m an disingenuous lobbyist: can you tell if I’m fudging my figures?

I heard a story about Reagan’s SDI program. Scientists estimated a space laser required 10²⁰ units of energy, and current technology could generate 10¹⁰ units. They got funding by saying they were halfway there.

I hope this tale is apocryphal. Nevertheless, one can gouge the mathematically challenged just as unscrupulous salesmen rip off unwitting buyers. Unfortunately, with finance and government policy, damage caused by bad decisions can be far worse and longer lasting.

Fermat’s Last … Dilemma?

One bright spot in the article was the mention of "the history and philosophy of [mathematics], as well as its applications in early cultures". While not required to solve problems, knowing the background to famous discoveries makes a subject more fun.

It is inspiring that within a few short school years we enjoy the fruits of thousands of years of labour. Perhaps a student struggling with negative numbers would feel better knowing that it took many generations for them to be socially acceptable. For instance, the Babylonians were forced to divide the quadratic equation into different cases because they rejected negative numbers on philosophical grounds.

But at the same time, we see a mention of "Fermat’s dilemma", which charitably is a creative renaming of "Fermat’s Last Theorem" (though more likely there was some confusion with the "Prisoner’s Dilemma" from game theory). The author chose this example poorly, because the history of Fermat’s Last Theorem actually bolsters the case for algebra. It shows how a little notation goes a long way.

For Fermat did not use symbolic algebra to state his famous conjecture. Instead, he wrote:

Cubum autem in duos cubos, aut quadrato-quadratum in duos quadrato-quadratos, et generaliter nullam in infinitum ultra quadratum potestatem in duos eiusdem nominis fas est dividere cuius rei demonstrationem mirabilem sane detexi. Hanc marginis exiguitas non caperet.

(If it took him that many words to state the theorem, no wonder he had no space for a proof!)

We have it easy today. Mathematics would be considerably harder if you had to compute amortized loan payments with Latin sentences instead of algebra.

How could a writer fail to appreciate algebra? Strunk taught that "vigorous writing is concise." Which is more concise: the above, or "xⁿ + yⁿ = zⁿ has no positive integer solutions for n > 2"?

What should we learn?

Some time ago, I arrived at the opposite conclusion of the author, after reading confessions of professional academic ghostwriters. Algebra is fine; the courses that need reform are those far removed from mathematics.

According to "Ed Dante", who is hopefully exaggerating, you can pass such courses so long as you have Amazon, Google, Wikipedia, and a decent writing ability. You get the same results and save money by paying for an internet connection instead of university tuition.

I suppose I should also end on a positive note: I propose introducing ghostwriting courses, where the goal is to bluff your way through another course in the manner "Ed Dante" describes. The library would be off-limits, and you must not have previously studied the target subject. Perhaps the first 3 assignments can be admissions essays: one each for undergraduate, master’s and doctoral programs. Grading would be easy: if they fall for it, you get a good score.

With luck, universities would be forced to either beef up the victim degrees (perhaps by assessing students with something besides essays, or by teaching something that cannot be immediately learned from the web), or withdraw them. Additionally, the students would learn the importance of writing, and be harder to fool.

Keeping up with yesterday

2012-08-05T17:39:00.000-07:00

My to-do list has grown frightening large. Perhaps I'll be more motivated to tackle it by publicly announcing a few of its entries.

Apologies to those who sent me patches to my Git tutorial, or are awaiting email responses about the PBC library. I'll try to get around to them soon. And perhaps I'll even get back to working on the second edition of the printed version, which I originally planned to release 2 years ago!
I took notes on return-oriented programming on 64-bit Linux that I want to put up on my site somewhere. They've been almost ready for months.
Months ago, I also coded a logic puzzle solver that takes its input in a concise format. It's about ready for release.
In general, I want to rant and rave more over petty technical issues.

I'd better stop here, otherwise this list may also become too scary for me look at.