Friday, November 16, 2018

Lambda the Penultimate

Lambda expressions have proven so useful that even Java and C++ support them nowadays. But how do we compile them for a machine to run? No CPU has a lambda instruction.

One strategy is to convert lambda terms into point-free code, a process known as bracket abstraction. One such algorithm rewrites any program in terms of two functions: the S and K combinators. We can build a compiler by assembling just two simple functions.

Unfortunately, even with extra rewrite rules, classic bracket abstraction yields monstrous unwieldy expressions. Decades ago, they worked around this problem by building custom combinators tailored for each input program, known as supercombinators. Compilation is trickier, but at least the output is reasonable.

But recently, Oleg Kiselyov found a time-linear and space-linear bracket abstraction algorithm provided the De Bruijn indices of the input term are encoded in unary. It only takes 20 lines:

data Deb = Zero | Succ Deb | Lam Deb | App Deb Deb deriving Show
infixl 5 :#
data Com = Com :# Com | S | I | C | K | B | Sn Int | Bn Int | Cn Int
ski :: Deb -> (Int, Com)
ski deb = case deb of
  Zero                           -> (1,       I)
  Succ d    | x@(n, _) <- ski d  -> (n + 1,   f (0, K) x)
  App d1 d2 | x@(a, _) <- ski d1
            , y@(b, _) <- ski d2 -> (max a b, f x y)
  Lam d | (n, e) <- ski d -> case n of
                               0 -> (0,       K :# e)
                               _ -> (n - 1,   e)
  where
  f (a, x) (b, y) = case (a, b) of
    (0, 0)             ->         x :# y
    (0, n)             -> Bn n :# x :# y
    (n, 0)             -> Cn n :# x :# y
    (n, m) | n == m    -> Sn n :# x :# y
           | n < m     ->                Bn (m - n) :# (Sn n :# x) :# y
           | otherwise -> Cn (n - m) :# (Bn (n - m) :#  Sn m :# x) :# y

Our ski function returns an integer and a combinatory logic term equivalent to the input lambda term. The integer is zero if the given lambda term is closed; for an open term, it’s the number of lambdas needed to close it.

It uses bulk variants of the B, C, and S combinators:

linBulk :: Com -> Com
linBulk b = case b of
  Bn n   -> iterate ((B:#        B):#) B !! (n - 1)
  Cn n   -> iterate ((B:#(B:#C):#B):#) C !! (n - 1)
  Sn n   -> iterate ((B:#(B:#S):#B):#) S !! (n - 1)
  x :# y -> linBulk x :# linBulk y
  _      -> b

Linear complexity depends on memoizing these bulk combinators. If memoization is undesirable, we can replace each bulk combinator of order n with O(log n) ordinary combinators.

logBulk :: Com -> Com
logBulk b = case b of
  Bn n   -> go n (K:#I)         :# B              :# I
  Cn n   -> go n (K:#(C:#I:#I)) :# (B:#(B:#C):#B) :# I
  Sn n   -> go n (K:#(C:#I:#I)) :# (B:#(B:#S):#B) :# I
  x :# y -> logBulk x :# logBulk y
  _      -> b
  where
  go n base = foldr (:#) base $ ([b0, b1]!!) <$> bits [] n
  bits acc 0 = reverse acc
  bits acc n | (q, r) <- divMod n 2 = bits (r:acc) q
  b0 = C:#B:#(S:#B:#I)
  b1 = C:#(B:#S:#(B:#(B:#B):#(C:#B:#(S:#B:#I)))):#B

For example:

λ print $ logBulk $ Sn 1234
CB(SBI)(C(BS(B(BB)(CB(SBI))))B(CB(SBI)(CB(SBI)(C(BS(B(BB)(CB
(SBI))))B(CB(SBI)(C(BS(B(BB)(CB(SBI))))B(C(BS(B(BB)(CB(SBI))
))B(CB(SBI)(CB(SBI)(C(BS(B(BB)(CB(SBI))))B(K(CII))))))))))))
(B(BS)B)I
λ print $ logBulk $ Bn 1234
CB(SBI)(C(BS(B(BB)(CB(SBI))))B(CB(SBI)(CB(SBI)(C(BS(B(BB)(CB
(SBI))))B(CB(SBI)(C(BS(B(BB)(CB(SBI))))B(C(BS(B(BB)(CB(SBI))
))B(CB(SBI)(CB(SBI)(C(BS(B(BB)(CB(SBI))))B(KI)))))))))))BI

For completeness, we include our pretty-printing code:

instance Show Com where
  show S = "S"
  show I = "I"
  show C = "C"
  show K = "K"
  show B = "B"
  show (l :# r@(_ :# _)) = show l ++ "(" ++ show r ++ ")"
  show (l :# r)          = show l ++        show r
  show (Bn n) = "B_" ++ show n
  show (Cn n) = "C_" ++ show n
  show (Sn n) = "S_" ++ show n

In other words, we can easily rewrite a lambda term of length N as a combinatory logic term of length O(N log N).

Edward Kmett outlines a different approach (about 21 minutes into the video) though the details are so "horrific" that even he has yet to work through them. By the way, the slides from this talk are packed with excellent references on combinators.

Kiselyov laments bracket abstraction has "many descriptions and explanations and blogs", all of which take a syntactic approach. I’m one of the guilty parties, and hope to redeem myself with this post. Also, I rewrote one of my toy compilers to demonstrate another algorithm from Kiselyov’s paper. Though not linear, the algorithm avoids bulk combinators and often produces short and sweet programs.

Tuesday, June 26, 2018

Why Laziness Matters

Should a programming language be lazy by default? Robert Harper says no. Lennart Augustsson says yes. No matter who is right, I say all computer scientists should become fluent in a lazy language, whether or not they speak it in daily life.

My evidence is a post by Russ Cox on parsing with derivatives: a very experienced programmer very convincingly argues why a parsing algorithm has exponential time complexity. But the claims are very wrong; Adams, Hollenbeck, and Might proved the algorithm is cubic.

How did he err so badly? Did he underestimate the power of lazy evaluation?

I once exclusively wrote eager code, and I imagine my younger self would have agreed with his analysis without a second thought. Today I know better. Marvel at these lines by Doug McIlroy:

int fs = 0 : zipWith (/) fs [1..]    -- integral from 0 to x
sins = int coss
coss = 1 - int sins

It seems too good to be true. Indistinguishable from magic perhaps. But somehow it all works when lazily evaluated. Beware of summarily dismissing lazy code because it looks implausibly amazing.

Also consider an earlier article by Cox on regular expressions. Again, a very experienced programmer very convincingly argues why a parsing algorithm has exponential time complexity. In this post, however, the claims are solid, and backed up by graphs of running times. (It’s worth reading by the way: it tells the tragedy of how popular regular expression implementations became sluggish twisted mockeries of true regular expressions, while offering hope for the future. My only criticism is it fails to mention regular expression derivatives.)

Why does the erroneous post lack similar graphs? Why didn’t the author throw some code together and benchmark it to produce damning evidence?

Perhaps he thought it was too tedious. This would imply unfamiliarity with lazy languages, because prototyping parsing with derivatives in Haskell is easier than criticizing it.

Preliminaries

We define a Pe data structure to represent parsing expressions, that is, the right-hand side of the production rules of a grammar.

import Control.Arrow
import Control.Monad.State
import qualified Data.Map as M
import qualified Data.Set as S

-- NT = non-terminal. (:.) = concatenation.
data Pe = NT String | Eps Char | Nul | Ch Char | Or [Pe] | Pe :. Pe | Del Pe

Although it represents the empty string, the Eps (for epsilon) expression holds a character that winds up in the abstract syntax tree (AST) returned by the parser. Similarly, the Del (for delta) expression, which is only generated internally, holds an expression which later helps build an AST.

A context-free grammar maps non-terminal symbols to parsing expressions:

type Grammar = M.Map String Pe

Our ASTs are full binary trees whose leaf nodes are characters (the free magma on the alphabet). The tree structure captures the order the production rules are applied.

data Ast = Bad | Lf Char | Ast :@ Ast deriving Show

isBad :: Ast -> Bool
isBad Bad = True
isBad _   = False

The Bad AST is returned for unparseable strings. An alternative is to drop Bad and replace Ast with Maybe Ast throughout our code.

A fancier parser might return a parse forest, that is, all parse trees for a given input. Ours simply settles on one parse tree.

Parsing with derivatives

To parse an input string, we first take successive derivatives of the start symbol with respect to each character of the input, taking care to leave bread crumbs in the Eps and Del expressions to record consumed characters. (The Del constructor is named for the delta symbol from the paper, but I also think of it as "deleted", because it remembers what has just been deleted from the input.)

Then the string is accepted if and only if the resulting expression is nullable, that is, accepts the empty string. As we traverse the expression to determine nullability, we also build an AST to return.

We memoize derivatives by adding entries to a state of type Grammar. Initially, this cache contains only the input grammar, mapping nonterminal symbols to Pe values. Later, we place a derivative at the key formed by concatenating the characters involved in the derivative with the nonterminal symbol being derived.

For example, if S is a nonterminal in the input grammar, then abS maps to derive 'a' (derive 'b' (NT "S")). We assume no nonterminal symbol in the input grammar is a suffix of any other nonterminal symbol, which is fine for a prototype.

It may help to imagine the grammar growing over time, gaining new production rules as we process input characters. Indeed, we consider nonterminals to refer to both nonterminals in the input grammar as well as their derivatives.

parse :: Grammar -> String -> String -> Ast
parse g start s = evalState (parseNull $ NT $ reverse s ++ start) g

Computing nullability requires finding a least fixed point. I found this the toughest part of the algorithm, partly because they never taught fixed point theory when I was in school. For some reason, the method reminds me of Hopcroft’s algorithm to minimize a DFA, where we repeatedly refine a partition until we reach a stable answer.

We initially guess each nonterminal is not nullable, which means it corresponds to the Bad AST. On encountering a nonterminal, if we’ve already seen it, then return our guess for that nonterminal. Otherwise, it’s the first time we’ve seen it and instead of guessing, we recursively traverse its corresponding expression. In doing so, we may discover our guess is wrong, so we correct it if necessary before returning an AST.

We repeat until our guesses stabilize. Guesses never change from a good AST to Bad, and the map of all guesses only changes if a guess is revised from Bad to a good AST. We exploit these facts to simplify our code slightly.

parseNull :: Pe -> State Grammar Ast
parseNull pe = leastFix M.empty where
  leastFix guessed = do
    (b, (_, guessed')) <- runStateT (visit pe) (S.empty, guessed)
    if M.size guessed == M.size guessed' then pure b else leastFix guessed'

visit :: Pe -> StateT (S.Set String, M.Map String Ast) (State Grammar) Ast
visit pe = case pe of
  Eps x  -> pure $ Lf x
  Del x  -> visit x
  Nul    -> pure Bad
  Ch _   -> pure Bad
  Or xs  -> chainsaw <$> mapM visit xs
  x :. y -> mul <$> visit x <*> visit y
  NT s -> do
    (seen, guessed) <- get
    case () of
      () | Just x <- M.lookup s guessed -> pure x
         | S.member s seen -> pure Bad
         | otherwise -> do
           modify $ first $ S.insert s
           b <- visit =<< lift (memoDerive s)
           unless (isBad b) $ modify $ second $ M.insert s b
           pure b

mul :: Ast -> Ast -> Ast
mul Bad _ = Bad
mul _ Bad = Bad
mul x y   = x :@ y

-- | Helps cut a non-empty parse forest down to one tree.
chainsaw :: [Ast] -> Ast
chainsaw xs | null xs'   = Bad
            | otherwise  = head xs'
            where xs' = filter (not . isBad) xs

Memoized derivatives are straightforward. For computing derivatives, we translate the rules given in the paper, and for memoization, on discovering a missing entry, we insert a knot-tying value before recursing, and replace it with the result of the recursion afteward.

memoDerive :: String -> State Grammar Pe
memoDerive cs@(c:s) = do
  m <- get
  unless (M.member cs m) $ do
    modify $ M.insert cs $ NT cs
    d <- derive c =<< memoDerive s
    modify $ M.insert cs d
  gets (M.! cs)
memoDerive _ = error "unreachable"

derive :: Char -> Pe -> State Grammar Pe
derive c pe = case pe of
  NT s             -> pure $ NT $ c:s
  Ch x | x == c    -> pure $ Eps x
  Or xs            -> Or <$> mapM (derive c) xs
  Del x :. y       -> (Del x :.) <$> derive c y
  x :. y           -> do
    b <- parseNull x
    dx <- derive c x
    if isBad b then pure $ dx :. y else do
      dy <- derive c y
      pure $ Or [dx :. y, Del x :. dy]
  _                -> pure Nul

Here’s the grammar that Cox claims will grind our parser to a halt:

cox :: Grammar
cox = M.fromList
  [ ("S", NT "T")
  , ("T", Or [NT "T" :. (Ch '+' :. NT "T"), NT "N"])
  , ("N", Ch '1')
  ]

Let’s try it on a small input in an interactive interpreter:

parse cox "S" "1+1+1"

The parser picks a particular parse tree:

(Lf '1' :@ (Lf '+' :@ Lf '1')) :@ (Lf '+' :@ Lf '1')

How about all strings of length 7 consisting of 1 or +?

filter (not . isBad . parse cox "S") $ replicateM 7 "+1"

Thankfully, we get:

["1+1+1+1"]

At last, it’s time to demolish Cox’s claims. We parse an 80-character input with a typo near the end:

main :: IO ()
main = print $ parse cox "S" $ concat (replicate 39 "1+") ++ "+1"

Our prototype is awful. We really should:

  • Add a slimmed down version of parseNull that returns a boolean instead of an AST, and call this in derive. We only want to recover the AST once the whole string has been parsed; the rest of the time, we only care whether an expression is nullable.

  • Use a better algorithm for finding the least fixed point. We’ve perhaps chosen the clunkiest and most obvious method.

  • Remove a layer of indirection when tying the knot. That is, point to a node directly rather than a string (which requires another lookup to get at the node).

  • Apply algebraic identities to reduce the number of nodes in parsing expressions and abstract syntax trees.

And yet, on my laptop:

Bad

real    0m0.220s
user    0m0.215s
sys     0m0.005s

Clearly, parsing with derivatives is efficient when run on the allegedly exponential-running-time example given by Cox.

The moral of the story

It’s best to test drive an algorithm before condemning it. If we see hilariously bad running times, then we can include them to hammer our points home. If we see surprisingly good running times, then there’s a mistake in our reasoning and we should keep quiet until we successfully attack the algorithm from another angle. (Cox rightly notes parsing with derivatives forgoes two key properties of yacc: linear running time and ambiguity detection. If only he had focused on these trade-offs.)

Is this practicable for parsing with derivatives? Well, we have presented an entire program, yet we have written less code than appears in Cox’s excellent article on regular expressions, which quotes just a few choice cuts from a presumably complete program. Indeed, with a splash of HTML, we can easily build an interactive online demo of parsing with derivatives.

The existence of the flawed post indicates no such sanity check was done. This was caused by poor understanding of lazy evaluation, or because it was deemed too troublesome to implement a lazy algorithm. Both problems are solved by learning a lazy language.

In sum, insufficient experience with lazy evaluation leads to faulty time complexity analysis. Therefore we should all be comfortable with lazy languages so computer science can progress unimpeded.

Monday, June 4, 2018

Regex Derivatives

Like many of my generation, I was taught to use Thompson’s construction to convert a regular expression to a deterministic finite automaton (DFA). Namely, we draw tiny graphs for each component of a given regular expression, and stitch them together to form a nondeterministic finite automaton (NFA), which we then convert to a DFA.

The ideas are interesting. Sadly, there is no other reason to study them, because there’s a simpler approach that:

  • Constructs a DFA directly from a regular expression. Forget NFAs.

  • Supports richer regular expressions. Behold, logical AND and NOT: [a-z]+&!(do|for|if|while)

  • Immediately obtains smaller and often minimal DFAs in realistic applications.

All this is expertly explained in Regular-expression derivatives reexamined by Owens, Reppy, and Turon. To my chagrin, I only stumbled across it recently, almost a decade after its publication. And after I had already written a regex tool.

But it could be worse: the authors note the superior method was published over 50 years ago by Brzozowski, before being "lost in the sands of time".

Derive to succeed

Take "standard" regular expressions. We have the constants:

  • \$\emptyset\$: accepts nothing; the empty language.

  • \$\epsilon\$: accepts the empty string.

  • \$c\$: accepts the character \$c\$.

and regexes built from other regexes \$r\$ and \$s\$:

  • \$rs\$: the language built from all pairwise concatenations of strings in \$r\$ and strings in \$s\$.

  • \(r\mid s\): logical or (alternation); the union of the two languages.

  • \$r\mbox{*}\$: Kleene closure; zero or more strings of \$r\$ concatenated together.

Then solve two problems:

  1. Determine if a regex accepts the empty string.

  2. For a character \$c\$ and a regex \$f\$, find a regex that accepts a string \$s\$ precisely when \$f\$ accepts \$c\$ followed by \$s\$. For example, feeding a to ab*c|d*e*f|g*ah results in the regex b*c|h.

The first problem is little more than a reading comprehension quiz. Going down the list, we see the answers are: no; yes; no; exactly when \$r\$ and \$s\$ do; exactly when \$r\$ or \$s\$ do; yes.

import Data.List

data Re = Nul | Eps | Lit Char | Kleene Re | Re :. Re | Alt [Re]
  deriving (Eq, Ord)

nullable :: Re -> Bool
nullable re = case re of
  Nul      -> False
  Eps      -> True
  Lit _    -> False
  r :. s   -> nullable r && nullable s
  Alt rs   -> any nullable rs
  Kleene _ -> True

In the second problem, the base cases remain easy: return \$\emptyset\$ except for the constant \$c\$, in which case return \$\epsilon\$.

The recursive cases are tougher. Given \(r\mid s\), solve the problem on both alternatives to get \$r'\$ and \$s'\$ then return \(r'\mid s'\). For \(r\mbox{*}\), return \(r'r\mbox{*}\).

The trickiest is concatenation: \$rs\$. First, determine if \$r\$ accepts the empty string (the problem we just solved). If so, return \(r's\mid s'\). If not, return \(r's\).

The answer to the second problem is the derivative of the regex \$f\$ with respect to the character \$c\$, and denoted \$\partial_c f\$.

derive :: Char -> Re -> Re
derive c f = case f of
  Nul                 -> Nul
  Eps                 -> Nul
  Lit a  | a == c     -> Eps
         | otherwise  -> Nul
  r :. s | nullable r -> mkAlt [dc r :. s, dc s]
         | otherwise  -> dc r :. s
  Alt rs              -> mkAlt $ dc <$> rs
  Kleene r            -> dc r :. f
  where dc = derive c

For now, pretend mkAlt = Alt. We shall soon reveal its true definition, why we need it, and why we represent an alternation with a list.

The regex is the state

We can now directly construct a DFA for any regex \$r\$.

Each state of our DFA corresponds to a regex. The start state is the input regex \$r\$. For each character \$c\$, create the state \$\partial_c r\$ if it doesn’t already exist, then draw an arrow labeled \$c\$ from \$r\$ to \$\partial_c r\$.

Repeat on all newly created states. The accepting states are those which accept the empty string. Done!

mkDfa :: Re -> ([Re], Re, [Re], [((Re, Re), Char)])
mkDfa r = (states, r, filter nullable states, edges) where
  (states, edges) = explore ([r], []) r
  explore gr q = foldl' (goto q) gr ['a'..'z']
  goto q (qs, es) c | qc `elem` qs = (qs, es1)
                    | otherwise    = explore (qc:qs, es1) qc
                    where qc  = derive c q
                          es1 = ((q, qc), c):es

So long as we’re mindful that the logical or operation is idempotent, commutative, and associative, that is, \(r\mid r = r\), \(r\mid s = s\mid r\), and \((r\mid s)\mid t = r\mid (s\mid t)\), the above is guaranteed to terminate.

This makes sense intuitively, because taking a derivative usually yields a simpler regex. The glaring exception is the Kleene star, but on further inspection, we ought to repeat ourselves eventually after taking enough derivatives so long as we can cope with the proliferating logical ors.

We handle idempotence with nub, commutativity with sort, and associativity by flattening lists:

mkAlt :: [Re] -> Re
mkAlt rs | [r] <- rs' = r
         | otherwise  = Alt rs'
         where rs' = nub $ sort $ concatMap flatAlt rs
               flatAlt (Alt as) = as
               flatAlt a        = [a]

This ties off the loose ends mentioned above, and completes our regex compiler. Not bad for 30 lines or so!

In practice, we apply more algebraic identities before comparing regexes to produce smaller DFAs, which empirically are often optimal. (Ideally, we’d like to tell if two given regexes describe the same language so we could always generate the minimal DFA, but this is too costly.)

Extending regexes

Adding new features to the regex language is easy with derivatives. Given an operation, we only need to:

  1. Determine if it accepts the empty string.

  2. Figure out the rules for its derivative.

(We should prove the algorithm still terminates, but we’ll just eyeball it and wave our hands.)

For example, we get the familiar \(r\mbox{+}\) by rejecting the empty string and defining its derivative to be \(r' r\mbox{*}\). We obtain \$r?\$ by accepting the empty string and defining its derivative to be \$r'\$. But let’s do something more fun.

The logical and \$r&s\$ of regexes \$r\$ and \$s\$ accepts if and only if both \$r\$ and \$s\$ match. Then \$r&s\$ accepts the empty string exactly when both \$r\$ and \$s\$ do (similar to concatenation), and the derivative of \$r&s\$ is \$r'&s'\$.

The complement \$!r\$ of a regex \$r\$ to accepts if and only if \$r\$ rejects. Then \$!r\$ accepts the empty string if and only if \$r\$ rejects it, and the derivative of \$!r\$ is \$!r'\$.

For example, if we write () for \$\epsilon\$ then !()&[a-z]* is the same as [a-z]+.

As before, we can plug these operations into our DFA-maker right away. Good luck doing this with NFAs! Well, I think it’s possible if we add weird rules, e.g. "if we can reach state A and state B, then we can magically reach state C", but then they’d no longer be true NFAs.

The unfortunate, undeserved, and hopefully soon-to-be unlamented prominence of the NFA approach are why these useful operations are considered exotic.

Regularly express yourself

Saturday, June 10, 2017

Solving the JavaScript Problem

The JavaScript Problem is a good problem to have. Against the odds, “write once, run anywhere” is a reality for web browsers because of a language governed by a standards organization. Not so long ago, proprietary technologies such as Flash and Silverlight threatened the openness of the web.

So we should be grateful the JavaScript problem is merely a technical one, namely that JavaScript is poorly designed. Though the language has improved over the years, its numerous flaws are too deeply entrenched to remove. Transpilers help by unlocking access to better languages, but JavaScript was never intended to be an assembly language. It’s only thanks to heroic efforts of many talented engineers that JavaScript has gone so far.

Ideally, the lingua franca of the web should be low-level, clean, and simple, so we can develop in any language with little overhead.

WebAssembly

Recent versions of several popular browsers have fulfilled our wishes with WebAssembly, also known as wasm. WebAssembly is an open standard, and well-designed. At last, works such as Benjamin Pierce’s “Types and Programming Languages” are mainstream enough that WebAssembly has formally specified reduction and typing rules, and even a proof of soundness. In contrast, weakly typed languages such as JavaScript ignore a century or so of mathematical progress.

In WebAssembly, nondeterminstic behvaviour can only arise from exhaustion, external host functions, and the IEEE-754 floating-point standard, which fails to specify the NaN bit pattern for all cases. Recall in C and the many languages built upon it such as Go and Haskell, signed integer overflow causes undefined behaviour. WebAssembly fixes this by stipulating two’s complement for negative numbers, as competing representations of negative numbers are ultimately responsible for this defect of C. Endianness is similarly settled, though curiously by travelling the road not taken by network byte order: numbers in WebAssembly are encoded in little-endian.

The WebAssembly virtual machine is stack-based. Years ago, I read that register-based virtual machines are faster, but perhaps these results are now obsolete. Browsing briefly, I found newer papers:

It’s the same old story. Register-based virtual machines are still faster after all. It seems WebAssembly prioritizes code size, and trusts browsers will ship with good JIT compilers.

Toy Compilers

Online demonstrations of WebAssembly compilers are fun to build, and fun to describe: I compiled a Haskell program to JavaScript that when executed, reads a Haskell-like program and compiles it to WebAssembly, which some JavaScript loads and executes.

Perhaps it’s easier to invite the reader to:

For the last 2 programs, I transformed the source to a tree of S and K combinators, so apart from graph reduction, I only had to code 2 combinators in assembly. The resulting binaries are excruciatingly slow, especially since numbers are Church-encoded, but it all seems to work.

I look forward to a Haskell compiler that produces efficient WebAssembly, though it may have to wait until WebAssembly gains a few more features, such as threads and tail calls.

Sunday, April 2, 2017

Lambda Calculus Surprises

Much time has passed since my last entry. I’ve been frantically filling gaps in my education, so I’ve had little to say here. My notes are better off on my homepage, where I can better organize them, and incorporate interactive demos.

However, I want to draw attention to delightful surprises that seem unfairly obscure.

1. Succinct Turing-complete self-interpreters

John McCarthy’s classic paper showed how to write a Lisp interpreter in Lisp itself. By adding a handful of primitives (quote, atom, eq, car, cdr, cons, cond) to lambda calculus, we get a Turing-complete language where a self-interpreter is easy to write and understand. For contrast, see Turing’s universal machine of 1936.

Researchers have learned more about lambda calculus since 1960, but many resources seem stuck in the past. Writing a Turing-complete interpreter in 7 lines is ostensibly still a big deal. The Roots of Lisp by Paul Graham praises McCarthy’s self-interpreter but explores no further. The Limits of Mathematics by Gregory Chaitin chooses Lisp over plain lambda calculus for dubious reasons. Perhaps McCarthy’s work is so life-changing that some find it hard to notice new advances.

(f.(x.f(xx))(x.f(xx)))(em.m(x.x)(mn.em(en))(mv.e(mv)))

(I’ve suppressed the lambdas. Exercise: write a regex substitution that restores them.)

In fact, under some definitions, the program “λq.q(λx.x)” is a self-interpreter.

2. Hindley-Milner sort

Types and Programming Languages (TaPL) by Benjamin C. Pierce is a gripping action thriller. Types are the heroes, and we follow their epic struggle against the most ancient and powerful foes of computer science and mathematics.

When we first meet them, types are humble guardians of a barebones language that can only express the simplest of computations involving booleans and natural numbers. As the story progresses, types gain additional abilities, enabling them to protect more powerful languages.

However, there seems to be a plot hole when types level up from Hindley-Milner to System F. As a “nice demonstration of the expressive power of pure System F”, the book mentions a program that can sort lists.

The details are left as an exercise to the reader. Working through them, we realize a Hindley-Milner type system is already powerful enough to sort lists. Moreover, the details are far more pleasant in Hindley-Milner because we avoid the ubiquitous type spam of System F.

System F is indeed more powerful than Hindley-Milner and deserves admiration, but because of well-typed self-application and polymorphic identity functions, existential types, and other gems; not because lists can be sorted.

3. Self-interpreters for total languages

They said it couldn’t be done.

According to Breaking Through the Normalization Barrier: A Self-Interpreter for F-omega by Matt Brown and Jens Palsberg, “several books, papers, and web pages” assert self-interpreters for a strongly normalizing lambda calculus are impossible. The paper then shows that reports of their non-existence have been greatly exaggerated.

Indeed, famed researcher Robert Harper writes on his blog that “one limitation of total programming languages is that they are not universal: you cannot write an interpreter for T within T (see Chapter 9 of PFPL for a proof).”, and as of now (April 2017), the Wikipedia article they cite still declares “it is impossible to define a self-interpreter in any of the calculi cited above”, referring to simply typed lambda calculus, System F, and the calculus of constructions.

I was shocked. Surely academics are proficient with diagonalization by now? Did they all overlook a hole in their proofs?

More shocking is the stark simplicity of what Brown and Palsberg call a shallow self-interpreter for System F and System Fω, which is essentially a typed version of “λq.q(λx.x)”.

It relies on a liberal definition of representation (we only require an injective map from legal terms to normal forms) and self-interpretation (mapping a representation of a term to its value) which is nonetheless still strong enough to upend conventional wisdom.

Which brings us to the most shocking revelation: there is no official agreement on the definition of representation or self-interpretation, or even what we should name these concepts.

Does this mean I should be wary of even the latest textbooks? Part of me hopes not, because I want to avoid learning falsehoods, but another part of me hopes so, for it means I’ve reached the cutting edge of research.

See for yourself!

Interactive demos of the above:

Tuesday, November 10, 2015

Neural Networks in Haskell

Long ago, when I first looked into machine learning, neural networks didn’t stand out of the crowd. They seemed on par with decision trees, genetic algorithms, genetic programming, and a host of other techniques. I wound up dabbling in genetic programming because it seemed coolest.

Neural networks have since distinguished themselves. Lately, they seem responsible for each newsworthy machine learning achievement I hear about. To name a few:

Inspired, I began reading Michael Nielsen’s online book on neural networks. We can whip up a neural network without straying beyond a Haskell base install, though we do have to implement the Box-Muller transform ourselves to avoid pulling in a library to sample from a normal distribution.

The following generates a neural network with 3 inputs, a hidden layer of 4 neurons, and 2 output neurons, and feeds it the inputs [0.1, 0.2, 0.3].

import Control.Monad
import Data.Functor
import Data.List
import System.Random

main = newBrain [3, 4, 2] >>= print . feed [0.1, 0.2, 0.3]

newBrain szs@(_:ts) = zip (flip replicate 1 <$> ts) <$>
  zipWithM (\m n -> replicateM n $ replicateM m $ gauss 0.01) szs ts

feed = foldl' (((max 0 <$>) . ) . zLayer)

zLayer as (bs, wvs) = zipWith (+) bs $ sum . zipWith (*) as <$> wvs

gauss :: Float -> IO Float
gauss stdev = do
  x <- randomIO
  y <- randomIO
  return $ stdev * sqrt (-2 * log x) * cos (2 * pi * y)

The tough part is training the network. The sane choice is to use a library to help with the matrix and vector operations involved in backpropagation by gradient descent, but where’s the fun in that?

It turns out even if we stay within core Haskell, we only need a few more lines, albeit some hairy ones:

relu = max 0
relu' x | x < 0      = 0
        | otherwise  = 1

revaz xs = foldl' (\(avs@(av:_), zs) (bs, wms) -> let
  zs' = zLayer av (bs, wms) in ((relu <$> zs'):avs, zs':zs)) ([xs], [])

dCost a y | y == 1 && a >= y = 0
          | otherwise        = a - y

deltas xv yv layers = let
  (avs@(av:_), zv:zvs) = revaz xv layers
  delta0 = zipWith (*) (zipWith dCost av yv) (relu' <$> zv)
  in (reverse avs, f (transpose . snd <$> reverse layers) zvs [delta0])
  where
    f _ [] dvs = dvs
    f (wm:wms) (zv:zvs) dvs@(dv:_) = f wms zvs $ (:dvs) $
      zipWith (*) [sum $ zipWith (*) row dv | row <- wm] (relu' <$> zv)

descend av dv = zipWith (-) av ((0.002 *) <$> dv)

learn xv yv layers = let (avs, dvs) = deltas xv yv layers
  in zip (zipWith descend (fst <$> layers) dvs) $
    zipWith3 (\wvs av dv -> zipWith (\wv d -> descend wv ((d*) <$> av))
      wvs dv) (snd <$> layers) avs dvs

See my Haskell notes for details. In short: ReLU activation function; online learning with a rate of 0.002; an ad hoc cost function that felt right at the time.

Despite cutting many corners, after a few runs, I obtained a neural network that correctly classifies 9202 of 10000 handwritten digits in the MNIST test set in just one pass over the training set.

I found this result surprisingly good. Yet there is much more to explore: top on my must-see list are deep learning (also described in Nielsen’s book) and long short-term memory.

I turned the neural net into an online digit recognition demo: you can draw on the canvas and see how it affects the outputs.

Monday, February 23, 2015

Mighty Warp

  • Outperforms nginx.

  • Under 1300 lines of source.

  • Clear control flow: handles one request per thread using blocking calls.

  • Slowloris DoS protection.

The secret is GHC’s runtime system (RTS). Every Haskell program must spend time in the RTS, and maybe this does hurt performance in certain cases, but for web servers it is a huge win: the RTS automatically transforms code that seems to handle one request per thread into a server with multiple event-driven processes. This saves many a context switch while keeping the source simple.

Best of all, this magic technology is widely available. To start a webserver on port 3000 using the Warp library, run these commands on Ubuntu [original inspiration]:

sudo apt-get install cabal-install
cabal update
cabal install warp
cat > server.hs << EOF
#!/usr/bin/env runghc
{-# LANGUAGE OverloadedStrings #-}
import Network.Wai (responseLBS)
import Network.Wai.Handler.Warp (run)
import Network.HTTP.Types (status200)
import Network.HTTP.Types.Header (hContentType)

main = run 3000 $ \_ f -> f $ responseLBS
  status200 [(hContentType, "text/plain")] "Hello, world!\n"
EOF
chmod +x server.hs
./server.hs

Eliminating context switches is the best part of the story, but there’s more. Copying data can be avoided with a simple but clever trick the authors call splicing. Using conduits instead of lazy I/O solves the non-deterministic resource finalization problem. And a few judiciously placed lockless atomic operations can work wonders: in particular, for basic Slowloris protection and for a robust file descriptor cache.

Uramaki

Those who fear straying too far from a C-like language can still reap the benefits in Go:

  • Goroutines are like green threads.

  • Channels are like conduits.

  • Array slices are like splices.

If I didn’t know better, I would say the designers of Go emulated the inventor of the California roll: they took some of the best features of languages like Haskell and made them palatable to a wider audience.

I wonder how Go’s RTS compares. One innate advantage GHC may have is Haskell’s type system, which leads to largely non-destructive computation, which ultimately leads to a cheap and effective scheduling scheme (namely, context switching on memory allocation). Still, I expect a well-written Go web server could achieve similar results.