Tuesday, June 26, 2018

Why Laziness Matters

Should a programming language be lazy by default? Robert Harper says no. Lennart Augustsson says yes. No matter who is right, I say all computer scientists should become fluent in a lazy language, whether or not they speak it in daily life.

My evidence is a post by Russ Cox on parsing with derivatives: a very experienced programmer very convincingly argues why a parsing algorithm has exponential time complexity. But the claims are very wrong; Adams, Hollenbeck, and Might proved the algorithm is cubic.

How did he err so badly? Did he underestimate the power of lazy evaluation?

I once exclusively wrote eager code, and I imagine my younger self would have agreed with his analysis without a second thought. Today I know better. Marvel at these lines by Doug McIlroy:

int fs = 0 : zipWith (/) fs [1..]    -- integral from 0 to x
sins = int coss
coss = 1 - int sins

It seems too good to be true. Indistinguishable from magic perhaps. But somehow it all works when lazily evaluated. Beware of summarily dismissing lazy code because it looks implausibly amazing.

Also consider an earlier article by Cox on regular expressions. Again, a very experienced programmer very convincingly argues why a parsing algorithm has exponential time complexity. In this post, however, the claims are solid, and backed up by graphs of running times. (It’s worth reading by the way: it tells the tragedy of how popular regular expression implementations became sluggish twisted mockeries of true regular expressions, while offering hope for the future. My only criticism is it fails to mention regular expression derivatives.)

Why does the erroneous post lack similar graphs? Why didn’t the author throw some code together and benchmark it to produce damning evidence?

Perhaps he thought it was too tedious. This would imply unfamiliarity with lazy languages, because prototyping parsing with derivatives in Haskell is easier than criticizing it.

Preliminaries

We define a Pe data structure to represent parsing expressions, that is, the right-hand side of the production rules of a grammar.

import Control.Arrow
import Control.Monad.State
import qualified Data.Map as M
import qualified Data.Set as S

-- NT = non-terminal. (:.) = concatenation.
data Pe = NT String | Eps Char | Nul | Ch Char | Or [Pe] | Pe :. Pe | Del Pe

Although it represents the empty string, the Eps (for epsilon) expression holds a character that winds up in the abstract syntax tree (AST) returned by the parser. Similarly, the Del (for delta) expression, which is only generated internally, holds an expression which later helps build an AST.

A context-free grammar maps non-terminal symbols to parsing expressions:

type Grammar = M.Map String Pe

Our ASTs are full binary trees whose leaf nodes are characters (the free magma on the alphabet). The tree structure captures the order the production rules are applied.

data Ast = Bad | Lf Char | Ast :@ Ast deriving Show

isBad :: Ast -> Bool
isBad Bad = True
isBad _   = False

The Bad AST is returned for unparseable strings. An alternative is to drop Bad and replace Ast with Maybe Ast throughout our code.

A fancier parser might return a parse forest, that is, all parse trees for a given input. Ours simply settles on one parse tree.

Parsing with derivatives

To parse an input string, we first take successive derivatives of the start symbol with respect to each character of the input, taking care to leave bread crumbs in the Eps and Del expressions to record consumed characters. (The Del constructor is named for the delta symbol from the paper, but I also think of it as "deleted", because it remembers what has just been deleted from the input.)

Then the string is accepted if and only if the resulting expression is nullable, that is, accepts the empty string. As we traverse the expression to determine nullability, we also build an AST to return.

We memoize derivatives by adding entries to a state of type Grammar. Initially, this cache contains only the input grammar, mapping nonterminal symbols to Pe values. Later, we place a derivative at the key formed by concatenating the characters involved in the derivative with the nonterminal symbol being derived.

For example, if S is a nonterminal in the input grammar, then abS maps to derive 'a' (derive 'b' (NT "S")). We assume no nonterminal symbol in the input grammar is a suffix of any other nonterminal symbol, which is fine for a prototype.

It may help to imagine the grammar growing over time, gaining new production rules as we process input characters. Indeed, we consider nonterminals to refer to both nonterminals in the input grammar as well as their derivatives.

parse :: Grammar -> String -> String -> Ast
parse g start s = evalState (parseNull $ NT $ reverse s ++ start) g

Computing nullability requires finding a least fixed point. I found this the toughest part of the algorithm, partly because they never taught fixed point theory when I was in school. For some reason, the method reminds me of Hopcroft’s algorithm to minimize a DFA, where we repeatedly refine a partition until we reach a stable answer.

We initially guess each nonterminal is not nullable, which means it corresponds to the Bad AST. On encountering a nonterminal, if we’ve already seen it, then return our guess for that nonterminal. Otherwise, it’s the first time we’ve seen it and instead of guessing, we recursively traverse its corresponding expression. In doing so, we may discover our guess is wrong, so we correct it if necessary before returning an AST.

We repeat until our guesses stabilize. Guesses never change from a good AST to Bad, and the map of all guesses only changes if a guess is revised from Bad to a good AST. We exploit these facts to simplify our code slightly.

parseNull :: Pe -> State Grammar Ast
parseNull pe = leastFix M.empty where
  leastFix guessed = do
    (b, (_, guessed')) <- runStateT (visit pe) (S.empty, guessed)
    if M.size guessed == M.size guessed' then pure b else leastFix guessed'

visit :: Pe -> StateT (S.Set String, M.Map String Ast) (State Grammar) Ast
visit pe = case pe of
  Eps x  -> pure $ Lf x
  Del x  -> visit x
  Nul    -> pure Bad
  Ch _   -> pure Bad
  Or xs  -> chainsaw <$> mapM visit xs
  x :. y -> mul <$> visit x <*> visit y
  NT s -> do
    (seen, guessed) <- get
    case () of
      () | Just x <- M.lookup s guessed -> pure x
         | S.member s seen -> pure Bad
         | otherwise -> do
           modify $ first $ S.insert s
           b <- visit =<< lift (memoDerive s)
           unless (isBad b) $ modify $ second $ M.insert s b
           pure b

mul :: Ast -> Ast -> Ast
mul Bad _ = Bad
mul _ Bad = Bad
mul x y   = x :@ y

-- | Helps cut a non-empty parse forest down to one tree.
chainsaw :: [Ast] -> Ast
chainsaw xs | null xs'   = Bad
            | otherwise  = head xs'
            where xs' = filter (not . isBad) xs

Memoized derivatives are straightforward. For computing derivatives, we translate the rules given in the paper, and for memoization, on discovering a missing entry, we insert a knot-tying value before recursing, and replace it with the result of the recursion afteward.

memoDerive :: String -> State Grammar Pe
memoDerive cs@(c:s) = do
  m <- get
  unless (M.member cs m) $ do
    modify $ M.insert cs $ NT cs
    d <- derive c =<< memoDerive s
    modify $ M.insert cs d
  gets (M.! cs)
memoDerive _ = error "unreachable"

derive :: Char -> Pe -> State Grammar Pe
derive c pe = case pe of
  NT s             -> pure $ NT $ c:s
  Ch x | x == c    -> pure $ Eps x
  Or xs            -> Or <$> mapM (derive c) xs
  Del x :. y       -> (Del x :.) <$> derive c y
  x :. y           -> do
    b <- parseNull x
    dx <- derive c x
    if isBad b then pure $ dx :. y else do
      dy <- derive c y
      pure $ Or [dx :. y, Del x :. dy]
  _                -> pure Nul

Here’s the grammar that Cox claims will grind our parser to a halt:

cox :: Grammar
cox = M.fromList
  [ ("S", NT "T")
  , ("T", Or [NT "T" :. (Ch '+' :. NT "T"), NT "N"])
  , ("N", Ch '1')
  ]

Let’s try it on a small input in an interactive interpreter:

parse cox "S" "1+1+1"

The parser picks a particular parse tree:

(Lf '1' :@ (Lf '+' :@ Lf '1')) :@ (Lf '+' :@ Lf '1')

How about all strings of length 7 consisting of 1 or +?

filter (not . isBad . parse cox "S") $ replicateM 7 "+1"

Thankfully, we get:

["1+1+1+1"]

At last, it’s time to demolish Cox’s claims. We parse an 80-character input with a typo near the end:

main :: IO ()
main = print $ parse cox "S" $ concat (replicate 39 "1+") ++ "+1"

Our prototype is awful. We really should:

  • Add a slimmed down version of parseNull that returns a boolean instead of an AST, and call this in derive. We only want to recover the AST once the whole string has been parsed; the rest of the time, we only care whether an expression is nullable.

  • Use a better algorithm for finding the least fixed point. We’ve perhaps chosen the clunkiest and most obvious method.

  • Remove a layer of indirection when tying the knot. That is, point to a node directly rather than a string (which requires another lookup to get at the node).

  • Apply algebraic identities to reduce the number of nodes in parsing expressions and abstract syntax trees.

And yet, on my laptop:

Bad

real    0m0.220s
user    0m0.215s
sys     0m0.005s

Clearly, parsing with derivatives is efficient when run on the allegedly exponential-running-time example given by Cox.

The moral of the story

It’s best to test drive an algorithm before condemning it. If we see hilariously bad running times, then we can include them to hammer our points home. If we see surprisingly good running times, then there’s a mistake in our reasoning and we should keep quiet until we successfully attack the algorithm from another angle. (Cox rightly notes parsing with derivatives forgoes two key properties of yacc: linear running time and ambiguity detection. If only he had focused on these trade-offs.)

Is this practicable for parsing with derivatives? Well, we have presented an entire program, yet we have written less code than appears in Cox’s excellent article on regular expressions, which quotes just a few choice cuts from a presumably complete program. Indeed, with a splash of HTML, we can easily build an interactive online demo of parsing with derivatives.

The existence of the flawed post indicates no such sanity check was done. This was caused by poor understanding of lazy evaluation, or because it was deemed too troublesome to implement a lazy algorithm. Both problems are solved by learning a lazy language.

In sum, insufficient experience with lazy evaluation leads to faulty time complexity analysis. Therefore we should all be comfortable with lazy languages so computer science can progress unimpeded.

2 comments:

Jon Harrop said...

Have you seen this? https://www.cs.ru.nl/bachelors-theses/2018/Timo_Maarse___4416295___Parsing_with_derivatives_in_Haskell.pdf

"The authors also provide a Haskell implementation [5], but it has the following issues: 1. It cannot handle, for example, the following grammar: A → AA | a"

Ben Lynn said...

Thanks! I skimmed it just now.

I'm happy to report my demo (https://crypto.stanford.edu/~blynn/haskell/pwd.html) seems to handle S = S S | "a"; correctly.