Tuesday, June 26, 2018

Why Laziness Matters

Should a programming language be lazy by default? Robert Harper says no. Lennart Augustsson says yes. No matter who is right, I say all computer scientists should become fluent in a lazy language, whether or not they speak it in daily life.

My evidence is a post by Russ Cox on parsing with derivatives: a very experienced programmer very convincingly argues why a parsing algorithm has exponential time complexity. But the claims are very wrong; Adams, Hollenbeck, and Might proved the algorithm is cubic.

How did he err so badly? Did he underestimate the power of lazy evaluation?

I once exclusively wrote eager code, and I imagine my younger self would have agreed with his analysis without a second thought. Today I know better. Marvel at these lines by Doug McIlroy:

int fs = 0 : zipWith (/) fs [1..]    -- integral from 0 to x
sins = int coss
coss = 1 - int sins

It seems too good to be true. Indistinguishable from magic perhaps. But somehow it all works when lazily evaluated. Beware of summarily dismissing lazy code because it looks implausibly amazing.

Also consider an earlier article by the same author on regular expressions. Again, a very experienced programmer very convincingly argues why a parsing algorithm has exponential time complexity. In this post, however, the claims are solid, and backed up by graphs of running times. (It’s worth reading by the way: it tells the tragedy of how popular regular expression implementations became sluggish twisted mockeries of true regular expressions, while offering hope for the future. My only criticism is it fails to mention regular expression derivatives.)

Why does the erroneous post lack similar graphs? Why didn’t the author throw some code together and benchmark it to produce damning evidence?

Perhaps he thought it was too tedious. This would imply unfamiliarity with lazy languages, because prototyping parsing with derivatives in Haskell is easier than criticizing it.

Preliminaries

We define a Pe data structure to represent parsing expressions, that is, the right-hand side of the production rules of a grammar.

import Control.Arrow
import Control.Monad.State
import qualified Data.Map as M
import qualified Data.Set as S

-- NT = non-terminal. (:.) = concatenation.
data Pe = NT String | Eps Char | Nul | Ch Char | Or [Pe] | Pe :. Pe | Del Pe

Although it represents the empty string, the Eps (for epsilon) expression holds a character that winds up in the abstract syntax tree (AST) returned by the parser. Similarly, the Del (for delta) expression, which is only generated internally, holds an expression which later helps build an AST.

A context-free grammar maps non-terminal symbols to parsing expressions:

type Grammar = M.Map String Pe

Our ASTs are full binary trees whose leaf nodes are characters (the free magma on the alphabet). The tree structure captures the order the production rules are applied.

data Ast = Bad | Lf Char | Ast :@ Ast deriving Show

isBad :: Ast -> Bool
isBad Bad = True
isBad _   = False

The Bad AST is returned for unparseable strings. An alternative is to drop Bad and replace Ast with Maybe Ast throughout our code.

A fancier parser might return a parse forest, that is, all parse trees for a given input. Ours simply settles on one parse tree.

Parsing with derivatives

To parse an input string, we first take successive derivatives of the start symbol with respect to each character of the input, taking care to leave bread crumbs in the Eps and Del expressions to record consumed characters. (The Del constructor is named for the delta symbol from the paper, but I also think of it as "deleted", because it remembers what has just been deleted from the input.)

Then the string is accepted if and only if the resulting expression is nullable, that is, accepts the empty string. As we traverse the expression to determine nullability, we also build an AST to return.

We memoize derivatives by adding entries to a state of type Grammar. Initially, this cache contains only the input grammar, mapping nonterminal symbols to Pe values. Later, we place a derivative at the key formed by concatenating the characters involved in the derivative with the nonterminal symbol being derived.

For example, if S is a nonterminal in the input grammar, then abS maps to derive 'a' (derive 'b' (NT "S")). We assume no nonterminal symbol in the input grammar is a suffix of any other nonterminal symbol, which is fine for a prototype.

It may help to imagine the grammar growing over time, gaining new production rules as we process input characters. Indeed, we consider nonterminals to refer to both nonterminals in the input grammar as well as their derivatives.

parse :: Grammar -> String -> String -> Ast
parse g start s = evalState (parseNull $ NT $ reverse s ++ start) g

Computing nullability requires finding a least fixed point. I found this the toughest part of the algorithm, partly because they never taught fixed point theory when I was in school. For some reason, the method reminds me of Hopcroft’s algorithm to minimize a DFA, where we repeatedly refine a partition until we reach a stable answer.

We initially guess each nonterminal is not nullable, which means it corresponds to the Bad AST. On encountering a nonterminal, if we’ve already seen it, then return our guess for that nonterminal. Otherwise, it’s the first time we’ve seen it and instead of guessing, we recursively traverse its corresponding expression. In doing so, we may discover our guess is wrong, so we correct it if necessary before returning an AST.

We repeat until our guesses stabilize. Guesses never change from a good AST to Bad, and the map of all guesses only changes if a guess is revised from Bad to a good AST. We exploit these facts to simplify our code slightly.

parseNull :: Pe -> State Grammar Ast
parseNull pe = leastFix M.empty where
  leastFix guessed = do
    (b, (_, guessed')) <- runStateT (visit pe) (S.empty, guessed)
    if M.size guessed == M.size guessed' then pure b else leastFix guessed'

visit :: Pe -> StateT (S.Set String, M.Map String Ast) (State Grammar) Ast
visit pe = case pe of
  Eps x  -> pure $ Lf x
  Del x  -> visit x
  Nul    -> pure Bad
  Ch _   -> pure Bad
  Or xs  -> chainsaw <$> mapM visit xs
  x :. y -> mul <$> visit x <*> visit y
  NT s -> do
    (seen, guessed) <- get
    case () of
      () | Just x <- M.lookup s guessed -> pure x
         | S.member s seen -> pure Bad
         | otherwise -> do
           modify $ first $ S.insert s
           b <- visit =<< lift (memoDerive s)
           unless (isBad b) $ modify $ second $ M.insert s b
           pure b

mul :: Ast -> Ast -> Ast
mul Bad _ = Bad
mul _ Bad = Bad
mul x y   = x :@ y

-- | Helps cut a non-empty parse forest down to one tree.
chainsaw :: [Ast] -> Ast
chainsaw xs | null xs'   = Bad
            | otherwise  = head xs'
            where xs' = filter (not . isBad) xs

Memoized derivatives are straightforward. For computing derivatives, we translate the rules given in the paper, and for memoization, on discovering a missing entry, we insert a knot-tying value before recursing, and replace it with the result of the recursion afteward.

memoDerive :: String -> State Grammar Pe
memoDerive cs@(c:s) = do
  m <- get
  unless (M.member cs m) $ do
    modify $ M.insert cs $ NT cs
    d <- derive c =<< memoDerive s
    modify $ M.insert cs d
  gets (M.! cs)
memoDerive _ = error "unreachable"

derive :: Char -> Pe -> State Grammar Pe
derive c pe = case pe of
  NT s             -> pure $ NT $ c:s
  Ch x | x == c    -> pure $ Eps x
  Or xs            -> Or <$> mapM (derive c) xs
  Del x :. y       -> (Del x :.) <$> derive c y
  x :. y           -> do
    b <- parseNull x
    dx <- derive c x
    if isBad b then pure $ dx :. y else do
      dy <- derive c y
      pure $ Or [dx :. y, Del x :. dy]
  _                -> pure Nul

Here’s the grammar that Cox claims will grind our parser to a halt:

cox :: Grammar
cox = M.fromList
  [ ("S", NT "T")
  , ("T", Or [NT "T" :. (Ch '+' :. NT "T"), NT "N"])
  , ("N", Ch '1')
  ]

Let’s try it on a small input in an interactive interpreter:

parse cox "S" "1+1+1"

The parser picks a particular parse tree:

(Lf '1' :@ (Lf '+' :@ Lf '1')) :@ (Lf '+' :@ Lf '1')

How about all strings of length 7 consisting of 1 or +?

filter (not . isBad . parse cox "S") $ replicateM 7 "+1"

Thankfully, we get:

["1+1+1+1"]

At last, it’s time to demolish Cox’s claims. We parse an 80-character input with a typo near the end:

main :: IO ()
main = print $ parse cox "S" $ concat (replicate 39 "1+") ++ "+1"

Our prototype is awful. We really should:

  • Add a slimmed down version of parseNull that returns a boolean instead of an AST, and call this in derive. We only want to recover the AST once the whole string has been parsed; the rest of the time, we only care whether an expression is nullable.

  • Use a better algorithm for finding the least fixed point. We’ve perhaps chosen the clunkiest and most obvious method.

  • Remove a layer of indirection when tying the knot. That is, point to a node directly rather than a string (which requires another lookup to get at the node).

  • Apply algebraic identities to reduce the number of nodes in parsing expressions and abstract syntax trees.

And yet, on my laptop:

Bad

real    0m0.220s
user    0m0.215s
sys     0m0.005s

Clearly, parsing with derivatives is efficient when run on the allegedly exponential-running-time example given by Cox.

The moral of the story

It’s best to test drive an algorithm before condemning it. If we see hilariously bad running times, then we can include them to hammer our points home. If we see surprisingly good running times, then there’s a mistake in our reasoning and we should keep quiet until we successfully attack the algorithm from another angle. (Cox rightly notes parsing with derivatives forgoes two key properties of yacc: linear running time and ambiguity detection. If only he had focused on these trade-offs.)

Is this practicable for parsing with derivatives? Well, we have presented an entire program, yet we have written less code than appears in Cox’s excellent article on regular expressions, which quotes just a few choice cuts from a presumably complete program. Indeed, with a splash of HTML, we can easily build an interactive online demo of parsing with derivatives.

The existence of the flawed post indicates no such sanity check was done. This was caused by poor understanding of lazy evaluation, or because it was deemed too troublesome to implement a lazy algorithm. Both problems are solved by learning a lazy language.

In sum, insufficient experience with lazy evaluation leads to faulty time complexity analysis. Therefore we should all be comfortable with lazy languages so computer science can progress unimpeded.

Monday, June 4, 2018

Regex Derivatives

Like many of my generation, I was taught to use Thompson’s construction to convert a regular expression to a deterministic finite automaton (DFA). Namely, we draw tiny graphs for each component of a given regular expression, and stitch them together to form a nondeterministic finite automaton (NFA), which we then convert to a DFA.

The ideas are interesting. Sadly, there is no other reason to study them, because there’s a simpler approach that:

  • Constructs a DFA directly from a regular expression. Forget NFAs.

  • Supports richer regular expressions. Behold, logical AND and NOT: [a-z]+&!(do|for|if|while)

  • Immediately obtains smaller and often minimal DFAs in realistic applications.

All this is expertly explained in Regular-expression derivatives reexamined by Owens, Reppy, and Turon. To my chagrin, I only stumbled across it recently, almost a decade after its publication. And after I had already written a regex tool.

But it could be worse: the authors note the superior method was published over 50 years ago by Brzozowski, before being "lost in the sands of time".

Derive to succeed

Take "standard" regular expressions. We have the constants:

  • \$\emptyset\$: accepts nothing; the empty language.

  • \$\epsilon\$: accepts the empty string.

  • \$c\$: accepts the character \$c\$.

and regexes built from other regexes \$r\$ and \$s\$:

  • \$rs\$: the language built from all pairwise concatenations of strings in \$r\$ and strings in \$s\$.

  • \(r\mid s\): logical or (alternation); the union of the two languages.

  • \$r\mbox{*}\$: Kleene closure; zero or more strings of \$r\$ concatenated together.

Then solve two problems:

  1. Determine if a regex accepts the empty string.

  2. For a character \$c\$ and a regex \$f\$, find a regex that accepts a string \$s\$ precisely when \$f\$ accepts \$c\$ followed by \$s\$. For example, feeding a to ab*c|d*e*f|g*ah results in the regex b*c|h.

The first problem is little more than a reading comprehension quiz. Going down the list, we see the answers are: no; yes; no; exactly when \$r\$ and \$s\$ do; exactly when \$r\$ or \$s\$ do; yes.

import Data.List

data Re = Nul | Eps | Lit Char | Kleene Re | Re :. Re | Alt [Re]
  deriving (Eq, Ord)

nullable :: Re -> Bool
nullable re = case re of
  Nul      -> False
  Eps      -> True
  Lit _    -> False
  r :. s   -> nullable r && nullable s
  Alt rs   -> any nullable rs
  Kleene _ -> True

In the second problem, the base cases remain easy: return \$\emptyset\$ except for the constant \$c\$, in which case return \$\epsilon\$.

The recursive cases are tougher. Given \(r\mid s\), solve the problem on both alternatives to get \$r'\$ and \$s'\$ then return \(r'\mid s'\). For \(r\mbox{*}\), return \(r'r\mbox{*}\).

The trickiest is concatenation: \$rs\$. First, determine if \$r\$ accepts the empty string (the problem we just solved). If so, return \(r's\mid s'\). If not, return \(r's\).

The answer to the second problem is the derivative of the regex \$f\$ with respect to the character \$c\$, and denoted \$\partial_c f\$.

derive :: Char -> Re -> Re
derive c f = case f of
  Nul                 -> Nul
  Eps                 -> Nul
  Lit a  | a == c     -> Eps
         | otherwise  -> Nul
  r :. s | nullable r -> mkAlt [dc r :. s, dc s]
         | otherwise  -> dc r :. s
  Alt rs              -> mkAlt $ dc <$> rs
  Kleene r            -> dc r :. f
  where dc = derive c

For now, pretend mkAlt = Alt. We shall soon reveal its true definition, why we need it, and why we represent an alternation with a list.

The regex is the state

We can now directly construct a DFA for any regex \$r\$.

Each state of our DFA corresponds to a regex. The start state is the input regex \$r\$. For each character \$c\$, create the state \$\partial_c r\$ if it doesn’t already exist, then draw an arrow labeled \$c\$ from \$r\$ to \$\partial_c r\$.

Repeat on all newly created states. The accepting states are those which accept the empty string. Done!

mkDfa :: Re -> ([Re], Re, [Re], [((Re, Re), Char)])
mkDfa r = (states, r, filter nullable states, edges) where
  (states, edges) = explore ([r], []) r
  explore gr q = foldl' (goto q) gr ['a'..'z']
  goto q (qs, es) c | qc `elem` qs = (qs, es1)
                    | otherwise    = explore (qc:qs, es1) qc
                    where qc  = derive c q
                          es1 = ((q, qc), c):es

So long as we’re mindful that the logical or operation is idempotent, commutative, and associative, that is, \(r\mid r = r\), \(r\mid s = s\mid r\), and \((r\mid s)\mid t = r\mid (s\mid t)\), the above is guaranteed to terminate.

This makes sense intuitively, because taking a derivative usually yields a simpler regex. The glaring exception is the Kleene star, but on further inspection, we ought to repeat ourselves eventually after taking enough derivatives so long as we can cope with the proliferating logical ors.

We handle idempotence with nub, commutativity with sort, and associativity by flattening lists:

mkAlt :: [Re] -> Re
mkAlt rs | [r] <- rs' = r
         | otherwise  = Alt rs'
         where rs' = nub $ sort $ concatMap flatAlt rs
               flatAlt (Alt as) = as
               flatAlt a        = [a]

This ties off the loose ends mentioned above, and completes our regex compiler. Not bad for 30 lines or so!

In practice, we apply more algebraic identities before comparing regexes to produce smaller DFAs, which empirically are often optimal. (Ideally, we’d like to tell if two given regexes describe the same language so we could always generate the minimal DFA, but this is too costly.)

Extending regexes

Adding new features to the regex language is easy with derivatives. Given an operation, we only need to:

  1. Determine if it accepts the empty string.

  2. Figure out the rules for its derivative.

(We should prove the algorithm still terminates, but we’ll just eyeball it and wave our hands.)

For example, we get the familiar \(r\mbox{+}\) by rejecting the empty string and defining its derivative to be \(r' r\mbox{*}\). We obtain \$r?\$ by accepting the empty string and defining its derivative to be \$r'\$. But let’s do something more fun.

The logical and \$r&s\$ of regexes \$r\$ and \$s\$ accepts if and only if both \$r\$ and \$s\$ match. Then \$r&s\$ accepts the empty string exactly when both \$r\$ and \$s\$ do (similar to concatenation), and the derivative of \$r&s\$ is \$r'&s'\$.

The complement \$!r\$ of a regex \$r\$ to accepts if and only if \$r\$ rejects. Then \$!r\$ accepts the empty string if and only if \$r\$ rejects it, and the derivative of \$!r\$ is \$!r'\$.

For example, if we write () for \$\epsilon\$ then !()&[a-z]* is the same as [a-z]+.

As before, we can plug these operations into our DFA-maker right away. Good luck doing this with NFAs! Well, I think it’s possible if we add weird rules, e.g. "if we can reach state A and state B, then we can magically reach state C", but then they’d no longer be true NFAs.

The unfortunate, undeserved, and hopefully soon-to-be unlamented prominence of the NFA approach are why these useful operations are considered exotic.

Regularly express yourself