Monday, June 4, 2018

Regex Derivatives

Like many of my generation, I was taught to use Thompson’s construction to convert a regular expression to a deterministic finite automaton (DFA). Namely, we draw tiny graphs for each component of a given regular expression, and stitch them together to form a nondeterministic finite automaton (NFA), which we then convert to a DFA.

The ideas are interesting. Sadly, there is no other reason to study them, because there’s a simpler approach that:

  • Constructs a DFA directly from a regular expression. Forget NFAs.

  • Supports richer regular expressions. Behold, logical AND and NOT: [a-z]+&!(do|for|if|while)

  • Immediately obtains smaller and often minimal DFAs in realistic applications.

All this is expertly explained in Regular-expression derivatives reexamined by Owens, Reppy, and Turon. To my chagrin, I only stumbled across it recently, almost a decade after its publication. And after I had already written a regex tool.

But it could be worse: the authors note the superior method was published over 50 years ago by Brzozowski, before being "lost in the sands of time".

Derive to succeed

Take "standard" regular expressions. We have the constants:

  • \$\emptyset\$: accepts nothing; the empty language.

  • \$\epsilon\$: accepts the empty string.

  • \$c\$: accepts the character \$c\$.

and regexes built from other regexes \$r\$ and \$s\$:

  • \$rs\$: the language built from all pairwise concatenations of strings in \$r\$ and strings in \$s\$.

  • \(r\mid s\): logical or (alternation); the union of the two languages.

  • \$r\mbox{*}\$: Kleene closure; zero or more strings of \$r\$ concatenated together.

Then solve two problems:

  1. Determine if a regex accepts the empty string.

  2. For a character \$c\$ and a regex \$f\$, find a regex that accepts a string \$s\$ precisely when \$f\$ accepts \$c\$ followed by \$s\$. For example, feeding a to ab*c|d*e*f|g*ah results in the regex b*c|h.

The first problem is little more than a reading comprehension quiz. Going down the list, we see the answers are: no; yes; no; exactly when \$r\$ and \$s\$ do; exactly when \$r\$ or \$s\$ do; yes.

import Data.List

data Re = Nul | Eps | Lit Char | Kleene Re | Re :. Re | Alt [Re]
  deriving (Eq, Ord)

nullable :: Re -> Bool
nullable re = case re of
  Nul      -> False
  Eps      -> True
  Lit _    -> False
  r :. s   -> nullable r && nullable s
  Alt rs   -> any nullable rs
  Kleene _ -> True

In the second problem, the base cases remain easy: return \$\emptyset\$ except for the constant \$c\$, in which case return \$\epsilon\$.

The recursive cases are tougher. Given \(r\mid s\), solve the problem on both alternatives to get \$r'\$ and \$s'\$ then return \(r'\mid s'\). For \(r\mbox{*}\), return \(r'r\mbox{*}\).

The trickiest is concatenation: \$rs\$. First, determine if \$r\$ accepts the empty string (the problem we just solved). If so, return \(r's\mid s'\). If not, return \(r's\).

The answer to the second problem is the derivative of the regex \$f\$ with respect to the character \$c\$, and denoted \$\partial_c f\$.

derive :: Char -> Re -> Re
derive c f = case f of
  Nul                 -> Nul
  Eps                 -> Nul
  Lit a  | a == c     -> Eps
         | otherwise  -> Nul
  r :. s | nullable r -> mkAlt [dc r :. s, dc s]
         | otherwise  -> dc r :. s
  Alt rs              -> mkAlt $ dc <$> rs
  Kleene r            -> dc r :. f
  where dc = derive c

For now, pretend mkAlt = Alt. We shall soon reveal its true definition, why we need it, and why we represent an alternation with a list.

The regex is the state

We can now directly construct a DFA for any regex \$r\$.

Each state of our DFA corresponds to a regex. The start state is the input regex \$r\$. For each character \$c\$, create the state \$\partial_c r\$ if it doesn’t already exist, then draw an arrow labeled \$c\$ from \$r\$ to \$\partial_c r\$.

Repeat on all newly created states. The accepting states are those which accept the empty string. Done!

mkDfa :: Re -> ([Re], Re, [Re], [((Re, Re), Char)])
mkDfa r = (states, r, filter nullable states, edges) where
  (states, edges) = explore ([r], []) r
  explore gr q = foldl' (goto q) gr ['a'..'z']
  goto q (qs, es) c | qc `elem` qs = (qs, es1)
                    | otherwise    = explore (qc:qs, es1) qc
                    where qc  = derive c q
                          es1 = ((q, qc), c):es

So long as we’re mindful that the logical or operation is idempotent, commutative, and associative, that is, \(r\mid r = r\), \(r\mid s = s\mid r\), and \((r\mid s)\mid t = r\mid (s\mid t)\), the above is guaranteed to terminate.

This makes sense intuitively, because taking a derivative usually yields a simpler regex. The glaring exception is the Kleene star, but on further inspection, we ought to repeat ourselves eventually after taking enough derivatives so long as we can cope with the proliferating logical ors.

We handle idempotence with nub, commutativity with sort, and associativity by flattening lists:

mkAlt :: [Re] -> Re
mkAlt rs | [r] <- rs' = r
         | otherwise  = Alt rs'
         where rs' = nub $ sort $ concatMap flatAlt rs
               flatAlt (Alt as) = as
               flatAlt a        = [a]

This ties off the loose ends mentioned above, and completes our regex compiler. Not bad for 30 lines or so!

In practice, we apply more algebraic identities before comparing regexes to produce smaller DFAs, which empirically are often optimal. (Ideally, we’d like to tell if two given regexes describe the same language so we could always generate the minimal DFA, but this is too costly.)

Extending regexes

Adding new features to the regex language is easy with derivatives. Given an operation, we only need to:

  1. Determine if it accepts the empty string.

  2. Figure out the rules for its derivative.

(We should prove the algorithm still terminates, but we’ll just eyeball it and wave our hands.)

For example, we get the familiar \(r\mbox{+}\) by rejecting the empty string and defining its derivative to be \(r' r\mbox{*}\). We obtain \$r?\$ by accepting the empty string and defining its derivative to be \$r'\$. But let’s do something more fun.

The logical and \$r&s\$ of regexes \$r\$ and \$s\$ accepts if and only if both \$r\$ and \$s\$ match. Then \$r&s\$ accepts the empty string exactly when both \$r\$ and \$s\$ do (similar to concatenation), and the derivative of \$r&s\$ is \$r'&s'\$.

The complement \$!r\$ of a regex \$r\$ to accepts if and only if \$r\$ rejects. Then \$!r\$ accepts the empty string if and only if \$r\$ rejects it, and the derivative of \$!r\$ is \$!r'\$.

For example, if we write () for \$\epsilon\$ then !()&[a-z]* is the same as [a-z]+.

As before, we can plug these operations into our DFA-maker right away. Good luck doing this with NFAs! Well, I think it’s possible if we add weird rules, e.g. "if we can reach state A and state B, then we can magically reach state C", but then they’d no longer be true NFAs.

The unfortunate, undeserved, and hopefully soon-to-be unlamented prominence of the NFA approach are why these useful operations are considered exotic.

Regularly express yourself

No comments: