Mon, 11 Jun 2018
Parsing with pictures
Derivatives == Earley?
In a cordial Twitter exchange with Matt Might and
and David Darais, Prof. Might asked if I was interested
in looking at their derivatives-based approach.
I answered that I was looking at it --
Marpa is an optimization of the Might/Darais approach.
This may sound strange.
At first glance, our two algorithms seem about as different as
parsing algorithms can get.
My Marpa parser is an Earley parser, table-driven,
and its parse engine is written in C language.
The MDS (Might/Darais/Spiewak) parser is
an extension of regular expressions
which constructs states on the fly.
MDS uses combinators and has implementations
in several functional programming languages.
Grammar Flow Graphs
Why then do I imagine that Marpa is an optimized version of the MDS
The reason is a paper sent to
me by Prof. Keshav Pingali at Austin: "Parsing with Pictures".
The title is a little misleading:
their approach is not
and the paper requires a considerable amount of math.
But it is a lot easier than the traditional way
of learning the various approaches to parsing.
The basis of the Pingali-Bilardi approach
is the Grammar Flow Graph (GFG).
These GFGs are the "pictures" of their title.
GFGs are NFAs with recursion added.
As has long been known,
adding recursion to NFAs
allows them to represent any context-free language.
Pingali and Bilardi's next step is new.
A GFG can be "simulated" using the same algorithm
used to simulate an NFA.
However, the result is not immediately impressive.
The simulation does produce a recognizer, but not a good recognizer:
Some of the strings recognized by
the GFG simulator
are not in the context-free language.
To repair their "simulation",
Pingali and Bilardi add a tag to each state.
This tag tracks where the recursion began..
This not only fixes the recognizer,
but the added information is enough to allow the set
of parse trees to be efficiently recovered from the
sets of GFG states.
In other words, with tags, the GFG recognizer now is a parser.
It turns out that this recognizer-turned-parser is not new.
In fact, it is
Pingali and Bilardi do not stop there.
Using their new framework,
they go on to show that all LL-based and LR-based algorithms
are simplifications of their Earley parser.
From this point of view,
Earley parsing is the foundation of all context-free parsing,
and LL- and LR-based algorithms are Earley optimizations.
Step 1: The MDS algorithm
To show that Marpa is an optimization of the MDS approach,
I will start with the MDS algorithm, and attempt to optimize it.
For its functional programming language,
the MDS paper uses Racket.
The MDS parser is described directly,
in the usual functional language manner,
as a matching
In the MDS paper,
the MDS parser is optimized with laziness and memoization.
Nulls are dealt with by computing their fixed points on the fly.
Even with these three optimizations,
the result is still highly inefficient.
as an additional step, MDS also
implements "deep recursive simplication" -- in effect,
strategically replacing laziness with eagerness.
With this the MDS paper conjectures that the algorithm's time
is linear for a large class of practical grammars.
Step 2: Extended regular expressions
Next, we notice that the context-free grammars
of the MDS algorithm
are regular expressions extended to allow recursion.
Essentially, they are
GFG's translated into Racket match expressions.
The equivalence is close enough that
you could imagine the MDS paper using GFG's
for its illustrations.
Step 3: Simulating an NFA
the MDS and GFG approaches are similar in their first step.
Each consumes a single character to produce a
A partial parse, for both of these algorithms,
can be represented as a duple.
One element of the duple is a string representing the
The other is a representation of a set of parse trees.
In the case of MDS,
in keeping with its functional approach,
the set of parse trees is represented directly.
In the case of the GFG-simulator,
the set of parse trees is compressed into a sequence of
There is one GFG-state-set for the start of parsing,
and one for each consumed character.
Step 4: Earley's Algorithm
At this point,
with the introduction of the GFG state-sets to represent parse-trees,
the MDS algorithm and its optimized GFG equivalent have
Recall from above,
that this GFG simulator has a bug --
it is over-liberal.
The bug is the one already described,
and our fix is the one already described:
Each GFG state either starts a recursion or is part of one.
We fix the bug by tagging each GFG state with
the index of the GFG state-set that starts its recursion.
Once these tags are added,
the GFG state-sets are exactly Earley sets.
Step 5: The Leo optimization
Next we incorporate an optimization by Joop Leo,
which makes Earley parsing linear for all LR-regular grammars,
without using lookahead.
Since LR-regular is a superset of LR(k) and LL(k),
we do not bother with lookahead.
Step 6: Marpa
To get from an Earley/Leo parser to a Marpa parser,
we need to address one more major point.
In modern parsing practice,
programmers expect the ability to introduce procedural logic,
even to the point of switching parsers.
By ensuring that processing each Earley set is complete
before processing on the next Earley set begins,
we accomplish this.
This means that Marpa has available full information
about the parse so far -- it is left-eidetic.
Error reporting is unexcelled,
and procedural logic can use this information as well.
full parse information implies full knowledge of which
tokens are expected next.
This means that you can write an liberal HTML parser,
starting with a very illiberal HTML grammar.
When the illiberal parser encounters a point where it cannot
continue because of a missing token,
procedural logic can ask
what the expected token is;
concoct that token on the fly;
supply that token to the illiberal parser;
and then ask the illiberal parser to continue.
This is called the "Ruby Slippers" technique,
and an HTML parser based on it has been implemented.
The way forward
the MDS algorithm has its own approach to optmization,
one which takes maximum advantage of functional programming.
In contrast, Marpa relies on C level coding.
One example of the contrast in optimization techniques
is null handling.
Recall from above that, to deal with null processing,
the MDS algorithm uses a fixed-point
algorithm on the fly.
Marpa, on the other hand,
before parsing begins.
precomputes a list of nullable
symbols using bit vectors.
In this particular case, Marpa will usually be the winner.
A second case in which hand optimization wins is the
all-important Leo optimization.
Simon Peyton-Jones is a very smart man,
but nonetheless I believe that
the day that GHC will look at a functional specification
of an Earley parser and rediscover the Leo optimization
at compile time is far off.
On the other hand,
I do not imagine that hand optimization and/or C language is the
winner in all cases.
A programmer is deceiving himself if he imagines that he can spot all the cases where
lazy evaluation or memoization will be effective in the general case.
And of course,
even an omniscient programmer is not going to be there at run-time
to do "just in time" optimization.
Perhaps the optimal parser of the future will combine important hand optimizations
with functional programming.
To learn about Marpa,
my Earley/Leo-based parsing project,
there is the
semi-official web site, maintained by Ron Savage.
The official, but more limited, Marpa website
is my personal one.
Comments on this post can be made in
Marpa's Google group,
or on our IRC channel: #marpa at freenode.net.
posted at: 03:06 |
direct link to this entry