Fri, 18 Nov 2011
What is the Marpa algorithm?
I have referred to
"the Marpa algorithm"
What is that?
involves many details,
but the Marpa algorithm itself is basically four ideas.
Of these only the most recent is mine.
The other three come
spanning over 40 years.
Idea 1: Parse by determining which rules can be applied where
The first idea is to track the progress of the a parse by determining,
for each token, which rules can be applied and where.
Sounds pretty obvious.
Not-so-obvious is how
to do this efficiently.
most parsing these days uses
some sort of shortcut.
Regexes and LALR (yacc, etc.) require the grammar to take a restricted form,
so that they can convert the rules into a state machine.
Recursive descent, rather than list the possibilities, dives into
them one by one.
It, too, only works well with grammars of a certain kind.
Jay Earley described an parsing algorithm that
went from left to right,
and worked by determining which rules applied where.
Earley's was reasonably fast,
but the severe limits of 1970's hardware pushed less powerful
parsing algorithms to the forefront,
where they remain.
Jay Earley soon left the computer field to become a
His ideas remain the basis of much of general BNF parsing.
Marpa is an Earley's parser.
Idea 2: Right-recursion is left-recursion in the mirror
Earley's original algorithm could handle anything you could write
in BNF: ambiguous, infinitely ambiguous, left-recursive, right-recursive, you name it.
And Earley's was very good at left-recursion -- blazingly fast.
With right-recursion, however, Earley's original algorithm went quadratic.
For many applications, quadratic time is unacceptable.
Right-recursion is just left-recursion seen backwards.
Joop Leo figured out how
to take efficient advantage of this.
Leo modified Earley's to identify potential right-recursions
and parse them "in the mirror", as if they were left-recursions.
With this improvement, Earley's algorithm was now linear for
the LR-regular grammars, a vast class which includes every other
class of grammar in practical use, then and now.
Leo's result went 20 years with few implementations.
Marpa is its first implementation in a general-purpose utility.
Idea 3: Don't dance around the issue of nullables
A pesky problem with Earley's algorithm remained: nullable symbols.
is a nullable symbol
if it represents something that might be omitted,
Examples are the three expression's
in C language
any or all of these may be omitted.
To be considered practical,
a parsing algorithm must work well with grammars
that contain nullables.
Earley's original 1970 algorithm actually had a bug in its handling
There was an easy fix, but it made the algorithm slower.
Since efficiency was already seen as the reason to prefer other
parsers, this was a big deal.
Aycock & Horspool stopped the dancing around the nullable issue.
They rewrote Earley's parse engine, targeting nullables.
The result was an improvement in other respects.
Marpa's parse engine is not that of Aycock & Horspool.
But Marpa's strategy for handling nullables comes directly from their work.
And the design of Marpa's parse engine is heavily influenced by
that described in Aycock & Horspool 2002.
Idea 4: Use the Ruby Slippers
For Marpa to get the benefits of both Leo and Aycock&Horspool,
I needed to combine their
In the process, I realized that behind Earley's original algorithm
lay a promise,
If your parser knows which rules are applicable where,
then it should, in principle,
allow you to use this information
to guide the parsing process.
What was needed was a parse engine which was careful to do
all the other processing BEFORE
it looked at the input.
The Marpa parse engine does this ---
at each parse location, examining the input
is the last thing Marpa does.
By examining the input last,
Marpa makes the information
from the other processing available for determining
what that input should be.
To see how this can be useful,
take the example of liberal HTML ---
HTML in which tags might be missing.
it is possible to parse liberal HTML with an unrealistically simple
An HTML parser can use
a reference grammar which assumes that start and end tags are always
present, even when not required by the standard.
When the input does not conform to this grammar's unrealistic
and it very often will not,
the application can ask the parser what it DOES expect
according to that grammar.
If it's a missing end tag, the application can invent it and
pass it on to the parser.
The parse then continues, quite happily.
I call this technique, where the grammar demands
a perfect world and the application changes the input
to match the grammar's expectations,
Ruby Slippers parsing.
implemented an HTML parser
which uses the Ruby Slippers.
myself a good deal.
It is fast and flexible and IMHO the best way to parse liberal HTML.
Four ideas are essential to the
- Marpa is an Earley's parser,
in that it works by creating lists of possibilities at each token.
- Marpa handles right-recursion using Joop Leo's method.
This makes it O(n) for every class of grammar in practical use today.
- Marpa handles nullables using the ideas of Aycock&Horspool 2002.
As a side effect, this
opens the way to further improvements in the Earley parse engine.
- The Marpa algorithm allows the application to be fully aware,
at all times,
of what is going on in the parse.
Obviously, this benefits error-handling.
it opens the road to powerful new parsing strategies.
- "Earley 1970":
"An efficient context-free parsing algorithm",
Communications of the Association for Computing Machinery, 13:2:94-102, 1970.
- "Leo 1991":
Leo, Joop M.I.M.,
"A General Context-Free Parsing Algorithm Running in Linear Time on Every LR(k) Grammar Without Using Lookahead",
Theoretical Computer Science, Vol. 82, No. 1, 1991, pp 165-176.
- "Aycock&Horspool 2002":
Aycock, John and
Horspool, R. Nigel,
"Practical Earley Parsing",
The Computer Journal, Vol. 45, No. 6, 2002, pp. 620-630.
posted at: 18:09 |
direct link to this entry