Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website


Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Fri, 18 Nov 2011

What is the Marpa algorithm?

I have referred to "the Marpa algorithm" many times. What is that? The implementation involves many details, but the Marpa algorithm itself is basically four ideas. Of these only the most recent is mine. The other three come from papers spanning over 40 years.

Idea 1: Parse by determining which rules can be applied where

The first idea is to track the progress of the a parse by determining, for each token, which rules can be applied and where. Sounds pretty obvious. Not-so-obvious is how to do this efficiently.

In fact, most parsing these days uses some sort of shortcut. Regexes and LALR (yacc, etc.) require the grammar to take a restricted form, so that they can convert the rules into a state machine. Recursive descent, rather than list the possibilities, dives into them one by one. It, too, only works well with grammars of a certain kind.

In 1970, Jay Earley described an parsing algorithm that went from left to right, and worked by determining which rules applied where. Earley's was reasonably fast, but the severe limits of 1970's hardware pushed less powerful parsing algorithms to the forefront, where they remain. Jay Earley soon left the computer field to become a psychotherapist. His ideas remain the basis of much of general BNF parsing. Marpa is an Earley's parser.

Idea 2: Right-recursion is left-recursion in the mirror

Earley's original algorithm could handle anything you could write in BNF: ambiguous, infinitely ambiguous, left-recursive, right-recursive, you name it. And Earley's was very good at left-recursion -- blazingly fast. With right-recursion, however, Earley's original algorithm went quadratic. For many applications, quadratic time is unacceptable.

Right-recursion is just left-recursion seen backwards. In 1991, Joop Leo figured out how to take efficient advantage of this. Leo modified Earley's to identify potential right-recursions and parse them "in the mirror", as if they were left-recursions.

With this improvement, Earley's algorithm was now linear for the LR-regular grammars, a vast class which includes every other class of grammar in practical use, then and now. Startlingly, Leo's result went 20 years with few implementations. Marpa is its first implementation in a general-purpose utility.

Idea 3: Don't dance around the issue of nullables

A pesky problem with Earley's algorithm remained: nullable symbols. A symbol is a nullable symbol if it represents something that might be omitted, Examples are the three expression's in C language for statements: any or all of these may be omitted. To be considered practical, a parsing algorithm must work well with grammars that contain nullables.

Earley's original 1970 algorithm actually had a bug in its handling of nullables. There was an easy fix, but it made the algorithm slower. Since efficiency was already seen as the reason to prefer other parsers, this was a big deal.

In 2002, Aycock & Horspool stopped the dancing around the nullable issue. They rewrote Earley's parse engine, targeting nullables. The result was an improvement in other respects. Marpa's parse engine is not that of Aycock & Horspool. But Marpa's strategy for handling nullables comes directly from their work. And the design of Marpa's parse engine is heavily influenced by that described in Aycock & Horspool 2002.

Idea 4: Use the Ruby Slippers

For Marpa to get the benefits of both Leo and Aycock&Horspool, I needed to combine their quite different parse engines. In the process, I realized that behind Earley's original algorithm lay a promise, never-fulfilled. If your parser knows which rules are applicable where, then it should, in principle, allow you to use this information to guide the parsing process.

What was needed was a parse engine which was careful to do all the other processing BEFORE it looked at the input. The Marpa parse engine does this --- at each parse location, examining the input is the last thing Marpa does. By examining the input last, Marpa makes the information from the other processing available for determining what that input should be.

To see how this can be useful, take the example of liberal HTML --- HTML in which tags might be missing. With Marpa, it is possible to parse liberal HTML with an unrealistically simple grammar. An HTML parser can use a reference grammar which assumes that start and end tags are always present, even when not required by the standard. When the input does not conform to this grammar's unrealistic expectations, and it very often will not, the application can ask the parser what it DOES expect according to that grammar. If it's a missing end tag, the application can invent it and pass it on to the parser. The parse then continues, quite happily.

I call this technique, where the grammar demands a perfect world and the application changes the input to match the grammar's expectations, Ruby Slippers parsing. I've implemented an HTML parser which uses the Ruby Slippers. I use Marpa::HTML myself a good deal. It is fast and flexible and IMHO the best way to parse liberal HTML.


Four ideas are essential to the Marpa algorithm:


  1. "Earley 1970": Earley, Jay, "An efficient context-free parsing algorithm", Communications of the Association for Computing Machinery, 13:2:94-102, 1970.
  2. "Leo 1991": Leo, Joop M.I.M., "A General Context-Free Parsing Algorithm Running in Linear Time on Every LR(k) Grammar Without Using Lookahead", Theoretical Computer Science, Vol. 82, No. 1, 1991, pp 165-176.
  3. "Aycock&Horspool 2002": Aycock, John and Horspool, R. Nigel, "Practical Earley Parsing", The Computer Journal, Vol. 45, No. 6, 2002, pp. 620-630. Available online.

posted at: 18:09 | direct link to this entry

§         §         §