Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website

Google+

Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Wed, 15 Dec 2010


Killing Yacc: 1, 2 & 3

The Good, the Bad and The Ugly

Angeleyescleef.jpgThe recent discussions about yacc made me feel a bit like Lee Van Cleef in an old spaghetti Western. Cast alongside Clint Eastwood, Van Cleef watched with great concern as one attempt after another was made on Eastwood's life. Van Cleef didn't mind Eastwood getting killed -- he just wanted to be the one to do it.

As some of you will have recognized, I am talking about a very interesting discussion started by a paper by Might and Darais entitled "Yacc is Dead". I am finding it very much worth reading as an example of clear and precise mathematical writing. With respect to the parser itself, my opinion is that Russ Cox's extremely well-informed blog post, "Yacc is not Dead", is an accurate assessment.

Might, Darais and Cox all devote considerable attention to exactly what it will take to send yacc on to its Final Reward. I see six requirements: Three from Might & Darais, one suggested by Cox, and two that I have added. The rest of this post will be about three of these requirements, all of which focus on parsing speed.

Requirement 1: Handle Arbitrary Context-Free Grammars in O(n**3)

For a parser to be convenient, it should take anything you can write in BNF and parse it. And it should do this in "reasonable" time. This enables a programmer to work on a grammar that does not fit a restricted theoretical framework (LL, LR, LALR, etc.). The programmer then has the choice: she can tighten the grammar up to make it faster, or she can decide that, for her application, worse-than-linear speed is acceptable.

This requirement is one of those in the Might-Darais paper, but Russ Cox adds a further requirement: "Reasonable time" means O(n**3). Might and Darais do not categorize their algorithm's speed for arbitrary context-free grammars, but Cox says that it is exponential (O(e**n)).

Cox's tightening of this requirement makes sense. Depending on the application, exponential time can make a grammar unuseable in practice. Several algorithms are known which parse arbitrary context-free grammars in O(n**3). Marpa is one of these. (Marpa, for those new to this blog, is a parsing algorithm I've been working on. It is based on Earley's algorithm, and includes several major enhancements to it from the academic literature.)

Requirement 2: Handle "Average" Grammars in Linear Time

When it needs to parse long inputs, an algorithm has to run in linear (O(n)) time. "Average" grammars -- grammars for languages where the inputs are expected to be long -- should parse in linear time. For example, in production programming, the code files can become quite large. Perhaps HTML files should not be huge, but they often are. And it is quite reasonable to expect XML files to be arbitrarily long. For all of these, and for any language expected to fit into the same kinds of production environment, linear-time parsing is a must. This requirement is based on the second requirement in the Might-Darais paper. Might and Darais only say that such parsing should be "efficient". Russ Cox is more specific: he requires that the time be linear.

Marpa's behavior for "average" grammars is excellent, for any reasonable definition of "average". Marpa parses all LR-regular grammars in linear time. LR-regular is a huge class of grammars, and it includes the LR(k) grammars for all values of k. In practical terms, this means that, if a grammar is parseable by recursive descent, by yacc, or by a regular expression, then it is parsed by Marpa in linear time.

Requirement 3: A Sound Theoretical Basis

This third requirement is Russ Cox's, and it is a real insight on his part. A sound theoretical basis is more important than it may seem. Over the years I have seen many new parsing algorithms introduced, only to disappear. The algorithms which drop from sight are those whose speed claims are based on speculation and/or initial tests, but not on theory.

You might think that testing can replace theory, but typically it can't. The only real way to test that a new algorithm is efficient enough for production is to use it in production. Few production compiler writers are going to risk the use of a new algorithm before there is solid theory to back their leap of faith.

The Might-Darais paper is beautifully written mathematics. But, as Russ Cox notes, nowhere does it characterize its speed claims in mathematical terms. We know that the Might-Darais algorithm will parse some grammars very quickly. We know that, for others, it will be painfully slow. Which ones will be which?

Marpa does well in backing up its speed claims. Marpa is an Earley parser. In his 40-year old paper, Jay Earley proved that his algorithm parses arbitrary context-free grammars in O(n**3) time. Marpa incorporates Joop Leo's 1991 modification of the Earley parser. In that paper, Leo proves that his algorithm parses LR-regular grammars in linear time. When it comes to speed claims, Marpa is doing things by the book.

posted at: 13:59 | direct link to this entry

§         §         §