Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website

Google+

Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Sun, 07 Sep 2014


Parsing: a timeline

[ Revised 22 Oct 2014 ]

1960: The ALGOL 60 spec comes out. It specifies, for the first time, a block structured language. The ALGOL committee is well aware that nobody knows how to parse such a language. But they believe that, if they specify a block-structured language, a parser for it will be invented. Risky as this approach is, it pays off ...

1961: Ned Irons publishes his ALGOL parser. In fact, the Irons parser is the first parser of any kind to be described in print. Ned's algorithm is a left parser -- a form of recursive descent. Unlike modern recursive descent, the Irons algorithm is general and syntax-driven. "General" means it can parse anything written in BNF. "Syntax-driven" (aka declarative) means that parser is actually created from the BNF -- the parser does not need to be hand-written.

1961: Almost simultaneously, hand-coded approaches to left parsing appear. These we would now recognize as recursive descent. Over the following years, hand-coding approaches will become more popular for left parsers than syntax-driven algorithms. Three factors are at work:

1965: Don Knuth invents LR parsing. Knuth is primarily interested in the mathematics. Knuth describes a parsing algorithm, but it is not thought practical.

1968: Jay Earley invents the algorithm named after him. Like the Irons algorithm, Earley's algorithm is syntax-driven and fully general. Unlike the Irons algorithm, it does not backtrack. Earley's core idea is to track everything about the parse in tables. Earley's algorithm is enticing, but it has three major issues:

1969: Frank DeRemer describes a new variant of Knuth's LR parsing. DeRemer's LALR algorithm requires only a stack and a state table of quite manageable size.

1972: Aho and Ullmann describe a straightforward fix to the zero-length rule bug in Earley's original algorithm. Unfortunately, this fix involves adding even more bookkeeping to Earley's.

1975: Bell Labs converts its C compiler from hand-written recursive descent to DeRemer's LALR algorithm.

1977: The first "Dragon book" comes out. This soon-to-be classic textbook is nicknamed after the drawing on the front cover, in which a knight takes on a dragon. Emblazoned on the knight's lance are the letters "LALR". From here on out, to speak lightly of LALR will be to besmirch the escutcheon of parsing theory.

1979: Bell Laboratories releases Version 7 UNIX. V7 includes what is, by far, the most comprehensive, useable and easily available compiler writing toolkit yet developed. Central to the toolkit is yacc, an LALR based parser generator. With a bit of hackery, yacc parses its own input language, as well as the language of V7's main compiler, the portable C compiler. After two decades of research, it seems that the parsing problem is solved.

1987: Larry Wall introduces Perl 1. Perl embraces complexity like no previous language. Larry uses LALR very aggressively -- to my knowledge more aggressively than anyone before or since.

1991: Joop Leo discovers a way of speeding up right recursions in Earley's algorithm. Leo's algorithm is linear for just about every unambiguous grammar of practical interest, and many ambiguous ones as well. In 1991 hardware is six orders of magnitude faster than 1968 hardware, so that the issue of bookkeeping overhead had receded in importance. This is a major discovery. When it comes to speed, the game has changed in favor of Earley algorithm. But Earley parsing is almost forgotten. It will be 20 years before anyone writes a practical implementation of Leo's algorithm.

1990's: Earley's is forgotten. So everyone in LALR-land is content, right? Wrong. Far from it, in fact. Users of LALR are making unpleasant discoveries. While LALR automatically generates their parsers, debugging them is so hard they could just as easily write the parser by hand. Once debugged, their LALR parsers are fast for correct inputs. But almost all they tell the users about incorrect inputs is that they are incorrect. In Larry's words, LALR is "fast but stupid".

2000: Larry Wall decides on a radical reimplementation of Perl -- Perl 6. Larry does not even consider using LALR again.

2002: Aycock&Horspool publish their attempt at a fast, practical Earley's parser. Missing from it is Joop Leo's improvement -- they seem not to be aware of it. Their own speedup is limited in what it achieves and the complications it introduces can be counter-productive at evaluation time. But buried in their paper is a solution to the zero-length rule bug. And this time the solution requires no additional bookkeeping.

2006: GNU announces that the GCC compiler's parser has been rewritten. For three decades, the industry's flagship C compilers have used LALR as their parser -- proof of the claim that LALR and serious parsing are equivalent. Now, GNU replaces LALR with the technology that it replaced a quarter century earlier: recursive descent.

2000 to today: With the retreat from LALR comes a collapse in the prestige of parsing theory. After a half century, we seem to be back where we started. If you took Ned Iron's original 1961 algorithm, changed the names and dates, and translated the code from the mix of assembler and ALGOL into Haskell, you would easily republish it today, and bill it as as revolutionary and new.

Marpa

Over the years, I had come back to Earley's algorithm again and again. Around 2010, I realized that the original, long-abandoned vision -- an efficient, practical, general and syntax-driven parser -- was now, in fact, quite possible. The necessary pieces had fallen into place.

Aycock&Horspool have solved the zero-length rule bug. Joop Leo had found the speedup for right recursion. And the issue of bookkeeping overhead had pretty much evaporated on its own. Machine operations are now a billion times faster than in 1968, and probably no longer relevant in any case -- caches misses are now the bottleneck.

But while the original issues with Earley's disappeared, a new issue emerged. With a parsing algorithm as powerful as Earley's behind it, a syntax-driven approach can do much more than it can with a left parser. But with the experience with LALR in their collective consciousness, few modern programmers are prepared to trust a purely declarative parser. As Lincoln said, "Once a cat's been burned, he won't even sit on a cold stove."

To be accepted, Marpa needed to allow procedure parsing, not just declarative parsing. So Marpa allows the user to specify events -- occurrences of symbols and rules -- at which declarative parsing pauses. While paused, the application can call procedural logic and single-step forward token by token. The procedural logic can hand control back over to syntax-driven parsing at any point it likes. The Earley tables can provide the procedural logic with full knowledge of the state of the parse so far: all rules recognized in all possible parses so far, and all symbols expected. Earley's algorithm is now a even better companion for hand-written procedural logic than recursive descent.

For more

For more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group.


posted at: 19:50 | direct link to this entry

§         §         §