Fri, 13 May 2011

Bovicide 6: The Final Requirement

This post in one of a series inspired by a web discussion about what it would take to replace yacc and its cousins as the industry standard for parser generators. My interest in the question grows out of Marpa, my project to take the many improvements made over the decades to Earley's algorithm, and turn them into a practical parser generator.

Requirement 6: Available as a Library

To be an adequate yacc replacement, an algorithm should be available as a library. In the Perl context, this means available as a module. I interpret this requirement loosely enough to regard the traditional implementations of LALR parsers, such as the yacc parser generator, as meeting this test. True, they usually generate the code for a custom parser, which then must be compiled as a separate step. But an LALR parser generator could be implemented as a library.

The reason that yacc and its relatives are traditionally not implemented as libraries is revealing: the chances of a practical grammar "just working" with yacc are pretty much zero. In practice you almost never go from writing a yacc grammar directly to a successful execution. In context, therefore, having to invoke the C compiler as a separate stage is only a small additional effort. And it does open up a lot of extra options, such as the use of alternative compilers, linkers and other tools; the full range of options to these tools; the opportunity to custom modify the C code; etc., etc.

Marpa fulfills Requirement 6. It is implemented as a Perl module, and its grammar pre-processing and parsing has been translated into a C library (libmarpa), in which form it runs much faster.

The major parsing technique that clearly fails this requirement is hand-written recursive descent. Hand-written recursive descent parsers are, by definition, not available in a library or any other ready-made form. A hand-written recursive descent parser is straightforward to write, so much so that writing one these days is often preferred to struggling with yacc. Nonetheless code it yourself is what you must do.

A Requirement I Dropped

In this list of requirements, I included the requirements in Russ Cox's post and all but one from the Might-Darais paper. Those I included I sometimes strengthened, and I added two more to the list, dealing with error detection. Error detection is important to include -- I attribute the present turn against the LALR-based family of parser generators (which includes yacc) mainly to LALR's grotesquely bad error detection properties.

What was the requirement I dropped? Might & Darais suggest that the ideal algorithm needs to fit on a single page -- in effect that a high-quality general parser must be something you could whiteboard for your Google interview. To be sure, this would be a useful and charming property in any algorithm. Unsurprisingly, the Might-Darais algorithm has it.

I say "unsurprisingly", because when authors of parsers list requirements for the perfect parser, they have a tendency to shoot the arrow first, then set up the target near where the arrow landed. I could be accused of this.

For Marpa to have fit on single page would have been a wonderful thing. Earley's original algorithm came close. But with each improvement to the Earley algorithm, while speed increased, so did the complexity of the logic. Marpa includes all the applicable improvements in the literature. There was never any hope it would fit on a whiteboard, not even if you follow Steve Yegge's suggestion and carry your own supply of fine-tip dry-erase markers.

Picasso and Angkor Wat

In this later years, Picasso could create a world in a few brush strokes. While writing Marpa, there were moments when a problem which appeared complex dissolved into a simple truth. The nicest of these was when the largely incompatible algorithms of Joop Leo and Aycock-Horspool turned out to be a perfect fit at all the points where it mattered. Even so, there were not enough of these moments to reverse the trend. Nobody will ever call the Marpa algorithm simple.

But disappointment is what happens when you search for the wrong thing. The straightness of the Parthenon's lines is an illusion. Simple straight lines do not look straight. The builders of the Parthenon wanted every line to appear perfectly straight to the human eye, and as a result there is not a prefectly straight line anywhere in the temple.

In South Asia, the great temple complexes glory in their complexity. And in its Frieze, the Parthenon did not even pretend to be simple. Their creators knew that beauty is beyond pattern.

Notes

Note 1: The previous posts in this series were Killing Yacc, Parse-time Error Reporting, and Why the Bovicidal Rage?,

Note 2: The discussion was started by Might and Darais's presentation of their algorithm, titled "Yacc is Dead". Russ Cox responded to that paper with an extremely well-informed blog post, "Yacc is not Dead,.

Note 3: Marpa is available on CPAN in Pure Perl and XS form. Marpa will parse anything you can write in BNF. If a grammar is one of the kinds in practical use, (such as yacc's LALR or recursive descent's LL(1)), you can expect Marpa to parse it in linear time -- O(n). To be specific, Marpa parses any LR-regular grammar in O(n).

Note 4: A yacc user must also create her own lexer. This was considered acceptable since there were tools specifically for creating yacc-compatible lexers, and since, in exchange for the trouble of writing a yacc parser, she expected a extremely efficient parser, one which often justified a hand-written lexer.

Note 5: I did not include every improvement to Earley's proposed in the literature. In particular, Marpa does not use lookahead. Prior to Marpa, Grune and Jacobs had already concluded that the tradeoffs for lookahead in Earley's parsers were dubious.

No amount of lookahead will make a parser faster than linear -- O(n). For the LR(k) grammars, that is, LR grammars with a lookahead of k, Marpa is already O(n) for all finite values of k. The same is true of the LL(k) grammars for all finite values of k. In fact, Marpa is O(n) for LR grammars with infinite lookahead, as long as the lookahead is a regular expression. The overhead from adding explicit lookahead to Marpa would be very real. The benefit would be hard to find.

posted at: 20:22 | direct link to this entry

§ § §