Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website


Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Sun, 26 Dec 2010

Why the Bovicidal Rage? (Killing Yacc: 4)

3299967437_6bae3ce6a8_z.jpg yacc was a major breakthrough. For the first time, automatic generation of of efficient, production-quality parsers was possible for languages of practical interest. Yacc-generated parsers had reasonable memory footprints. They ran in linear time.

But error reporting was overlooked. Then as now, the focus in analyzing algorithms was on power -- what kinds of grammar an algorithm can parse -- and on resource consumption. This leaves out something big.

Our frameworks for analyzing things affect what we believe. We find it hard to recognize a problem if our framework makes us unable to articulate it. Complaints about yacc tended to be kept to oneself. But while yacc's overt reputation flourished, programmers were undergoing an almost Pavlovian conditioning against it -- a conditioning through pain.

With yacc, you no longer need to write your own parser. If you put your grammar into a form that yacc will accept, yacc writes your parser for you. But over the years, people noticed -- it usually takes longer to put a grammar into a form that yacc will accept than it does to write a recursive descent parser from scratch. This is almost always true when someone was using yacc for the first time. But it is usually true for yacc experts as well.

Certain tools do have a high initial cost. They make this acceptable with a payoff over the lifetime of the software. But in this respect also, yacc does worse than hand-written recursive descent -- much worse.

The structure of a recursive descent parser reflects the structure of the language being parsed. Small changes in the language tend to require only small changes in the parser. Major changes usually affect only that part of the grammar actually changed. And how the recursive descent parser must change, and why, is usually obvious even to a programmer who does not make a specialty of parsing.

For a yacc grammar, the change process is not incremental -- each iteration is almost like starting from scratch. yacc parsers work using LALR automata. Small changes in the grammar often cause big changes to the LALR states. Tracing the logic behind these changes is a challenge even to those familiar with the underlying math -- one which they usually find not worth the time and effort. This is one reason that those experienced with yacc find it nearly as hard to use as beginners did -- for real-world problems, understanding the LALR states is simply too hard for anyone. Experts, like non-experts, are forced to fix yacc grammars using trial and error.

yacc's demands follow the nearly incomprehensible logic of the LALR automaton. Adapting a grammar to yacc is a struggle, during which you watch your grammar become less and less a reflection of what you were originally trying to do. One of the most difficult tasks in programming, it is almost one of the most unsatisifying and unrewarding.

The Fourth Requirement for Replacing yacc: Easy Diagnosis of Grammar Problems

The easiest way to deal with grammar problems is to arrange for them not to happen. You can do this if you have a notation for the grammar which is

  1. A natural and intuitive way to express the grammar, and which
  2. Makes it literally impossible to specify a problem grammar.

The notation for regular expressions has these two properties, and that is a major reason for the enduring popularity of regular expressions. Once you get used to its limits, regular expression notation becomes a natural way to describe languages. And regular expression notation makes it impossible to specify anything that is not a regular expression.

More powerful parsing algorithms have these same two properties when they accept all context-free grammars (as Marpa does). The context-free grammars are those you can write in BNF, and vice versa. BNF is (again, modulo some getting-used-to) a natural and intuitive way to express a language, and you don't have to worry about specifying a language which is harder than context-free -- the BNF notation makes that impossible.

yacc uses BNF as its notation for expressing grammars, but the most natural way to express a practical grammar in BNF is almost never an LALR grammar. No natural or intutive notation is known that describes only LALR grammars, even after 40 years of considerable interest in them. I have to doubt that such a notation ever will be found.


Note 1: It has definitely been the tradition to understate the importance of error reporting, or even to ignore it entirely. But I should point out that I have not consulted Knuth's and DeRemer's original papers, which are behind paywalls. Also, things seem to be getting better, particularly with the arrival of the very important textbook by Grune and Jacobs, which does devote significant attention to the error reporting properties of the various algorithms.

Note 2: Perhaps because reporting that you found it impossible to use one of the standard tools in our field was more likely to produce conclusions about your competence than about the tool, the tradition among yacc users was to suffer in silence. One good 1995 account of trials with yacc is this contribution to comp.compilers. And there is one person who I believe has an intuitive understanding of LALR: Larry Wall. Certainly I doubt there is anyone alive whose practical knowledge of LALR is better than Larry's. That makes it very significant that Perl 6 does not use yacc or any other LALR based parser.

Note 3: Here I am talking about pure regular expressions. Perl regexes are much more powerful than regular expressions, and the power comes with tradeoffs: The notation is no longer as simple or intuitive.

posted at: 08:02 | direct link to this entry

§         §         §