Sun, 26 Dec 2010
Why the Bovicidal Rage? (Killing Yacc: 4)
For the first time, automatic generation of
of efficient, production-quality parsers was possible
for languages of practical interest.
Yacc-generated parsers had reasonable memory footprints.
They ran in linear time.
But error reporting was overlooked.
Then as now, the focus in analyzing algorithms was on power
-- what kinds of grammar an algorithm can parse --
and on resource consumption.
This leaves out something big.
Our frameworks for analyzing things affect what we believe.
We find it hard to recognize a problem if our
framework makes us unable to articulate it.
Complaints about yacc tended to be kept to oneself.
But while yacc's
overt reputation flourished,
programmers were undergoing an almost Pavlovian
conditioning against it --
a conditioning through pain.
With yacc, you no longer need to write
your own parser.
If you put your grammar into a form that yacc will accept,
yacc writes your parser for you.
But over the years, people noticed -- it usually takes longer
to put a grammar into a form that yacc will accept
than it does to write a recursive descent parser from scratch.
This is almost always true when someone was using yacc
for the first time.
But it is usually true for yacc
experts as well.
Certain tools do have a high initial cost.
They make this acceptable with a payoff over
the lifetime of the software.
But in this respect also,
yacc does worse than hand-written recursive descent --
The structure of a recursive descent parser reflects the structure
of the language being parsed.
Small changes in the language tend to require only small changes in the parser.
Major changes usually affect only that part of the grammar actually changed.
And how the recursive descent parser must change, and why,
is usually obvious even to a programmer
who does not make a specialty of parsing.
For a yacc grammar, the change process
is not incremental -- each iteration is almost like
starting from scratch.
yacc parsers work using LALR automata.
Small changes in the grammar often cause big changes to the LALR states.
Tracing the logic behind these changes is a challenge
even to those
familiar with the underlying math --
one which they usually find not worth the time and effort.
This is one reason that those experienced with yacc find it nearly
as hard to use as beginners did -- for real-world problems,
understanding the LALR states is simply too hard for anyone.
Experts, like non-experts, are forced to fix yacc
grammars using trial and error.
follow the nearly incomprehensible logic of the LALR
Adapting a grammar to yacc is a
struggle, during which you
watch your grammar become less and less a
reflection of what you were originally trying to do.
One of the most difficult tasks
in programming, it is almost one of the most unsatisifying
The Fourth Requirement for Replacing yacc:
Easy Diagnosis of Grammar Problems
The easiest way to deal with grammar problems is to arrange for them
not to happen.
You can do this if you have a notation for the grammar which is
- A natural and intuitive way to express the grammar, and which
- Makes it literally impossible to specify a problem grammar.
The notation for
has these two properties,
and that is a major reason for the enduring popularity
of regular expressions.
Once you get used to its limits,
regular expression notation becomes a natural way to
And regular expression notation
makes it impossible to specify
anything that is not a regular expression.
More powerful parsing algorithms have these same two properties
when they accept all context-free grammars (as
The context-free grammars are those you can write in BNF, and vice versa.
BNF is (again, modulo some getting-used-to) a natural and intuitive way
to express a language, and you don't have to worry about specifying a language
which is harder than context-free -- the BNF notation makes that impossible.
yacc uses BNF as its notation
for expressing grammars, but the most natural way to express a practical
grammar in BNF is almost never an LALR grammar.
No natural or intutive notation is known
that describes only LALR grammars, even after 40 years of
considerable interest in them.
I have to doubt that such a notation ever will be found.
It has definitely been the tradition to
understate the importance of error reporting,
or even to ignore it entirely.
But I should point out
that I have not consulted Knuth's and DeRemer's original papers,
which are behind paywalls.
Also, things seem to be getting better, particularly with the arrival
very important textbook by Grune and Jacobs,
which does devote significant attention to the error
reporting properties of the various algorithms.
Perhaps because reporting that you found it impossible to
use one of the standard
tools in our field
was more likely to produce conclusions
about your competence than about the tool,
the tradition among yacc users was to suffer in silence.
One good 1995 account of trials with yacc is
this contribution to comp.compilers.
And there is one person who I
believe has an intuitive understanding of LALR: Larry Wall.
Certainly I doubt there is anyone alive whose practical knowledge
of LALR is better than Larry's.
That makes it very significant that
Perl 6 does not use
yacc or any other LALR based parser.
Here I am talking about pure regular expressions.
Perl regexes are much more powerful than regular expressions,
and the power comes with tradeoffs:
The notation is no longer
as simple or intuitive.
posted at: 08:02 |
direct link to this entry