Fri, 13 May 2011
Bovicide 6: The Final Requirement
This post in one of
a web discussion
it would take to replace yacc
and its cousins as the industry standard
for parser generators.
My interest in the question grows out of
my project to take the many improvements made
over the decades to Earley's algorithm,
and turn them into a practical parser generator.
Requirement 6: Available as a Library
To be an adequate yacc replacement,
an algorithm should be available as a library.
In the Perl context,
this means available as a module.
I interpret this requirement loosely
enough to regard the traditional implementations
of LALR parsers,
such as the yacc parser generator,
as meeting this test.
True, they usually
generate the code for a custom parser,
which then must be
compiled as a separate step.
But an LALR parser generator could be implemented
as a library.
The reason that
yacc and its relatives
are traditionally not implemented
as libraries is revealing:
the chances of a practical grammar
"just working" with
are pretty much zero.
In practice you almost never go from writing a yacc
directly to a successful execution.
In context, therefore,
having to invoke
the C compiler as a separate stage is
only a small additional effort.
And it does open up a lot of extra options,
such as the
use of alternative compilers, linkers
and other tools;
the full range of options to
the opportunity to custom modify the C code; etc., etc.
Marpa fulfills Requirement 6.
It is implemented as a Perl
module, and its grammar pre-processing and parsing has been translated
into a C library (libmarpa), in which form it runs much faster.
The major parsing technique that clearly fails this requirement
is hand-written recursive descent.
Hand-written recursive descent parsers are,
by definition, not available in
a library or any other ready-made form.
A hand-written recursive descent parser is straightforward
so much so that writing one these days is often preferred
to struggling with yacc.
Nonetheless code it yourself is what you must do.
A Requirement I Dropped
In this list of requirements, I included the requirements
Russ Cox's post
and all but one from
the Might-Darais paper.
Those I included I sometimes strengthened,
and I added two more to the list,
dealing with error detection.
Error detection is important to include --
the present turn against the LALR-based family of
parser generators (which includes yacc)
mainly to LALR's grotesquely bad error detection properties.
What was the requirement I dropped?
Might & Darais suggest that the ideal algorithm
needs to fit on
a single page --
in effect that a high-quality general parser
must be something you could whiteboard for your
To be sure,
this would be a useful and charming property in any algorithm.
Unsurprisingly, the Might-Darais algorithm has it.
I say "unsurprisingly", because when authors of parsers list
requirements for the perfect parser, they have a tendency to
shoot the arrow first, then set up the target near where
the arrow landed.
I could be accused of this.
For Marpa to have fit on single page would have been a wonderful
thing. Earley's original algorithm came close.
But with each improvement to the Earley algorithm,
while speed increased, so did the complexity of the logic.
all the applicable improvements
in the literature.
There was never any hope it would fit on a whiteboard,
not even if you follow
Yegge's suggestion and carry your own supply of fine-tip dry-erase markers.
Picasso and Angkor Wat
In this later years, Picasso could create a world in a few brush strokes.
While writing Marpa, there were moments when
a problem which appeared complex
dissolved into a simple truth.
The nicest of these was when
the largely incompatible algorithms
of Joop Leo and Aycock-Horspool turned out to
be a perfect fit at all the points where it mattered.
Even so, there were not enough of these moments
to reverse the trend.
Nobody will ever call the Marpa algorithm simple.
But disappointment is what happens when you search
for the wrong thing.
The straightness of the Parthenon's lines is an illusion.
Simple straight lines do not look straight.
The builders of the Parthenon wanted every line to appear
perfectly straight to the human eye, and as a result
not a prefectly straight line
anywhere in the temple.
In South Asia,
the great temple complexes glory in their complexity.
And in its Frieze,
the Parthenon did not even pretend to be simple.
Their creators knew that
beauty is beyond pattern.
The previous posts in this series were
Parse-time Error Reporting, and
Why the Bovicidal Rage?,
The discussion was started by
Might and Darais's presentation of their algorithm,
titled "Yacc is Dead".
Russ Cox responded to that paper with an extremely well-informed blog post,
"Yacc is not Dead,.
Marpa is available on CPAN in
Marpa will parse anything you can write in BNF.
If a grammar is one of the kinds in practical use,
(such as yacc's LALR or recursive descent's LL(1)),
you can expect Marpa to parse it in linear time -- O(n).
To be specific, Marpa parses any
LR-regular grammar in O(n).
Note 4: A yacc
user must also create her own lexer.
This was considered acceptable since there were tools
specifically for creating yacc-compatible lexers,
and since, in exchange for the trouble of writing a yacc
parser, she expected a extremely efficient parser,
one which often justified a hand-written lexer.
I did not include every improvement to Earley's proposed
in the literature.
In particular, Marpa does not use lookahead.
Prior to Marpa,
Grune and Jacobs had already concluded
that the tradeoffs for lookahead in
Earley's parsers were dubious.
No amount of lookahead
will make a parser faster
than linear -- O(n).
For the LR(k) grammars,
that is, LR grammars with a lookahead of k,
Marpa is already O(n) for all finite values of k.
The same is true of the LL(k) grammars for all
finite values of k.
Marpa is O(n) for
LR grammars with infinite lookahead,
as long as the lookahead is a regular expression.
The overhead from adding
explicit lookahead to Marpa would be very real.
The benefit would be hard to find.
posted at: 17:22 |
direct link to this entry