Wed, 15 Dec 2010
Killing Yacc: 1, 2 & 3
The Good, the Bad and The Ugly
The recent discussions about yacc made me
feel a bit like Lee Van Cleef in an old
Cast alongside Clint Eastwood, Van Cleef watched
with great concern as one attempt after another
was made on Eastwood's life.
Van Cleef didn't mind Eastwood getting killed --
he just wanted to be the one to do it.
As some of you will have recognized, I am talking about
a very interesting discussion started by
by Might and Darais
entitled "Yacc is Dead".
I am finding it very much worth reading as an example
of clear and precise mathematical writing.
With respect to the parser itself,
my opinion is that
Russ Cox's extremely well-informed
blog post, "Yacc is not Dead",
is an accurate assessment.
Might, Darais and Cox all devote considerable attention to
exactly what it will take to send yacc on to its Final
I see six requirements:
Three from Might & Darais, one suggested by
Cox, and two that I have added.
The rest of this post will be about three of these requirements,
all of which focus on parsing speed.
Requirement 1: Handle Arbitrary Context-Free Grammars in O(n**3)
For a parser to be convenient, it should take anything you can write in BNF
and parse it.
And it should do this in "reasonable" time.
This enables a programmer to work on a grammar that does not fit a
restricted theoretical framework (LL, LR, LALR, etc.).
The programmer then has the choice:
she can tighten the grammar up to make it faster,
or she can decide that, for her application,
worse-than-linear speed is acceptable.
This requirement is one of those in the Might-Darais paper, but Russ Cox
adds a further requirement:
"Reasonable time" means O(n**3).
Might and Darais do not categorize their algorithm's speed for arbitrary context-free
grammars, but Cox says that it is exponential (O(e**n)).
Cox's tightening of this requirement makes sense.
Depending on the application, exponential time can make a grammar unuseable
Several algorithms are known which parse arbitrary context-free grammars in
is one of these.
(Marpa, for those new to this blog, is a parsing algorithm I've been working on.
It is based on Earley's algorithm, and includes several major enhancements to
it from the academic literature.)
Requirement 2: Handle "Average" Grammars in Linear Time
When it needs to parse long inputs, an algorithm has to run in linear (O(n)) time.
"Average" grammars -- grammars for languages where the inputs are expected to be long --
should parse in linear time.
For example, in production programming, the code files can become quite large.
Perhaps HTML files should not be huge, but they often are.
And it is quite reasonable to expect XML files to be arbitrarily long.
For all of these, and for any language expected to fit into the same kinds
of production environment, linear-time parsing is a must.
This requirement is based on the second requirement in the Might-Darais paper.
Might and Darais only say that such parsing should be "efficient".
Russ Cox is more specific: he requires that the time be linear.
Marpa's behavior for "average" grammars is excellent, for any reasonable
definition of "average".
Marpa parses all LR-regular grammars in linear time.
LR-regular is a huge class of grammars, and it includes the LR(k) grammars
for all values of k.
In practical terms, this means that, if a grammar is parseable by recursive descent,
by yacc, or by a regular expression, then it is parsed by Marpa in linear time.
Requirement 3: A Sound Theoretical Basis
This third requirement is Russ Cox's, and it is a real insight on his part.
A sound theoretical basis is more important than it may seem.
Over the years I have seen many new parsing algorithms introduced, only to
The algorithms which drop from sight
are those whose speed claims are based on speculation
and/or initial tests, but not on theory.
You might think that
testing can replace theory, but typically it can't.
The only real way to test
that a new algorithm is efficient enough for production is to use it in
Few production compiler writers are going to risk the use of a new algorithm
before there is solid theory to back their leap of faith.
The Might-Darais paper is
beautifully written mathematics.
But, as Russ Cox notes,
nowhere does it characterize its speed claims in mathematical terms.
We know that the Might-Darais algorithm will parse some grammars very quickly.
We know that, for others, it will be painfully slow.
Which ones will be which?
Marpa does well in
backing up its speed claims.
Marpa is an Earley parser.
In his 40-year old paper,
Jay Earley proved that his algorithm parses
arbitrary context-free grammars in O(n**3) time.
Marpa incorporates Joop Leo's 1991 modification of the Earley parser.
In that paper, Leo proves that his algorithm
parses LR-regular grammars in linear time.
When it comes to speed claims, Marpa is doing things by the book.
posted at: 13:59 |
direct link to this entry