Sun, 07 Sep 2014
Parsing: a timeline
[ Revised 22 Oct 2014 ]
The ALGOL 60 spec comes out.
It specifies, for the first time, a block structured
The ALGOL committee is well aware
nobody knows how to parse such a language.
But they believe that,
if they specify a block-structured
language, a parser for it will be invented.
Risky as this approach is, it pays off ...
1961: Ned Irons publishes his ALGOL parser.
In fact, the Irons parser
is the first parser of any kind to be described
Ned's algorithm is a left parser --
a form of recursive descent.
the Irons algorithm
is general and syntax-driven.
"General" means it can parse anything written in BNF.
"Syntax-driven" (aka declarative) means that parser is
actually created from the BNF --
the parser does not need
to be hand-written.
Almost simultaneously, hand-coded approaches to left parsing
These we would now recognize as recursive descent.
Over the following years, hand-coding approaches
will become more popular for left parsers
than syntax-driven algorithms.
Three factors are at work:
In 1960's, memory and CPU are both extremely limited.
Hand-coding pays off, even when the gains are small.
Pure left parsing is a very weak parsing technique.
Hand-coding is often necessary
to overcome its limits.
as true today as it is in 1961.
Left parsing works well in combination with hand-coding --
they are a very good fit.
Don Knuth invents LR parsing.
Knuth is primarily interested
in the mathematics.
Knuth describes a parsing algorithm,
but it is not thought practical.
1968: Jay Earley invents the algorithm named after him.
Like the Irons algorithm,
Earley's algorithm is syntax-driven and fully general.
Unlike the Irons algorithm, it does not backtrack.
Earley's core idea is to
track everything about the parse in tables.
Earley's algorithm is enticing, but it has three major issues:
- First, there is a bug in the handling of zero-length rules.
- Second, it is quadratic for right recursions.
- Third, the bookkeeping required to set up the tables is,
by the standards of 1968 hardware, daunting.
Frank DeRemer describes a new variant of Knuth's LR
DeRemer's LALR algorithm requires only
a stack and a state table of quite
Aho and Ullmann describe
a straightforward fix to the zero-length rule bug in Earley's original algorithm.
Unfortunately, this fix involves adding even more bookkeeping to Earley's.
Bell Labs converts its C compiler from hand-written recursive
descent to DeRemer's LALR algorithm.
The first "Dragon book" comes out.
This soon-to-be classic textbook is nicknamed after
the drawing on the front cover,
in which a knight takes on a dragon.
Emblazoned on the knight's lance are the letters "LALR".
From here on out,
to speak lightly of LALR will be to besmirch the escutcheon
of parsing theory.
1979: Bell Laboratories releases Version 7 UNIX.
V7 includes what is, by far,
the most comprehensive, useable and easily available
compiler writing toolkit yet developed.
Central to the toolkit is
yacc, an LALR based parser generator.
With a bit of hackery,
yacc parses its own input language,
as well as the language of V7's main compiler,
the portable C compiler.
After two decades of research,
it seems that the parsing problem is solved.
Larry Wall introduces Perl 1.
Perl embraces complexity like no previous language.
Larry uses LALR very aggressively --
to my knowledge more aggressively than anyone before
Joop Leo discovers a way of speeding up right
recursions in Earley's algorithm.
is linear for just about every unambiguous grammar of
practical interest, and many ambiguous ones as well.
In 1991 hardware is six orders of magnitude faster
than 1968 hardware, so that the
issue of bookkeeping overhead had receded
This is a major discovery.
When it comes to speed,
the game has changed in favor of Earley algorithm.
But Earley parsing is almost forgotten.
It will be 20 years before anyone writes a practical
implementation of Leo's algorithm.
Earley's is forgotten.
So everyone in LALR-land is content, right?
Wrong. Far from it, in fact.
Users of LALR are making unpleasant discoveries.
While LALR automatically
generates their parsers,
is so hard they could just as easily
write the parser by hand.
Once debugged, their LALR parsers are fast for correct inputs.
But almost all they tell the users about incorrect inputs
is that they are incorrect.
In Larry's words, LALR is "fast but stupid".
Larry Wall decides on a radical reimplementation
of Perl -- Perl 6.
Larry does not even consider using LALR again.
Aycock&Horspool publish their attempt at a fast, practical Earley's parser.
Missing from it is Joop Leo's improvement --
they seem not to be aware of it.
Their own speedup is limited in what it achieves
and the complications it introduces
can be counter-productive at evaluation time.
But buried in their paper is a solution to the zero-length rule bug.
And this time the solution requires no additional bookkeeping.
GNU announces that the GCC compiler's parser has been rewritten.
For three decades,
the industry's flagship C compilers have used
LALR as their parser --
proof of the claim that LALR and serious
parsing are equivalent.
Now, GNU replaces
LALR with the technology that
it replaced a quarter century earlier:
2000 to today:
With the retreat from LALR comes a collapse in the
prestige of parsing theory.
After a half century,
we seem to be back
where we started.
If you took Ned Iron's original 1961 algorithm,
changed the names and dates,
and translated the code from the mix of assembler and
ALGOL into Haskell,
you would easily republish it today,
and bill it as
as revolutionary and new.
Over the years,
I had come back to Earley's algorithm again and again.
Around 2010, I realized
that the original, long-abandoned vision --
an efficient, practical, general and syntax-driven parser --
was now, in fact, quite possible.
The necessary pieces had fallen into place.
Aycock&Horspool have solved the zero-length rule bug.
Joop Leo had found the speedup for right recursion.
And the issue of bookkeeping overhead had pretty much evaporated on its
Machine operations are now a billion times faster than in 1968,
and probably no longer relevant in any case --
caches misses are now the bottleneck.
But while the original issues with Earley's disappeared,
a new issue emerged.
With a parsing algorithm as powerful as Earley's behind it,
a syntax-driven approach can do much more than it can with
a left parser.
But with the experience with LALR in their collective consciousness,
few modern programmers are prepared
to trust a purely declarative parser.
As Lincoln said, "Once a cat's been burned,
he won't even sit on a cold stove."
To be accepted, Marpa needed to allow
not just declarative parsing.
So Marpa allows the user to specify events --
occurrences of symbols and rules --
at which declarative parsing pauses.
the application can call procedural logic
and single-step forward token by token.
The procedural logic can hand control back
over to syntax-driven parsing at any point it likes.
The Earley tables can provide the procedural logic with
full knowledge of the state of the
parse so far:
all rules recognized
in all possible parses so far,
and all symbols expected.
Earley's algorithm is now a even better companion
for hand-written procedural logic than recursive descent.
For more about Marpa,
official web site maintained by Ron Savage.
I also have
a Marpa web site.
Comments on this post can be made in
Marpa's Google group.
posted at: 19:50 |
direct link to this entry