Mon, 29 Apr 2013
Is Earley parsing fast enough?
"First we ask, what impact will our algorithm have on the parsing
done in production compilers for existing programming languages?
The answer is, practically none." -- Jay Earley's Ph.D thesis, p. 122.
In the above quote, the inventor of the Earley parsing
algorithm poses a question.
Is his algorithm fast enough for a production compiler? His answer is a
This is the verdict on Earley's that you often
hear repeated today, 45 years later.
Earley's, it is said, has a too high a "constant factor".
Verdicts tends to be repeated more often than examined.
This particular verdict originates with the inventor himself.
So perhaps it is not astonishing
that many treat the dismissal
of Earley's on grounds of speed to be as valid today as it
was in 1968.
But in the past 45 years,
computer technology has changed beyond recognition
have made several significant improvements to Earley's.
It is time to reopen this case.
What is a "constant factor"
The term "constant factor" here has a special meaning,
one worth looking at carefully.
Programmers talk about time efficiency in two ways:
time complexity and speed.
Speed is simple:
It's how fast the algorithm is against the clock.
To make comparison easy,
the clock can be an abstraction.
The clock ticks could be, for example, weighted instructions
on some convenient and mutually-agreed architecture.
By the time Earley was writing, programmers had discovered that simply comparing
even on well-chosen abstract clocks, was not enough.
Computers were improving very quickly.
A speed result
that was clearly significant when the comparison was made
could quickly become unimportant.
Researchers needed to
talk about time efficiency in a way that made what they said as true
decades later as on the day they said it.
To do this, researchers created the idea of time complexity.
Time complexity is measured using several notations, but the most
Here's the idea:
Assume we are comparing two algorithms, Algorithm A and Algorithm B.
Assume that algorithm A uses 42 weighted instructions for each input symbol.
Assume that algorithm B uses 1792 weighted instructions for each input symbol.
Where the count of input symbols is N,
A's speed is 42*N, and B's is 1792*N.
But the time complexity of both is the same: O(N).
The big-O notation throws away the two "constant factors", 42 and 1792.
Both are said to be "linear in N".
(Or more often, just "linear".)
It often happens that algorithms we need to compare for time efficiency
have different speeds,
but the same time complexity.
this usually this means we can treat them as having essentially
the same time efficiency.
But not always.
It sometimes happens that this difference is relevant.
When this happens, the rap against the slower algorithm is that it
has a "high constant factor".
OK, about that high constant factor
What is the "constant factor" between Earley and the current favorite
parsing algorithm, as a number?
(My interest is practical, not historic,
so I will be talking about Earley's
as modernized by Aycock, Horspool, Leo and myself.
But much of what I say applies to Earley's algorithm in general.)
What the current favorite parsing algorithm is
can be an interesting question.
When Earley wrote, it was hand-written recursive descent.
The next year (1969) LALR parsing was invented,
and the year after (1970) a tool that used it was introduced -- yacc.
At points over the next decades,
yacc chased both Earley's
and recursive descent almost completely out of the textbooks.
But as I have detailed elsewhere,
yacc had serious problems.
In 2006 things went full circle -- the industry's standard C
compiler, GCC, replaced LALR with recursive descent.
So back to 1970.
That year, Jay Earley wrote up his algorithm for
"Communications of the ACM",
and put a rough number on his "constant factor".
He said that his algorithm was an "order of magnitude" slower
than the current favorites -- a factor of 10.
Earley suggested ways to lower this 10-handicap,
and modern implementations have followed up on them
and found others.
But for this post,
let's concede the factor of ten and throw
Let's say Earley's is 100 times slower than the current favorite,
whatever that happens to be.
Moore's Law and beyond
Let's look at the handicap of 100
in the light of Moore's Law.
Since 1968, computers have gotten a billion times faster -- nine orders
of magnitude. Nine factors of ten.
This means that today Earley's runs
seven factors of ten faster than
the current favorite algorithm did in
Earley's is 10 million times as fast as the algorithm that was
then considered practical.
Of course, our standard of "fast enough to be practical" also evolves.
But it evolves a lot more slowly.
and say that "practical" meant "takes an hour" in 1968,
but that today we would demand that the same program take only a second.
Do the arithmetic and you find that Earley's is now
more than 2,000 times faster than it needs to be to be practical.
Bringing in Moore's Law is just the beginning.
The handicap Jay Earley gave his algorithm
is based on a straight comparison of CPU speeds.
But parsing, in practical cases, involves I/O.
And the "current favorite" needs to do as much I/O as Earley's.
I/O overheads, and the accompanying context switches,
swamp considerations of CPU speed,
and that is more true today
that it was in 1968.
When an application is I/O bound, CPU is in effect free.
Parsing may not be I/O bound in this sense, but neither
is it one of those applications where the comparison can be made
in raw CPU terms.
Finally, pipelining has changed
the nature of the CPU overhead itself radically.
In 1968, the time to run a series of CPU
instructions varied linearly with the number of instructions.
Today, that is no longer true,
and the change favors strategies like Earley's,
which require a higher instruction count,
but achieve efficiency in other ways.
So far, I've spoken in terms of theoretical speeds, not achievable ones.
That is, I've assumed that both Earley's
and the current favorite are producing their best speed, unimpeded by
Earley, writing in 1968 and thinking of hand-written recursive descent,
assumed that production compilers
could be, and in practice usually would be,
programmers with plenty of time to do
careful and well-thought-out hand-optimization.
After forty-five years of real-life experience,
we know better.
In those widely used practical compilers and interpreters
that rely on lots of procedural logic --
and these days that is almost all of them --
it is usually all the maintainers can do to keep the procedural logic correct.
In all but a few cases, optimization is opportunistic,
Programmers have been exposed to
the realities of parsing with
large amounts of complex procedural logic,
and hand-written recursive descent has acquired a
reputation for being slow.
LALR based compilers are less dependent on procedural
parsing and therefore easier to keep optimal.
In practice they are as bad or worse.
LALR parsers usually still need a considerable amount of procedural logic,
but procedural logic is harder to write for LALR than it
is for recursive descent.
Modern Earley parsing
has a much easier time actually delivering
its theoretical best speed in practice.
Earley's is powerful enough,
and in its modern version well-enough aware of the state of the parse,
that procedural logic can be kept to minimum or eliminated.
Most of the parsing is done by the mathematics at its core.
The math at Earley's core can be heavily optimized,
and any optimization benefits all applications.
Optimization of special-purpose procedural logic benefits
only the application that uses that logic.
But you might say,
"A lot of interesting points, Jeffrey, but all things being
equal, a factor of 10,
or even what's left from a factor of ten once I/O,
pipelining and implementation inefficiencies have all nibbled away at it,
is still worth having.
It may in a lot of instances not even be measurable, but why not grab
it for the sake of the cases where it is?"
Which is a good point.
The "implementation inefficiences" can be nasty enough that Earley's is in
fact faster in raw terms,
but let's assume
that some cost in speed is still being paid for the use of Earley's.
Why incur that cost?
The parsing algorithms currently favored,
in their quest for efficiency,
do not maintain full
information about the state of the parse.
This is fine when the source is 100% correct,
but in practice an important function of a parser is to find and
When the parse fails, the current favorites
often have little idea of why.
An Earley parser knows the full state of the parse.
This added knowledge can save a lot of
The more that a parser does from the grammar,
and the less procedural logic it uses,
the more readable the code will be.
This has a determining effect on maintainance costs
and the software's ability to evolve over time.
Procedural logic can produce inaccuracy -- inability
to describe or control the actual language begin parsed.
Some parsers, particularly LALR and PEG,
have a second major source of inaccuracy -- they use
a precedence scheme for conflict resolution.
In specific cases, this can work, but
precedence-driven conflict resolution
produces a language without
a "clean" theoretical description.
The obvious problem with not knowing what language you
are parsing is failure to parse correct source code.
But another, more subtle, problem can be worse over the
life cycle of a language ...
False positives are cases
where the input is in error,
and should be reported as such, but instead
the result is what you wanted.
This may sound like unexpected good news,
but when a false positive does surface,
it is quite possible that it cannot be fixed
without breaking code that, while incorrect, does work.
Over the life of a language, false positives are deadly.
False positives produce buggy and poorly understood code
which must be preserved and maintained forever.
The modern Earley implementation can parse vast classes
of grammar in linear time.
These classes include all those currently in practical use.
Modern Earley implementations
parse all context-free grammars in times that are, in practice,
With other parsers,
the class of grammars parsed is highly restricted,
and there is usually a real danger that a new change
will violate those restrictions.
the favorite alternatives to Earley's
make it hard to know exactly what language you are,
in fact, parsing.
A change can break one of these parsers
without there being any indication.
syntax changes and extensions to Earley's grammars
For more about Marpa
Above I've spoken of "modern Earley parsing",
by which I've meant Earley parsing as amended and improved
by the efforts of Aho, Horspool, Leo and myself.
At the moment, the only implementation that contains
all of these modernizations is Marpa.
Marpa's latest version is
which is available on CPAN.
a new interface,
which represents a major increase
in Marpa's "whipitupitude".
The SLIF has tutorials
a web page,
and of course it is the focus of
my "Ocean of Awareness" blog.
Comments on this post
can be sent to the Marpa's Google Group:
posted at: 11:12 |
direct link to this entry