Sat, 29 Aug 2015
Fast handy languages
Back around 1980, I had access to UNIX and a language I wanted to parse.
I knew that UNIX had all the latest CS tools.
So I expected to type in my BNF and "Presto, Language!".
Not so easy, I was told.
Languages were difficult things created with complex tools
written by experts who understood the issues.
I recall thinking that,
while English had a syntax that is
as hard as they come,
toddlers manage to parse it
But experts are experts,
and more so at second-hand.
I was steered to an LALR-based parser called yacc.
(Readers may be more familiar with bison, a yacc successor.)
LALR had extended the class of quickly parseable grammars a bit
beyond recursive descent.
But recursive descent was simple in principle,
and its limits were easy to discover and work around.
LALR, on the hand, was OK when it worked, but
figuring out why it failed when it failed
was more like decryption than debugging,
and this was the case both with parser development
and run-time errors.
I soon gave up on yacc
and found another way to solve my problem.
Few people complained about yacc on the Internet.
If you noise it about that you are unable
to figure out how to use
what everybody says is the state-of-the-art tool,
the conclusions drawn may not be the ones you want.
But my experience seems to have been more than common.
LALR's claim to fame was that it was the basis of the
industry-standard C compiler.
Over three decades,
its maintainers suffered amid the general silence.
But by 2006, they'd had enough.
GCC (the new industry standard)
ripped its LALR engine out.
trend back to recursive descent
was well underway.
A surprise discovery
Back in the 1970's,
there had been more powerful alternatives
to LALR and recursive descent.
But they were
to be slow.
For some applications slow is OK.
In 2007 I decided that a parsing tool that parsed
all context-free languages at state-of-the-art speeds,
slow or fast as the case might be,
would be a useful addition to programmer toolkits.
And I ran into a surprise.
Hidden in the literature was an amazing discovery --
an 1991 article by Joop Leo that
described how to modify Earley's
algorithm to be fast for every language class in practical use.
(When I say "fast" in this article, I will mean "linear".)
Leo's article had been almost completely ignored --
my project (Marpa)
would become its first
The implications of Leo's discovery go well beyond speed.
If you can rely on the BNF that you write always producing
a practical parser, you can auto-generate your language.
you can write languages which write languages.
Which languages are fast?
The Leo/Earley algorithm is not fast
for every BNF-expressible language.
BNF is powerful, and you can write exponentially
ambiguous languages in it.
But programmers these days
mostly care about unambiguous languages --
we are accustomed to tools and techniques
that parse only a subset of these.
As I've said, Marpa is fast for every language
class in practical use today.
Marpa is almost certainly fast for any language
that a modern programmer has in mind.
Unless you peek ahead at the hints I am about to give you,
in fact, it is actually
to write an unambiguous
grammar that goes non-linear on Marpa.
Simply mixing up lots of left, right and middle recursions
be enough to make an
unambiguous grammar go non-linear.
You will also need to violate a rule
in the set that
I am about to give you.
To guarantee that Marpa is fast for your BNF language,
follow three rules:
- Rule 1: Your BNF must be unambiguous.
- Rule 2: Your BNF must have no "unmarked" middle recursions.
- Rule 3: All of the right-recursive symbols
in your BNF must be dedicated
to right recursion.
Rule 3 turns out to be very easy to obey.
I discuss it in detail in the next section,
which will be about how to break these rules and
get away with it.
Before we do that,
let's look at what an "unmarked" middle recursion is.
Here's an example of a "marked" middle recursion:
M ::= 'b'
M ::= 'a' M 'a'
Here the "b" symbol is the marker.
This marked middle recursion generates sequences like
a b a
a a b a a
Now for an "unmarked" middle recursion:
M ::= 'a' 'a'
M ::= 'a' M 'a'
This unmarked middle recursion generates sequences like
a a a a
a a a a a a
In this middle recursion there is no marker.
To know where the middle is,
you have to scan all the way to the end,
and then count back.
A rule of thumb is that if you can "eyeball" the middle
of a long sequence,
the recursion is marked.
If you can't, it is unmarked.
Unfortunately, we can't characterize exactly what a marker
must look like -- a marker can encode the moves of a Turing machine,
so marker detection is undecidable.
How to get away with breaking the rules
The rules about ambiguity and recursions are "soft".
If you only use limited ambiguity and
keep your rule-breaking recursions short,
your parser will stay fast.
Above, I promised to explain rule 3, which insisted that
a right recursive symbol be "dedicated".
A right recursive symbol is "dedicated" if it appears only
as the recursive symbol in a right recursion.
If your grammar is unambiguous, but you've used an "undedicated"
right-recursive symbol, that is easy to fix.
Just rewrite the grammar, replacing the "undedicated" symbol
with two different symbols.
Dedicate one of the two to the right recursion,
and use the other symbol everywhere else.
When NOT to use Marpa
The languages I have described as "fast" for Marpa
include all those in practical use and many more.
But do you really want to use Marpa for all of them?
Here are four cases for which Marpa is probably not
your best alternative.
The first case: a language that parses easily with a regular
The regular expression will be much faster.
Don't walk away from a good thing.
The second case:
that is easily parsed using a single
loop and some state that fits into constant space.
This parser might be very easy to write and maintain.
If you are using a much slower higher level language,
Marpa's optimized C language
may be a win on CPU speed.
But, as before, if you have a good thing,
don't walk away from it.
The third case:
a variation on the second.
Here your single loop might be getting out of hand,
making you yearn for the syntax-driven convenience
but your state still fits into constant space.
In its current implementation, Marpa keeps all of its
parse tables forever, so Marpa does
parse in constant space.
Keeping the tables
allows Marpa to deal with the full structure of its
input, in a way that a SAX-ish approaches cannot.
But if space requirements are an issue,
and your application allows a simplified constant-space
Marpa's power and convenience may not be enough to
make up for that.
The fourth case:
a language that
- is very small;
- changes slowly or not at all, and does not grow in complexity;
- merits careful hand-optimization, and has available the staff
to do it;
- merits and has available the kind of on-going support that will
keep your code optimized under changing circumstances; and
- is easily parseable via recursive descent:
It is rare that all of these are the case,
but when that happens,
recursive descent is often preferable to Marpa.
Lua and JSON
are two languages which meet the above criteria.
In Lua's case, it targets platforms with very restricted memories,
which is an additional reason to prefer recursive descent --
Marpa has a relatively large footprint.
It was not good luck that made
both Lua and JSON good targets for recursive descent --
they were designed around its limits.
JSON is a favorite test target of Marpa for just these reasons.
There are carefully hand-optimized C language parsers for us to
We get closer and closer,
but Marpa will
never beat small hand-optimized JSON parsers in software.
However, while recursive descent is a technique for hand-writing parsers,
Marpa is a mathematical algorithm.
instructions for manipulating Earley items could be implemented directly
When and if that day comes,
Earley's algorithm will beat recursive descent even at
parsing the grammars that were designed for it.
Comments on this post can be made in
Marpa's Google group,
or on our IRC channel: #marpa at freenode.net.
To learn more about Marpa,
official web site maintained by Ron Savage.
I also have
a Marpa web site.
posted at: 16:36 |
direct link to this entry