Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest
The Ocean of Awareness blog: home page, chronological index, and annotated index.
Back around 1980, I had access to UNIX and a language I wanted to parse. I knew that UNIX had all the latest CS tools. So I expected to type in my BNF and "Presto, Language!".
Not so easy, I was told. Languages were difficult things created with complex tools written by experts who understood the issues. I recall thinking that, while English had a syntax that is as hard as they come, toddlers manage to parse it just fine. But experts are experts, and more so at second-hand.
I was steered to an LALR-based parser called yacc. (Readers may be more familiar with bison, a yacc successor.) LALR had extended the class of quickly parseable grammars a bit beyond recursive descent. But recursive descent was simple in principle, and its limits were easy to discover and work around. LALR, on the hand, was OK when it worked, but figuring out why it failed when it failed was more like decryption than debugging, and this was the case both with parser development and run-time errors. I soon gave up on yacc and found another way to solve my problem.
Few people complained about yacc on the Internet. If you noise it about that you are unable to figure out how to use what everybody says is the state-of-the-art tool, the conclusions drawn may not be the ones you want. But my experience seems to have been more than common.
LALR's claim to fame was that it was the basis of the industry-standard C compiler. Over three decades, its maintainers suffered amid the general silence. But by 2006, they'd had enough. GCC (the new industry standard) ripped its LALR engine out. By then the trend back to recursive descent was well underway.
Back in the 1970's, there had been more powerful alternatives to LALR and recursive descent. But they were reputed to be slow.
For some applications slow is OK. In 2007 I decided that a parsing tool that parsed all context-free languages at state-of-the-art speeds, slow or fast as the case might be, would be a useful addition to programmer toolkits. And I ran into a surprise.
Hidden in the literature was an amazing discovery -- an 1991 article by Joop Leo that described how to modify Earley's algorithm to be fast for every language class in practical use. (When I say "fast" in this article, I will mean "linear".) Leo's article had been almost completely ignored -- my project (Marpa) would become its first practical implementation.
The implications of Leo's discovery go well beyond speed. If you can rely on the BNF that you write always producing a practical parser, you can auto-generate your language. In fact, you can write languages which write languages.
The Leo/Earley algorithm is not fast for every BNF-expressible language. BNF is powerful, and you can write exponentially ambiguous languages in it. But programmers these days mostly care about unambiguous languages -- we are accustomed to tools and techniques that parse only a subset of these.
As I've said, Marpa is fast for every language class in practical use today. Marpa is almost certainly fast for any language that a modern programmer has in mind. Unless you peek ahead at the hints I am about to give you, in fact, it is actually hard to write an unambiguous grammar that goes non-linear on Marpa. Simply mixing up lots of left, right and middle recursions will not be enough to make an unambiguous grammar go non-linear. You will also need to violate a rule in the set that I am about to give you.
To guarantee that Marpa is fast for your BNF language, follow three rules:
Rule 3 turns out to be very easy to obey. I discuss it in detail in the next section, which will be about how to break these rules and get away with it.
Before we do that, let's look at what an "unmarked" middle recursion is. Here's an example of a "marked" middle recursion:
M ::= 'b' M ::= 'a' M 'a'
Here the "b" symbol is the marker. This marked middle recursion generates sequences like
b a b a a a b a a
Now for an "unmarked" middle recursion:
M ::= 'a' 'a' M ::= 'a' M 'a'
This unmarked middle recursion generates sequences like
a a a a a a a a a a a a
In this middle recursion there is no marker. To know where the middle is, you have to scan all the way to the end, and then count back.
A rule of thumb is that if you can "eyeball" the middle of a long sequence, the recursion is marked. If you can't, it is unmarked. Unfortunately, we can't characterize exactly what a marker must look like -- a marker can encode the moves of a Turing machine, so marker detection is undecidable.
The rules about ambiguity and recursions are "soft". If you only use limited ambiguity and keep your rule-breaking recursions short, your parser will stay fast.
Above, I promised to explain rule 3, which insisted that a right recursive symbol be "dedicated". A right recursive symbol is "dedicated" if it appears only as the recursive symbol in a right recursion. If your grammar is unambiguous, but you've used an "undedicated" right-recursive symbol, that is easy to fix. Just rewrite the grammar, replacing the "undedicated" symbol with two different symbols. Dedicate one of the two to the right recursion, and use the other symbol everywhere else.
The languages I have described as "fast" for Marpa include all those in practical use and many more. But do you really want to use Marpa for all of them? Here are four cases for which Marpa is probably not your best alternative.
The first case: a language that parses easily with a regular expression. The regular expression will be much faster. Don't walk away from a good thing.
The second case: a language that is easily parsed using a single loop and some state that fits into constant space. This parser might be very easy to write and maintain. If you are using a much slower higher level language, Marpa's optimized C language may be a win on CPU speed. But, as before, if you have a good thing, don't walk away from it.
The third case: a variation on the second. Here your single loop might be getting out of hand, making you yearn for the syntax-driven convenience of Marpa, but your state still fits into constant space. In its current implementation, Marpa keeps all of its parse tables forever, so Marpa does not parse in constant space. Keeping the tables allows Marpa to deal with the full structure of its input, in a way that a SAX-ish approaches cannot. But if space requirements are an issue, and your application allows a simplified constant-space approach, Marpa's power and convenience may not be enough to make up for that.
The fourth case: a language that
It is rare that all of these are the case, but when that happens, recursive descent is often preferable to Marpa. Lua and JSON are two languages which meet the above criteria. In Lua's case, it targets platforms with very restricted memories, which is an additional reason to prefer recursive descent -- Marpa has a relatively large footprint.
It was not good luck that made both Lua and JSON good targets for recursive descent -- they were designed around its limits. JSON is a favorite test target of Marpa for just these reasons. There are carefully hand-optimized C language parsers for us to benchmark against.
We get closer and closer, but Marpa will never beat small hand-optimized JSON parsers in software. However, while recursive descent is a technique for hand-writing parsers, Marpa is a mathematical algorithm. Someday, instructions for manipulating Earley items could be implemented directly in silicon. When and if that day comes, Earley's algorithm will beat recursive descent even at parsing the grammars that were designed for it.
Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net. To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.
posted at: 19:36 | direct link to this entry