Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website


Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Tue, 10 Mar 2015

Linear? Yeah right.


I have claimed that my new parser, Marpa, is linear for vast classes of grammars, going well beyond what the traditional parsers can do. But skepticism is justified. When it comes to parsing algorithms, there have been a lot of time complexity claims that are hand-wavy, misleading or just plain false. This post describes how someone, who is exercising the appropriate degree of skepticism, might conclude that believing Marpa's claims is a reasonable and prudent thing to do.

Marpa's linearity claims seem to be, in comparison with the other parsers in practical use today, bold. Marpa claims linearity, not just for every class of grammar for which yacc/bison, PEG and recursive descent currently claim linearity, but for considerably more. (The mathematical details of these claims are in a section at the end.) It seems too good to be true.

Why should I believe you?

The most important thing to realize, in assessing the believability of Marpa's time complexity claims, is that they are not new. They were already proved in a long-accepted paper in the refereed literature. They are the time complexity claims proved by Joop Leo for his algorithm in 1991, over two decades ago. Marpa is derived from Leo's algorithm, and its time complexity claims are those proved for Leo's algorithm.

Above I said that Marpa's time complexity claims "seem" bold. On any objective assessment, they are in fact a bit of a yawn. The claims seem surprising only because a lot of people are unaware of Leo's results. That is, they are surprising in the same sense that someone who had avoided hearing about radio waves would be surprised to learn that he can communicate instantly with someone on the other side of the world.

So, if there's so little to prove, why does the Marpa paper have proofs? In Marpa, I made many implementation decisions about, and some changes to, the Leo/Earley algorithm. None of my changes produced better time complexity results -- my only claim is that I did not change the Leo/Earley algorithm in a way that slowed it down. To convince myself of this claim, I reworked the original proofs of Leo and Earley, changing them to reflect my changes, and demonstrated that the results that Leo had obtained still held.

Proofs of this kind, which introduce no new mathematical techniques, but simply take a previous result and march from here to there by well-know means, are called "tedious". In journals, where there's a need to conserve space, they are usually omitted, especially if, as is the case with Marpa's time complexity proofs, the results are intuitively quite plausible.

Getting from plausible to near-certain

So let's say you are not going to work through every line of Marpa's admittedly tedious proofs. We've seen that the results are intuitively plausible, as long as you don't reject the previous literature. But can we do better than merely "plausible"?

As an aside, many people misunderstand the phrase "mathematically proven", especially as it applies to branches of math like parsing theory. The fact is that proofs in papers often contain errors. Usually these are minor, and don't affect the result. On the other hand, Jay Earley's paper, while one of the best Computer Science papers ever published, also contained a very serious error. And this error slipped past his Ph.D. committee and his referees. Mathematical arguments and proofs do not allow us to achieve absolute certainty. They can only improve the degree of certainty.

There's a second way to dramatically increase your degree of conviction in Marpa's linearity claims, and it is quite simple. Create examples of problematic grammars, run them and time them. This is not as satisfying as a mathematical proof, because no set of test grammars can be exhaustive. But if you can't find a counter-example to Marpa's linearity claims among the grammars of most interest to you, that should help lift your level of certainty to "certain for all practical purposes".

Much of this increase in certainty can be achieved without bothering to run your own tests. Marpa is in wide use at this point. If Marpa was going quadratic on grammars for which it claimed to be linear, and these were grammars of practical interest, that would very likely have been noticed by now.

I'm still not convinced

Let's suppose all this has not brought you to the level of certainty you need to use Marpa. That means the reasonable thing is to continue to struggle to work with the restrictions of the traditional algorithms, right? No, absolutely not.

OK, so you don't believe that Marpa preserves the advances in power and speed made by Leo. Does that mean that parsers have to stay underpowered? No, it simply means that there should be a more direct implementation of Leo's 1991, bypassing Marpa.

But if you are looking for an implementation of Leo's 1991 algorithm, I think you may end up coming back to Marpa as the most reasonable choice. Marpa's additional features include the ability to use custom, procedural logic, as you can with recursive descent. And Marpa has worked out a lot of the implementation details for you.


Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net. To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.

Appendix: Some technical details

Above I talked about algorithms, classes of grammars and their linearity claims. I didn't give details because most folks aren't interested. For those who are, they are in this section.

yacc is linear for a grammar class called LALR, which is a subset of another grammar class called LR(1). If you are willing to hassle with GLR, bison claims linearity for all of LR(1). Recursive descent is a technique, not an algorithm, but it is top-down with look-ahead, and therefore can be seen as some form of LL(k), where k depends on how it is implemented. In practice, I suspect k is never much bigger than 3, and usually pretty close to 1. With packratting, PEG can be made linear for everything it parses but there is a catch -- only in limited cases do you know what language your PEG grammar actually parses. In current practice, that means your PEG grammar must be LL(1). Some of the PEG literature looks at techniques for extending this as far as LL-regular, but there are no implementations, and it remains to be seen if the algorithms described are practical.

The Marpa paper contains a proof, based on a proof of the same claim by Joop Leo, that Marpa is linear for LR-regular grammars. The LR-regular grammars include the LR(k) grammars for every k. So Marpa is linear for LR(1), LR(2), LR(8675309), etc. LR-regular also includes LL-regular. So every class of grammar under discussion in the PEG literature is already parsed in linear time by Marpa. From this, it is also safe to conclude that, if a grammar can be parsed by anything reasonably described as recursive descent, it can be parsed in linear time by Marpa.

posted at: 22:01 | direct link to this entry

§         §         §