Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website

Google+

Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Mon, 27 May 2013


Why Marpa works: table parsing

Marpa works very differently from the parsers in wide use today. Marpa is a table parser. And Marpa is unusual among table parsers -- its focus is on speed.

The currently favored parsers use stacks, in some cases together with a state machine. These have the advantage that it is easy to see how they can be made to run fast. They have the disadvantage of severely limiting what the parser can do and how much it can know.

What is table parsing?

Table parsing means parsing by constructing a table of all the possible parses. This is pretty clearly something you want -- anything less means not completely knowing what you're doing. It's like walking across the yard blindfolded. It's fine if you can make sure there are no walls, open pits, or other dangerous obstacles. But for the general case, it's a bad idea.

Where the analogy between walking blindfolded and parsing breaks down is that while taking off the blindfold has no cost, building a table of everything that is happening while you parse does have a cost. If you limit your parser to a stack and a state machine, you may be parsing with a blindfold on, but it is clear that your parser can be fast. How to make a table parser run fast is not so clear.

The advantages of table parsing

What are the advantages of taking off the blindfold? First, your parser can be completely general -- anything you can write in BNF it can parse. And second, you always know exactly what is going on -- what rules are possible, what rules have been recognized, how far into a rule you have recognized it, what symbols you expect next, etc. etc.

We programmers have gotten used to parsers which run blindfolded. When you bump into something while blindfolded you don't know what it was. When non-table parsers fail, they usually don't know why -- they can only guess. If you have a full parse table, built from left to right, you know what you were looking for and what you already think you found. This means that you can pinpoint and identify errors precisely.

Knowing where you are in a parse also allows certain tricks, like the one I call the Ruby Slippers. In this, you parse with an over-simplified grammar and, when it fails, you ask the parser what it was looking for. Then -- poof! -- you provide it.

The Ruby Slippers work beautifully when dealing with missing tags in HTML. You can define a simplified HTML grammar, one that lives in a non-existent world -- an ideal world where all start and end tags are always where they belong. Then you parse away. If, as will happen with real-world input, a tag is missing, you ask the table parser what it was looking for, and give it a virtual tag.

And as for fast ...

When I decided to write Marpa in 2007 my goal was to create a table parser that was as fast as possible. I was surprised to find that the academic literature contained a major improvement to table parsing by Joop Leo, an improvement which nobody had ever made a serious attempt to implement. Marpa is the first implementation of Joop Leo's 1991 improvement to table parsing which, as far as theory goes, makes Marpa as fast any parser in practical use today. Any class of grammar that recursive descent, bison, etc. will parse, Marpa will parse in linear time.

Table parsing has had a reputation for being slow due to a bad "constant factor". Theoreticians, when looking at speed as time complexity, throw away constant factors. What's left once the constant factor is ignored is always more important. Surprisingly often, time complexity results which ignore constant factors are also the most meaningful results in practical terms.

But not 100% of the time. Sometimes the constant factor makes all the difference. And deciding when a constant factor does make a difference, and when it does not, is a tricky matter, one that lies in that murky zone between practice and theory where neither practitioner or theorist feels fully at home.

It's time to look again, for two reasons. First, Aycock and Horspool did a lot of careful work on reducing the constant factor for table parsing, work which Marpa incorporates and builds on. Second, the judgment about the constant factor dates from 1968, when computers were literally a billion times slower then they are now.

Why has nobody re-examined this judgment as the years and the order of magnitude speed-ups marched by? When a judgment crosses sub-disciplines, it can be "sticky" beyond all reason, and this one is a good example. The decision that its "constant factor" made table parsing too slow for many practical applications is in part a practical take on a theoretical issue, and in part a theoretical call on a practical matter.

Since 1968, the theoretical results have improved. But the theoreticians did not change their mind about the speed of table parsing because it was a judgment about the practical application of the theory. The practitioners were actually out there writing compilers. When it comes down to practice, you have to assume that practitioners know what they are talking about, right?

Meanwhile the practice of writing software underwent revolution after revolution. But the practitioners continued to write off table parsing as impractical. Talking about the speed of table parsers quickly got you into some very heavy math. And some of the algorithms did not even exist except as mathematical notation on the pages of the journals and textbooks. When it comes down to theory about things that don't exist outside of theory, who do you listen to if not the theoreticians?

I look carefully at the issue of the "constant factor" in a previous post. Forty-five years have passed and cost of CPU has fallen nine orders of magnitude. (Others say the cost of CPU has dropped by 50% a year, in which case it's over 14 orders of magnitude. But why quibble?) It's reasonable to suspect that the constant factor that practitioners and theoreticians were worried about in 1968 stopped being a major obstacle many years ago.

For more about Marpa

Marpa's latest version is Marpa::R2, which is available on CPAN. A list of my Marpa tutorials can be found here. There is a new tutorial by Peter Stuifzand. This blog focuses on Marpa, and it has an annotated guide. Marpa also has a web page. For questions, support and discussion, there is a Google Group: marpa-parser@googlegroups.com. Comments on this post can be made there.


posted at: 04:37 | direct link to this entry

§         §         §