**Jeffrey Kegler's blog**
about Marpa, his new parsing algorithm,
and other topics of interest

The Ocean of Awareness blog: home page, chronological index, and annotated index.

I have just released Marpa::XS 0.008000. With this release the core Marpa algorithm has been converted to C, vastly speeding it up. Marpa::XS is still alpha, but the additional development needed at this point is a matter of packaging (See Note 1).

It is my hope that Marpa will become the standard parsing algorithm for problems too big for regular expressions.

- Marpa parses all classes of grammar that are currently in practical use in linear time. (See Note 2).
- Marpa is a general BNF parser -- that means if you feed it anything written in BNF, it "just parses" it. This includes grammars which are left-recursive, right-recursive and ambiguous -- even infinitely ambiguous.
- Marpa never goes exponential -- worst case, even for highly ambiguous grammars, is O(n**3), which is considered optimal (See Note 3).

The foremost limitation of Marpa is, of course, that it is alpha. Development is well advanced, but the interface remains subject to change.

There are several parsing tasks where current technology will be superior even to a stable, production Marpa. For problems which work well as regexes (compact and with no backtracking), Marpa will never replace a good regular expression engine. On the other hand, Marpa may well be the answer to a lot of problems which are forced to fit into the regular expression paradigm for lack of anything better. Hand-written recursive descent could compete with Marpa, but not having to write your own parser is a huge advantage.

I've made no secret that I consider yacc and LALR theoretical milestones and industry traditions which are best honored in the breach. I've pointed out LALR's horrendous error-handling properties elsewhere (See Note 4). LALR does offer some apparent speed advantages, but in the context of practical applications these are more apparent than real.

Note 1: Inside the code are many relics of the development process, and these need to be cleaned up. The user documentation is finished, but the internals documentation needs a lot of work.

Some additional C code may be written to help in the evaluation phase but, while this code could produce some significant additional efficiencies, I don't consider it part of the core Marpa development project for two reasons. First, evaluation is to a large degree a matter of calling higher-level and application code.

Second, the core Marpa algorithm was new and extremely difficult mathematics -- built on top of a large body of prior work to be sure, but new nonetheless. With the evaluation code, my original thinking involves making making choices of what to use in existing mathematics -- evaluation consists of problems like tree traversal, where there is lots of prior work to consult.

Note 2: To be specific, Marpa parses any LR-regular grammar in linear time -- O(n). LR-regular is a vast class of grammars that includes LALR, LR(k) for all k, and LL(k) for all k. This means that Marpa parses, in linear time, every grammar parsed by yacc, recursive descent and regular expressions.

Note 3: The phrase "considered optimal" elides some irrelevant bits of theory. It would be a bit of a surprise if it is possible to do general BNF parsing in O(n), but nobody has proved that it can't be done. The Valiant algorithm parses general BNF and is O(n**2.376...) or better. But Valiant's algorithm is only faster for huge problems, and for those it needs a machine with many terabytes of main memory to deliver on its speed claims. So it won't be competing with Marpa any time soon.

Note 4:
My series on yacc ran in four blog posts:
Killing Yacc,
Why the Bovicidal Rage?,
Bovicide 5: Parse-time Error Reporting,
and
Bovicide 6: The Final Requirement

posted at: 11:47 | direct link to this entry

§
§
§