Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website

Google+

Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Wed, 11 Apr 2012


Marpa v. Parse::RecDescent: some numbers

The application

In a recent blog post, Flavio Poletti described an unusual language, one which he'd written a parser for, using the very traditional method of recursive descent. Specifically he'd used Perl's Parse::RecDescent module. Several people asked me how Marpa would do on this language. This post contains a comparison, including a benchmark.

The reader who wants an exact description should look at Flavio's post and code. I call it a Dyck-Hollerith language because it combines Hollerith constants (strings preceded by a count), with balanced parentheses (what is called a Dyck language by mathematicians). Here's Flavio's example:
A2(A2(S3(Hey)S13(Hello, World!))S5(Ciao!))

When I did a Marpa versus Perl regexes comparison, I was very careful to choose an application which showed Marpa in a good light. Here I did not pick the application, and it is almost as if it was designed to show recursive descent in a good light.

Recursive descent tries to parse by treating each rule as a subroutine call. This is a very natural way for a programmer to look at a grammar. But, alas, it only works in certain cases, and most of those are already handled by regular expressions. However, Dyck languages, including balanced parentheses, are one of the boundary cases: languages beyond the abilities of regular expressions, but within the capabilities of recursive descent.

Hollerith constants highlight the other strength of recursive descent. Since recursive descent's basic idea is so intuitive, it is easy to add custom hacks to it. And Hollerith constants are beyond the capabilities of even general BNF parsing. Earley's, Marpa, CYK, GLR, you name it, none of them can handle Hollerith constants without some kind of custom hack. You might say that Hollerith constants force a parser to wrestle in the mud. And when it comes to wrestling in the mud, recursive descent is hard to beat.

The benchmark

LengthSeconds
Marpa::XS Parse::RecDescent
1000 2.938 13.616
2000 7.067 62.083
3000 13.953 132.549
10000 121.654 1373.171

For Parse::RecDescent, the code used in the benchmark is Flavio's, with light modifications such as the one that allows inputs of varying lengths. The Marpa::XS parser used in this benchmark is on github as a gist.

Since Marpa::XS has its parse engine in optimized C, while Parse::RecDescent is pure Perl, one would hope that Marpa::XS would put better numbers on the board, and it did. Marpa::XS has some startup overhead, but settles in at roughly 10 times faster than the Parse::RecDescent implementation.

To test inputs of varying lengths, I took the example above and created arrays of it in the Dyck-Hollerith language. The reported length is the length of the top-level array. The example is 42 characters long, so a reported length of 1000 means the input string is 6+1000*42+1=42007 characters long. (The 6 is the length of the "A1000(" prefix of the enclosing array and the 1 is for its closing parenthesis.)

Conclusions

Benchmarks are like traffic accidents. They have a draw and an excitement, one that the wiser and nobler sort know is improper and to be resisted. But you'll often catch even them slowing down to look.

Other factors than speed may determine the choice of a parser, particularly if large inputs do not occur. As noted, many programmers prefer the recursive descent approach of turning grammar rules into subroutines calls: for them, it is the way things SHOULD work. And given recursive descent's hacker friendliness, it is the way things often DO work.

But I think those accustomed to recursive descent may want to look at Marpa. Marpa has all the same context that recursive descent has, and allows the same tricks and more. In this benchmark, Marpa did NOT haul out the Ruby Slippers -- it limited itself to the context that recursive descent has.

Marpa is faster in this benchmark because it gives the application the best of both worlds -- custom hacks, *AND* a parse engine written in optimized C. Recursive descent is good at custom hacks, and for this grammar its underlying algorithm has the same big-O complexity. But by its nature, Parse::RecDescent is pure Perl, so Marpa::XS clocks a steady factor-of-ten advantage. And, in fact, of the time Marpa::XS does use, only about 20% is used by its parse engine. Overhead and lexing in Marpa::XS are in pure Perl, and they take 80% of the time.


posted at: 21:08 | direct link to this entry

§         §         §