Marpa v. Perl regexes: some numbers

Mon, 07 Nov 2011

Marpa v. Perl regexes: some numbers

In this post, I pit Marpa against the Perl regex engine. The example I will use is unanchored searching for balanced parentheses. I have claimed that many problems now tackled with regexes are better solved with a more powerful parser, like Marpa. I believe the numbers in this post back up that claim.

To be very clear, I am NOT claiming that Marpa should or can replace regexes in general. For each character, all an RE (regular expression) engine needs to do is to compute a transition from one "state" to another state based on that character -- essentially a simple lookup. It's the sort of thing a modern C compiler should optimize into a series of machine instructions that you can count on the fingers of one hand.

Marpa is much more powerful than an regular expression engine, and to deliver this power Marpa makes a list of all the possibilities for each token. Tricks are used to compress these per-token lists, and Marpa's code to process them is heavily optimized. But even so, Marpa's processing requires more than a handful of machine instructions.

In this context, I think some of the numbers below may surprise people. RE's beat everything else as long as they stick to their game. But these days they are often stretched beyond their limits, often without an appreciation of how quickly their efficiency deteriorates when outside those limits.

Unanchored searches for balanced parens

I chose the problem solved by Regexp::Common::balanced -- unanchored searches for balanced parens. "Unanchored" here means that the search is not anchored to the beginning, or any other specific point of the string. Unanchored searches need, not only to identify the matching string, but to determine where in the searched string to find the target.

When it comes to unanchored matches, most users want the "first longest" match. That is, they want the first match but, when one match is completely contained in another one, they want the longest match. This is the problem in its hardest form. It is simple to find the match which ends first. "First longest" needs the match which STARTS first. "First longest" is the problem that Regexp::Common::balanced addresses. For this benchmark, I programmed Marpa to return exactly the same results that Regexp::Common::balanced returns.

The examples I will use in this post are sets of parens of varying length. All the examples will have a prefix, a balanced paren "target", and a short, unbalanced, trailer. If the string is of length N, the prefix consists of N-8 left parens. The target is always this string: "(()())". The trailer is always two left parens: "((". Here, with spacing added for clarity, is the test string for length 20:

(((((((((( ((()()) ((

The Results

Number
of parens
in test
string Executions per second

libmarpa
(pure C) Marpa::XS
(mixed C
and Perl) Regexp::
Common::
balanced tchrist's
regex Marpa::PP
(Pure
Perl)

10 4524.89 111.71 3173.30 33429.33 47.39

100 1180.64 58.96 62.09 197.25 15.35

500 252.40 19.50 2.43 7.58 4.09

1000 117.16 10.28 0.53 1.84 2.14

2000 56.07 5.47 0.12 0.34 1.08

3000 36.35 3.72 0.05 0.13 0.74

Number of parens in test string	Executions per second
libmarpa (pure C)	Marpa::XS (mixed C and Perl)	Regexp:: Common:: balanced	tchrist's regex	Marpa::PP (Pure Perl)
10	4524.89	111.71	3173.30	33429.33	47.39
100	1180.64	58.96	62.09	197.25	15.35
500	252.40	19.50	2.43	7.58	4.09
1000	117.16	10.28	0.53	1.84	2.14
2000	56.07	5.47	0.12	0.34	1.08
3000	36.35	3.72	0.05	0.13	0.74

The above results are in executions per second -- a higher number means a faster algorithm. These numbers are what happens when regexes are pushed beyond their limits. Regex::Balanced::common goes quadratic on these examples, while all versions of Marpa stay linear. (Here linear and quadratic refer to speed. The results above are reported in executions per second, and you need to take the reciprocal to get the speed.)

libmarpa, the pure C version of Marpa, is faster than Regexp::Common::balanced on even the shortest example I tested. Marpa::XS, the XS (mixed C and Perl) version, catches up with it when the length of the test strings gets a little past 100 characters. You would expect that Marpa::PP, which is implemented entirely in Perl, would not have a chance against the Perl regex engine, which is implemented in C. But somewhere in the low 100's Marpa::PP also catches up and by the time we are testing strings of 500 characters, Marpa::PP is running twice as fast.

As the length of the test strings increases, Marpa's relative advantage grows. At 3000, Marpa::XS is over 74 times as fast as Regexp::Common::balanced, and libmarpa is well over 700 times as fast. Even Marpa::PP is nearly 15 times as fast, and steadily improving its advantage.

[ In the comments, Tom Christiansen shares a Perl regex which is faster than Regexp::Common::balanced. I've included results for it in the table above. For discussion of it, see the comments. ]

And I hardly even cheated!

In this comparison, I tried to be "fair". "Fair" can be hard to define in a benchmark.

The Choice of Tests

The test strings, and the problem (unanchored searching for balanced parens) were chosen to highlight the limits of Perl regexes. On the other hand, this problem is typical of the things programmers want to do, as well as of the sort of thing they ask Perl regexes to to.

Official versus Developer's Versions

For Marpa::XS and Marpa::PP, I insisted on using official distributions out on CPAN. My benchmarking script is available as a gist on github. These benchmarks were done using only the DOCUMENTED interfaces of Marpa::XS 0.020000 and Marpa::PP 0.010000. Both these versions had several useful capabilities which are not in the documented interface, and, especially with the shorter test strings, both Marpa::XS and Marpa::PP pay a real price for my decision not to use them. But I wanted to demonstrate the kinds of speeds that real users can get, using what is actually on CPAN now, as it is documented TODAY.

The libmarpa numbers, on the other hand, are for a developer's version. The libmarpa library has never had a separate release, and is not yet documented. A stable libmarpa is part of Marpa::XS and the latest code is in Marpa::R2. The version used for these tests is the one in Marpa::R2 and my benchmarking code is in the Marpa::R2 github repository.

Precompilation

Parsers, like Marpa and yacc, are designed for repeated use. Regular expression engines, on the other hand, are often used in "one-shot" applications. When the application is viewed as a regex, it seems fair to include any precompilation in the benchmark times. If the application is thought of as parsing, it seems fair to allow the algorithms to see the grammar first, without the clock running, and optimize the heck out of it. Which is fair here? The choice would make a real difference. Marpa does a lot of precomputation, more than any other parsing algorithm I know of.

I decided that this test would be about pitting Marpa versus regexes, on the regex's own turf and using their rules. In all the tests above, for every repetition, Marpa was forced to redo all its precomputations, while the clock was running. For shorter tests in particular, this put Marpa at a real disadvantage. But it makes the results clearly relevant to the use of Marpa inside a regex engine.

Reading the input string

Both libmarpa and Regexp::Common::balanced have a big advantage over Marpa::XS and Marpa::PP. Marpa::XS and Marpa::PP have to use Perl to convert the input string into their internal format. For Perl's regex engine and libmarpa, this is done in C. I decided to require both algorithms to do their string manipulation "with the clock running", very much to the disadvantage of Marpa::XS and Marpa::PP.

This disadvantage could be called unfair, since the choice of language for string manipulation is about interface and convenience, and does not really have anything to do with the speed of the underlying algorithms. But I felt that, when string conversion times were included, run times were more realistic -- more like what would be encountered in the actual applications that regexes are asked to deal with. This handicap makes Marpa::XS's performance all the more impressive.

A faster regex engine

I hope this post will encourage use of Marpa::XS. That is why I carefully limited the benchmark code for Marpa::XS to documented features of an already available, stable beta release.

I also want to suggest the possibility of using libmarpa to extend the Perl regex engine. It is possible to efficiently identify, for any regex, the presence of features that are problematic for an RE-based recognition engine. A regex implementation could check for these and select a recognition engine accordingly -- the tradition RE-based engine for simpler regexes or, when it seems beneficial, a Marpa-based recognition engine.

Notes

"regular expression": In the pure mathematical sense, a regular expression (RE) is a state machine that recognizes only patterns that use concatenation (ab), repetition (a*), alternation a|b), or some combination of these. Perl regexes are sometimes regular expressions or their equivalent. More often they are not. For example, any Perl regex which captures substrings is not equivalent to a regular expression.
"token": In the example in this post, "token" can be taken as a synonym for character. RE's typically work with the individual characters of character strings. In this post, so does Marpa. Typically, general purpose parsers like Marpa let a lower level gather one or more individual characters into "tokens".
"paren": Saying "parenthesis" gets tedious. I often abbreviate it to "paren".
"quadratic": By quadratic, I mean O(n^2) where n is the length of the test string. That Regexp::Common::balanced goes quadratic is my conjecture -- I have no proof. But many of the simpler approaches to unanchored search are quadratic, and the benchmark numbers suggest this is the case here.

posted at: 20:21 | direct link to this entry

§ § §

Ocean of Awareness

Marpa resources

Mon, 07 Nov 2011