Marpa v. Perl regexes: a rematch

Tue, 07 Aug 2012

Marpa v. Perl regexes: a rematch

Marpa::R2, the latest version of Marpa, has some significant speedups. Enough so, that it seems appropriate to revisit an old benchmark. (For those new to this blog Marpa is a new parser with a decades-long heritage. Marpa parses anything you can write in BNF and, if your grammar is in one of the classes currently in practical use, parses it in linear time.)

The benchmark I'll revisit compared Marpa to Perl regexes, testing the speed with which each could find balanced sets of parentheses in a string of parentheses, on a first longest match basis. It's an interesting test, because it has easy cases and hard cases, and because the dividing line between good applications for Marpa and good applications for regexes is somewhere between the two sets of cases.

In the "hard cases", the matches come toward the end. On these, Perl regexes go quadratic (O(n²)), while Marpa stays linear. When Perl regexes become unuseable depends on your hardware and your patience, but there does come such a point.

In the "easy cases", the matches come toward the start. Here Perl regexes stay linear. In my previous testing of easy cases, Perl regexes beat Marpa at a steady 40 to 1, even on some very long and non-trivial strings. So let's see what has changed ...

The hard cases

Finding balanced parentheses: the hard cases Executions per second by parsing method and length of input (a higher number means a faster method)
	3000	2000	1000	500	100	10
Marpa::R2, "thin" interface	30.88	46.14	89.81	176.98	803.23	2665.79
Marpa::R2, standard interface	7.79	11.69	22.47	41.03	140.00	305.61
Marpa::XS	3.72	5.47	10.28	19.50	58.96	111.71
Perl regex	0.13	0.34	1.84	7.58	197.25	33429.33
Regexp::Common::Balanced	0.05	0.12	0.53	2.43	62.09	3173.30

Marpa::R2 and Marpa::R2::Thin are the new version of Marpa, respectively its user-friendly and its "no-frills close to the C language" interfaces. Marpa::XS is an older version, now stable. The strings consist of left parentheses, except for a target: "(()())". In the "hard cases", the target is near the end of the string. Further details on the benchmarking methodology are given in the previous post and the code for this benchmark is a Github gist.

For regexes against which to benchmark, I first went to Regexp::Common::Balanced on CPAN. The Regexp::Common::Balanced solution was not all fast, and went quadratic in the hard cases. Tom Christiansen provided a better solution for Perl 5.10 and above:

(\\((?:[^()]++|(?-1))*+\\))

Tom's regex also goes quadratic in the hard cases, but it is roughly 3 times faster than the one in Regexp::Common::Balanced.

Marpa::XS already had a big lead in the hard cases, and Marpa::R2 and especially Marpa::R2::Thin have widened it. Marpa does extensive precomputation, so it loses for short strings, as you can see in the "10" column. But Marpa's precomputation pays off quickly. At around 125 parentheses, Marpa::R2 catches up. Marpa::R2::Thin catches up even faster -- at roughly 35 parentheses. When strings reach 3000 characters in length, Marpa::R2 is about 60 times faster, and Marpa::R2::Thin is more than 230 times as fast.

The easy cases

Finding balanced parentheses: the easy cases Executions per second by parsing method and length of input (a higher number means a faster method)
	10000	3000
Marpa::R2, "thin" interface	12.59	41.59
Marpa::R2, standard interface	2.59	8.60
Perl regex	20.43	78.33
Regexp::Common::Balanced	9.55	33.10

The "easy cases" are easy, relatively speaking, but they are not trivial. The string is the same as for the hard cases, except that the target starts at character position 2, near the beginning. This is not a trivial matching problem because, until the very end of the string, it cannot be known whether the match will start at character position 0, 1 or 2. (This code is also in a Github gist.)

For easy cases, I look at the numbers for longer strings. I've added a column for strings of 10,000 parentheses. (I had wanted to carry the "hard cases" test up to 10,000, but regex performance had deteriorated so badly by 3,000 that getting numbers for 10,000 seemed pointless.) At 3,000 and even 10,000, regexes are usually better than Marpa::R2. (Usually, but not always: for these long strings Marpa::R2::Thin beats the solution from Regexp::Common::Balanced, even on these "easy" cases.)

Let's focus on the real race -- the one with Tom's regex. Tom's regex had beat Marpa::XS by 40 to 1 on this test. And it still beats Marpa::R2 by almost 10 to 1. But, for the easy cases, its lead over Marpa::R2::Thin is now less than 2 to 1.

There will always be cases where regexes beat Marpa. For this balanced parentheses benchmark, Marpa's precomputations guarantee that short strings will be one of those cases. But much of regex's currently remaining advantage comes, not from faster pattern recognition, but from faster string handling. Perl regexes do their string handling in C. Marpa::R2 currently does its string handling in Perl, something which could be changed. Stay tuned.

posted at: 20:25 | direct link to this entry

§ § §

Ocean of Awareness

Marpa resources

Tue, 07 Aug 2012