Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website


Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Sun, 27 Nov 2011

Marpa and the Ruby Slippers

In a previous post, I listed the four ideas that are essential to Marpa. This post delves into one of them: Ruby Slippers parsing. In Ruby Slippers parsing, the parser imagines ("wishes") that the language it is parsing is easier to parse than it actually is. The part of the application that handles input (the "lexer") manipulates the input to make the parser's "wishes" come true.

As an example, take liberal HTML. "Liberal HTML" is HTML as it is found "in the wild", with missing and spurious tags. I've written a Marpa-powered liberal HTML parser which uses the Ruby Slippers as its primary technique. The grammar behind Marpa::HTML assumes a fantasy world, one where no element ever occurs out of place, and where all HTML elements have both start and end tags.

With Marpa as the parse engine, it is easy for the lexer to make wishes come true. All the lexer needs to do is wait until the parser is not happy with the input. When the parser sees the actual input as a problem, the lexer asks the parser what it would like to see instead. Marpa always knows exactly what it is looking for, so that it is no problem for the lexer to invent an input that makes the parser happy.

The Obstacles

This technique sounds a bit magical, which is why I named it "the Ruby Slippers". But the idea is simple enough, and the need for it great enough, that it has occurred to others over the now 50 year history of parsing techniques. In fact, the Perl lexer invents input to simplify the Perl language into something that its LALR-based parse engine can handle. Commenters on a previous post have mentioned other instances. I suspect that instances where the Ruby Slippers were used can be found going back to the 60's.

But previous use of the Ruby Slippers was difficult, and had to be rudimentary and infrequent. The parsers in standard use did not provide enough feedback. Recursive descent takes a worm's eye point of view of the parse -- it knows only where it is trying to burrow at that particular time.

When it came to feedback, LALR-based parsers were particularly bad -- they were based on a very abstract state machine. The Perl lexer, for its use of the Ruby Slippers, didn't even try looking at the LALR state -- instead the Perl lexer duplicated the LALR parser's work for the section of the parse that was of interest. If the Perl lexer tried to make frequent use of the Ruby Slippers it would raise a question: Why bother with the LALR parser at all?

The Solution

To really empower the Ruby Slippers, a parser needs to do two things. First, the parser must know what it wants. Second, the parser must know this in time to help guide the input.

Earley's algorithm met the first requirement from the beginning. Earley's algorithm works by creating "Earley sets". The original Earley sets were lists of rule applications, one list for every token in the input. A rule application was in the Earley set if and only if it was a real possibility.

The second requirement was not met in the original Earley's algorithm. For the second requirement to be met, all work on an Earley set at a parse position must be finished before the tokens that start at that position are read. Pre-Marpa, Earley parse engines intermixed work on a location's Earley set, with reading the input for that location. That meant that the information to guide Ruby Slippers parsing did not became available until just AFTER it was needed.

In creating Marpa, I needed to combine other researcher's improvements to Earley's into a single algorithm. Merging their parse engines into a new one forced me to write my own, new, parse engine. In the course of this, I saw that Earley's Algorithm could be rearranged so that, at every location in the parse, each list of possible rule applications was finished before the input starting at that location was read.

To make this happen, Marpa divides the single loop of previous Earley parse engines into two loops, and reverses their order. In other words, Marpa's parse engine turns the original Earley parse engine inside-out and upside-down. The result is provably equivalent to the original, and just as fast.

At the end of the Wizard of Oz movie, Dorothy tells Glenda she wants to get back to Kansas. Glenda points down to Dorothy's Ruby Slippers and explains their powers, adding that Dorothy could have gone to Kansas anytime she liked. The Ruby Slippers have always been implicit in Earley's Algorithm. We've just had to realize they were there.


  1. "possibility": Possiblity is defined the way you'd hope it would be: a rule application is "possible" at any point, if and only if, given the input up to that point, there is some input from that point on, that produces a parse which includes that rule application.

  2. "anytime she liked": As a child I wondered if Glenda was not being a bit tactless.

posted at: 15:48 | direct link to this entry

§         §         §