Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website


Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Sat, 19 Jun 2010

Parsing with Ruby Slippers

PPI, perly.y and Marpa

Recently I've been combining the two main approaches to parsing Perl. One of these is the PPI module. The other is the parser in the actual Perl distribution. This is usually what is meant when someone speaks simply of "the Perl parser". That can be confusing in this context, so I will call the parser in the distribution, the Traditional Perl Parser (TPP).

I used the grammar in the TPP (it's in perly.y). But I threw out TPP's lexer (toke.c). I replaced toke.c with PPI::Tokenizer, rewrote TPP's bison grammar to use Marpa and put the two together.

To my surprise, mating the PPI::Tokenizer and perly.y has proved almost easy. PPI preserves whitespace and the TPP does not, so I had to throw whitespace away. PPI combines some tokens that TPP separates. (Notably PPI often combines sigils with other things. The TPP never does.) Also, there was the slightly more complicated issue of the ghostly semicolons, which I'll deal with in a bit.

I've gotten this combination working on the Perl subset that Data::Dumper produces as output, and it passes an appropriately selected dozen of Data::Dumper's own test cases. Data::Dumper uses only part of Perl's semantics, and that allows me to avoid most of Perl's semantics. Not that the semantics in Data::Dumper's output are trivial. They include a lot of the logic of Perl expressions, and the test cases get into some very serious nested dereferencing of lvalues.

I don't ever intend to reimplement all (or even most) of Perl's semantics. Tricky lvalue indirections or not, what I've done is a toy in comparison with the Perl semantics, and it will stay that way. But I do think it possible to use Marpa to reimplement most of Perl's grammar. An application could then plug its own semantics into it.

I've adhered to the structure of perly.y's grammar. My intent is that this adaptation of perly.y's grammar to Marpa will work for cases much broader than Data::Dumper output. As this project progresses, I expect to be forced to confront a lot of interesting issues in Perl's grammar.

The Case of the Ghostly Semicolons

But for the moment, let's get back to interfacing the PPI lexer to perly.y's grammar, and what turned out to be relatively simple issues. For reasons which probably have to do with the limits of its underlying LALR algorithm, perly.y wants all expressions inside hash brackets to end in semicolons. That is, instead of

$p->{ func() }
perly.y needs to see
$p->{ func(); }
In actual Perl code that semicolon causes a syntax error. But perly.y not only allows the semicolon -- perly.y requires it. These semicolons are only allowed to exist as momentary "ghosts" inside the TPP, where they flit from toke.c to perly.y. When exactly are ghost semicolons required? Beats me. The toke.c code for handling curly braces is as convoluted as anything I've ever seen, and I don't know of any documentation for it -- certainly the comments do no more than hint at what's going on.

Now, you'd think to mate PPI, and perly.y I would need to figure out exactly when to insert "ghost" semicolons and when not. I'd have to then implement this, ending up with something that looks a lot like (shudder) toke.c rewritten in Perl. But I don't have to do any of this. You see, I've got Ruby Slippers.

Stepping Out with the Ruby Slippers

As most readers will recall, the Ruby Slippers are Dorothy's shoes in the movie Wizard of Oz, and they grant wishes. In particular, Dorothy wants to get back to Kansas, and she discovers that, so long as she keeps the Ruby Slippers away from wicked witches, all she has to do is wish she is back in Kansas and she is there.

Marpa's parse engine has the very nice property that at every point Marpa knows exactly what parses are possible, and it knows this information in a form that it can share with the lexer. This ability is a by-product of my rewrite of the Earley parse engine, and has been useful beyond my expectations. Here's how I deal with the ghostly semicolons.

Step 1: Marpa has an "interactive" parse mode -- interactive in the sense that the parser and lexer interact. What I do is just take the closing curly brace as I get it from PPI::Tokenizer and pass it on to the Marpa implementation of the perly.y grammar. In cases where an "un-semi-coloned" closing curly is acceptable, that's the end of the story.

Step 2: But what if Marpa /perly.y needs the semi-colon? In interactive mode, when the parser is not happy with an input, it stops, marking the point in the token stream where it failed. Marpa also makes available a list of tokens which are acceptable.

Step 3: Here's where Marpa's version of Ruby Slippers comes in. All the application/tokenizer has to do is pick a token which the parser does want, dummy up the input so that the wanted token is the next token, and restart the parse. Poof, you're in Kansas.

The Ruby Slippers method is quite remarkable. In many practical cases, there is only one acceptable token. In those cases, the Marpa parser tells the lexer exactly how to make a grammar work as it goes along, on a case by case basis.

Issues with Ruby Slippers parsing can arise. Sometimes there is more than one acceptable token. In those cases, you rank your choices, use lookahead, look at Marpa's ambiguous parsing capabilities, etc., etc. Also, in some situations it may be important that all illegal parses fail -- liberal acceptance of input is not always what is wanted. You might need to be fascist, rejecting everything which is not in exactly correct format. Fascists need to be careful with their Ruby Slippers.

The Ruby Slippers Strike Again

A more complex example of Ruby Slippers parsing (and a pretty nifty one if I say so myself) is in Marpa::HTML. This is Marpa adapted to parse HTML is a very liberal way. You can feed any file to Marpa::HTML and get an HTML parse for it. If the file is pathological HTML, or not HTML at all, the HTML parse will be pathological, but there will be an HTML parse. (This is not totally crazy. Most renderers accept any text whatsoever, doing their best to render it on the screen as HTML.)

As part of being very liberal, Marpa::HTML supplies missing end tags, even end tags which are required under the standards. For traditional parsing methods, this is a devilishly complex problem. Try writing BNF that allows for missing end tags when possible. Actually, just try defining "when possible". I think you'll see it's very, very hard.

But to solve this problem of end tags with Marpa I don't need to work out BNF or even define that slippery phrase "when possible". Marpa::HTML supplies missing end tags using the Ruby Slippers method.

It's basically the same trick as before. Marpa::HTML sets Marpa to parse in interactive mode. Then it simply feeds the tags it has to Marpa, as is, and waits for Marpa to have a problem. When Marpa can't use a token, Marpa stops. Marpa tells the tokenizer which token it could not use, and which tokens it could use instead -- its list of "expected" tokens.

The Marpa::HTML tokenizer looks in the expected tokens. Is there an end tag in there? If so, it dummies up a token for that end tag, sticks it in the input stream, and restarts.

Are there complications? If Marpa::HTML were parsing strict HTML, there wouldn't be. For strict HTML, whenever there is a missing end tag, there will be only one missing end tag. In the strict HTML case, the Marpa parser can always tell the Marpa::HTML exactly the right fix for any missing end tag.

If you want to parse liberalized and defective HTML, which Marpa::HTML does, things are more complex. In some cases the Marpa parser suggests more than one way to dummy up the input -- more than one pair of Ruby Slippers, so to speak. In these cases Marpa::HTML has to decide "which shoes to wear." Marpa::HTML uses a handful of simple rules and heuristics. The rules provide the right answer where there is one. If it comes down to guessing at the intent, the heuristics make good suggestions.


On this project, obvious thanks are due to Larry Wall and the other authors for perly.y and toke.c. as well as to Adam Kennedy for PPI. Not so obvious might be my debt to Randal Schwartz, whose Oct 2001 Linux Magazine column suggested the idea of using Data::Dumper output (and its test cases) to create an initial subset for a Perl parser.

posted at: 14:56 | direct link to this entry

§         §         §