Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website


Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Mon, 28 Jun 2010

Parsing Perl 2: Down the Garden Path

The Garden Path

In Perl 5.10, the following code is a syntax error:

If you try it, what you'll see is something like this:
syntax error at (eval 25) line 1, near ";4"
syntax error at (eval 25) line 1, near "}}

TPP is based on an algorithm called LALR. (To avoid confusion when discussing various programs which parse Perl, I will call the one that comes with the Perl distribution the Tradition Perl Parser, or TPP.) If you look at perly.y, you will find a real achievement in LALR parsing. and therefore in computer language parsing in general. But LALR has its limits and even in the TPP these show up every now and then.

Let's call our first example the Garden Path Code Block. What is happening here? Perl is being taken "down the garden path" between the inner set of curly braces. TPP starts out down the path and, based on what it sees, decides that it is parsing an anonymous hash. TPP commits to that choice and then trainwrecks on the semicolon. PPI does no better. PPI recognizes the Garden Path Code Block, but it parses it as an anonymous hash, which is wrong.

I am prototyping a new Perl syntax analyzer using Marpa, my new parser. Marpa parses anything you can write in BNF, using ideas based on those by Jay Earley. Marpa::Perl is still a prototype, but it has no difficulty in recognizing the Garden Path Code Block as a code block.

LALR's problem is that it is based on a state machine. LALR's state machine ran at speeds and used memory in keeping with the computer technology of 1969, which is when LALR was invented. But the LALR state machine needs to make decisions based on limited context. When the LALR parser in perly.y sees that semicolon, it is pretty much game over.

LALR has always been a tight squeeze for the grammars of practical programming languages. This is despite the fact that LALR has had decades to condition language designers to accept its idea of what "practical" means in a programming language.

For Marpa, the Garden Path Code Block is not even a serious challenge. Marpa is based on Jay Earley's ideas. These were the subject of my last post. Briefly, the core idea is to create a table of "Earley items". Earley items are short records indicating parsing progress, keyed to locations in the input. For an Earley-based parser, a "garden path" is dealt with by having Earley items record the multiple possibilities. This need to deal with alternative possibilities is not dissimilar to the one which motivates other parsers to use more expensive strategies, such as backtracking or parallel processing.

What the Man Page Says

Suppose we go all the way down the "garden path", and it still is not clear whether the curly braces enclose an anonymous hash or a code block? If we alter our example to replace the deadly semicolon with an insidious comma, this is what it looks like:
And for Perl 5.10, this is the result:
{'1' => 2,'3' => 4}
PPI seems to agree with Perl 5.10.

Marpa::Perl, on the other hand, sees that this code is ambiguous. It recognizes two parses. The first parse is the one Perl and PPI see: an anonymous hash. Marpa::Perl's second parse treats both sets of curlies as surrounding code blocks. In scalar context, the value which results from this second parse is the number 4.

In cases of desperation, we look at the documentation. It takes some searching to find it, but there is a very relevant discussion in a section of the perlref man page. It is worth reading. Of its several examples, this is the most interesting:

sub showem {       { @_ } }   # ambiguous (currently ok, but may change)
Here Perl's rules of thumb result in the "{ @_ }" being parsed as a code block -- the opposite of the behavior in our previous example.

The threat made in the comment -- that TPP's current behavior may change -- would be a hard thing to make good on. Assuming array context, the effect of having the perlref example flip semantics would be to change the return value from an array into a hash ref. That kind of silent change is not the kind of thing I like to see in production code.

Most Perl programmers are probably not aware of this gotcha. Even well-informed programmers can easily write ambiguous code by accident, and will receive no warning of their mistake. Perhaps a good application for Marpa::Perl would be spotting ambiguities like these.

How to Disambiguate

Once spotted, ambiguous curlies are easy to fix. A standard way to disambiguate an anonymous hash is to prefix the curly braces with a plus sign:

TPP, PPI and Marpa::Perl have no trouble recognizing that the inner curly brackets enclose an anonymous hash in the above.

A programmer can disambiguate code blocks by placing a semicolon as the first thing inside the curly brackets.

The inner curlies in the above are recognized as defining a code block by all three parsers: TPP, PPI and Marpa::Perl.

posted at: 22:29 | direct link to this entry

§         §         §