Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website

Google+

Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Sun, 09 Jan 2011


Perl and Parsing 6: Error Handling

Is the Perl interpreter psychic?

An important consequence of the choice of parsing algorithm is the handling of parse-time errors. It's often overlooked. Perl's use of LALR based parsing puts severe limits on its ability to locate errors, limits which the Perl interpreter is sometimes able to overcome. Consider the following erroneous piece of code.


my $lyric =
'Sloopy wears a red dress, yeah
As old as the hills
but when sloopy wears that red dress, yeah
you know it gives me the chills
Sloopy when I see you walking, 
walking down the street
I say don't worry sloopy, girl
You belong to me';
print "$lyric\n";

Some of you may have spotted this as the lost third verse of the October 1965 US #1 hit, "Hang on Sloopy". Others may notice that the single quote in "don't" is going to make things come out pear-shaped. On this latter point, the Perl interpreter's performance is uncanny, if a bit noisy:


Bareword found where operator expected at bomb.pl line 8, near "I say don't"
  (Might be a runaway multi-line '' string starting on line 2)
        (Do you need to predeclare I?)
syntax error at bomb.pl line 8, near "I say don't worry "
Bad name after me' at bomb.pl line 9.

I just said that Perl's LALR-based parsing made error detection difficult. So how do I explain the above, where the error detection is not merely good, it almost seems psychic?

Behind the Smoke and Mirrors

A closer look will show how Perl pulls off this bit of apparent mind-reading. If you look at toke.c, you'll see the trick. Globals in the tokenizer track the beginning and ending of the most recent multi-line object. Anytime a multi-line object ends just before an error, the Perl interpreter fingers it. It's a bit like police lieutenant Arthur Tragg in the old Perry Mason series, who on discovering a murder victim, always arrested the next person he saw -- usually Mason's client. It's a good guess and, unlike the indefatigable Arthur Tragg, who persisted despite turning out to be wrong week after week, it works for the Perl interpreter most of the time.

But not always. Let's rearrange this example, so that the string is OK, and we're depending on the LALR parser to locate an error.


my $lyric =
'Sloopy wears a red dress, yeah
As old as the hills
but when sloopy wears that red dress, yeah
you know it gives me the chills
Sloopy when I see you walking, 
walking down the street
I say don\'t worry sloopy, girl
You belong to me'~!@$%^&*()_+;
print "$lyric\n";


Bareword found where operator expected at bomb2.pl line 9, near ")_"
        (Missing operator before _?)
syntax error at bomb2.pl line 9, near "You belong to me'~"
  (Might be a runaway multi-line '' string starting on line 2)
Execution of bomb2.pl aborted due to compilation errors.


This example clearly illustrates the limits of LALR parsing. A multi-line string just ended, and the tokenizer duly drops the dime, but there's no way to make the charge stick. The multi-line string is fine, and the problem is somewhere in the special characters after it.

But where? Now that we are no longer dazzled by the psychic abilities of the tokenizer, let's carefully define what we mean by the actual error location. The error location is the first character which makes a valid parse impossible. Put another way, if a parser reads a character, and that character renders it impossible that any subsequent reading of characters will produce a valid parse, then the location of that character is the error location.

Where is the error location in this example? I think it may be at the '%' sign, but I frankly don't know. More to the point, neither does the Perl interpreter, though it throws out guesses all over the map.

The Immediate Error Detection Property

The error-locating ability of parsing algorithms divides them into two types. "Immediate error detection" is what you'd like to see in a parser -- the moment the parser encounters a character which makes a valid parse impossible, it knows it. Other parsers, and LALR is prominent among them, need to grind on, reading more characters. Only later do they realize something has gone wrong.

When recursive descent is full LL(1) it has the immediate error detection property. (In simplified terms, recursive descent is full LL(1) when it is smart about lookahead for empty rules.) Recursive descent implementations which are not full LL(1) do not have the immediate error detection property. This is the case with most actual implementations of recursive descent. On the other hand, recursive descent's major attraction is that it is hack-friendly, so that a recursive descent implementation which is less than full LL(1) in theory might well do immediate error detection when it counts.

Earley's algorithm has the immediate error detection property. My implementation of it in Marpa has the immediate error detection property and then some. When it is about to read a token, Marpa knows, and can inform the lexer, exactly which input symbols will result in a viable parse. This allows some very nice tricks.

Note 1: For the readers of my last post on yacc, the discussion of error reporting here changes the subject: That post dealt with the reporting of errors by a parser generator while the parser is being generated: generation-time error reporting. This post changes the topic to the reporting of errors in its input by a parser: parse-time error reporting. My next post in the yacc series will discuss yacc's parse-time error reporting.

For a hand-written parser, of course, there is no such thing as generation-time, and therefore generation-time error reporting is a non-issue. Parse-time error reporting is an issue for all parsers, whether generated or hand-written.

posted at: 23:30 | direct link to this entry

§         §         §