Thu, 24 Mar 2011

Perl and Parsing 8: The Where and Why of Rejection

Why Perl is Just Not That Into Your Syntax

In a previous post, I noted that Perl often cannot precisely locate syntax errors in its scripts. Still less can it identify the exact problem. In this post, I will demonstrate an experimental utility which does pinpoint Perl syntax errors, precisely indicating where and what the problem is.

Here's my example from the previous post.


my $lyric =
'Sloopy wears a red dress, yeah
As old as the hills
but when sloopy wears that red dress, yeah
you know it gives me the chills
Sloopy when I see you walking, 
walking down the street
I say don\'t worry sloopy, girl
You belong to me'~!@$%^&*()_+;
print "$lyric\n";

And here is Perl's output for the error:


Bareword found where operator expected at bomb2.pl line 9, near ")_"
        (Missing operator before _?)
syntax error at bomb2.pl line 9, near "You belong to me'~"
  (Might be a runaway multi-line '' string starting on line 2)
Execution of bomb2.pl aborted due to compilation errors.

As I said in that previous post, perl clearly has very little idea where things went wrong -- it's guessing.

Pinpointing the Error

At this point, let me give away the ending. The point of failure is the very first special symbol: the tilde. When I ran my fingers from left to right across the top of my keyboard, I was hoping to produce a more complicated example. But perhaps it is just as well I did not.

Here, from my experimental Marpa-based utility, is what Perl is looking for when it encounters the tilde: The dot in the rules indicates how far the parse has already progressed.

line -> label sideff . SEMI
sideff -> expr . IF expr
sideff -> expr . UNLESS expr
sideff -> expr . WHILE expr
sideff -> expr . UNTIL iexpr
sideff -> expr . FOR expr
sideff -> expr . WHEN expr
or_expr -> or_expr . OROP and_expr
or_expr -> or_expr . DOROP and_expr
and_expr -> and_expr . ANDOP argexpr
argexpr -> argexpr . COMMA
argexpr -> argexpr . COMMA term
term_listop -> term_cond . ASSIGNOP term_listop
term_assign -> term_cond . ASSIGNOP term_assign
term_cond -> term_dotdot . QUESTION term_cond COLON term_cond
term_dotdot -> term_oror . DOTDOT term_oror
term_oror -> term_oror . OROR term_andand
term_oror -> term_oror . DORDOR term_andand
term_andand -> term_andand . ANDAND term_bitorop
term_bitorop -> term_bitorop . BITOROP term_bitandop
term_bitandop -> term_bitandop . BITANDOP term_eqop
term_eqop -> term_relop . EQOP term_relop
term_relop -> term_uniop . RELOP term_uniop
term_shiftop -> term_shiftop . SHIFTOP term_addop
term_addop -> term_addop . ADDOP term_mulop
term_mulop -> term_mulop . MULOP term_matchop
term_matchop -> term_matchop . MATCHOP term_uminus
term_powop -> term_increment . POWOP term_powop
term_increment -> term_arrow . POSTINC
term_increment -> term_arrow . POSTDEC
term_arrow -> term_arrow . ARROW method LPAREN listexprcom RPAREN
term_arrow -> term_arrow . ARROW method
subscripted -> term_hi . ARROW LSQUARE expr RSQUARE
subscripted -> term_hi . ARROW LCURLY expr SEMI RCURLY
subscripted -> term_hi . ARROW LPAREN RPAREN
subscripted -> term_hi . ARROW LPAREN expr RPAREN

The names of the symbols are based on those in perly.y. Operators are not shown symbolically, but are indicated with the name in caps: "POSTINC" instead of "++". Terms are suffixed with their precedence: "term_assignop" is the symbol for terms with the same precedence as the assignment operator. "term_hi" is the symbol for terms at the highest precedence level.

A tilde, when it is a single-character Perl operator, is always a prefix unary operator. Tildes also form part of several multi-character operators, but that is not the case here. Here is what Perl is looking for when it encounters the tilde:

A semicolon that would end the statement. (Statements are called line's internally by perly.y.)
The postfix conditonal keywords.
A lot of binary operators.
A couple of postfix unary operators. These, if found, will produce a compile-time complaint about incrementing a constant.

No prefix unary operator is in the above list, and the parse fails here.

About the Utility Used in This Post

Finding the exact point of failure and the exact reasons would seem like something that you'd want in a parser. But in fact, production languages have tended to be like Perl -- they try to indicate the general area of a syntax problem and to make a good guess as to its nature. But they leave it to the programmer to figure out exactly where they failed and why.

Marpa, then, is unusual, in that for any grammar you can write in BNF, and any input, it will either produce a parse, or a precise characterization of the failure. Marpa::XS::Perl is still experimental and under development. As I tackle tasks (like preparing this post, for example) I add the necessary capabilities. An example of what my utility cannot yet do is deal with floating point constants. (They're not hard, I just haven't encountered them yet in a test case.)

My original intent with Marpa::XS::Perl was to use it for snippets, and for academic and toy examples, and it cannot yet deal with production Perl code. My purpose so far has been to demonstrate that Marpa could be the basis of a practical Perl parsing utility.

Notes

Note 1: Of course, in one sense, the exact nature of the problem depends on what the person writing the script intended, and on this my utility has not a clue. In this post, "finding the exact problem" means finding the exact location of a parse failure, and finding exactly what perl was looking for when perl did not find what perl wanted to find.

In determining the "exact location of parse failure", I also avoid mind-reading. I use a definition taken from the parsing literature: In a rejected token stream, the point of failure is the first token which made a successful parse impossible. In other words, if you encounter a token which cannot possibly be part of a successful parse, given the input you've already read, that token is the point of failure. Looking at it from the opposite point of view, if you can find some additonal input that makes the parse succeed, you have not yet found a point of failure.

Note 2: To be precise, the output in this post was automatically generated by my utility, then edited for readability. Specifically, the edits removed those lines which were for rules with the dot at the end, and removed rule numbers and token numbers from the beginnings of the lines. Since rules with the dot at the end are completed, they do not generate any expectations for future tokens, and are irrelevant here. Similarly, in this context the internal rule numbers and location numbers would be clutter. While I made these readability edits by hand, they were rote and could easily have been automated.

Note 3: Unlike in textbook BNF, the BNF in perly.y does not have a separate symbol for terms of each precedence. The BNF in perly.y is wildly ambiguous, unlike the Perl language. perly.y uses a tie-breaking technique, in combination with the BNF, to assign precedence. While some use of this kind of tie-breaking is standard in yacc, for the Perl parser, Larry used it far more boldly than had been the practice before. Or for that matter, has been since. This is very important aspect of Perl parsing, one on which I've been planning to post.

posted at: 13:21 | direct link to this entry

§ § §