Thu, 24 Mar 2011
Perl and Parsing 8: The Where and Why of Rejection
Why Perl is Just Not That Into Your Syntax
In a previous post,
I noted that Perl often cannot precisely
locate syntax errors in its scripts.
Still less can it identify the exact problem.
In this post, I will demonstrate an experimental
utility which does
pinpoint Perl syntax errors,
precisely indicating where and what the problem is.
Here's my example
from the previous post.
my $lyric =
'Sloopy wears a red dress, yeah
As old as the hills
but when sloopy wears that red dress, yeah
you know it gives me the chills
Sloopy when I see you walking,
walking down the street
I say don\'t worry sloopy, girl
You belong to me'~!@$%^&*()_+;
And here is Perl's output for the error:
Bareword found where operator expected at bomb2.pl line 9, near ")_"
(Missing operator before _?)
syntax error at bomb2.pl line 9, near "You belong to me'~"
(Might be a runaway multi-line '' string starting on line 2)
Execution of bomb2.pl aborted due to compilation errors.
As I said
in that previous post
clearly has very
little idea where things went wrong -- it's guessing.
Pinpointing the Error
At this point, let me give away the ending.
The point of failure is the very first special symbol: the tilde.
When I ran my fingers from left to right across the top of my keyboard,
I was hoping to produce a more complicated example.
But perhaps it is just as well I did not.
from my experimental Marpa-based utility,
Perl is looking for when it encounters the tilde:
The dot in the rules indicates how far the parse has already progressed.
line -> label sideff . SEMI
sideff -> expr . IF expr
sideff -> expr . UNLESS expr
sideff -> expr . WHILE expr
sideff -> expr . UNTIL iexpr
sideff -> expr . FOR expr
sideff -> expr . WHEN expr
or_expr -> or_expr . OROP and_expr
or_expr -> or_expr . DOROP and_expr
and_expr -> and_expr . ANDOP argexpr
argexpr -> argexpr . COMMA
argexpr -> argexpr . COMMA term
term_listop -> term_cond . ASSIGNOP term_listop
term_assign -> term_cond . ASSIGNOP term_assign
term_cond -> term_dotdot . QUESTION term_cond COLON term_cond
term_dotdot -> term_oror . DOTDOT term_oror
term_oror -> term_oror . OROR term_andand
term_oror -> term_oror . DORDOR term_andand
term_andand -> term_andand . ANDAND term_bitorop
term_bitorop -> term_bitorop . BITOROP term_bitandop
term_bitandop -> term_bitandop . BITANDOP term_eqop
term_eqop -> term_relop . EQOP term_relop
term_relop -> term_uniop . RELOP term_uniop
term_shiftop -> term_shiftop . SHIFTOP term_addop
term_addop -> term_addop . ADDOP term_mulop
term_mulop -> term_mulop . MULOP term_matchop
term_matchop -> term_matchop . MATCHOP term_uminus
term_powop -> term_increment . POWOP term_powop
term_increment -> term_arrow . POSTINC
term_increment -> term_arrow . POSTDEC
term_arrow -> term_arrow . ARROW method LPAREN listexprcom RPAREN
term_arrow -> term_arrow . ARROW method
subscripted -> term_hi . ARROW LSQUARE expr RSQUARE
subscripted -> term_hi . ARROW LCURLY expr SEMI RCURLY
subscripted -> term_hi . ARROW LPAREN RPAREN
subscripted -> term_hi . ARROW LPAREN expr RPAREN
The names of the symbols are based on those in perly.y.
Operators are not shown symbolically, but are indicated with the name
in caps: "POSTINC" instead of "++".
Terms are suffixed with their precedence:
"term_assignop" is the symbol
for terms with the same precedence as the assignment operator.
"term_hi" is the symbol for terms at the highest precedence level.
A tilde, when it is a single-character Perl operator,
is always a prefix unary operator.
Tildes also form part of several multi-character operators,
but that is not the case here.
Here is what Perl is looking for
when it encounters the tilde:
- A semicolon that would end the statement. (Statements
are called line's internally by perly.y.)
- The postfix conditonal keywords.
- A lot of binary operators.
- A couple of postfix unary operators.
These, if found, will produce
a compile-time complaint about incrementing a constant.
No prefix unary operator is in the above list,
and the parse fails here.
About the Utility Used in This Post
Finding the exact point of failure and the exact
reasons would seem like something that you'd want in a parser.
But in fact, production languages have tended to be like Perl --
they try to indicate the general area of a syntax problem
and to make a good guess as to its nature.
But they leave it to the programmer to figure out exactly where they failed
Marpa, then, is unusual, in that for any grammar you can write in BNF,
and any input, it will either produce a parse, or a precise characterization
of the failure.
is still experimental and under development.
As I tackle tasks (like preparing this post, for example)
I add the necessary capabilities.
An example of what my utility cannot yet do is deal with
floating point constants.
(They're not hard, I just haven't encountered them yet in a test case.)
My original intent with Marpa::XS::Perl was to use it for snippets,
and for academic and toy examples,
and it cannot yet deal with production Perl
My purpose so far
has been to demonstrate that Marpa could be the basis
of a practical Perl parsing utility.
Of course, in one sense, the exact nature of the problem depends
on what the person writing the script intended, and on this
my utility has not a clue.
In this post, "finding the exact problem" means finding the exact
location of a parse failure, and finding exactly what perl
was looking for when
perl did not find what
perl wanted to find.
In determining the "exact location of parse failure",
I also avoid mind-reading.
I use a definition taken from the parsing literature:
In a rejected token stream,
the point of failure is the first token
which made a successful parse impossible.
In other words, if you encounter a token which cannot possibly
be part of a successful parse, given the input you've already
read, that token is the point of failure.
Looking at it from the opposite point of view,
if you can find some additonal
input that makes the parse succeed,
you have not yet found a point of failure.
To be precise, the output in this post was automatically generated
by my utility, then edited for readability.
Specifically, the edits removed
those lines which were for rules with the dot
at the end,
and removed rule numbers and token
numbers from the beginnings of the lines.
rules with the dot at the end are completed, they do not
generate any expectations for future tokens,
and are irrelevant here.
Similarly, in this context
the internal rule numbers and location numbers would be clutter.
While I made these readability edits by hand, they were rote
and could easily have been automated.
Unlike in textbook BNF,
the BNF in perly.y does not have a separate symbol for terms
of each precedence.
The BNF in
perly.y is wildly ambiguous,
unlike the Perl language.
perly.y uses a tie-breaking technique,
in combination with the BNF,
to assign precedence.
While some use of this kind of tie-breaking is standard in yacc,
for the Perl parser, Larry used it far more boldly than had been the practice before.
Or for that matter, has been since.
This is very important aspect of Perl parsing, one on which I've been planning to post.
posted at: 13:21 |
direct link to this entry