Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest
The Ocean of Awareness blog: home page, chronological index, and annotated index.
Using Marpa's facilities for error reporting, a quickly written domain-specific language can, as of its first draft, have error reporting whose helpfulness and precision exceeds that of carefully hand-crafted production compilers. This post will show how, with an example.
Two techniques will be used. First and most basic, Marpa's knowledge of the point at which the parse can no longer proceed is 100% accurate and immediate. This is not the case with yacc-derived parsers, and is not the case with most recursive descent parsers.
However, even Marpa's 100% accuracy in pinpointing the problem location is only accuracy in the technical sense -- it cannot take into account what the programmer intended. A second technique allows the programmer to double-check his intentions against what the parser has actually seen. Marpa can tell the programmer exactly how it thinks the input parsed, up to the point at which it could no longer proceed. The Marpa parser can report the answer to questions like
"What was the last statement you successfully parsed?"
"What was the last expression you successfully parsed?"
"What was the last arithmetic expression you successfully parsed?"
"Where did the last successfully parsed block start? End?"
To focus on the logic of the error reporting,
I looked for a language that was error-prone,
but extremely simple.
For this purpose,
prefix arithmetic is like a gift from the dakinis.
It is almost trivial in concept,
and almost impossible to get right when it is more than a few
characters long.
Two valid strings in this language are
say + 1 2
and
+++ 1 2 3 + + 1 2 4
.
Their results are, in order, 3 and 13.
I restricted the calculator to addition, because even with one operator, prefix notation is more than confusing enough to serve our purposes. I have included an optional say keyword, in order to illustrate rejection of a token by type. In pure prefix arithmetic, either all tokens are valid or none are. The say keyword is only valid as the first token.
The full code for this post is in a Github gist. It was run using a release candidate for the full release of Marpa::R2. Here is the grammar.
my $prefix_grammar = Marpa::R2::Grammar->new( { start => 'Script', actions => 'My_Actions', default_action => 'do_arg0', rules => [ <<'END_OF_RULES' ] Script ::= Expression | kw_say Expression action => do_arg1 Expression ::= Number | op_add Expression Expression action => do_add END_OF_RULES } );
The rules are specified in another DSL, of the kind I've used in previous posts. This one is incorporated in Marpa::R2 itself, and is documented here. Here are its features relevant to this example:
The "action => do_add" adverb indicates that the semantics for the alternative are in the Perl closure named do_add.
The rest of the grammar's definition will be familiar to Marpa users. Script is the start symbol, the Perl closures implementing semantics are to be found in the My_Actions package, and where no semantics are explicitly specified, the Perl closure do_arg0 is the default.
The semantics for this example are easy.
sub My_Actions::do_add { shift; return $_[1] + $_[2] } sub My_Actions::do_arg0 { shift; return shift; } sub My_Actions::do_arg1 { shift; return $_[1]; }
The first argument to a Marpa semantic closure is a "per-parse variable", which is not used in this application. The other arguments are the values of the child nodes, as determined recursively and in lexical order.
In this post, I am skipping around in the code -- the full code is in the gist. But lexical analysis is of particular interest to new Marpa users. The lexer I use for this example is overkill -- table-driven and using Perl's progressive matching capabilities, it is capable of serving a much more complex language. (I talked about lexing more in a previous example.) Here is the lexing table:
my @terminals = ( [ Number => qr/\d+/xms, 'Number' ], [ op_add => qr/[+]/xms, 'Addition operator' ], [ kw_say => qr/say\b/xms, qq{"say" keyword} ], );
The lexing table is an array of 3-element arrays. Each sub-array contains the symbol name, a regular expression that is used to recognize it, and a "long name", a human-readable name more appropriate for error messages than the symbol name. For some languages, the order of our lexing tables may be significant, although in the case of this language it makes no difference.
Before plunging into the error-handling code, I will describe the forms parsing errors take and the messages they produce.
The lexer may reach a point in the input where it does not find one of the allowed tokens. An example in this language would be an input with an an exclamation point. This is no need to talk much about this kind of error, which has always been relatively easy to diagnose, pinpoint and, usually, to fix.
In some cases the lexer finds a token, but it is not one that the parser will accept at that point, so the parser rejects the token. An example for this language would be the input "+ 1 say 2", which causes the following diagnostic:
Last expression successfully parsed was: 1 A problem occurred here: say 2 Parser rejected token ""say" keyword"
Marpa successfully determined that "1" is a valid expression of the language, but "+ 1" is not.
In other cases, the parser may "dead end" -- reach a point where no more input can be accepted. One example is with the input "+ 1 2 3 + + 1 2 4". This causes the following diagnostic:
Last expression successfully parsed was: + 1 2 The parse became exhausted here: " 3 + + 1 2 4"
The parser has completed a prefix expression. Unlike infix and postfix expressions, once a prefix expression has been allowed to end there is no way to "extend" or "restart" it. The parse is "exhausted".
A second example of an exhausted parse occurs with the the input "1 + 2 +3 4 + 5 + 6 + 7". Here is the diagnostic:
Last expression successfully parsed was: 1 The parse became exhausted here: " + 2 +3 4 + 5 + 6 + 7"
Finally, it may happen that lexer and parser read and accept the entire input, but do not find a valid parse in it. For example, if the input is "+++", the diagnostic will be:
No expression was successfully parsed No parse was found, after reading the entire input
The input was a good start for a prefix expression, but no numbers were ever found, and our DSL reports that it never recognized any prefix expressions.
A more complicated case is this input: "++1 2++". Here is what our DSL tells us:
Last expression successfully parsed was: +1 2 No parse was found, after reading the entire input
Our DSL did find a good expression, and tells us where it was. If there is more than one good expression, our DSL tells us the most recent. With input "++1 2++3 4++", the diagnostic becomes
Last expression successfully parsed was: +3 4 No parse was found, after reading the entire input
In fact, if we thought it would be helpful our DSL could show all the expressions found, or the last N expressions for some N. This is a simple language with nothing but expressions involving a single operator. More interesting languages will have statements and blocks, and layers of subexpressions. The logic below can be straightforwardly modified to show us as much about these as we think will be helpful.
sub my_parser { my ( $grammar, $string ) = @_; my @positions = (0); my $recce = Marpa::R2::Recognizer->new( { grammar => $grammar } ); my $self = bless { grammar => $grammar, input => \$string, recce => $recce, positions => \@positions }, 'My_Error'; my $length = length $string; pos $string = $positions[-1]; ... "Reading the tokens" goes here ... my $value_ref = $recce->value; if ( not defined $value_ref ) { die $self->show_last_expression(), "\n", "No parse was found, after reading the entire input\n"; } return ${$value_ref}; } ## end sub my_parser
The above closure takes a grammar and an input string, and either produces a parse value, or a diagnostic telling us exactly why it could not. For truly helpful diagnostics, I find it necessary to be able to quote the input exactly. The @positions array will be used to map the locations that the Marpa parser uses back to positions in the original input string. Marpa location 0 is always before any input symbol, so it is initialized to string position 0.
The $self object is a convenience. It collects the information the error handler needs, and allows an elegant syntax for the error-handling calls.
The loop for reading tokens will be described below. After it, but before the return, is our first error check. "No parse" errors show up after all the tokens have been read, when the $recce->value() call returns a Perl undef. In that case, we produce the message we showed above. The tricky details are hidden in the show_last_expression() method, which we will come to.
TOKEN: while ( pos $string < $length ) { next TOKEN if $string =~ m/\G\s+/gcxms; # skip whitespace if ( $recce->exhausted() ) { die $self->show_last_expression(), "\n", q{The parse became exhausted here: "}, $self->show_position( $positions[-1] ), qq{"\n}, ; } ## end if ( $recce->exhausted() ) ... "Looping through the lexing table" goes here ... die 'A problem occurred here: ', $self->show_position( $positions[-1] ), "\n", q{No valid token was found}; } ## end TOKEN: while ( pos $string < $length )
This loop implements part of our progressive matching within $string, and contains two of our four error checks. The exhausted() method check if the parse is exhausted, and again the hard work is done by the show_last_expression() method.
If we get through the lexing table without finding a token, we produce an invalid token message and report the position using the show_position() method. For invalid tokens, position should be all that the user needs to know. Position is also reported in the case of an exhausted parse. Implementation of the show_position() method presents no difficulties -- the code can be found in the gist.
TOKEN_TYPE: for my $t (@terminals) { my ( $token_name, $regex, $long_name ) = @{$t}; next TOKEN_TYPE if not $string =~ m/\G($regex)/gcxms; if ( defined $recce->read( $token_name, $1 ) ) { my $latest_earley_set_ID = $recce->latest_earley_set(); $positions[$latest_earley_set_ID] = pos $string; next TOKEN; } die $self->show_last_expression(), "\n", 'A problem occurred here: ', $self->show_position( $positions[-1] ), "\n", qq{Parser rejected token "$long_name"\n}; } ## end TOKEN_TYPE: for my $t (@terminals)
Our innermost loop is through the lexing table, checking each table entry against the input string. If a match is found, the Marpa recognizer's read() method is called. This may fail due to our fourth and last type of error: a rejected token. Again, show_position() reports position and show_last_expression() does the interesting stuff.
sub My_Error::show_last_expression { my ($self) = @_; my $last_expression = $self->input_slice( $self->last_completed_range('Expression') ); return defined $last_expression ? "Last expression successfully parsed was: $last_expression" : 'No expression was successfully parsed'; } ## end sub My_Error::show_last_expression
At its top level, show_last_expression() finds the parse locations of the last completed Expression symbol, using the last_completed_range() method. (In Marpa, as in other Earley parsers, a symbol or rule that has been recognized from start to finish is said to be "completed".) The parse locations are passed to the input_slice() method, which translates them into the corresponding substring of the input string.
sub My_Error::input_slice { my ( $self, $start, $end ) = @_; my $positions = $self->{positions}; return if not defined $start; my $start_position = $positions->[$start]; my $length = $positions->[$end] - $start_position; return substr ${ $self->{input} }, $start_position, $length; } ## end sub My_Error::input_slice
The last_completed_range() method does the complicated part of the error handling -- finding the last successfully recognized ("completed") occurrence of a symbol. The last_completed_range() method does not use any internals, but it certainly gets technical in its use of the external methods. It or something like it is a prime candidate to be folded into the Marpa interface someday.
Successful recognitions of a symbol are called, again following standard Earley parsing terminology, "completions". Completions are recorded by rule, so the first thing that must be done is to turn the symbol name into a list of those rules which have that symbol on their left hand side. These are called the @sought_rules. We also need to initialize the loop by recording the last parse location ("latest Earley set"). $earley_set will be our loop variable.
sub My_Error::last_completed_range { my ( $self, $symbol_name ) = @_; my $grammar = $self->{grammar}; my $recce = $self->{recce}; my @sought_rules = (); for my $rule_id ( $grammar->rule_ids() ) { my ($lhs) = $grammar->rule($rule_id); push @sought_rules, $rule_id if $lhs eq $symbol_name; } die "Looking for completion of non-existent rule lhs: $symbol_name" if not scalar @sought_rules; my $latest_earley_set = $recce->latest_earley_set(); my $earley_set = $latest_earley_set; ... "Traversing the Earley sets" goes here ... return if $earley_set < 0; return ( $first_origin, $earley_set ); } ## end sub My_Error::last_completed_range
Once we have traversed the Earley sets, we need only return the appropriate value. If the Earley set number fell below 0, we never found any completions of the "sought rules", a circumstance which we report with a bare return statement. Otherwise, $first_origin and $earley_set will be set to the first and last parse locations of the completion, and we return them.
This is our final code sample, and the buck stops here. Marpa::R2 introduced more detailed user access to the progress reporting information, and that interface is used here.
We traverse the Earley sets in reverse order, beginning with the latest and going back, if necessary to Earley set 0. For each Earley sets, there are "progress items", reports of the progress as of that Earley set. Of these, we are only interested in completions, which have a "dot position" of -1. (Those interested in a fuller explanation of "dot positions", progress items, and progress reports, can look in the documentation for progress reports.) Of the completions, we are interested only in those for one of the @sought_rules.
For any given set of sought rules, more than one might end at an given Earley set. Usually we are most interested in the longest of these, and this logic assumes that we are only interested in the longest completion. We check if the start of the completion (its "origin") is prior to our current match, and if so its becomes our new $first_origin.
$first_origin was initialized to an non-existent Earley set, higher in number than any actual one. Once out of the loop through the progress items, we check if $first_origin is still at its initialized value. If so, we need to iterate backward one more Earley set. If not, we are done, and $first_origin and $earley_set contain the information that we were looking for -- the start and end locations of the most recent longest completion of one of the @sought_rules.
my $first_origin = $latest_earley_set + 1; EARLEY_SET: while ( $earley_set >= 0 ) { my $report_items = $recce->progress($earley_set); ITEM: for my $report_item ( @{$report_items} ) { my ( $rule_id, $dot_position, $origin ) = @{$report_item}; next ITEM if $dot_position != -1; next ITEM if not scalar grep { $_ == $rule_id } @sought_rules; next ITEM if $origin >= $first_origin; $first_origin = $origin; } ## end ITEM: for my $report_item ( @{$report_items} ) last EARLEY_SET if $first_origin <= $latest_earley_set; $earley_set--; } ## end EARLEY_SET: while ( $earley_set >= 0 )
Comments on this post can be sent to the Marpa Google Group:
marpa-parser@googlegroups.com
posted at: 12:30 | direct link to this entry