Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest
The Ocean of Awareness blog: home page, chronological index, and annotated index.
In a previous post, I described a method of writing powerful domain-specific languages (DSLs), one that was simpler and faster than previous approaches. This post takes things significantly further.
The approach described in the previous post was not itself directly DSL-based, and it required the programmer to write a separate lexer. This post uses Marpa::R2's new Scanless interface. The Scanless interface is a DSL for writing DSL's and it incorporates the specification of the lexer into the language description.
When it comes to dealing with a programming problem, no tool is as powerful and flexible as a custom language targeted to the problem domain. But writing a domain specific language (DSL) is among the least used approaches, and for what has been a very good reason -- in the past, DSL's have been very difficult to write.
This post takes a tutorial approach. It does not assume knowledge of the previous tutorials on this blog.
The full code for this post is in a Github gist. Our example DSL is a calculator, one whose features are chosen for the purpose of illustration. It is not a "toy" example -- its error reporting is quite good and it has a test suite. Nonetheless, it is both short and easy to read, capable of being written quickly and maintained and extended easily.
The grammar for our calculator divides naturally into two parts. Here is the first:
:start ::= script script ::= expression script ::= script ';' expression action => do_arg2 <reduce op> ::= '+' | '-' | '/' | '*' expression ::= number | variable action => do_is_var | '(' expression ')' assoc => group action => do_arg1 || '-' expression action => do_negate || expression '^' expression action => do_caret assoc => right || expression '*' expression action => do_star | expression '/' expression action => do_slash || expression '+' expression action => do_plus | expression '-' expression action => do_minus || expression ',' expression action => do_array || <reduce op> 'reduce' expression action => do_reduce || variable '=' expression action => do_set_var
The format of the grammar is documented here. It consists of a series of rules. Each rule has a left hand side (LHS) and a right hand side (RHS), which are separated by a rule operator. In the rules above, the rule operator is the BNF operator (::=).
The first rule is a pseudo-rule -- its LHS is the pseudo-symbol :start, and indicates that script is the grammar's start symbol. The next two rules indicate that script is a series of one or more expression's, separated by a semicolon.
Rules can have action "adverbs" to describe the semantics. For example, the adverb "action => do_args" says that the semantics for the preceding RHS are implemented by a Perl closure named do_args.
The rule for <reduce op> introduces two new features: symbols names in angle brackets, and alternatives, separated by a veritcal bar, ("|").
The last and longest rule, defined an expression, is a precedence rule. It is a series of alternatives, some separated by a single vertical bar, and others separated by a double vertical bar ("||"). The double vertical bar indicates that the alternatives after it are at a looser ("lower") precedence than the alternatives before it. The single vertical bar separates alternatives at the same precedence level.
While Marpa's Scanless interface allows lexical and structural rules to be intermixed, it is usually convenient to have the lexical rules come after the structural rules:
number ~ [\d]+ variable ~ [\w]+ :discard ~ whitespace whitespace ~ [\s]+ # allow comments :discard ~ <hash comment> <hash comment> ~ <terminated hash comment> | <unterminated final hash comment> <terminated hash comment> ~ '#' <hash comment body> <vertical space char> <unterminated final hash comment> ~ '#' <hash comment body> <hash comment body> ~ <hash comment char>* <vertical space char> ~ [\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}] <hash comment char> ~ [^\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}] END_OF_GRAMMAR
Rules in this second set of rules have the same syntax as rules in the first set, but instead of the BNF operator (::=), they have a match operator (~) separating the LHS and RHS. The BNF operator can be seen as telling Marpa, "When it comes to whitespace and comments, do what I mean". The match operator tells Marpa to "Do exactly what I say on a literal, character-by-character basis."
The first two lines indicate how number's and variable's are formed. The square bracketed character classes accept anything acceptable to Perl. :discard is another pseudo-symbol -- any lexeme recognized as a :discard symbol is thrown away.
This is how whitespace and comments are dealt with. Note that our calculator recognizes "hash comments", and takes some care to do the right thing even when the hash comment is at the end of a string which does not end in vertical whitespace. It is interesting to compare the representation of hash comments here with the usual regular expression notation. Regular expressions are much more concise, but the BNF-ish form can be easier to read. In this example, long descriptive angle-bracketed symbol names save the reader the trouble of puzzling out the purpose of some of the more obscure cases.
Now that we have defined the grammar, we need to pre-process it:
my $grammar = Marpa::R2::Scanless::G->new( { action_object => 'My_Actions', default_action => 'do_arg0', source => \$rules, } );
The action_object named argument specifies a package to implement the semantics -- Marpa will look up the names of the Perl closures in that package. The default_action named argument specified the action name for RHS's which do not explicitly specify one with an action adverb.
The calculate() closure uses our grammar to parse a string.
sub calculate { my ($p_string) = @_; my $recce = Marpa::R2::Scanless::R->new( { grammar => $grammar } ); my $self = bless { grammar => $grammar }, 'My_Actions'; $self->{recce} = $recce; $self->{symbol_table} = {}; local $My_Actions::SELF = $self; if ( not defined eval { $recce->read($p_string); 1 } ) { # Add last expression found, and rethrow my $eval_error = $EVAL_ERROR; chomp $eval_error; die $self->show_last_expression(), "\n", $eval_error, "\n"; } ## end if ( not defined eval { $recce->read($p_string); 1 }) my $value_ref = $recce->value(); if ( not defined $value_ref ) { die $self->show_last_expression(), "\n", "No parse was found, after reading the entire input\n"; } return ${$value_ref}, $self->{symbol_table}; } ## end sub calculate
Walking through the code, we first create a recognizer ("recce" for short) from our grammar. Next, we define a parse object named "$self". (Object enthusiasts will, I hope, forgive a certain awkwardness at this stage.)
Next, we call the read() method on the recognizer with our string. We then check the result of the read() method for errors.
Finally, we return our results. This calculator allows variables, whose values it keeps in a symbol table. Since these can be important side effects, the symbol table is returned as part of the results.
This calculator has error reporting that compares favorably with production languages. (Unfortunately, these often do not set the bar very high.) The methods of the Scanless interface return diagnostics that pinpoint where things went wrong from the technical point of view, and what the problem was from the technical point of view. As a diagnostic, this is often adequate, but not always. Marpa's diagnostics have 100% technical accuracy, but the parsing may have ceased to reflect the programmer's intent before there is a technical problem.
To help the programmer sync his intent to what Marpa is seeing, when there is a problem, this calculator reports to the user the text for the last expression that was successfully recognized. Here's the code that finds it:
sub show_last_expression { my ($self) = @_; my $recce = $self->{recce}; my ( $start, $end ) = $recce->last_completed_range('expression'); return 'No expression was successfully parsed' if not defined $start; my $last_expression = $recce->range_to_string( $start, $end ); return "Last expression successfully parsed was: $last_expression"; } ## end sub show_last_expression
Here is a snippet of the semantics, with a few of the simpler semantic closures.
package My_Actions; our $SELF; sub new { return $SELF } sub do_set_var { my ( $self, $var, undef, $value ) = @_; return $self->{symbol_table}->{$var} = $value; } sub do_negate { return -$_[2]; } sub do_arg0 { return $_[1]; } sub do_arg1 { return $_[2]; } sub do_arg2 { return $_[3]; }
Full code for this example can be found in a Github gist. Semantics, legalese, a test suite and other packaging bring its total length to not quite 300 lines. It uses the latest indexed CPAN release of Marpa::R2. Marpa also has a web page.
Comments on this post
can be sent to the Marpa Google Group:
marpa-parser@googlegroups.com
posted at: 11:02 | direct link to this entry