Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest
The Ocean of Awareness blog: home page, chronological index, and annotated index.
A declarative parser takes a description of your language and parses it for you. On the face of it, this sounds like the way you'd want to go, and Marpa offers that possibility -- it generates a parser from anything you can write in BNF and, if the parser is in one of the classes currently in practical use, that parser will run in linear time.
But practical grammars often have context-sensitive parts -- features which cannot be described in BNF. Nice as declarative parsing may sound, at least some procedural parsing can be a necessity in real-life. In this post, I take a problem for which procedural parsing is essential, and create a fast, short solution that mixes procedural and declarative.
This is a sample of the language:
A2(A2(S3(Hey)S13(Hello, World!))S5(Ciao!))
It describes strings in nested arrays. The strings are introduced by the letter 'S', followed by a length count and then, in parentheses, the string itself. Arrays are introduced by the letter 'A' followed by an element count and, inside parentheses, the array's contents. These contents are a concatenated series of strings and other arrays. I call this a Dyck-Hollerith language because it combines Hollerith constants (strings preceded by a count), with balanced parentheses (what is called a Dyck language by mathematicians).
The language is one I've dealt with before. It is apparently from "real life", and is described more fully in a blog post by Flavio Poletti. Several people, Gabor Szabo among them, prodded me to show how Marpa would do on this language. I did this a year ago, using Marpa's previous version, Marpa::XS. The result was well-received and quite satisfactory.
This time around, I used Marpa's latest version, Marpa::R2, and its new interface, the SLIF. The solution presented here was much easer to write, and will be easier to read. It is also several times faster.
The full code for this example is in a Github gist. In what follows, I will assume the reader is interested in the ideas. Details of the interface, along with more detail-oriented tutorials, can be found in Marpa's documentation. Other tutorials are on the Ocean of Awareness blog, and on the Marpa Guide, a new website being built due to the generosity of Peter Stuifzand and Ron Savage.
First off, let's look at the declarative part. The core of the parser is the following lines, containing the BNF for the language's top-level structure.
my $dsl = <<'END_OF_DSL'; # The BNF :start ::= sentence sentence ::= element array ::= 'A' <array count> '(' elements ')' action => check_array string ::= ( 'S' <string length> '(' ) text ( ')' ) elements ::= element+ action => ::array element ::= string | array
Details of this syntax are in Marpa's documentation, but it's a dialect of EBNF. Adverbs like action => semantics tell Marpa what the semantics will be. The default (which will be set below) is for a rule to return its first child. ::array semantics tell Marpa to return every element of elements in an array. And check_array is the name of a function providing the semantics, as will be seen below.
Single-quoted strings are looked for literally in the input. In the string declaration, you'll note some parentheses which are not in quotes. The unquoted parentheses are part of the Marpa DSL's own syntax, telling Marpa to "hide" the parenthesized symbols from the semantics. Here, the effect is that text is treated by the semantics as if it were the "first" symbol.
Marpa's SLIF provides a lexer for the user, and this Marpa-internal lexer will handle most of the symbols in this example. The single-quoted strings we saw in the BNF are actually instructions to the internal lexer. The next lines tell Marpa how to recognize <array count> and <string length>.
<array count> ~ [\d]+ <string length> ~ [\d]+ text ~ [\d\D] END_OF_DSL
<array_count> and <string length> are both declared to be a series of digits. text is a stub. The length of text depends on the numeric value of <string length>, and dealing with that is beyond the power of the BNF. When it comes time to count out the symbols needed for text, we will hand control over to an external lexer. For the purposes of Marpa's lexer, text is described as a single character of any kind. Marpa's internal scanner uses a longest tokens match algorithm, and since we don't want the internal scanner to read text lexemes, describing text and other purely external lexemes as single characters is the right thing to do.
Now comes the weld between declarative and procedural ...
:lexeme ~ <string length> pause => after :lexeme ~ text pause => before
These two statements tell Marpa that <string length> and <text> are two lexicals at which Marpa's own parsing should "pause", handing over control to external procedural parsing logic. In the case of <string length>, the pause should be after it is read. In the case of <text> the pause should be before. What happens during the "pause", we will soon see.
Next follows the code to read the DSL, and start the parser.
my $grammar = Marpa::R2::Scanless::G->new( { action_object => 'My_Actions', default_action => '::first', source => \$dsl } ); my $recce = Marpa::R2::Scanless::R->new( { grammar => $grammar } );
The previous lines tell Marpa that when its semantics are provided by a Perl closure, it is to look for that closure in a package called My_Actions. The default semantics are ::first, which means simply pass the value of the first RHS symbol of a rule upwards.
We saw our input above:
$input = 'A2(A2(S3(Hey)S13(Hello, World!))S5(Ciao!))';
The block of code which follows is the main loop through the parse, including all the procedural parsing logic. Below, I will pull this procedural parsing logic out of the loop for separate examination.
Here the $recce->read() method performs the first read and sets up the input string. The interior of the loop is entered whenever Marpa "pauses". Once the procedural parsing logic is done, Marpa resumes with the $recce->resume() call. Throughout, $pos is used to track the current character in the input stream. The loop ends when $pos is after the last character of $input.
my $last_string_length; my $input_length = length $input; INPUT: for ( my $pos = $recce->read( \$input ); $pos < $input_length; $pos = $recce->resume($pos) ) { my $lexeme = $recce->pause_lexeme(); die q{Parse exhausted in front of this string: "}, substr( $input, $pos ), q{"} if not defined $lexeme; my ( $start, $lexeme_length ) = $recce->pause_span(); if ( $lexeme eq 'string length' ) { $last_string_length = $recce->literal( $start, $lexeme_length ) + 0; $pos = $start + $lexeme_length; next INPUT; } if ( $lexeme eq 'text' ) { my $text_length = $last_string_length; $recce->lexeme_read( 'text', $start, $text_length ); $pos = $start + $text_length; next INPUT; } ## end if ( $lexeme eq 'text' ) die "Unexpected lexeme: $lexeme"; } ## end INPUT: for ( my $pos = $recce->read( \$input ); $pos < $input_length...)
In this language, we need the procedural parsing logic to count the text strings properly. This is done in a very direct way. First we pull the count from <string length>:
if ( $lexeme eq 'string length' ) { $last_string_length = $recce->literal( $start, $lexeme_length ) + 0; $pos = $start + $lexeme_length; next INPUT; }
Above, we used pause_span() to set $start and $lexeme_length to the start and length of the lexeme that Marpa's internal scanner found. Passed to $recce->literal(), these two values return the "literal" string value of the lexeme, which will be the ASCII representation of a decimal number. We convert it to numeric, salt it away in $last_string_length, and set $pos to the location just after the <string length> lexeme.
if ( $lexeme eq 'text' ) { my $text_length = $last_string_length; $recce->lexeme_read( 'text', $start, $text_length ); $pos = $start + $text_length; next INPUT; } ## end if ( $lexeme eq 'text' )
Now we come to counting out the characters for the text lexeme. Recall that in the case of text, we pause before the lexeme, which means it will not have been read yet. With $recce->lexeme_read(), we tell Marpa that we want the next lexeme
We also set $pos to be just after the end of the lexeme.
We've focused on the string lengths, but the Dyck-Hollerith language has a count of the number of elements in its array. Marpa's BNF-driven parsing logic has no trouble determining the number of elements from the array contents, and it does not need the count. What to do with it?
package My_Actions;
sub check_array { my ( undef, undef, $declared_size, undef, $array ) = @_; my $actual_size = @{$array}; warn "Array size ($actual_size) does not match that specified ($declared_size)" if $declared_size != $actual_size; return $array; } ## end sub check_array
Recall that Marpa promised special semantics for the array rule in its DSL. Here they are. The first parameter to Marpa's semantic closures is a per-parse variable, here unused. The rest are the values of the RHS symbols, in order. We only care about the second (for <array count>), and the fourth (for elements). We determine a $declared_size from <array count>; and an $actual_size by looking at the array referenced by $array. If these differ, we choose to warn the user. Depending on your purposes, anything from ignoring the issue to throwing a fatal error may be equally or more reasonable.
And now we are ready to take the result of the parse.
my $result = $recce->value(); die 'No parse' if not defined $result;
The techniques described apply to problems considerably larger than the example of this post. Jean-Damien Durand is using them to create a C-to-AST tool. This takes C language and converts it to an AST, following the C11 specification carefully. The AST can then be manipulated as you wish.
Marpa::R2
is available on CPAN.
A list of my Marpa tutorials can be found
here.
There is
a new tutorial by Peter Stuifzand.
The Ocean of Awareness blog
focuses on Marpa,
and it has
an annotated guide.
Marpa also has
a web page.
For questions, support and discussion, there is a
Google Group:
marpa-parser@googlegroups.com
.
Comments on this post can be made there.
posted at: 10:17 | direct link to this entry