Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest
The Ocean of Awareness blog: home page, chronological index, and annotated index.
Marpa is an Earley-based parser, and Earley parsers are typically not good at procedural parsing. Many programmers are used to recursive descent (RD), which has been state-of-the-art in terms of its procedural programming capabilities -- it was these capabilities which led to RD's triumph over the now-forgotten Irons algorithm.
Marpa, however, has a parse engine expressly redesigned[1] to handle procedural logic well. In fact, Marpa is better at procedural logic than RD.
Marpa parses all LR-regular grammars in linear time, so the first challenge is to find a grammar that illustrates a need for procedural logic, even when Marpa is used. The following is the canonical example of a grammar that is context-sensitive, but not context-free:
a^n . b^n . c^n : n >= 1I will call this the "ABC grammar". It is a sequence of a's, b's, and c's, in alphabetical order, where the character counts are all equal to each other and greater than zero.
The ABC "grammar" is really a counting problem more than a natural parsing problem, and parsing is not the fastest or easiest way to solve it. Three tight loops, with counters, would do the same job nicely, and would be much faster. But I chose the ABC grammar for exactly this reason. It is simple in itself, but it is tricky when treated as a parsing problem.[2]
In picking the strategy below, I opted for one that illustrates a nice subset of Marpa's procedural parsing capabilities. Full code is on-line, and readers are encouraged to "peek ahead".
Our strategy will be to start with a context-free syntax, and then extend it with procedural logic. Here is the context-free grammar:
lexeme default = latm => 1 :default ::= action => [name,start,length,values] S ::= prefix ABC trailer ABC ::= ABs Cs ABs ::= A ABs B | A B prefix ::= A* trailer ::= C_extra* A ~ 'a' B ~ 'b' :lexeme ~ Cs pause => before event => 'before C' Cs ~ 'c' # dummy -- procedural logic readsC_extra ~ 'c'
The first line is boiler-plate: It turns off a default which was made pointless by a later enhancement to Marpa::R2. Marpa::R2 is stable, and backward-compatibility is a very high priority.
:default ::= action => [name,start,length,values]
We will produce a parse tree. The second line defines its format -- each node is an array whose elements are, in order, the node name, its start position, its length and its child nodes.
S ::= prefix ABC trailer
The symbol <ABC> is our "target" -- the counted a's, b's, and c's. To make things a bit more interesting, and to make the problem more like a parsing problem instead of a counting problem, we allow a prefix of a's and a trailer of c's.
ABC ::= ABs Cs
We divide the <ABC> target into two parts: <ABs>, which contains the a's, and b's; and <Cs>, which contains the c's.
The string
a^n . b^n
is context free, so that we can handle it without procedural logic, as follows:
ABs ::= A ABs B | A B
The line above recognizes a non-empty string of a's, followed by an equal number of b's.
prefix ::= A* trailer ::= C_extra*
As stated above, <prefix> is a series of a's and <trailer> is a series of c's.
A ~ 'a' B ~ 'b'
Marpa::R2 has a separate lexical and syntactic phase. Here we define our lexemes. The first two are simple enough: <A> is the character "a"; and <B> is the character "b".
:lexeme ~ Cs pause => before event => 'before C' Cs ~ 'c' # dummy -- procedural logic readsC_extra ~ 'c'
For the character "c", we need procedural logic. As hooks for procedural logic, Marpa allows a full range of events. Events can occur on prediction and completion of symbols; when symbols are nulled; before lexemes; and after lexemes. The first line in the above display declares a "before lexeme" event on the symbol <Cs>. The name of the event is "before C".
The second line is a dummy entry, which is needed to allow the "before C" event to trigger. The entry says that <Cs> is a single character "c". This is false -- <Cs> is a series of one or more c's, which needs to be counted. But when the "before C" event triggers, the procedural logic will make things right.
The third line defines <C_extra>, which is another lexeme for the character "c". We have two different lexemes for the character c, because we want some c's (those in the target) to trigger events; and we want other c's (those in the trailer) not to trigger events, but to be consumed by Marpa directly.
At this point, we have solved part of the problem with context-free syntax, and set up a Marpa event named "before C", which will solve the rest of it.
my $input_length = length ${$input}; for ( my $pos = $recce->read($input); $pos < $input_length; $pos = $recce->resume() ) { ... Process events ... }
Processing of events takes place inside a Marpa read loop. This is initialized with a read() method, and is continued with a resume() method. The read() and resume() methods both return the current position in the input. If the current position is end-of-input, we are done. If not, we were interrupted by an event, which we must process.
Process events EVENT: for ( my $event_ix = 0 ; my $event = $recce->event($event_ix) ; $event_ix++ ) { my $name = $event->[0]; if ( $name eq 'before C' ) { ... Process "before C" event ... } die qq{Unexpected event: name="$name"}; }
In this application, only one event can occur at any location, so the above loop is "overkill". It loops through the events, one by one. The event method returns a reference to an array of event data. The only element we care about is the event name. In fact, if we weren't being careful about error checking, we would not even care about the event name, since there can be only one.
If, as expected, the event name is "before C", we process it. In any other case, we die with an error message.
Process "before C" event my ( $start, $length ) = $recce->last_completed_span('ABs'); my $c_length = ($length) / 2; my $c_seq = ( 'c' x $c_length ); if ( substr( ${$input}, $pos, $c_length ) eq $c_seq ) { $recce->lexeme_read( 'Cs', $pos, $c_length, $c_seq ); next EVENT; } die qq{Too few C's};
This is the core part of our procedural logic, where we have a "before C" event. We must
my ( $start, $length ) = $recce->last_completed_span('ABs'); my $c_length = ($length) / 2;
Marpa claims to be "left-eidetic", that is, to have full knowledge of the parse so far, and to make this knowledge available to the programmer. How does a programmer cash in on this promise?
Of course, there is a fully general interface, which allows you to go through the Earley tables and extract the information in any form necessary. But another, more convenient interface, fits our purposes here. Specifically,
Marpa has a last_completed_span() method, and that is just what we need. This finds the most recent instance of a symbol. (If there had been more than one most recent instance, it would have found the longest.) The last_completed_span() method returns the start of the symbol instance (which we do not care about) and its length. The desired number of c characters, $c_length, is half the length of the <ABs> instance.
my $c_seq = ( 'c' x $c_length ); if ( substr( ${$input}, $pos, $c_length ) eq $c_seq ) { ... }
Marpa allows external parsing. You can pause Marpa, as we have done, and hand control over to another parser -- including another instance of Marpa.
Here external parsing is necessary to make our parser context-sensitive, but the external parser does not have to be fancy. All it needs to do is some counting -- not hard, but something that a context-free grammar cannot do.
$pos is the current position in the input, as returned by the read() or resume() method in the outer loop. Our input is the string referred to by $input. We just calculated $c_length as the number of c characters required. The above code checks to see that the required number of c characters is at $pos in the input.
$recce->lexeme_read( 'Cs', $pos, $c_length, $c_seq );
Our external logic is doing the parsing, but we need to let Marpa know what we are finding. We do this with the lexeme_read() method. lexeme_read() needs to know what symbol we are reading (Cs in our case); and its value ($c_seq in our case).
Marpa requires that every symbol be tied in some way to the input. The tie-in is only for error reporting, and it can be hack-ish or completely artificial, if necessary. In this application, our symbol instance is tied into the input in a very natural way -- it is the stretch of the input that we compared to $c_seq in the display before last. We therefore tell Marpa that the symbol is at $pos in the input, and of length $c_length.
next EVENT;
External parsing can go on quite a long time. In fact, an external parser never has to hand control back to Marpa. But in this case, we are done very quickly.
We ask for the next iteration of the EVENT loop. (In this code, there will not be a next iteration, unless there is an error.) Once done, the EVENT loop will hand control over to the outer loop. The outer loop will call the resume() method to return control back to Marpa.
The full code for this example is on-line. There is a lot more to Marpa, including more facilities for adding procedural logic to your Marpa parsers. To learn more about Marpa, a good first stop is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.
1. To handle procedural logic well, an Earley engine needs to complete its Earley sets in strict order -- that is, Earley set N cannot change after work on Earley set N+1 has begun. I have not looked at every Earley parse engine, and some may have had this strict-sequencing property. And many of the papers are agnostic about the order of operations. But Marpa is the first Earley parser to recognize and exploit strict-sequencing as a feature. ↩
2. The ABC grammar, in fact, is not all that easy or natural to describe even with a context-sensitive phrase structure description. A solution is given on Wikipedia: https://en.wikipedia.org/wiki/Context-sensitive_grammar#Examples. ↩
posted at: 20:02 | direct link to this entry