Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest
The Ocean of Awareness blog: home page, chronological index, and annotated index.
In
the previous post in this series,
I described the way
the Traditional Perl Parser (TPP) parses Perl's
use
statement.
This post will describe an IMHO attractive alternative approach,
using my own
Marpa::XS.
Marpa::XS has a prototype Perl parser as part of its test suite.
If that previous post is not fresh in your mind, it may be useful at this point to look back at it. Summarizing, the TPP uses the "Ruby Slippers" method: It first simplifies the grammar so that, while convenient for the parser, the grammar falls short of describing the actual language. The lexer then fakes the input to the parser so that the parser's "wishful thinking" comes true.
In the TPP, the parser requires all
use
statements
take this form:
Here the
use_statement ::= USE WORD WORD LIST
WORD
's may be either versions or
names of modules -- the semantics
looks at the internal representations to sort this out once
the parsing is finished.
Marpa is a lot better than
TPP's LALR parse engine at the Ruby Slippers thing.
Marpa will tell its lexers exactly what it wants at every point, whereas
the TPP's lexer can expect no help from the LALR parse engine
and is forced to do its own parsing of
use
statement.
But in Marpa::Perl, I decided NOT to use the Ruby Slippers
to parse Perl's use
statement.
Marpa is good at the Ruby Slippers, but it also allows even better and easier
ways to tackle this problem.
Marpa can parse far more grammars than the TPP's parse engine.
Marpa can even parse ambigious grammars.
The use
statement
comes in three forms, and it is easy to create a rule for
each one.
use_statement ::= USE NAME VERSION LIST
use_statement ::= USE VERSION
use_statement ::= USE NAME LIST
use
statement --
so natural that in fact, it's almost exactly
the way the Perl documentation
DOES describe it.
The one difference is that
the Perl documentation lists separate forms for
use
statements
with and
without LIST
's of arguments.
This is useful for clarity, but
LIST
's can be empty.
Separate rules in the parser's BNF
dedicated to module requests
without the LIST
's would create
a useless and inefficient ambiguity.
The three rules above demonstrate ambiguity of a helpful kind.
Consider
The TPP parses this as a request to use
the (so far non-existent) version 5 or higher of the
use Fatal 5;
Fatal
.
According to both TPP's and Marpa::Perl's BNF,
however, it could also be parsed as a request to
Fatal to wrap a Perl subroutine named 5
.
In fact, that's exactly how the TPP will interpret
use Fatal +5;
use
statement that way
whenever possible.
When a
use
statement cannot
possibly be the long form of a module request,
Marpa::Perl tries to parse it as a perl version request.
As a last resort,
Marpa::Perl interprets a
use
statement
as the short form (name only)
of a module request.
By using this rule ranking,
Marpa::Perl's lexer can avoid the issue of determining what is a
version number and what is a name.
Marpa allows tokens to be
ambiguous.
Marpa::Perl's lexer can tell the parser that
v5.0
is BOTH a name and a version,
and let the parser sort things out.
LALR is, to say the least,
not tolerant of
ambiguities, so
the TPP's lexer must tackle the issue
of identifying versions head-on.
When it comes to parsing Perl's
use
statement,
I believe I have shown that Marpa can do it more easily.
Parsing the
use
statement
in Marpa required the three BNF rules shown,
plus arranging to have the lexer recognize that numbers
and v-strings can be either a version or a name,
which is done with
a simple loop over a fixed list.
Look at
the corresponding code in
toke.c's tokenize_use()
and
op.c's
Perl_utilize(),
and you'll see some of the most difficult logic
that you're likely to find anywhere.
Perhaps more significant is what Marpa::Perl could do that the TPP
does not and could not do.
Imagine, for example, that you've written a module to make available
the regular expression logic as implemented in the classic
AT&T UNIX version 8,
affectionately remembered as v8 UNIX.
Imagine further that you decide, unhappily, to name this module,
v8.pm.
In that case you can simply forget about loading it with TPP's
use
statement,
which will insist that v8
is a version number,
even in a case like this:
The
use v8 0 qw(re);
use
statement
just shown can only have one parse.
A human being can quickly determine what that one parse must be.
Why have we come to believe this is too much
to expect of our parsing algorithms?
"Traditional Perl Parser": In this series, I call the Perl interpreter's parser, the one whose implementation centers on the files toke.c and perly.y, the Traditional Perl Parser, or TPP. Most others just call it the Perl parser, but in this series I discuss its alternatives (Marpa::Perl and PPI) a lot. This little bit of special terminology saves confusion.
"names of modules": Actually, these can also be pragma names, a fact that, for brevity's sake, I will not mention again.
"last resort":
Actually, the TPP accepts some undocumented syntaxes.
This gets into strange territory.
The TPP parses
as a request to use version 2.0 or higher
of the
use v2 Fatal;
Fatal
module.
You could take the point of view that the TPP should reject statements
like this
as syntax errors,
in which case TPP's behavior is a bug.
For the purposes of this blog post, I ignore this issue.
"a simple loop over a fixed list": For the code see pperl/Marpa/Perl.pm in the latest developer's version of Marpa::XS . The code discussed begins at line 1200.
posted at: 13:01 | direct link to this entry