Thu, 06 Oct 2011
Perl and Parsing 10: "Use" the Easier Way
the previous post in this series,
I described the way
the Traditional Perl Parser (TPP) parses Perl's
This post will describe an IMHO attractive alternative approach,
using my own
Marpa::XS has a prototype Perl parser as part of its test suite.
The TPP Way
that previous post
is not fresh in your mind, it may be useful
at this point to look back at it.
Summarizing, the TPP uses
the "Ruby Slippers" method:
It first simplifies the grammar so that, while convenient for the
parser, the grammar falls short of describing the actual language.
The lexer then fakes the input to the parser
so that the parser's "wishful thinking" comes true.
In the TPP, the parser requires all
take this form:
use_statement ::= USE WORD WORD LIST
WORD's may be either versions or
names of modules -- the semantics
looks at the internal representations to sort this out once
the parsing is finished.
The Marpa::Perl Way
Marpa is a lot better than
TPP's LALR parse engine at the Ruby Slippers thing.
Marpa will tell its lexers exactly what it wants at every point, whereas
the TPP's lexer can expect no help from the LALR parse engine
and is forced to do its own parsing of
But in Marpa::Perl, I decided NOT to use the Ruby Slippers
to parse Perl's
Marpa is good at the Ruby Slippers, but it also allows even better and easier
ways to tackle this problem.
Marpa can parse far more grammars than the TPP's parse engine.
Marpa can even parse ambigious grammars.
comes in three forms, and it is easy to create a rule for
use_statement ::= USE NAME VERSION LIST
use_statement ::= USE VERSION
use_statement ::= USE NAME LIST
This is a very natural way to describe
so natural that in fact, it's almost exactly
the way the Perl documentation
DOES describe it.
The one difference is that
the Perl documentation lists separate forms for
's of arguments.
This is useful for clarity, but
's can be empty.
Separate rules in the parser's BNF
dedicated to module requests
's would create
a useless and inefficient ambiguity.
The three rules above demonstrate ambiguity of a helpful kind.
The TPP parses this as a request to use
the (so far non-existent) version 5 or higher of the
use Fatal 5;
According to both TPP's and Marpa::Perl's BNF,
however, it could also be parsed as a request to
Fatal to wrap a Perl subroutine named
In fact, that's exactly how the TPP will interpret
use Fatal +5;
Marpa::Perl solves the ambiquity by using Marpa's new
ranking method, which allows every rule to have an
Marpa::Perl ranks the rules in the order I listed them
from highest to lowest.
The long form (both name and version)
of the module request ranks highest, meaning that Marpa::Perl
statement that way
possibly be the long form of a module request,
Marpa::Perl tries to parse it as a perl version request.
As a last resort
Marpa::Perl interprets a
as the short form (name only)
of a module request.
By using this rule ranking,
Marpa::Perl's lexer can avoid the issue of determining what is a
version number and what is a name.
Marpa allows tokens to be
Marpa::Perl's lexer can tell the parser that
v5.0 is BOTH a name and a version,
and let the parser sort things out.
LALR is, to say the least,
not tolerant of
the TPP's lexer must tackle the issue
of identifying versions head-on.
When it comes to parsing Perl's
I believe I have shown that Marpa can do it more easily.
in Marpa required the three BNF rules shown,
plus arranging to have the lexer recognize that numbers
and v-strings can be either a version or a name,
which is done with
a simple loop over a fixed list.
the corresponding code in
and you'll see some of the most difficult logic
that you're likely to find anywhere.
Perhaps more significant is what Marpa::Perl could do that the TPP
does not and could not do.
Imagine, for example, that you've written a module to make available
the regular expression logic as implemented in the classic
AT&T UNIX version 8,
affectionately remembered as v8 UNIX.
Imagine further that you decide, unhappily, to name this module,
In that case you can simply forget about loading it with TPP's
which will insist that
v8 is a version number,
even in a case like this:
use v8 0 qw(re);
just shown can only have one parse.
A human being can quickly determine what that one parse must be.
Why have we come to believe this is too much
to expect of our parsing algorithms?
"Traditional Perl Parser":
In this series, I call the Perl interpreter's parser,
the one whose implementation centers on the files
toke.c and perly.y,
the Traditional Perl Parser, or TPP.
Most others just call it the Perl parser, but in this
series I discuss its alternatives (Marpa::Perl
and PPI) a lot.
This little bit of special terminology
"names of modules":
Actually, these can also be pragma names, a fact that,
for brevity's sake,
I will not mention again.
Actually, the TPP accepts some undocumented syntaxes.
This gets into strange territory.
The TPP parses
as a request to use version 2.0 or higher
use v2 Fatal;
You could take the point of view that the TPP should reject statements
as syntax errors,
in which case TPP's behavior is a bug.
For the purposes of this blog post, I ignore this issue.
"a simple loop over a fixed list":
For the code see
in the latest developer's version of Marpa::XS
The code discussed begins at line 1200.
posted at: 13:01 |
direct link to this entry