Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest
The Ocean of Awareness blog: home page, chronological index, and annotated index.
In this post, I talk about how
Perl 5 parses its
use
statement.
The use
statement
is implemented with
what I have named
"Ruby Slippers" parsing.
The idea is that you parse with a convenient grammar,
but one which is too simple to actually describe the language
you are parsing.
For example, if you are parsing HTML, the grammar might assume
all start tags have end tags.
Whenever the simplified grammar has trouble parsing, the lexer fixes the situation by pretending the input is what the parser wants to see. The parser is like Dorothy in the Wizard of Oz, who really would like to be back in Kansas. The lexer is like the good witch, Glenda, who assures Dorothy that, because of her Ruby Slippers, Dorothy really can be wherever she wants to be.
Few "new" programming ideas are so new that they have no precedent in previous practice. Perl 5 put the Ruby Slippers technique to work well before I described and named it. Its code captures the two essential elements of Ruby Slippers parsing.
use
Statement
As a reminder, the
use
statement comes in several forms.
Most of them are module
requests --
that is, they request the loading of a module.
In the long form of the module request, a version number
is specified as well. The version number is usually
interpreted as the minimum acceptable version of
that module.
use Module VERSIONFor example,
use List::Util 1.21;
The short form of the module request is a module
request with no version specified.
use ModuleFor example,
use List::Util;
Module requests, both short form and long form, also allow a list argument.
use Module VERSION LIST use Module LISTHere's an example of the short form with a list of arguments.
use List::Util qw(reduce shuffle);
Because a number can be either a version or a single item list argument,
module requests of the long form are potentially ambiguous.
That is,
use Fatal v42;
could be a request to use at least version
42 of the Fatal module.
It can also be
a request to load any version of the Fatal module, catching errors
for the function named v42
.
As implemented, Perl disambiguates these in the lexer.
It parses the use
statement
as the long form, with version specified, whenever possible.
The above line, for example, will complain that there is no
version 42 of the Fatal
module.
If you want to use the
Fatal
module
with a function that you happened to have named
v42
, you can take advantage of the fact
that the lexer's idea of whether a lexeme is a version number
or not is a guess.
This guess is based entirely on the first character or,
in the case of a v-string, the first two.
For example,
will be parsed as the short form of a module request with
a single item list.
use Fatal +'v42';
Perl 5 has
to live within the limits of LALR parsing.
So you would think what I've already described
would be living plenty dangerously enough
for its designers.
But there is in fact
an additional form of the
use
statement:
use VERSIONFor example,
use 5.010;
This is the perl version request form.
The example requires that the version of Perl
used be at least 5.010.
So, to parse all these different forms, what is in Perl's BNF?
The BNF in perly.y has a single rule for
the use
statement,
one that specifies only the long form of the
module request.
The Perl 5 BNF rule
is the equivalent of
use WORD WORD LISTHere I have
WORD
instead of
Module
and VERSION
,
because that is what
perly.y has.
In this context, a WORD
is a token which
can be either a version number
or the name of a module.
Of the five forms of the use
command, only one
is represented in the BNF.
Of couse, a LIST can be empty,
and that accounts for two of the missing variants.
But one BNF rule still accounts
for three different syntaxes:
Collapsing three syntaxes into one rule makes for some gruesome code within toke.c, but Perl 5 has no better choice. Perl 5 is committed to LALR parsing. Restricting the grammar to a single BNF rule is the best hope of making sure the grammar does not go beyond LALR.
So the Perl 5 BNF assumes, contrary to fact, that Perl
use
statements always contain both module name
and version.
Here we see the first part of the Ruby Slippers strategy:
wishful thinking.
When a feature of the actual language is inconvenient
to parse,
the Ruby Slippers allow you to simply pretend it does not
exist.
Perl 5's parser will only handle grammars that are LALR,
but the Perl 5 language is not LALR.
Using the Ruby Slippers approach,
the Perl 5 BNF pretends that the language is LALR.
The second part of the Ruby Slippers strategy requires that wishful thinking be made to come true. That's easy in Marpa, where the parse engine knows, and can tell the lexer, exactly what it is wishing for and where. yacc is much less context-aware. As far as yacc is concerned, if the lexer really loved it, it would know what it wants.
The Perl 5 lexer winds up having to work very hard
to make this particular relationship work.
It looks ahead at the two WORD
tokens and decides what to
do based on what it sees.
This amounts to figuring out in the lexer which variant of the
use
statement is actually being used.
In effect Perl 5 parses every
use
statement twice:
once in the lexer, and then one more time with yacc.
If the lexer sees that the use
statement is the short form,
then it invents a second WORD
token
to fill in for the missing one.
The lexer make the short form
of the use
statement
look to yacc as if it was the full form.
This is the classic Ruby Slippers approach.
The perl version form of the
use
statement is essentially a completely different
statement with the same keyword.
As its first step, it also does the classic Ruby Slippers
move, only in reverse.
The lexer reads the version as the first WORD
token.
A second WORD
is then invented, with NULL
contents.
As a final step, Perl 5 needs to distinguish
perl version requests from module load requests.
Here things get quite hackish.
Because the lexer is actually creating the Perl scalars,
it has full control over their internal representations:
the semantics can rely on the internal representations of
the WORD
tokens.
Perl 5 assumes that WORD
's
are version numbers if and only if they have numeric
representations.
After the lexer reworks its input,
the parser always sees two WORD
's,
exactly one of which is a version number.
Based
on whether that version number is
the first or second WORD
,
Perl 5 decides whether a use
statement is
a module request or a perl version request.
Desperate men do desperate things.
"exactly what it is wishing for": As currently implemented, Marpa has a convenient call that, at any point in the parse, will return the list of expected terminals. An lexer willing to use the debug and trace functions can find a lot more information: which rules are in progress, how far they have been recognized, where they began, etc., etc. I could create convenient calls to access this information as well, but so far the list of expected terminals has been all that my lexers care to know.
"module":
Actually, the use
statement
implements pragmas as well as modules,
so the module request forms are also pragma
invocations,
and module names can be pragma names.
posted at: 14:34 | direct link to this entry