Mon, 19 Sep 2011
Perl and Parsing 9: "Use" and the Ruby Slippers
In this post, I talk about how
Perl 5 parses its
is implemented with
what I have named
"Ruby Slippers" parsing.
The idea is that you parse with a convenient grammar,
but one which is too simple to actually describe the language
you are parsing.
For example, if you are parsing HTML, the grammar might assume
all start tags have end tags.
Whenever the simplified grammar has trouble parsing,
the lexer fixes the situation by pretending the input
is what the parser wants to see.
The parser is like Dorothy in
the Wizard of Oz, who really would
like to be back in Kansas.
The lexer is like the good witch,
Glenda, who assures Dorothy that,
because of her Ruby Slippers,
Dorothy really can be wherever she wants to be.
Few "new" programming ideas are so new
that they have no
precedent in previous practice.
Perl 5 put the Ruby Slippers technique
to work well before I described and named it.
Its code captures the two essential elements of
Ruby Slippers parsing.
The Syntax of the
As a reminder, the
use statement comes in several forms.
Most of them are module
that is, they request the loading of a module.
In the long form of the module request, a version number
is specified as well. The version number is usually
interpreted as the minimum acceptable version of
use Module VERSION
use List::Util 1.21;
The short form of the module request is a module
request with no version specified.
Module requests, both short form and long form,
also allow a list argument.
use Module VERSION LIST
use Module LIST
Here's an example of the short form with a list of arguments.
use List::Util qw(reduce shuffle);
Because a number can be either a version or a single item list argument,
module requests of the long form are potentially ambiguous.
use Fatal v42;
could be a request to use at least version
42 of the Fatal module.
It can also be
a request to load any version of the Fatal module, catching errors
for the function named
As implemented, Perl disambiguates these in the lexer.
It parses the
as the long form, with version specified, whenever possible.
The above line, for example, will complain that there is no
version 42 of the
If you want to use the
with a function that you happened to have named
v42, you can take advantage of the fact
that the lexer's idea of whether a lexeme is a version number
or not is a guess.
This guess is based entirely on the first character or,
in the case of a v-string, the first two.
will be parsed as the short form of a module request with
a single item list.
use Fatal +'v42';
Perl 5 has
to live within the limits of LALR parsing.
So you would think what I've already described
would be living plenty dangerously enough
for its designers.
But there is in fact
an additional form of the
This is the perl version request form.
The example requires that the version of Perl
used be at least 5.010.
Out Come the Ruby Slippers
So, to parse all these different forms, what is in Perl's BNF?
The BNF in perly.y has a single rule for
one that specifies only the long form of the
The Perl 5 BNF rule
is the equivalent of
use WORD WORD LIST
Here I have
because that is what
In this context, a
is a token which
can be either a version number
or the name of a module.
Of the five forms of the
command, only one
is represented in the BNF.
Of couse, a LIST can be empty,
and that accounts for two of the missing variants.
But one BNF rule still accounts
for three different syntaxes:
- the long form of the module request,
- the short form of the module request, and
- the perl version request.
Collapsing three syntaxes into one rule makes
gruesome code within toke.c, but
Perl 5 has no better choice.
Perl 5 is committed to LALR parsing.
Restricting the grammar to a single BNF rule
is the best hope of making sure the grammar
does not go beyond LALR.
So the Perl 5 BNF assumes, contrary to fact, that Perl
use statements always contain both module name
Here we see the first part of the Ruby Slippers strategy:
When a feature of the actual language is inconvenient
the Ruby Slippers allow you to simply pretend it does not
Perl 5's parser will only handle grammars that are LALR,
but the Perl 5 language is not LALR.
Using the Ruby Slippers approach,
the Perl 5 BNF pretends that the language is LALR.
Making Wishes Come True
The second part of the Ruby Slippers strategy requires that
wishful thinking be made to come true.
That's easy in
where the parse engine knows,
and can tell the lexer,
exactly what it is wishing for
yacc is much less context-aware.
As far as yacc is concerned,
if the lexer really loved it,
it would know what it wants.
The Perl 5 lexer winds up having to work very hard
to make this particular relationship work.
It looks ahead at the two
tokens and decides what to
do based on what it sees.
This amounts to figuring out in the lexer which variant of the
use statement is actually being used.
In effect Perl 5 parses every
use statement twice:
once in the lexer, and then one more time with yacc.
If the lexer sees that the
statement is the short form,
then it invents a second
to fill in for the missing one.
The lexer make the short form
look to yacc as if it was the full form.
This is the classic Ruby Slippers approach.
The perl version form of the
use statement is essentially a completely different
statement with the same keyword.
As its first step, it also does the classic Ruby Slippers
move, only in reverse.
The lexer reads the version as the first
WORD is then invented, with
As a final step, Perl 5 needs to distinguish
perl version requests from module load requests.
Here things get quite hackish.
Because the lexer is actually creating the Perl scalars,
it has full control over their internal representations:
the semantics can rely on the internal representations of
Perl 5 assumes that
are version numbers if and only if they have numeric
After the lexer reworks its input,
the parser always sees two
exactly one of which is a version number.
on whether that version number is
the first or second
Perl 5 decides whether a
use statement is
a module request or a perl version request.
Desperate men do desperate things.
"exactly what it is wishing for":
As currently implemented, Marpa has a convenient call
at any point in the parse,
will return the list of expected terminals.
An lexer willing to use the debug and trace functions
a lot more information: which rules are in progress,
how far they have been recognized,
where they began, etc., etc.
I could create convenient calls to access this information as well,
but so far the list of expected terminals
has been all that my lexers care to know.
implements pragmas as well as modules,
so the module request forms are also pragma
and module names can be pragma names.
posted at: 14:34 |
direct link to this entry