Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website


Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Mon, 19 Sep 2011

Perl and Parsing 9: "Use" and the Ruby Slippers

In this post, I talk about how Perl 5 parses its use statement. The use statement is implemented with what I have named "Ruby Slippers" parsing. The idea is that you parse with a convenient grammar, but one which is too simple to actually describe the language you are parsing. For example, if you are parsing HTML, the grammar might assume all start tags have end tags.

Whenever the simplified grammar has trouble parsing, the lexer fixes the situation by pretending the input is what the parser wants to see. The parser is like Dorothy in the Wizard of Oz, who really would like to be back in Kansas. The lexer is like the good witch, Glenda, who assures Dorothy that, because of her Ruby Slippers, Dorothy really can be wherever she wants to be.

Few "new" programming ideas are so new that they have no precedent in previous practice. Perl 5 put the Ruby Slippers technique to work well before I described and named it. Its code captures the two essential elements of Ruby Slippers parsing.

The Syntax of the use Statement

As a reminder, the use statement comes in several forms. Most of them are module requests -- that is, they request the loading of a module. In the long form of the module request, a version number is specified as well. The version number is usually interpreted as the minimum acceptable version of that module.

use Module VERSION
For example,
use List::Util 1.21;

The short form of the module request is a module request with no version specified.
use Module
For example,
use List::Util;

Module requests, both short form and long form, also allow a list argument.

use Module LIST
Here's an example of the short form with a list of arguments.
use List::Util qw(reduce shuffle);
Because a number can be either a version or a single item list argument, module requests of the long form are potentially ambiguous. That is,

use Fatal v42;
could be a request to use at least version 42 of the Fatal module. It can also be a request to load any version of the Fatal module, catching errors for the function named v42. As implemented, Perl disambiguates these in the lexer. It parses the use statement as the long form, with version specified, whenever possible. The above line, for example, will complain that there is no version 42 of the Fatal module.

If you want to use the Fatal module with a function that you happened to have named v42, you can take advantage of the fact that the lexer's idea of whether a lexeme is a version number or not is a guess. This guess is based entirely on the first character or, in the case of a v-string, the first two. For example,

use Fatal +'v42';
will be parsed as the short form of a module request with a single item list.

Perl 5 has to live within the limits of LALR parsing. So you would think what I've already described would be living plenty dangerously enough for its designers. But there is in fact an additional form of the use statement:

For example,
use 5.010;
This is the perl version request form. The example requires that the version of Perl used be at least 5.010.

Out Come the Ruby Slippers

So, to parse all these different forms, what is in Perl's BNF? The BNF in perly.y has a single rule for the use statement, one that specifies only the long form of the module request. The Perl 5 BNF rule is the equivalent of

Here I have WORD instead of Module and VERSION, because that is what perly.y has. In this context, a WORD is a token which can be either a version number or the name of a module.

Of the five forms of the use command, only one is represented in the BNF. Of couse, a LIST can be empty, and that accounts for two of the missing variants. But one BNF rule still accounts for three different syntaxes:

Collapsing three syntaxes into one rule makes for some gruesome code within toke.c, but Perl 5 has no better choice. Perl 5 is committed to LALR parsing. Restricting the grammar to a single BNF rule is the best hope of making sure the grammar does not go beyond LALR.

So the Perl 5 BNF assumes, contrary to fact, that Perl use statements always contain both module name and version. Here we see the first part of the Ruby Slippers strategy: wishful thinking. When a feature of the actual language is inconvenient to parse, the Ruby Slippers allow you to simply pretend it does not exist. Perl 5's parser will only handle grammars that are LALR, but the Perl 5 language is not LALR. Using the Ruby Slippers approach, the Perl 5 BNF pretends that the language is LALR.

Making Wishes Come True

The second part of the Ruby Slippers strategy requires that wishful thinking be made to come true. That's easy in Marpa, where the parse engine knows, and can tell the lexer, exactly what it is wishing for and where. yacc is much less context-aware. As far as yacc is concerned, if the lexer really loved it, it would know what it wants.

The Perl 5 lexer winds up having to work very hard to make this particular relationship work. It looks ahead at the two WORD tokens and decides what to do based on what it sees. This amounts to figuring out in the lexer which variant of the use statement is actually being used. In effect Perl 5 parses every use statement twice: once in the lexer, and then one more time with yacc.

If the lexer sees that the use statement is the short form, then it invents a second WORD token to fill in for the missing one. The lexer make the short form of the use statement look to yacc as if it was the full form. This is the classic Ruby Slippers approach.

The perl version form of the use statement is essentially a completely different statement with the same keyword. As its first step, it also does the classic Ruby Slippers move, only in reverse. The lexer reads the version as the first WORD token. A second WORD is then invented, with NULL contents.

As a final step, Perl 5 needs to distinguish perl version requests from module load requests. Here things get quite hackish. Because the lexer is actually creating the Perl scalars, it has full control over their internal representations: the semantics can rely on the internal representations of the WORD tokens. Perl 5 assumes that WORD's are version numbers if and only if they have numeric representations. After the lexer reworks its input, the parser always sees two WORD's, exactly one of which is a version number. Based on whether that version number is the first or second WORD, Perl 5 decides whether a use statement is a module request or a perl version request. Desperate men do desperate things.


  1. "exactly what it is wishing for": As currently implemented, Marpa has a convenient call that, at any point in the parse, will return the list of expected terminals. An lexer willing to use the debug and trace functions can find a lot more information: which rules are in progress, how far they have been recognized, where they began, etc., etc. I could create convenient calls to access this information as well, but so far the list of expected terminals has been all that my lexers care to know.

  2. "module": Actually, the use statement implements pragmas as well as modules, so the module request forms are also pragma invocations, and module names can be pragma names.

      posted at: 11:34 | direct link to this entry

      §         §         §