Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website

Google+

Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Mon, 10 Mar 2014


Evolvable languages

Ideally, if a syntax is useful and clear, and a programmer can easily read it at a glance, you should be able to add it to an existing language. In this post, I will describe a modest incremental change to the Perl syntax.

It's one I like, because that's beside the point, for two reasons. First, it's simply intended as an example of language evolution. Second, regardless of its merits, it is unlikely to happen, because of the way that Perl 5 is parsed. In this post I will demonstrate a way of writing a parser, so that this change, or others, can be made in a straightforward way, and without designing your language into a corner.

When initializing a hash, Perl 5 allows you to use not just commas, but also the so-called "wide comma" (=>). The wide comma is suggestive visually, and it also has some smarts about what a hash key is: The hash key is always converted into a string, so that wide comma knows that in a key-value pair like this:

    key1 => 711,

that key1 is intended as a string.

But what about something like this?

  {
   company name => 'Kamamaya Technology',
   employee 1 => first name => 'Jane',
   employee 1 => last name => 'Doe',
   employee 1 => title => 'President',
   employee 2 => first name => 'John',
   employee 2 => last name => 'Smith',
   employee 3 => first name => 'Clarence',
   employee 3 => last name => 'Darrow',
  }

Here I think the intent is obvious -- to create an employee database in the form of a hash of hashes, allowing spaces in the keys. In Data::Dumper format, the result would look like:

{
              'employee 2' => {
                                'last name' => '\'Smith\'',
                                'first name' => '\'John\''
                              },
              'company name' => '\'Kamamaya Technology\'',
              'employee 3' => {
                                'last name' => '\'Darrow\'',
                                'first name' => '\'Clarence\''
                              },
              'employee 1' => {
                                'title' => '\'President\'',
                                'last name' => '\'Doe\'',
                                'first name' => '\'Jane\''
                              }
            }

And in fact, that is the output of the script in this Github gist, which parses the previous "extended Perl 5" snippet using a Marpa grammar before passing it on to Perl.

Perl 5 does not allow a syntax like this, and looking at its parsing code will tell you why -- it's already a maintenance nightmare. The extension I've described above could, in theory, be added to Perl 5, but doing so would aggravate an already desperate maintenance situation.

Now, depending on taste, you may be just as happy that you'll never see the extensions I have just outlined in Perl 5. But I don't think it is as easy to be happy about a parsing technology that quickly paints the languages which use it into a corner.

How it works

The code is in a Github gist. For the purposes of the example, I've implemented a toy subset of Perl. But this approach has been shown to scale. There are full Marpa-powered parsers of C, ECMAScript, XPath, and liberal HTML.

Marpa is a general BNF parser, which means that anything you can write in BNF, Marpa can parse. For practical parsing, what matters are those grammars that can be parsed in linear time, and with Marpa that class is vast, including all the classes of grammar currently in practical use. To describe the class of grammars that Marpa parses in linear time, assume that you have either a left or right parser, with infinite lookahead, that uses regular expressions. (A parser like this is called LR-regular.) Assume that this LR-regular parser parses your grammar. In that case, you can be sure that Marpa will parse that grammar in linear time, and without doing the lookahead. (Instead Marpa tracks possibilities in a highly-optimized table.) Marpa also parses many grammars that are not LR-regular in linear time, but just LR-regular is very likely to include any class of grammar that you will be interested in parsing. The LR-regular grammars easily include all those that can be parsed using yacc, recursive descent or regular expressions.

Marpa excels at those special hacks so necessary in recursive descent and other techniques. Marpa allows you to define events that will stop it at symbols or rules, both before and after. While stopped, you can hand processing over to your own custom code. Your custom code can feed your own tokens to the parse for as long as you like. In doing so, it can consult Marpa to determine exactly what symbols and rules have been recognized and which ones are expected. Once finished with custom processing, you can then ask Marpa to pick up again at any point you wish.

The craps game is over

The bottom line is that if you can describe your language extension in BNF, or in BNF plus some hacks, you can rely on Marpa parsing it in reasonable time. Language design has been like shooting crap in a casino that sets you up to win a lot of the first rolls before the laws of probability grind you down. Marpa changes the game.

To learn more

Marpa::R2 is available on CPAN. A list of my Marpa tutorials can be found here. There are new tutorials by Peter Stuifzand and amon. The Ocean of Awareness blog focuses on Marpa, and it has an annotated guide. Marpa has a web page that I maintain and Ron Savage maintains another. For questions, support and discussion, there is the "marpa parser" Google Group.

Comments

Comments on this post can be made in Marpa's Google group.


posted at: 22:48 | direct link to this entry

§         §         §