Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website

Google+

Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Sun, 07 Sep 2014


Parsing: a timeline

[ Revised 22 Oct 2014 ]

1960: The ALGOL 60 spec comes out. It specifies, for the first time, a block structured language. The ALGOL committee is well aware that nobody knows how to parse such a language. But they believe that, if they specify a block-structured language, a parser for it will be invented. Risky as this approach is, it pays off ...

1961: Ned Irons publishes his ALGOL parser. In fact, the Irons parser is the first parser of any kind to be described in print. Ned's algorithm is a left parser -- a form of recursive descent. Unlike modern recursive descent, the Irons algorithm is general and syntax-driven. "General" means it can parse anything written in BNF. "Syntax-driven" (aka declarative) means that parser is actually created from the BNF -- the parser does not need to be hand-written.

1961: Almost simultaneously, hand-coded approaches to left parsing appear. These we would now recognize as recursive descent. Over the following years, hand-coding approaches will become more popular for left parsers than syntax-driven algorithms. Three factors are at work:

1965: Don Knuth invents LR parsing. Knuth is primarily interested in the mathematics. Knuth describes a parsing algorithm, but it is not thought practical.

1968: Jay Earley invents the algorithm named after him. Like the Irons algorithm, Earley's algorithm is syntax-driven and fully general. Unlike the Irons algorithm, it does not backtrack. Earley's core idea is to track everything about the parse in tables. Earley's algorithm is enticing, but it has three major issues:

1969: Frank DeRemer describes a new variant of Knuth's LR parsing. DeRemer's LALR algorithm requires only a stack and a state table of quite manageable size.

1972: Aho and Ullmann describe a straightforward fix to the zero-length rule bug in Earley's original algorithm. Unfortunately, this fix involves adding even more bookkeeping to Earley's.

1975: Bell Labs converts its C compiler from hand-written recursive descent to DeRemer's LALR algorithm.

1977: The first "Dragon book" comes out. This soon-to-be classic textbook is nicknamed after the drawing on the front cover, in which a knight takes on a dragon. Emblazoned on the knight's lance are the letters "LALR". From here on out, to speak lightly of LALR will be to besmirch the escutcheon of parsing theory.

1979: Bell Laboratories releases Version 7 UNIX. V7 includes what is, by far, the most comprehensive, useable and easily available compiler writing toolkit yet developed. Central to the toolkit is yacc, an LALR based parser generator. With a bit of hackery, yacc parses its own input language, as well as the language of V7's main compiler, the portable C compiler. After two decades of research, it seems that the parsing problem is solved.

1987: Larry Wall introduces Perl 1. Perl embraces complexity like no previous language. Larry uses LALR very aggressively -- to my knowledge more aggressively than anyone before or since.

1991: Joop Leo discovers a way of speeding up right recursions in Earley's algorithm. Leo's algorithm is linear for just about every unambiguous grammar of practical interest, and many ambiguous ones as well. In 1991 hardware is six orders of magnitude faster than 1968 hardware, so that the issue of bookkeeping overhead had receded in importance. This is a major discovery. When it comes to speed, the game has changed in favor of Earley algorithm. But Earley parsing is almost forgotten. It will be 20 years before anyone writes a practical implementation of Leo's algorithm.

1990's: Earley's is forgotten. So everyone in LALR-land is content, right? Wrong. Far from it, in fact. Users of LALR are making unpleasant discoveries. While LALR automatically generates their parsers, debugging them is so hard they could just as easily write the parser by hand. Once debugged, their LALR parsers are fast for correct inputs. But almost all they tell the users about incorrect inputs is that they are incorrect. In Larry's words, LALR is "fast but stupid".

2000: Larry Wall decides on a radical reimplementation of Perl -- Perl 6. Larry does not even consider using LALR again.

2002: Aycock&Horspool publish their attempt at a fast, practical Earley's parser. Missing from it is Joop Leo's improvement -- they seem not to be aware of it. Their own speedup is limited in what it achieves and the complications it introduces can be counter-productive at evaluation time. But buried in their paper is a solution to the zero-length rule bug. And this time the solution requires no additional bookkeeping.

2006: GNU announces that the GCC compiler's parser has been rewritten. For three decades, the industry's flagship C compilers have used LALR as their parser -- proof of the claim that LALR and serious parsing are equivalent. Now, GNU replaces LALR with the technology that it replaced a quarter century earlier: recursive descent.

2000 to today: With the retreat from LALR comes a collapse in the prestige of parsing theory. After a half century, we seem to be back where we started. If you took Ned Iron's original 1961 algorithm, changed the names and dates, and translated the code from the mix of assembler and ALGOL into Haskell, you would easily republish it today, and bill it as as revolutionary and new.

Marpa

Over the years, I had come back to Earley's algorithm again and again. Around 2010, I realized that the original, long-abandoned vision -- an efficient, practical, general and syntax-driven parser -- was now, in fact, quite possible. The necessary pieces had fallen into place.

Aycock&Hospool has solved the zero-length rule bug. Joop Leo had found the speedup for right recursion. And the issue of bookkeeping overhead had pretty much evaporated on its own. Machine operations are now a billion times faster than in 1968, and probably no longer relevant in any case -- caches misses are now the bottleneck.

But while the original issues with Earley's disappeared, a new issue emerged. With a parsing algorithm as powerful as Earley's behind it, a syntax-driven approach can do much more than it can with a left parser. But with the experience with LALR in their collective consciousness, few modern programmers are prepared to trust a purely declarative parser. As Lincoln said, "Once a cat's been burned, he won't even sit on a cold stove."

To be accepted, Marpa needed to allow procedure parsing, not just declarative parsing. So Marpa allows the user to specify events -- occurrences of symbols and rules -- at which declarative parsing pauses. While paused, the application can call procedural logic and single-step forward token by token. The procedural logic can hand control back over to syntax-driven parsing at any point it likes. The Earley tables can provide the procedural logic with full knowledge of the state of the parse so far: all rules recognized in all possible parses so far, and all symbols expected. Earley's algorithm is now a even better companion for hand-written procedural logic than recursive descent.

For more

For more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group.


posted at: 16:50 | direct link to this entry

§         §         §

Mon, 01 Sep 2014


Marpa has a new web page

Marpa has a new official public website, which Ron Savage has generously agreed to manage. For those who have not heard of it, Marpa is a parsing algorithm. It is new, but very much based on earlier work by Jay Earley, Joop Leo, John Aycock and R. Nigel Horspool. Marpa is intended to replace, and to go well beyond, recursive descent and the yacc family of parsers.

Comments

Comments on this post can be made in Marpa's Google group.


posted at: 20:17 | direct link to this entry

§         §         §

Mon, 18 Aug 2014


Language design: Exploiting ambiguity

Currently, in designing languages, we don't allow ambiguities -- not even potential ones. We insist that it must not be even possible to write an ambiguous program. This is unnecessarily restrictive.

This post is written in English, which is full of ambiguities. Natural languages are always ambiguous, because human beings find that that's best way for versatile, rapid, easy communication. Human beings arrange things so that every sentence is unambiguous in context. Mistakes happen, and ambiguous sentences occur, but in practice, the problem is manageable. In a conversation, for example, we would just ask for clarification.

If we allow our computer languages to take their most natural forms, they will often have the potential for ambiguity. This is even less of a problem on a computer than it is in conversation -- a computer can always spot an actual ambiguity immediately. When actual ambiguities occur, we can deal with them in exactly the same way that we deal with any other syntax problem: The computer catches it and reports it, and we fix it.

An example

To illustrate, I'll use a DSL-writing DSL language. It'll be tiny -- just lexeme declarations and BNF rules. Newlines will not be significant. Statements can end with a semicolon, but that's optional. (The code for this post is in a Github gist.)

Here is a toy calculator written in our tiny DSL-writing language:

  Number matches '\d+'
  E ::= T '*' F
  E ::= T
  T ::= F '+' Number
  T ::= Number

Trying an improvement

With a grammar this small, just about anything is readable. But let's assume we want to improve it, and that we decide that the lexeme declaration of Number really belongs after the rules which use it. (If our grammar was longer, this could make a real difference.) So we move the lexeme declaration to the end:

  E ::= T '*' F
  E ::= T
  T ::= F '+' Number
  T ::= Number
  Number matches '\d+'

But there's an issue

It turns out the grammar for our toy DSL-writer is ambiguous. When a lexeme declaration follows a BNF rule, there's no way to tell whether or not it is actually a lexeme declaration, or part of the BNF rule. Our parser catches that:

Parse of BNF/Scanless source is ambiguous
Length of symbol "Statement" at line 4, column 1 is ambiguous
  Choices start with: T ::= Number
  Choice 1, length=12, ends at line 4, column 12
  Choice 1: T ::= Number
  Choice 2, length=33, ends at line 5, column 20
  Choice 2: T ::= Number\nNumber matches '\\d

Here Marpa tells you why it thinks your script is ambiguous. Two different statements can start at line 4. Both of them are BNF rules, but one is longer than the other.

Just another syntax error

Instead of having to design a language where ambiguity was not even possible, we designed one where ambiguities can happen. This allows us to design a much more flexible language, like the ones we choose when we humans communicate with each other. The downside is that actual ambiguities will occur, but they can be reported, and fixed, just like any other syntax error.

In this case, we recall we allowed semi-colons to terminate a rule, and our fix is easy:

  E ::= T '*' F
  E ::= T
  T ::= F '+' Number
  T ::= Number ;
  Number matches '\d+'

To learn more

The code for this post is a gist on Github. It was written using Marpa::R2, which is available on CPAN. A list of my Marpa tutorials can be found here. There are new tutorials by Peter Stuifzand and amon. The Ocean of Awareness blog focuses on Marpa, and it has an annotated guide. Marpa has a web page that I maintain and Ron Savage maintains another. For questions, support and discussion, there is a "marpa parser" Google Group and an IRC channel: #marpa at irc.freenode.net.

Comments

Comments on this post can be made in Marpa's Google group.


posted at: 14:12 | direct link to this entry

§         §         §

Mon, 10 Mar 2014


Evolvable languages

Ideally, if a syntax is useful and clear, and a programmer can easily read it at a glance, you should be able to add it to an existing language. In this post, I will describe a modest incremental change to the Perl syntax.

It's one I like, because that's beside the point, for two reasons. First, it's simply intended as an example of language evolution. Second, regardless of its merits, it is unlikely to happen, because of the way that Perl 5 is parsed. In this post I will demonstrate a way of writing a parser, so that this change, or others, can be made in a straightforward way, and without designing your language into a corner.

When initializing a hash, Perl 5 allows you to use not just commas, but also the so-called "wide comma" (=>). The wide comma is suggestive visually, and it also has some smarts about what a hash key is: The hash key is always converted into a string, so that wide comma knows that in a key-value pair like this:

    key1 => 711,

that key1 is intended as a string.

But what about something like this?

  {
   company name => 'Kamamaya Technology',
   employee 1 => first name => 'Jane',
   employee 1 => last name => 'Doe',
   employee 1 => title => 'President',
   employee 2 => first name => 'John',
   employee 2 => last name => 'Smith',
   employee 3 => first name => 'Clarence',
   employee 3 => last name => 'Darrow',
  }

Here I think the intent is obvious -- to create an employee database in the form of a hash of hashes, allowing spaces in the keys. In Data::Dumper format, the result would look like:

{
              'employee 2' => {
                                'last name' => '\'Smith\'',
                                'first name' => '\'John\''
                              },
              'company name' => '\'Kamamaya Technology\'',
              'employee 3' => {
                                'last name' => '\'Darrow\'',
                                'first name' => '\'Clarence\''
                              },
              'employee 1' => {
                                'title' => '\'President\'',
                                'last name' => '\'Doe\'',
                                'first name' => '\'Jane\''
                              }
            }

And in fact, that is the output of the script in this Github gist, which parses the previous "extended Perl 5" snippet using a Marpa grammar before passing it on to Perl.

Perl 5 does not allow a syntax like this, and looking at its parsing code will tell you why -- it's already a maintenance nightmare. The extension I've described above could, in theory, be added to Perl 5, but doing so would aggravate an already desperate maintenance situation.

Now, depending on taste, you may be just as happy that you'll never see the extensions I have just outlined in Perl 5. But I don't think it is as easy to be happy about a parsing technology that quickly paints the languages which use it into a corner.

How it works

The code is in a Github gist. For the purposes of the example, I've implemented a toy subset of Perl. But this approach has been shown to scale. There are full Marpa-powered parsers of C, ECMAScript, XPath, and liberal HTML.

Marpa is a general BNF parser, which means that anything you can write in BNF, Marpa can parse. For practical parsing, what matters are those grammars that can be parsed in linear time, and with Marpa that class is vast, including all the classes of grammar currently in practical use. To describe the class of grammars that Marpa parses in linear time, assume that you have either a left or right parser, with infinite lookahead, that uses regular expressions. (A parser like this is called LR-regular.) Assume that this LR-regular parser parses your grammar. In that case, you can be sure that Marpa will parse that grammar in linear time, and without doing the lookahead. (Instead Marpa tracks possibilities in a highly-optimized table.) Marpa also parses many grammars that are not LR-regular in linear time, but just LR-regular is very likely to include any class of grammar that you will be interested in parsing. The LR-regular grammars easily include all those that can be parsed using yacc, recursive descent or regular expressions.

Marpa excels at those special hacks so necessary in recursive descent and other techniques. Marpa allows you to define events that will stop it at symbols or rules, both before and after. While stopped, you can hand processing over to your own custom code. Your custom code can feed your own tokens to the parse for as long as you like. In doing so, it can consult Marpa to determine exactly what symbols and rules have been recognized and which ones are expected. Once finished with custom processing, you can then ask Marpa to pick up again at any point you wish.

The craps game is over

The bottom line is that if you can describe your language extension in BNF, or in BNF plus some hacks, you can rely on Marpa parsing it in reasonable time. Language design has been like shooting crap in a casino that sets you up to win a lot of the first rolls before the laws of probability grind you down. Marpa changes the game.

To learn more

Marpa::R2 is available on CPAN. A list of my Marpa tutorials can be found here. There are new tutorials by Peter Stuifzand and amon. The Ocean of Awareness blog focuses on Marpa, and it has an annotated guide. Marpa has a web page that I maintain and Ron Savage maintains another. For questions, support and discussion, there is the "marpa parser" Google Group.

Comments

Comments on this post can be made in Marpa's Google group.


posted at: 19:48 | direct link to this entry

§         §         §

Tue, 25 Feb 2014


Significant newlines? Or semicolons?

Should statements have explicit terminators, like the semicolon of Perl and the C language? Or should they avoid the clutter, and separate statements by giving whitespace syntactic significance and a real effect on the semantics, as is done in Python and Javascript?

Actually we don't have to go either way. As an example, let's look at some BNF-ish DSL. It defines a small calculator. At first glance, it looks as if this language has taken the significant-whitespace route -- there certainly are no explicit statement terminators.

:default ::= action => ::first
:start ::= Expression
Expression ::= Term
Term ::=
      Factor
    | Term '+' Term action => do_add
Factor ::=
      Number
    | Factor '*' Factor action => do_multiply
Number ~ digits
digits ~ [\d]+
:discard ~ whitespace
whitespace ~ [\s]+

The rule is that there isn't one

If we don't happen to like the layout of the above DSL, and rearrange it in various ways, we'll find that everything we try works. If we become curious about what exactly what the rules for newlines are, and look at the documentation, we won't find any. That's because there aren't any.

We can see this by thoroughly messing up the line structure:

:default ::= action => ::first :start ::= Expression Expression ::= Term
Term ::= Factor | Term '+' Term action => do_add Factor ::= Number |
Factor '*' Factor action => do_multiply Number ~ digits digits ~
[\d]+ :discard ~ whitespace whitespace ~ [\s]+

The script will continue to run just fine.

How does it work?

How does it work? Actually, pose the question this way: Can a human reader tell where the statements end? If the reader is not used to reading BNF, he might have trouble with this particular example but, for a language that he knows, the answer is simple: Yes, of course he can. So really the question is, why do we expect the parser to be so stupid that it cannot?

The only trick is that this is done without trickery. Marpa's DSL is written in itself, and Marpa's self-grammar describes exactly what a statement is and what it is not. The Marpa parser is powerful enough to simply take this self-describing DSL and act on it, finding where statements begin and end, much as a human reader is able to.

To learn more

This example was produced with the Marpa parser. Marpa::R2 is available on CPAN. The code for this example is based on that in the synopsis for its top-level document, but it is isolated conveniently in a Github gist.

A list of my Marpa tutorials can be found here. There are new tutorials by Peter Stuifzand and amon. The Ocean of Awareness blog focuses on Marpa, and it has an annotated guide. Marpa has a web page that I maintain and Ron Savage maintains another. For questions, support and discussion, there is the "marpa parser" Google Group. Comments on this post can be made there.


posted at: 15:30 | direct link to this entry

§         §         §