Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website

Google+

Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Wed, 20 Mar 2013


The Interpreter Design Pattern

The influential Design Patterns book lays out 23 patterns for programming. One of them, the Interpreter Pattern, is rarely used. Steve Yegge puts it a bit more strikingly -- he says that the book contains 22 patterns and a practical joke.

That sounds (and in fact is) negative, but elsewhere Yegge says that "[t]ragically, the only [Go4] pattern that can help code get smaller (Interpreter) is utterly ignored by programmers". (The Design Patterns book has four authors, and is often called the Gang of Four book, or Go4.)

In fact, under various names and definitions, the Interpreter Pattern and its close relatives and/or identical twins are widely cited, much argued and highly praised[1]. As they should be. Languages are the most powerful and flexible design pattern of all. A language can include all, and only, the concepts relevent to your domain. A language can allow you to relate them in all, and only, the appropriate ways. A language can identify errors with pinpoint precision, hide implementation details, allow invisible "drop-in" enhancements, etc., etc., etc.

In fact languages are so powerful and flexible, that their use is pretty much universal. The choice is not whether or not to use a language to solve the problem, but whether to use a general-purpose language, or a domain-specific language. Put another way, if you decide not to use a language targeted to your domain, it almost always means that you are choosing to use another language that is not specifically fitted to your domain.

Why then, is the Interpreter Pattern so little used? Why does Yegge call it a practical joke?

There's a problem

The problem with the Interpreter Pattern is that you must turn your language into an AST -- that is, you must parse it somehow. Simplifying the language can help here. But if the point is to be simple at the expense of power and flexibility, you might as well stick with the other 22 design patterns.

On the other hand, creating a parser for anything but the simplest languages has been a time-consuming effort, and one of a kind known for disappointing results. In fact, language development efforts run a real risk of total failure.

How did the Go4 deal with this? They defined the problem away. They stated that the parsing issue was separate from the Interpreter Pattern, which was limited to what you did with the AST once you'd somehow come up with one.

But AST's don't (so to speak) grow on trees. You have to get one from somewhere. In their example, the Go4 simply built an AST in their code, node by node. In doing this, they bypassed the BNF and the problem of parsing. But they also bypassed their language and the whole point of the Interpreter Pattern.

Which is why Yegge characterized the chapter as a practical joke. And why other programming techniques and patterns are almost always preferred to the Interpreter Pattern.

Finding that one missing piece

So that's how the Go4 left things. A potentially great programming technique, made almost useless because of a missing piece. There was no easy, general, and practical way to generate AST's.

Few expected that to change. I was more optimistic than most. In 2007 I embarked on a full-time project: to create a parser based on Earley's algorithm. I was sure that it would fulfill two of the criteria -- it would be easy to use, and it would be general. As for practical -- well, a lot of parsing problems are small, and a lot of applications don't require a lot of speed, and for these I expected the result to be good enough.

What I didn't realize was that all of the problems preventing Earley's from seeing real, practical use has already been solved in the academic literature. I was not alone in not having put the picture together. The people who had solved the problems had focused on two disjoint sets of issues, and were unaware of each other's work. In 1991, in the Netherlands, the mathematican Joop Leo had arrived at an astounding result -- he showed how to make Earley's run in linear time for LR-regular grammars. LR-regular is a vast class of grammars. It easily includes, as a proper subset, every class of grammar now in practical use -- regular expressions, PEG, recursive descent, the LALR on which yacc and bison are based, you name it. (For those into the math, LR-regular includes LR(k) for all k, and therefore LL(k), also for all k.)

Leo's mathematical approach did not address some nagging practical issues, foremost among them the handling of nullable rules and symbols. But ten years later in Canada, Aycock and Horspool focused on exactly these issues, and solved them. Aycock-Horspool seem to have been unaware of Leo's earlier result. The time complexity of the Aycock-Horspool algorithm was essentially that of Earley's original algorithm.

Because of Leo's work, for any grammar in any class currently in practical use, an Earley's parser could be fast. If only it could be combined with the approach of Aycock and Horspool, I realized, Leo's speeds could be available in an everyday programming tool.

In changing the Earley parse engine, Aycock-Horspool and Leo had branched off in different directions. It was not obvious that their approaches could be combined, much less how. And in fact, the combination of the two is not a simple algorithm. But it is fast, and the new Marpa parse engine makes full information about the state of the parse (rules recognized, symbols expected, etc.) available as it proceeds. This is very convenient for, among other things, error reporting.

Eureka and all that

The result is an algorithm which parses anything you can write in BNF and does it in times considered optimal in practice. Unlike recursive descent, you don't have to write out the parser -- Marpa generates a parser for you, from the BNF. It's the easy, "drop-in" solution that the Go4 needed and did not have. A reworking of the Go4 example, with the missing parser added, is in a previous blog post, and the code for the reworking is in a Github gist.

More about Marpa

Marpa's latest version is Marpa::R2, which is available on CPAN. Recently, it has gained immensely in "whipitupitude" with a new interface, which has tutorials here and here. Marpa has a web page, and of course it is the focus of my "Ocean of Awareness" blog.

Comments on this post can be sent to the Marpa's Google Group: marpa-parser@googlegroups.com

Notes

Note 1: For example, the Wikipedia article on DSL's; Eric Raymond discussing mini-languages; "Notable Design Patterns for Domain-Specific Languages", Diomidis Spinellis; and the c2.com wiki.


posted at: 13:16 | direct link to this entry

§         §         §