Introduction to Marpa Book in progress

Tue, 22 Mar 2016

Introduction to Marpa Book in progress

What follows is a summary of the features of the Marpa algorithm, followed by a discussion of potential applications. It refers to itself as a "monograph", because it is a draft of part of the introduction to a technical monograph on the Marpa algorithm. I hope the entire monograph will appear in a few weeks.

The Marpa project

The Marpa project was intended to create a practical and highly available tool to generate and use general context-free parsers. Tools of this kind had long existed for LALR and regular expressions. But, despite an encouraging academic literature, no such tool had existed for context-free parsing. The first stable version of Marpa was uploaded to a public archive on Solstice Day 2011. This monograph describes the algorithm used in the most recent version of Marpa, Marpa::R2. It is a simplification of the algorithm presented in my earlier paper.

A proven algorithm

While the presentation in this monograph is theoretical, the approach is practical. The Marpa::R2 implementation has been widely available for some time, and has seen considerable use, including in production environments. Many of the ideas in the parsing literature satisfy theoretical criteria, but in practice turn out to face significant obstacles. An algorithm may be as fast as reported, but may turn out not to allow adequate error reporting. Or a modification may speed up the recognizer, but require additional processing at evaluation time, leaving no advantage to compensate for the additional complexity.

In this monograph, I describe the Marpa algorithm as it was implemented for Marpa::R2. In many cases, I believe there are better approaches than those I have described. But I treat these techniques, however solid their theory, as conjectures. Whenever I mention a technique that was not actually implemented in Marpa::R2, I will always explicitly state that that technique is not in Marpa as implemented.

Features

General context-free parsing

As implemented, Marpa parses all "proper" context-free grammars. The proper context-free grammars are those which are free of cycles, unproductive symbols, and inaccessible symbols. Worst case time bounds are never worse than those of Earley's algorithm, and therefore never worse than O(n**3).

Linear time for practical grammars

Currently, the grammars suitable for practical use are thought to be a subset of the deterministic context-free grammars. Using a technique discovered by Joop Leo, Marpa parses all of these in linear time. Leo's modification of Earley's algorithm is O(n) for LR-regular grammars. Leo's modification also parses many ambiguous grammars in linear time.

Left-eidetic

The original Earley algorithm kept full information about the parse --- including partial and fully recognized rule instances --- in its tables. At every parse location, before any symbols are scanned, Marpa's parse engine makes available its information about the state of the parse so far. This information is in useful form, and can be accessed efficiently.

Recoverable from read errors

When Marpa reads a token which it cannot accept, the error is fully recoverable. An application can try to read another token. The application can do this repeatedly as long as none of the tokens are accepted. Once the application provides a token that is accepted by the parser, parsing will continue as if the unsuccessful read attempts had never been made.

Ambiguous tokens

Marpa allows ambiguous tokens. These are often useful in natural language processing where, for example, the same word might be a verb or a noun. Use of ambiguous tokens can be combined with recovery from rejected tokens so that, for example, an application could react to the rejection of a token by reading two others.

Using the features

Error reporting

An obvious application of left-eideticism is error reporting. Marpa's abilities in this respect are ground-breaking. For example, users typically regard an ambiguity as an error in the grammar. Marpa, as currently implemented, can detect an ambiguity and report specifically where it occurred and what the alternatives were.

Event driven parsing

As implemented, Marpa::R2 allows the user to define "events". Events can be defined that trigger when a specified rule is complete, when a specified rule is predicted, when a specified symbol is nulled, when a user-specified lexeme has been scanned, or when a user-specified lexeme is about to be scanned. A mid-rule event can be defined by adding a nulling symbol at the desired point in the rule, and defining an event which triggers when the symbol is nulled.

Ruby slippers parsing

Left-eideticism, efficient error recovery, and the event mechanism can be combined to allow the application to change the input in response to feedback from the parser. In traditional parser practice, error detection is an act of desperation. In contrast, Marpa's error detection is so painless that it can be used as the foundation of new parsing techniques.

For example, if a token is rejected, the lexer is free to create a new token in the light of the parser's expectations. This approach can be seen as making the parser's "wishes" come true, and I have called it "Ruby Slippers Parsing".

One use of the Ruby Slippers technique is to parse with a clean but oversimplified grammar, programming the lexical analyzer to make up for the grammar's short-comings on the fly. As part of Marpa::R2, the author has implemented an HTML parser, based on a grammar that assumes that all start and end tags are present. Such an HTML grammar is too simple even to describe perfectly standard-conformant HTML, but the lexical analyzer is programmed to supply start and end tags as requested by the parser. The result is a simple and cleanly designed parser that parses very liberal HTML and accepts all input files, in the worst case treating them as highly defective HTML.

Ambiguity as a language design technique

In current practice, ambiguity is avoided in language design. This is very different from the practice in the languages humans choose when communicating with each other. Human languages exploit ambiguity in order to design highly flexible, powerfully expressive languages. For example, the language of this monograph, English, is notoriously ambiguous.

Ambiguity of course can present a problem. A sentence in an ambiguous language may have undesired meanings. But note that this is not a reason to ban potential ambiguity --- it is only a problem with actual ambiguity.

Syntax errors, for example, are undesired, but nobody tries to design languages to make syntax errors impossible. A language in which every input was well-formed and meaningful would be cumbersome and even dangerous: all typos in such a language would be meaningful, and parser would never warn the user about errors, because there would be no such thing.

With Marpa, ambiguity can be dealt with in the same way that syntax errors are dealt with in current practice. The language can be designed to be ambiguous, but any actual ambiguity can be detected and reported at parse time. This exploits Marpa's ability to report exactly where and what the ambiguity is. Marpa::R2's own parser description language, the SLIF, uses ambiguity in this way.

Auto-generated languages

In 1973, Čulik and Cohen pointed out that the ability to efficiently parse LR-regular languages opens the way to auto-generated languages. In particular, Čulik and Cohen note that a parser which can parse any LR-regular language will be able to parse a language generated using syntax macros.

Second order languages

In the literature, the term "second order language" is usually used to describe languages with features which are useful for second-order programming. True second-order languages --- languages which are auto-generated from other languages --- have not been seen as practical, since there was no guarantee that the auto-generated language could be efficiently parsed.

With Marpa, this barrier is raised. As an example, Marpa::R2's own parser description language, the SLIF, allows "precedenced rules". Precedenced rules are specified in an extended BNF. The BNF extensions allow precedence and associativity to be specified for each RHS.

Marpa::R2's precedenced rules are implemented as a true second order language. The SLIF representation of the precedenced rule is parsed to create a BNF grammar which is equivalent, and which has the desired precedence. Essentially, the SLIF does a standard textbook transformation. The transformation starts with a set of rules, each of which has a precedence and an associativity specified. The result of the transformation is a set of rules in pure BNF. The SLIF's advantage is that it is powered by Marpa, and therefore the SLIF can be certain that the grammar that it auto-generates will parse in linear time.

Notationally, Marpa's precedenced rules are an improvement over similar features in LALR-based parser generators like yacc or bison. In the SLIF, there are two important differences. First, in the SLIF's precedenced rules, precedence is generalized, so that it does not depend on the operators: there is no need to identify operators, much less class them as binary, unary, etc. This more powerful and flexible precedence notation allows the definition of multiple ternary operators, and multiple operators with arity above three.

Second, and more important, a SLIF user is guaranteed to get exactly the language that the precedenced rule specifies. The user of the yacc equivalent must hope their syntax falls within the limits of LALR.

References, comments, etc.

Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

posted at: 20:56 | direct link to this entry

§ § §

Ocean of Awareness

Marpa resources

Tue, 22 Mar 2016