Marpa and combinator parsing

Mon, 18 Jun 2018

Marpa and combinator parsing

The missing part

A previous post described how to use the current stable Marpa implementation as a better procedural parser. This post describes how the Marpa algorithm can be used as the basis of better combinator parsers.

In the post on procedural parsing, the subparsers[1] were like combinators, in that they could be called recursively, so that a parse could be built up from components. Like combinators, each child could return, not just a parse, but a set of parses. And, as in combinators, once a child combinator returned its value, the parent parser could resume parsing at a location specified by the child combinator. So what was missing?

A combinator, in order to handle ambiguity, returns not a subparse, but a set of subparses. In the full combinator model, each subparse can have its own "resume location".[2] The procedural parsing post did not provide for multiple resume locations. We will now proceed to make up for that.

How it works

The Marpa parser has the ability to accept multiple subparses, each with its own length. This allows child subparses to overlap in any fashion, forming a mosaic as complex as the application needs.

An Earley parser is table-driven -- its parse tables consists of Earley sets, with an initial Earley set and one Earley set per token. This makes for a very simple idea of location. Location 0 is the location of the initial Earley set. Location N is the location of the Earley set after the N'th token has been consumed.

Simplicity is great, but unfortunately this won't work for variable-length tokens. To handle those, Marpa introduces another idea of location: the earleme. Like Earley set locations, the earlemes begin at 0, and advance in integer sequence. Earley set 0 is always at earleme 0. Every Earley set has an earleme location. On the other hand, not every earleme has a corresponding Earley set -- there can be "empty" earlemes.

The lower-level interface for Marpa is Libmarpa. Every time Libmarpa adds a token, a length in earlemes must be specified. In the most-used higher level Marpa interfaces, this "earleme length" is always 1, which makes the Libmarpa location model collapse into the traditional one.

The Libmarpa recognizer advances earleme-by-earleme. In the most-used higher level Marpa interfaces, a token ends at every earleme (unless of course that earleme is after end-of-input). This means that the most-used Marpa interfaces create a new Earley set every time they advance one earleme. Again, in this case, the Libmarpa model collapses into the traditional one.

In Libmarpa and other lower-level interfaces, there may be cases where

one or more tokens end after the current earleme, but
no tokens end at the current earleme.

In such cases the current earleme will be empty.

This is only an outline of the basic concepts behind the Marpa input model. The formalisms are in the Marpa theory paper.[3] The documentation for Libmarpa and Marpa's other low-level interfaces contains more accessible, but detailed, descriptions.[4]

Value added

Left-eidetic information

As readers of my previous posts[5] will know, Marpa is "left-eidetic" -- the application has access to everything to its left. This is an advantage over the traditional implementation of combinator parsing, where parse information about the left context may be difficult or impossible to access.[6]

More powerful linear-time combinators

Marpa parses a superset of LR-regular grammars in linear time, which makes it a more powerful "building block" than traditionally available for combinator parsing. This gives the programmer of a combinator parser more options.

State of the art worse-than-linear combinators

In special circumstances, programmers may want to use subparsers which are worse than linear -- for example, they may know that the string is very short. Marpa parses context-free grammars in state of the art time.[7]

The code, comments, etc.

To learn more about Marpa, a good first stop is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Footnotes

1. In some of the descriptions of Marpa's procedural parsing, these subparsers are called "lexers". This emphasizes the usual case in current practice, where the subparsers are the bottom layer of the parsing application, and do not invoke their own child subparsers. ↩

2. In notational terms, a full combinator is a function of the form
A* → ℙ( P × A* ),
where A is the alphabet of the grammar; P is a representation of a single parser (for example, a parse tree); ℙ(X) is the power set of a set X: and X × Y is the Cartesian product of sets X and Y. The subparsers of the procedural parsing post were of the form
A* → ℙ( P ) × A*.
↩

3. Kegler, Jeffrey. "Marpa, a Practical General Parser: The Recognizer". 2013. Section 12, "The Marpa input model", pp. 39-40. ↩

4. Libmarpa API document, the "Input" section. Marpa::R2's NAIF interface allows access to the full Libmarpa input model and its documentation contains a higher-level description of Marpa's alternative input models. There is also a thin Perl interface to Libmarpa, the THIF interface, which allows full access to the alternative input models. ↩

5. For example, the post on procedural parsing contains a good, simple, example of the use of Marpa's left-eideticism. ↩

6. For best effect, left-eidetism and functional purity probably should be used in combination. For the moment at least, I am focusing on explaining the capabilities, and leaving it to others to find the monadic or other solutions that will allow programmers to leverage this power in functionally pure ways. ↩

7. Specifically O(n^2) for unambiguous grammars, and O(n^3) for ambiguous grammars. ↩

posted at: 08:29 | direct link to this entry

§ § §

Ocean of Awareness

Marpa resources

Mon, 18 Jun 2018