Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website

Google+

Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Mon, 11 Jun 2018


Parsing with pictures

Derivatives == Earley?

In a cordial Twitter exchange with Matt Might and and David Darais, Prof. Might asked if I was interested in looking at their derivatives-based approach. I answered that I was looking at it -- Marpa is an optimization of the Might/Darais approach.

This may sound strange. At first glance, our two algorithms seem about as different as parsing algorithms can get. My Marpa parser is an Earley parser, table-driven, and its parse engine is written in C language.[1]

The MDS (Might/Darais/Spiewak) parser is an extension of regular expressions which constructs states on the fly. MDS uses combinators and has implementations in several functional programming languages.[2]

Grammar Flow Graphs

Why then do I imagine that Marpa is an optimized version of the MDS approach? The reason is a paper sent to me by Prof. Keshav Pingali at Austin: "Parsing with Pictures".[3] The title is a little misleading: their approach is not that easy, and the paper requires a considerable amount of math. But it is a lot easier than the traditional way of learning the various approaches to parsing.

The basis of the Pingali-Bilardi approach is the Grammar Flow Graph (GFG). These GFGs are the "pictures" of their title. GFGs are NFAs with recursion added. As has long been known, adding recursion to NFAs allows them to represent any context-free language.

Pingali and Bilardi's next step is new. A GFG can be "simulated" using the same algorithm used to simulate an NFA. However, the result is not immediately impressive. The simulation does produce a recognizer, but not a good recognizer: Some of the strings recognized by the GFG simulator are not in the context-free language.[4]

To repair their "simulation", Pingali and Bilardi add a tag to each state. This tag tracks where the recursion began.[5]. This not only fixes the recognizer, but the added information is enough to allow the set of parse trees to be efficiently recovered from the sets of GFG states. In other words, with tags, the GFG recognizer now is a parser.

It turns out that this recognizer-turned-parser is not new. In fact, it is exactly Earley's algorithm.

Pingali and Bilardi do not stop there. Using their new framework, they go on to show that all LL-based and LR-based algorithms are simplifications of their Earley parser. From this point of view, Earley parsing is the foundation of all context-free parsing, and LL- and LR-based algorithms are Earley optimizations.[6]

Step 1: The MDS algorithm

To show that Marpa is an optimization of the MDS approach, I will start with the MDS algorithm, and attempt to optimize it. For its functional programming language, the MDS paper uses Racket. The MDS parser is described directly, in the usual functional language manner, as a matching operation.

In the MDS paper, the MDS parser is optimized with laziness and memoization. Nulls are dealt with by computing their fixed points on the fly. Even with these three optimizations, the result is still highly inefficient. So, as an additional step, MDS also implements "deep recursive simplication" -- in effect, strategically replacing laziness with eagerness. With this the MDS paper conjectures that the algorithm's time is linear for a large class of practical grammars.

Step 2: Extended regular expressions

Next, we notice that the context-free grammars of the MDS algorithm are regular expressions extended to allow recursion. Essentially, they are GFG's translated into Racket match expressions. The equivalence is close enough that you could imagine the MDS paper using GFG's for its illustrations.

Step 3: Simulating an NFA

Unsurprisingly, then, the MDS and GFG approaches are similar in their first step. Each consumes a single character to produce a "partial parse". A partial parse, for both of these algorithms, can be represented as a duple. One element of the duple is a string representing the unconsumed input. The other is a representation of a set of parse trees. In the case of MDS, in keeping with its functional approach, the set of parse trees is represented directly. In the case of the GFG-simulator, the set of parse trees is compressed into a sequence of GFG-state-sets. There is one GFG-state-set for the start of parsing, and one for each consumed character.

Step 4: Earley's Algorithm

At this point, with the introduction of the GFG state-sets to represent parse-trees, the MDS algorithm and its optimized GFG equivalent have "forked". Recall from above, that this GFG simulator has a bug -- it is over-liberal.

The bug is the one already described, and our fix is the one already described: Each GFG state either starts a recursion or is part of one. We fix the bug by tagging each GFG state with the index of the GFG state-set that starts its recursion. Once these tags are added, the GFG state-sets are exactly Earley sets.

Step 5: The Leo optimization

Next we incorporate an optimization by Joop Leo, which makes Earley parsing linear for all LR-regular grammars, without using lookahead. Since LR-regular is a superset of LR(k) and LL(k), including LL(*), we do not bother with lookahead.

Step 6: Marpa

To get from an Earley/Leo parser to a Marpa parser, we need to address one more major point. In modern parsing practice, programmers expect the ability to introduce procedural logic, even to the point of switching parsers. By ensuring that processing each Earley set is complete before processing on the next Earley set begins, we accomplish this.

This means that Marpa has available full information about the parse so far -- it is left-eidetic. Error reporting is unexcelled, and procedural logic can use this information as well. For example, full parse information implies full knowledge of which tokens are expected next.

This means that you can write an liberal HTML parser, starting with a very illiberal HTML grammar. When the illiberal parser encounters a point where it cannot continue because of a missing token, procedural logic can ask what the expected token is; concoct that token on the fly; supply that token to the illiberal parser; and then ask the illiberal parser to continue. This is called the "Ruby Slippers" technique, and an HTML parser based on it has been implemented.[7]

The way forward

As mentioned, the MDS algorithm has its own approach to optmization, one which takes maximum advantage of functional programming. In contrast, Marpa relies on C level coding.

One example of the contrast in optimization techniques is null handling. Recall from above that, to deal with null processing, the MDS algorithm uses a fixed-point algorithm on the fly. Marpa, on the other hand, before parsing begins. precomputes a list of nullable symbols using bit vectors. In this particular case, Marpa will usually be the winner.

A second case in which hand optimization wins is the all-important Leo optimization. Simon Peyton-Jones is a very smart man, but nonetheless I believe that the day that GHC will look at a functional specification of an Earley parser and rediscover the Leo optimization at compile time is far off.

On the other hand, I do not imagine that hand optimization and/or C language is the winner in all cases. A programmer is deceiving himself if he imagines that he can spot all the cases where lazy evaluation or memoization will be effective in the general case. And of course, even an omniscient programmer is not going to be there at run-time to do "just in time" optimization. Perhaps the optimal parser of the future will combine important hand optimizations with functional programming.

Comments, etc.

To learn about Marpa, my Earley/Leo-based parsing project, there is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Footnotes

1. Kegler, Jeffrey. "Marpa, A Practical General Parser: The Recognizer.", 2013. PDF accessed 24 April 2018.

Marpa has a stable implementation. For it, and for more information on Marpa, there are these resources: Marpa website, accessed 25 April 2018. Kegler's website, accessed 25 April 2018. Github repo, accessed 25 April 2018. MetaCPAN, accessed 30 April 2018..

2. Matthew Might, David Darais and Daniel Spiewak. "Functional Pearl: Parsing with Derivatives." International Conference on Functional Programming 2011 (ICFP 2011). Tokyo, Japan. September, 2011. pages 189--195. PDF accessed 9 Jun 2018. Slides accessed 9 June 2018. Video accessed 9 June 2018.

3. Keshav Pingali and Gianfranco Bilardi, UTCS tech report TR-2012. 2012. PDF accessed 9 Junk 2018. Video accessed 9 June 2018.

Less accessible, but with more details about GFGs and GFG-based Earley parsing is Keshav Pingali and Gianfranco Bilardi, "A graphical model for context-free grammar parsing." Compiler Construction - 24th International Conference, CC 2015. Lecture Notes in Computer Science, Vol. 9031, pp. 3-27, Springer Verlag, 2015. PDF accessed 9 June 2018.

4. Pingali and Bilardi 2012, section 3.1.

5. Pingali and Bilardi 2015, p. 11.

6. Pingali and Bilardi 2012, Sections 4-7.

7. I have based an liberal HTML pretty-printer on that parser, one which I use quite frequently. I used it, for example, when writing this blog post. To find out more about Ruby Slippers parsing see the Marpa FAQ, questions 122 and 123; my blog series on parsing html; and my blog post "Marpa and the Ruby Slippers".


posted at: 03:06 | direct link to this entry

§         §         §