Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website

Google+

Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Wed, 20 Jun 2018


Parsing left recursions

Left recursion

A lot has been written about parsing left recursion. Unfortunately, much of it simply adds to the mystery. In this post, I hope to frame the subject clearly and briefly.

I expect the reader has some idea of what left recursion is, and perhaps some experience of it as an issue. Informally, left recursion occurs when a symbol expands to something with itself on the left. This can happen directly, for example, if production (1) is in a grammar.


    (1) A ::= A B
    
Indirect left recursion happens when, for example, (2) and (3) are productions in a grammar.

    (2) A ::= B C
    (3) B ::= A D
    
A grammar with productions (4) and (5) has a "hidden" left recursion.

    (4) A ::= B A C
    (5) B ::= # empty
    
This is because <A> ends up leftmost in derivations like:

    (6) A  ⟶ B A C ⟶ A C
    
In derivation (6), production (4) was applied, then production (5).

For those into notation, a grammar is left recursive if and only if it allows a derivation of the form


    (7) A  ⟶+ β A γ  where  β = ε
    
In (7) ε is the empty string, and α ⟶+ β indicates that α derives β in one or more rule applications.

So, OK, what is the problem?

The problem with parsing left recursions is that if you are parsing using a derivation like


    (8) A  ⟶ A B 
    
then you have defined <A> in terms of <A>. All recursions can be a problem, but left recursions are a particular problem because almost all practical parsing methods[1] proceed left to right, and derivations like (8) will lead many of the most popular algorithms straight into an infinite regress.

Why do some algorithms not have a problem?

In a sense, all algorithms which solve the left recursion problem do it in the same way. It is just that in some, the solution appears in a much simpler form.

The solution is at most simple in Earley's algorithm. That is no coincidence -- as Pingali and Bilardi[2] show, Earley's, despite its daunting reputation, is actually the most basic Chomskyan context-free parsing algorithm, the one from which all others derive.

Earley's builds a table. The Earley table contains an initial Earley set and an Earley set for each token. The Earley set for each token describes the state of the parse after consuming that token. The basic idea is not dissimilar to that of the Might/Darais/Spiewak (MDS) idea of parsing by derivatives, and the logic for building the Earley sets resembles that of MDS.[3]

For the purpose of studying left recursion, what matters is that each Earley set contains Earley "items". Some of the items are called predictions because they predict the occurrence of a symbol at that location in the input.

To record a left recursion in an Earley set, the program adds a prediction item for the left recursive symbol. It is that simple.[4]

Multiple occurrences of a prediction item would be identical, and therefore useless. Therefore subsequent attempts to add the same prediction item are ignored, and recursion does not occur.

If some have no problem, why do others?

Besides Earley's, a number of other algorithms handle left recursion without any issue -- notably LALR (aka yacc or bison) and LR. This re-raises the original question: why do some algorithms have a left recursion problem?

The worst afflicted algorithms are the "top-down" parsers. The best known of these is recursive descent -- a parsing methodology which, essentially, does parsing by calling a subroutine to handle each symbol. In the traditional implementation of recursive descent, left recursion is very problematic. Suppose that, as part of a recursive descent implementation, you are writing the function to parse the symbol <A>, which you are calling parse_A(). If your grammar has a rule


    (9) A ::= A B
    
the first thing you need to do in a naive implementation of parse_A() is to call parse_A(). And parse_A() will then call parse_A(). And so, in the naive implementation, on and on forever.

The fixed-point solution to left recursion

Over the years, many ways to solve the top-down left recursion issue have been announced. The MDS solution is one of the more interesting -- interesting because it actually works[5], and because it describes all the others, including the Earley algorithm solution. MDS reduces the problem to the more general one of finding a "fixed point" of the recursion.

In math, the "fixed point" of a function is an argument of the function which is equal to its value for that argument -- that is, an x such that f(x) = x. MDS describes an algorithm which "solves" the left recursion for its fixed point. That "fixed point" can then be memoized. For example the value of parse_A can be the memoized "fixed point" value of <A>.

The Earley solution of left recursion was, in fact, an optimized "fixed point". The computation of an Earley is the application of a set of rules for adding Earley items. This continues until no more Earley items can be added. In other words, the rules for building an Earley set are applied until they find their "fixed point".[6]

Other top-down solutions

The MDS fixed point solution does the job, but as described in their paper it requires a functional programming language to implement, and it is expensive. In the worst case, the MDS approach is exponential, although they conjecture that it is linear for a large class of practical grammars.

Top-down algorithms can take an "in-between strategy" -- they can tackle those left recursions that are cheap to find, without computing the full "fixed point". Here a well-defined boundary is crucial: A programmer wants to know if their particular grammar will work, and whether small tweaks to their grammar will break it.

Top-down can be seen as a "guessing" strategy with hacks. Hacks are always needed in top-down, because the input is at the bottom, not the top, and a useful top-down algorithm needs to look at the input. But the hacks can be as simple as lookahead, and lookahead can be implemented without seriously compromising the simplicity and flexibility of the original top-down approach.

With detection and fixing of left-recursion, the "hack" part of the top-down strategy becomes very complicated. The attraction of top-down is its simplicity, and its resulting adapability to procedural logic. The point can be reached where the original strategy comes into question.

After all, a recursive descent parser can straightforwardly take care of left recursion issues by calling an Earley parser. But in that case, why not simply use Earley's?

Comments, etc.

Marpa is my own implementation of an Earley parser.[7] Comments on this post can be made in Marpa's Google group, or on its IRC channel: #marpa at freenode.net.

Footnotes

1. I probably could have said "all practical parsing methods" instead of "almost all". Right-to-left parsing methods exist, but they see little use. In any case, they only reverse the problem. Parsing in both directions is certainly possible but, as I will show, we do not have to go to quite that much trouble.

2. Keshav Pingali and Gianfranco Bilardi, UTCS tech report TR-2012. 2012. PDF accessed 9 Junk 2018. Video accessed 9 June 2018. Less accessible is Keshav Pingali and Gianfranco Bilardi, "A graphical model for context-free grammar parsing." Compiler Construction - 24th International Conference, CC 2015. Lecture Notes in Computer Science, Vol. 9031, pp. 3-27, Springer Verlag, 2015. PDF accessed 9 June 2018.

3. Matthew Might, David Darais and Daniel Spiewak. "Functional Pearl: Parsing with Derivatives." International Conference on Functional Programming 2011 (ICFP 2011). Tokyo, Japan. September, 2011. pages 189-195. PDF accessed 9 Jun 2018. Slides accessed 9 June 2018. Video accessed 9 June 2018.

4. Those familiar with Earley's algorithm may note that Earley items are traditionally in terms of productions, not symbols. Symbol predictions are therefore recorded indirectly.

Specifically, Earley items are traditionally duples of (dr, origin), where dr is a dotted production -- a production with a "dot location" marked; and origin is a location in the input. In all predictions origin is the current location, and the dot location is at the start of the production, so there can be at most one prediction per rule. A prediction of a symbol <A> is recorded as a prediction of every production which has <A> on its LHS.

The argument in the main text is made the way it is because it is simpler to speak of "predicted symbols" than to repeatedly refer to "sets of predictions of productions sharing a common LHS".

5. There have been many more attempts than implementations over the years, and even some of the most-widely used implementations have their issues.

6. Recall that potential left recursions are recorded as "predictions" in Earley's algorithm. Predictions recurse, but since they do not depend on the input, they can be precomputed. This means that Earley implementations can bring each Earley set to its fixed point very quickly.

7. Marpa has a stable implementation. For it, and for more information on Marpa, there are these resources:
Marpa website, accessed 25 April 2018.
Kegler's website, accessed 25 April 2018.
Github repo, accessed 25 April 2018.
MetaCPAN, accessed 30 April 2018..
There is also a theory paper for Marpa: Kegler, Jeffrey. "Marpa, A Practical General Parser: The Recognizer.", 2013. PDF accessed 24 April 2018.


posted at: 09:15 | direct link to this entry

§         §         §