Wed, 20 Jun 2018
Parsing left recursions
A lot has been written about parsing left recursion.
Unfortunately, much of it simply adds to the mystery.
In this post, I hope to frame the subject clearly and briefly.
I expect the reader has some idea of what left recursion is,
and perhaps some experience of it as an issue.
Informally, left recursion occurs when a symbol expands to something
with itself on the left.
This can happen directly, for example, if
is in a grammar.
(1) A ::= A B
Indirect left recursion happens when,
are productions in a grammar.
(2) A ::= B C
(3) B ::= A D
A grammar with productions
has a "hidden" left recursion.
(4) A ::= B A C
(5) B ::= # empty
This is because
ends up leftmost in derivations like:
(6) A ⟶ B A C ⟶ A C
In derivation (6)
then production (5)
For those into notation,
a grammar is left recursive if and only if it allows a derivation of the
(7) A ⟶+ β A γ where β = ε
In (7) ε
is the empty string,
α ⟶+ β
indicates that α
in one or more rule applications.
So, OK, what is the problem?
The problem with parsing left recursions is that if you are parsing
using a derivation like
(8) A ⟶ A B
then you have defined
in terms of
All recursions can be a problem,
but left recursions are a particular problem because almost all practical
proceed left to right,
and derivations like (8)
will lead many of
the most popular algorithms straight into
an infinite regress.
Why do some algorithms not have a problem?
In a sense,
all algorithms which solve the left recursion problem do
it in the same way.
It is just that in some,
the solution appears in a much simpler form.
The solution is at most simple in Earley's algorithm.
That is no coincidence -- as Pingali and Bilardi
Earley's, despite its daunting reputation,
is actually the most basic Chomskyan context-free parsing algorithm,
the one from which all others derive.
Earley's builds a table.
The Earley table contains an initial Earley set
and an Earley set for each token.
The Earley set for each token
describes the state of the parse after consuming that token.
The basic idea is not dissimilar
to that of the Might/Darais/Spiewak (MDS) idea of parsing by derivatives,
and the logic for building the Earley sets resembles
that of MDS.
For the purpose of studying left recursion,
what matters is that
each Earley set contains Earley "items".
Some of the items are called predictions
because they predict the occurrence of a symbol
at that location in the input.
To record a left recursion in an Earley set,
the program adds
a prediction item for the left recursive symbol.
It is that simple.
Multiple occurrences of a prediction item would be identical,
and therefore useless.
Therefore subsequent attempts
to add the same prediction item are ignored,
and recursion does not occur.
If some have no problem, why do others?
a number of other algorithms handle left recursion without
any issue -- notably LALR
(aka yacc or bison) and LR.
This re-raises the original question:
why do some algorithms have a left recursion problem?
The worst afflicted algorithms are the "top-down"
The best known of these is recursive descent --
a parsing methodology which, essentially, does parsing
by calling a subroutine to handle each symbol.
In the traditional implementation of recursive descent,
left recursion is very problematic.
Suppose that, as part of a recursive descent implementation,
you are writing the function to parse the
which you are calling
If your grammar has a rule
(9) A ::= A B
the first thing you need to do in a naive
is to call parse_A()
then call parse_A()
And so, in the naive implementation, on and on forever.
The fixed-point solution to left recursion
Over the years,
many ways to solve the top-down left recursion issue have been
The MDS solution is one of the more interesting --
interesting because it actually works,
and because it describes all the others,
including the Earley algorithm solution.
MDS reduces the problem to
the more general one of finding a "fixed point" of the recursion.
In math, the "fixed point" of a function is an argument of
the function which is equal to its value for that argument --
that is, an x such that f(x) = x.
MDS describes an algorithm which "solves" the left recursion
for its fixed point.
That "fixed point" can then be memoized.
For example the value of parse_A
can be the memoized "fixed point" value of
The Earley solution of left recursion was, in fact, an optimized
The computation of an Earley is the application of a set
of rules for adding Earley items.
This continues until no more Earley items can be added.
In other words, the rules for building an Earley set
are applied until they find
their "fixed point".
Other top-down solutions
The MDS fixed point solution does
but as described in their paper it requires a functional
programming language to implement,
and it is expensive.
In the worst case, the MDS approach is exponential,
although they conjecture that it is linear for a large
class of practical grammars.
Top-down algorithms can take an "in-between strategy" --
they can tackle those left recursions that are cheap to
find, without computing the full "fixed point".
Here a well-defined boundary is crucial:
A programmer wants to know if their particular grammar will
and whether small tweaks to their grammar will break it.
Top-down can be seen as
a "guessing" strategy with hacks.
Hacks are always needed in top-down, because the input is at the bottom,
not the top, and a useful top-down algorithm needs to look at the input.
But the hacks can be as simple as lookahead,
and lookahead can be implemented without seriously compromising
the simplicity and flexibility of the original top-down approach.
With detection and fixing of left-recursion,
the "hack" part of the top-down strategy becomes very complicated.
The attraction of top-down is its simplicity,
and its resulting adapability to procedural logic.
The point can be reached where the original strategy
comes into question.
a recursive descent parser can straightforwardly take care of left recursion
issues by calling an Earley parser.
But in that case,
why not simply use Earley's?
Marpa is my own implementation of an Earley parser.
Comments on this post can be made in
Marpa's Google group
or on its IRC channel: #marpa at freenode.net.
posted at: 09:15 |
direct link to this entry