**Jeffrey Kegler's blog**
about Marpa, his new parsing algorithm,
and other topics of interest

The Ocean of Awareness blog: home page, chronological index, and annotated index.

The difference between theory and practice is that in theory there is no difference between theory and practice, but in practice, there is.[1]

Once it was taken seriously that humans might have the power to, for example, "read" a chessboard in a way that computers could not beat. This kind of "computational mysticism" has taken a beating. But it survives in one last stronghold -- parsing theory.

In a previous post, I asked "Why is parsing considered solved?" If the state of the art of computer parsing is taken as anything close to its ultimate solution, then it is a case of "human exceptionalism" -- the human brain has some power that makes it much better at parsing than computers can be. It is very unlikely resorting to human exceptionalism as an explanation would be accepted for any other problem in computer science. Why is it accepted for parsing theory?[2]

The question really requires two separate answers:

- "Why do practitioners accept the current state of the art as the solution?" and
- "Why do the theoreticians accept the current state of the art as the solution?"

In one sense, the answer to both questions is the same -- because of the consensus created by Knuth's 1965 paper "On the translation of languages from left to right". In a previous post, I looked at Knuth 1965 and I answered the practitioner question in detail. But, for the sake of brevity, I answered the question about the theoreticians in outline. This post expands on that outline.

To summarize, in 1965,
**practitioners**
accepted the parsing problem as solved
for the following reasons.

- In 1965, every practical parser was stack-driven.
- As of 1965, stacks themselves were quite leading edge. As recently as 1961, a leading edge article[3] could not assume that its readers knew what "pop" and "push" operations were.
- An algorithm that combined state transitions and stack operations was already a challenge to existing machines. In 1965, any more complicated algorithm was likely to be unuseable in practice.
- Last, but not least, the theoreticians assured the
practitioners that
`LR`-parsing was either state-of-the-art or beyond, so making more agressive use of hardware would be futile.

The practitioners of 1965, then,
were quite reasonable in feeling that
`LR`-parsing was as good as anything they were likely to be able
to implement any time soon.
And they were being told by the theorists that,
in fact,
it never would get any better --
there were theoretical limits on parsers that faster
hardware could not overcome.

We now know that the theorists were wrong --
there are non-`LR`
parsers which are better than the
`LR`
parsers are at
`LR`
grammars.
What made the theorists go astray?

As the epigraph for this post reminds us, theorists who hope to guide practitioners have to confront a big problem -- theory is practice only in theory. Theoreticians (or at least the better ones, like Knuth) know this, but they try to make theory as reliable a guide to practice as possible.

One of the most important examples of the theoretician's successes is asymptotic notation, which we owe to Knuth[4]. Asymptotic notation is more commonly referred to as big-O notation. The term "asymptotic notation" emphasizes its most dangerous aspect from a practical point of view: Asymptotic notation assumes that the behavior of most interest is the behavior for arbitrarily large inputs.

Practical inputs can be very large but, by definition, they are never arbitrarily large. Results in asymptotic terms might be what is called "galactic" -- they might have relevance only in situations which cannot possibly occur in practice.

Fortunately for computer science, asymptotic results usually are not "galactic". Most often asymptotic results are not only relevant to practice -- they are extremely relevant. Wikipedia pages for algorithms put the asymptotic complexities in special displays, and these displays are one of the first things that some practitioners look at.

Since coming up with a theoretical model that is equivalent to "practical" is impossible, theoreticians often work like artillerists. Artillerists often deliberately overshoot and undershoot, before they "fire for effect". "Bracketing" their target in this way has disadvantages -- it reduces the element of surprise, and can even allow the enemy to get their counter-fire in first. But, nasty as these consequences could be, the advantage in accuracy is usually held to outweigh them.

The practice of theoretical computer science is less risky, which makes "bracketing" a very attractive approach to tricky problems. Theoreticians often try to "bracket" practice between an "undershoot" and an "overshoot". The undershoots are models simple and efficient enough to be practical, but too weak to capture all the needs of practice. The overshoots are models which capture everything a practitioner needs, but which are too complicated and/or too resource-intensive for practice.

The P vs. NP problem is an active example of a bracketing technique.
You will sometimes read that
the P/NP boundary is expected to be
that between practical and impractical,
but this is an extreme simplification.
P includes complexities like
`O(n^1000000)`,
where the complexity for even
`n == 2`
is
a nunber which, in decimal form,
fills many pages.
Modulo bold advances in quantum computing,
I cannot imagine that
`O(n^1000000)`
will ever be
practical.
And you can make the complexities much harder
than
`O(n^1000000)`
without ever reaching P-hard.

So P-hard is beyond any reasonable definition of "practical" -- it is an "overshoot". But the P vs. NP question is almost certainly very relevant to what is "practical". Resolving the P vs. NP question is likely to be an important or even necessary step. It is a mystery that such a seemingly obvious question has resisted the best efforts of the theoreticians for so long, and the solution of P vs. NP is likely to bring new insights into asymptotic complexity.

When Knuth published his 1965, "practical parsing" was already bracketed. On the overshoot side, Irons had already published a parser for context-free grammars. Worst case, this ran in exponential time, and it was, and remains, expected that general context-free parsing was not going to be practical.[5]

On the undershoot side, there were regular expressions and recursive descent. Regular expressions are fast and very practical, but parse a very limited set of grammars. Recursive descent is also fast and, since it parses a larger set of grammars, was the closest undershoot.

To curry respect from the behaviourists, American linguistics for many years banned any reference to meaning. Behaviorists looked down on hypothesized mental states as not worthy of "science", and it's hard to have a theory of meaning without conjectures about mental states. Without mental states, language was just a set of utterances. So in 1926 the linguist Leonard Bloomfield dutifully defined a "language" as a set of "utterances" (for our purposes, "strings"), and through the 30s and 40s most American linguists followed him.

After a brief nod to this tradition, Noam Chomsky restored sanity to linguistics. But it was too late for computer science. Automata theory adopted the semantics-free definition. In 1965, Knuth inherited a lot of prior work, almost all of which ignored, not just meaning or semantics, but even syntax and structure.[6]

Knuth, of course, wanted to make contact with prior art. The definition he had inherited seemed to work well enough and Knuth's 1965 defines a language as a set of strings. Most subsequent work has refused to breach this tradition.

In most people's idea of what a language is, the utterances/strings mean something -- you cannot take just any set of meaningless strings and call it a language. So the parsing theorists and everybody else had two different definitions of language.

But parsing theory also hoped to produce results relevant to practice, and few people are interested in recognizing meaningless strings -- almost everybody who parses is interested in (at a minimum) finding some kind of structure in what they parse, in order to do something with the result of the parse. Parsing theorists ended up using the word "language" in one sense, but implying that results they found worked for the word "language" in the usual sense.

At this point both senses of the word "language"
have gotten entrenched in parsing theory.
Instead of making up a new terminology for this blog post,
I will borrow a distinction from linguistics
and speak of
**the extension of a language**
and
**the intension of a language**.
The extension of a language is the Bloomfieldian defintion --
the set of utterances/strings in the language.
The intension of a language, for our purposes here,
can be regarded as its BNF grammar.
Each language intension will have (if it is well-defined)
exactly one extension.
But multiple language intensions can have the same extension.

The temptation to use language extensions as
a proxy for
`LR`-grammars must have been overwhelming.
It turns out that the language extension of
deterministic stack machines
is
**exactly**
that of the
`LR`
grammars.
Further,
the language extension of the context-free grammars is
exactly that of the non-deterministic stack machines.
(Non-deterministic stack machines are
stack machines which can "fork" new instances of themselves on the fly.)

If you take language extensions as the proxy for grammars,
things fall into place very neatly:
the
`LR`-parsers are the deterministic subset of the
context-free parsers.
And "deterministic" seemed like a very good approximation
of practical.
Certainly non-deterministic parsing is probably not practical.
And the best practical parsers in 1965 were
deterministic stack parsers.

Viewed this way,
`LR`-parsing looked like the equivalent
of practical parsing.
It was a "direct hit",
or as close to a exact equivalent of practical parsing
as theory was going to get.

As we shall see, with this red herring, the reasoning went astray. But disaster was not inevitable. The whole point of bracketing, after all, is that it allows you to correct errors. Another red herring, however, resulted in parsing theory going on a decades-long wrong turn.

The second red herring led to the mis-bracketing of practical
parsing.
Having seemingly established that
`LR`-parsing is a natural boundary
in the hierarchy of languages,
Knuth discovered that general
`LR`-parsers were very far from practical.
`LR`
parsing goes out to
`LR(k)`
for arbitrary
`k`,
but even
`LR(1)`
parsing was impractical in 1965 --
in fact, it is rare in practical use today.
As the
`k`
in
`LR(k)`
grows, the size of the tables grows exponentially,
while the value of the additional lookahead rapidly diminishes.
It is not likely that
`LR(2)`
parsing will ever see much practical use,
never mind
`LR(k)` for any `k`
greater than 2.

From this it was concluded that
`LR`-parsing is an overshoot.
In reality,
as Joop Leo was to show,
it is an
**undershoot**,
and in practical terms a very large one.
If you mistake an undershoot for an overshoot,
bracketing no longer works,
and you are not likely to hit your target.

Summing up, parsing theorists concluded, based on the results of Knuth 1965, that

- LR-parsing is a good approximation to practical parsing -- it brackets it closely.
- LR-parsing is an overshoot.
- A subset of
`LR`-parsing will be the solution to the parsing problem.

There were, in hindsight, clear signs
that
`LR`
language extensions were not a good proxy for
`LR`
grammars.
`LR`
grammars form a hierarchy --
for every
`k≥0`,
there is an
`LR`
grammar which
is
`LR(k+1)`, but which is not
`LR(k)`.

But if you look at extensions
instead of grammars,
the hierarchy immediately
collapses --
every
`LR(k)`
language extension is also
an
`LR(1)`
language extension,
as long as
`k≥1`.
Only
`LR(0)`
remains distinct.

It gets worse.
In most practical applications,
you can add an end-of-input marker to a grammar.
If you do this the
`LR`
extension hierarchy collapses totally --
every
`LR(k)`
language extension is also an
`LR(0)`
language extension.

In short, it seems that,
as a proxy for
`LR`
grammars,
`LR`
language extensions are likely to be completely worthless.

Why didn't Knuth see the problem?
Knuth certainly noted the strange behavior of the
`LR`
hierarchy
in extensional terms -- he discovered it,
and devoted several dense pages of his 1965 to laying
out the complicated mathematics involved.

So why did Knuth expect to get away with punning intension and extension, even in the face of some very unsettling results? Here, the answer is very simple -- "punning" had always worked before.

Regular expressions are easily turned into parsers[7], so the language extension of a regular grammar is an adequate approximation to its intension. Context-free recognition has the same complexity, and in practice uses the same algorithms, as context-free parsing, so here again, language extension is a good approximation of language intension.

And the
`LL`
language extensions follow a strict hierarchy --
for every
`k≥0`,
`LL(k+1)`
is a proper superset of
`LL(k)`.
This fact forces
`LL`
grammars to follow the same
hierarchy[8].
So, when studying complexity,
`LL`
language extensions are an excellent proxy for
`LL`
grammars.

Based on past experience, Knuth had reason to believe he could use language extensions as a proxy for grammars, and that the result would be a theory that was a reliable guide to practice.

In
my timeline of parsing,
I describe what happened next.
Briefly,
theory focused on finding a useful subset of
`LR(1)`.
One,
`LALR`, became the favorite and
the basis of the
`yacc`
and
`bison`
tools.

Research into parsing of supersets of
`LR`
became rare.
The theorists were convinced the
`LR`
parsing
was the solution.
These were so convinced that when,
in 1991, Joop Leo discovered a practical way to
parse an
`LR` superset,
the result went unimplemented for decades.

In 1965, the theoreticians gave a lot of weight to the evidence from the world of practice, but probably not undue weight. Going forward, it was a different story.

Leo had,
in essence,
disproved the implied conjecture of Knuth 1965.
But the question is
not an explicit mathematical question,
like that of P vs. NP.
It is a slipprier one -- capturing practice.
Practitioners left it to the theoreticians to keep up with
the literature.
But theoreticians, as long as
`LR`-superset methods did not
come into use in the world of practice,
felt no need to revisit their conclusions.

I encourage
those who want to know more about the story of Parsing Theory
to look at my
Parsing: a timeline 3.0.
To learn about Marpa,
my Earley/Leo-based parsing project,
there is the
semi-official web site, maintained by Ron Savage.
The official, but more limited, Marpa website
is my personal one.
Comments on this post can be made in
Marpa's Google group,
or on our IRC channel:
`#marpa` at `freenode.net`.

1. Attributed to Jan L. A. van de Snepscheut and Yogi Berra. See https://en.wikiquote.org/wiki/Jan_L._A._van_de_Snepscheut, accessed 1 July 2018. I quote my preferred form of this -- the one it takes in Doug Rosenberg and Matt Stephens, Use Case Driven Object Modeling with UML: Theory and Practice, 2007, p. xxvii. Rosenberg and Stephens is also the accepted authority for its attribution. ↩

2. As an aside, I am open to the idea that the human mind has abilities that Turing machines cannot improve on or even duplicate. When it comes to survival heuristics tied to the needs of human bodies, for example, it seems very reasonable to at least entertain the conjecture that the human mind might be near-optimal, particularly in big-O terms. But when it comes to ability to solve problems which can be formalized as "puzzles" -- and syntactic analysis is one of these -- I think that resort to human exceptionalism is a sign of desperation. ↩

3. Oettinger, Anthony. "Automatic Syntactic Analysis and the Pushdown Store", Proceedings of Symposia in Applied Mathematics, Volume 12, American Mathematical Society, 1961. Oettinger describes "push" and "pop" stack operations in "Iversion notation" -- what later became APL. See the discussion of Oettinger in my "Why is parsing considered solved?" post. ↩

4. Knuth did not invent asymptotic notation -- it comes from calculus -- but he introduced it to computer science and motivated its use. ↩

5.
The best lower bound for context-free parsing is still
`O(n)`.
So it is even possible that there is a practical
linear-time general context-free
parser.
But its discovery would be a big surprise.
↩

6. In another blog post, I talk about the use of the word "language" in parsing theory in much more detail. ↩

7. For example, regular expressions can be extended with "captures". Captures cannot handle recursion, but neither can regular expressions, so captures are usually sufficient to provide all the structure an application wants. ↩

8.
The discussion of the
`LL(k)`
hierarchy is in a sense anachronistic --
the
`LL(k)`
hierachy was not studied until after 1965.
But Knuth certainly was aware of recursive descent,
and it seems reasonable to suppose that,
even in 1965,
he had a sense of what
the
`LL`
hierarchy would look like.
↩

posted at: 21:02 | direct link to this entry

§
§
§

A lot has been written about parsing left recursion. Unfortunately, much of it simply adds to the mystery. In this post, I hope to frame the subject clearly and briefly.

I expect the reader has some idea of what left recursion is, and perhaps some experience of it as an issue. Informally, left recursion occurs when a symbol expands to something with itself on the left. This can happen directly, for example, if production

Indirect left recursion happens when, for example,(1) A ::= A B

A grammar with productions(2) A ::= B C (3) B ::= A D

This is because(4) A ::= B A C (5) B ::= # empty

In derivation(6) A ⟶ B A C ⟶ A C

For those into notation, a grammar is left recursive if and only if it allows a derivation of the form

In(7) A ⟶where^{+}β A γβ = ε

The problem with parsing left recursions is that if you are parsing using a derivation like

then you have defined(8) A ⟶ A B

In a sense, all algorithms which solve the left recursion problem do it in the same way. It is just that in some, the solution appears in a much simpler form.

The solution is at most simple in Earley's algorithm. That is no coincidence -- as Pingali and Bilardi[2] show, Earley's, despite its daunting reputation, is actually the most basic Chomskyan context-free parsing algorithm, the one from which all others derive.

Earley's builds a table. The Earley table contains an initial Earley set and an Earley set for each token. The Earley set for each token describes the state of the parse after consuming that token. The basic idea is not dissimilar to that of the Might/Darais/Spiewak (MDS) idea of parsing by derivatives, and the logic for building the Earley sets resembles that of MDS.[3]

For the purpose of studying left recursion, what matters is that each Earley set contains Earley "items". Some of the items are called predictions because they predict the occurrence of a symbol at that location in the input.

To record a left recursion in an Earley set, the program adds a prediction item for the left recursive symbol. It is that simple.[4] Multiple occurrences of a prediction item would be identical, and therefore useless. Therefore subsequent attempts to add the same prediction item are ignored, and recursion does not occur.Besides Earley's,
a number of other algorithms handle left recursion without
any issue -- notably LALR
(aka `yacc` or `bison`) and LR.
This re-raises the original question:
why do some algorithms have a left recursion problem?

The worst afflicted algorithms are the "top-down"
parsers.
The best known of these is recursive descent --
a parsing methodology which, essentially, does parsing
by calling a subroutine to handle each symbol.
In the traditional implementation of recursive descent,
left recursion is very problematic.
Suppose that, as part of a recursive descent implementation,
you are writing the function to parse the
symbol `<A>`,
which you are calling
`parse_A()`.
If your grammar has a rule

the first thing you need to do in a naive implementation of(9) A ::= A B

Over the years, many ways to solve the top-down left recursion issue have been announced. The MDS solution is one of the more interesting -- interesting because it actually works[5], and because it describes all the others, including the Earley algorithm solution. MDS reduces the problem to the more general one of finding a "fixed point" of the recursion.

In math, the "fixed point" of a function is an argument of
the function which is equal to its value for that argument --
that is, an `x` such that `f(x) = x`.
MDS describes an algorithm which "solves" the left recursion
for its fixed point.
That "fixed point" can then be memoized.
For example the value of `parse_A`
can be the memoized "fixed point" value of
`<A>`.

The Earley solution of left recursion was, in fact, an optimized "fixed point". The computation of an Earley is the application of a set of rules for adding Earley items. This continues until no more Earley items can be added. In other words, the rules for building an Earley set are applied until they find their "fixed point".[6]

The MDS fixed point solution **does**
the job,
but as described in their paper it requires a functional
programming language to implement,
and it is expensive.
In the worst case, the MDS approach is exponential,
although they conjecture that it is linear for a large
class of practical grammars.

Top-down algorithms can take an "in-between strategy" -- they can tackle those left recursions that are cheap to find, without computing the full "fixed point". Here a well-defined boundary is crucial: A programmer wants to know if their particular grammar will work, and whether small tweaks to their grammar will break it.

Top-down can be seen as a "guessing" strategy with hacks. Hacks are always needed in top-down, because the input is at the bottom, not the top, and a useful top-down algorithm needs to look at the input. But the hacks can be as simple as lookahead, and lookahead can be implemented without seriously compromising the simplicity and flexibility of the original top-down approach.

With detection and fixing of left-recursion, the "hack" part of the top-down strategy becomes very complicated. The attraction of top-down is its simplicity, and its resulting adapability to procedural logic. The point can be reached where the original strategy comes into question.

After all, a recursive descent parser can straightforwardly take care of left recursion issues by calling an Earley parser. But in that case, why not simply use Earley's?

**1.**
I probably could have said "all practical parsing methods"
instead of "almost all".
Right-to-left parsing methods exist,
but they see little use.
In any case, they only reverse the problem.
Parsing in both directions is certainly possible but,
as I will show,
we do not have to go to quite that much trouble.
↩

**2.**
Keshav Pingali and Gianfranco Bilardi, UTCS tech report TR-2012.
2012.
PDF accessed 9 Junk 2018.
Video accessed 9 June 2018.
Less accessible is
Keshav Pingali and Gianfranco Bilardi,
"A graphical model for context-free grammar parsing."
Compiler Construction - 24th International Conference, CC 2015.
Lecture Notes in Computer Science,
Vol. 9031, pp. 3-27, Springer Verlag, 2015.
PDF accessed 9 June 2018.
↩

**3.**
Matthew Might, David Darais and Daniel Spiewak.
"Functional Pearl: Parsing with Derivatives."
International Conference on Functional Programming 2011 (ICFP 2011).
Tokyo, Japan. September, 2011. pages 189-195.
PDF accessed 9 Jun 2018.
Slides accessed 9 June 2018.
Video accessed 9 June 2018.
↩

**4.**
Those familiar with Earley's algorithm may note that
Earley items are traditionally in terms of productions,
not symbols.
Symbol predictions are therefore recorded indirectly.

Specifically, Earley items are traditionally duples
of `(dr, origin)`,
where `dr` is a dotted production -- a production
with a "dot location" marked;
and `origin` is a location in the input.
In all predictions `origin` is the current location,
and the dot location is at the start of the production,
so there can be at most one prediction per rule.
A prediction of a symbol
`<A>`
is recorded as a prediction of every
production which has
`<A>` on its LHS.

The argument in the main text is made the way it is
because it is simpler to speak
of "predicted symbols" than
to repeatedly refer to "sets of predictions
of productions sharing a common LHS".
↩

**5.**
There have been many more attempts than implementations
over the years,
and even some of the most-widely used
implementations
have
their issues.
↩

**6.**
Recall that potential left recursions are
recorded as "predictions" in Earley's algorithm.
Predictions recurse,
but since they do not depend on the input,
they can be precomputed.
This means that Earley implementations can
bring each Earley set to its fixed point
very quickly.
↩

**7.**
Marpa has a stable implementation.
For it, and for more information on Marpa, there are these resources:

Marpa website, accessed 25 April 2018.

Kegler's website, accessed 25 April 2018.

Github repo, accessed 25 April 2018.

MetaCPAN, accessed 30 April 2018..

There is also a theory paper for Marpa:
Kegler, Jeffrey.
"Marpa, A Practical General Parser: The Recognizer.", 2013.
PDF accessed 24 April 2018.
↩

posted at: 09:15 | direct link to this entry

§
§
§

A previous post described how to use the current stable Marpa implementation as a better procedural parser. This post describes how the Marpa algorithm can be used as the basis of better combinator parsers.

In the post on procedural parsing, the subparsers[1] were like combinators, in that they could be called recursively, so that a parse could be built up from components. Like combinators, each child could return, not just a parse, but a set of parses. And, as in combinators, once a child combinator returned its value, the parent parser could resume parsing at a location specified by the child combinator. So what was missing?

A combinator, in order to handle ambiguity, returns not a subparse, but a set of subparses. In the full combinator model, each subparse can have its own "resume location".[2] The procedural parsing post did not provide for multiple resume locations. We will now proceed to make up for that.

The Marpa parser has the ability to accept multiple subparses, each with its own length. This allows child subparses to overlap in any fashion, forming a mosaic as complex as the application needs.

An Earley parser is table-driven -- its parse tables consists of Earley sets, with an initial Earley set and one Earley set per token. This makes for a very simple idea of location. Location 0 is the location of the initial Earley set. LocationSimplicity is great, but unfortunately
this won't work for variable-length
tokens.
To handle those, Marpa introduces another idea of location:
the **earleme**.
Like Earley set locations,
the earlemes begin at 0,
and advance in integer sequence.
Earley set 0 is always at earleme 0.
Every Earley set has an earleme location.
On the other hand,
not every earleme has a corresponding Earley set --
there can be "empty" earlemes.

The lower-level interface for Marpa is Libmarpa. Every time Libmarpa adds a token, a length in earlemes must be specified. In the most-used higher level Marpa interfaces, this "earleme length" is always 1, which makes the Libmarpa location model collapse into the traditional one.

The Libmarpa recognizer advances earleme-by-earleme. In the most-used higher level Marpa interfaces, a token ends at every earleme (unless of course that earleme is after end-of-input). This means that the most-used Marpa interfaces create a new Earley set every time they advance one earleme. Again, in this case, the Libmarpa model collapses into the traditional one.

In Libmarpa and other lower-level interfaces, there may be cases where

- one or more tokens end after the current earleme, but
- no tokens end
**at**the current earleme.

This is only an outline of the basic concepts behind the Marpa input model. The formalisms are in the Marpa theory paper.[3] The documentation for Libmarpa and Marpa's other low-level interfaces contains more accessible, but detailed, descriptions.[4]

As readers of my previous posts[5] will know, Marpa is "left-eidetic" -- the application has access to everything to its left. This is an advantage over the traditional implementation of combinator parsing, where parse information about the left context may be difficult or impossible to access.[6]

Marpa parses a superset of LR-regular grammars in linear time, which makes it a more powerful "building block" than traditionally available for combinator parsing. This gives the programmer of a combinator parser more options.

In special circumstances, programmers may want to use subparsers which are worse than linear -- for example, they may know that the string is very short. Marpa parses context-free grammars in state of the art time.[7]

**1.**
In some of the descriptions of Marpa's procedural
parsing, these subparsers are called "lexers".
This emphasizes the usual case in current practice,
where the subparsers are the bottom layer of the
parsing application,
and do not invoke their own child subparsers.
↩

**2.**
In notational terms, a full combinator is a function of the form

`A* → ℙ( P × A* )`,

where `A` is the alphabet of the grammar;
`P` is a representation of a single parser
(for example, a parse tree);
`ℙ(X)` is the power set of a set `X`:
and
`X × Y` is the Cartesian product
of sets `X` and `Y`.
The subparsers
of the procedural parsing post
were of the form

`A* → ℙ( P ) × A*`.

↩

**3.**
Kegler, Jeffrey.
"Marpa, a Practical General Parser: The Recognizer".
2013.
Section 12, "The Marpa input model", pp. 39-40.
↩

**4.**
Libmarpa API document,
the "Input" section.
Marpa::R2's NAIF interface allows
access to the full Libmarpa input model
and its documentation contains
a higher-level description of Marpa's alternative input models.
There is also a thin Perl interface to Libmarpa,
the THIF interface,
which allows full access to the alternative input models.
↩

**5.**
For example, the
post on procedural parsing contains a good,
simple, example of the use of Marpa's left-eideticism.
↩

**6.**
For best effect,
left-eidetism and functional purity
probably should be used in combination.
For the moment at least,
I am focusing on explaining the capabilities,
and leaving it to others to find the monadic
or other solutions that will allow programmers to leverage
this power in functionally pure ways.
↩

**7.**
Specifically O(n^2) for unambiguous grammars,
and O(n^3) for ambiguous grammars.
↩

posted at: 08:29 | direct link to this entry

§
§
§

Marpa is an Earley-based parser, and Earley parsers are typically not good at procedural parsing. Many programmers are used to recursive descent (RD), which has been state-of-the-art in terms of its procedural programming capabilities -- it was these capabilities which led to RD's triumph over the now-forgotten Irons algorithm.

Marpa, however, has a parse engine expressly redesigned[1]
to handle procedural logic well.
In fact, Marpa is **better** at procedural logic
than RD.

Marpa parses all LR-regular grammars in linear time,
so the first challenge is to find a grammar
that illustrates a
**need** for procedural logic, even when Marpa is used.
The following is the canonical example of a grammar that is
context-sensitive, but not context-free:

a^n . b^n . c^n : n >= 1I will call this the "ABC grammar". It is a sequence of

The ABC "grammar" is really a counting problem more than
a natural parsing problem,
and parsing is not the fastest or easiest way to solve it.
Three tight loops, with counters, would do the same job nicely,
and would be much faster.
But I chose the ABC grammar for exactly this reason.
It **is** simple in itself,
but it is tricky when treated as a parsing problem.[2]

In picking the strategy below, I opted for one that illustrates a nice subset of Marpa's procedural parsing capabilities. Full code is on-line, and readers are encouraged to "peek ahead".

Our strategy will be to start with a context-free syntax, and then extend it with procedural logic. Here is the context-free grammar:

lexeme default = latm => 1 :default ::= action => [name,start,length,values] S ::= prefix ABC trailer ABC ::= ABs Cs ABs ::= A ABs B | A B prefix ::= A* trailer ::= C_extra* A ~ 'a' B ~ 'b' :lexeme ~ Cs pause => before event => 'before C' Cs ~ 'c' # dummy -- procedural logic readsC_extra ~ 'c'

The first line is boiler-plate: It turns off a default which was made pointless by a later enhancement to Marpa::R2. Marpa::R2 is stable, and backward-compatibility is a very high priority.

:default ::= action => [name,start,length,values]

We will produce a parse tree. The second line defines its format -- each node is an array whose elements are, in order, the node name, its start position, its length and its child nodes.

S ::= prefix ABC trailer

The symbol `<ABC>`
is our "target" -- the counted
`a`'s,
`b`'s,
and `c`'s.
To make things a bit more interesting,
and to make the problem more like a parsing problem instead of a counting problem,
we allow a prefix of `a`'s
and a trailer of `c`'s.

ABC ::= ABs Cs

We divide the
`<ABC>` target into two parts:
`<ABs>`, which contains the
`a`'s,
and `b`'s;
and
`<Cs>`, which contains
the `c`'s.

The string

a^n . b^n

is context free, so that we can handle it without procedural logic, as follows:

ABs ::= A ABs B | A B

The line above recognizes a non-empty string of
`a`'s,
followed by an equal number
of `b`'s.

prefix ::= A* trailer ::= C_extra*

As stated above,
`<prefix>`
is a series of `a`'s and
`<trailer>`
is a series of `c`'s.

A ~ 'a' B ~ 'b'

Marpa::R2 has a separate lexical and syntactic phase.
Here we define our lexemes.
The first two are simple enough:
`<A>` is the character "`a`"; and
`<B>` is the character "`b`".

:lexeme ~ Cs pause => before event => 'before C' Cs ~ 'c' # dummy -- procedural logic readsC_extra ~ 'c'

For the character "`c`",
we need procedural logic.
As hooks for procedural logic,
Marpa allows a full range of events.
Events can occur on prediction and completion of symbols;
when symbols are nulled;
before lexemes;
and after lexemes.
The first line in the above display
declares a "before lexeme" event
on the symbol
`<Cs>`.
The name of the event is "`before C`".

The second line is a dummy entry,
which is needed to allow the "`before C`"
event to trigger.
The entry says that
`<Cs>` is a single character "`c`".
This is false --
`<Cs>` is a series of one or more
`c`'s,
which needs to be counted.
But when
the "`before C`" event triggers,
the procedural
logic will make things right.

The third line defines
`<C_extra>`, which
is another lexeme for the character "`c`".
We have two different lexemes for
the character `c`, because we want some
`c`'s (those in the target)
to trigger events;
and we want other
`c`'s (those in the trailer)
not to trigger events,
but to be consumed by Marpa directly.

At this point, we have solved part of the problem with context-free syntax,
and set up a Marpa event named "`before C`",
which will solve the rest of it.

my $input_length = length ${$input}; for ( my $pos = $recce->read($input); $pos < $input_length; $pos = $recce->resume() ) {... Process events ...}

Processing of events takes place inside a Marpa read loop.
This is initialized with a `read()` method,
and is continued with a `resume()` method.
The `read()` and `resume()` methods
both return the current position
in the input.
If the current position is end-of-input, we are done.
If not, we were interrupted by an event, which we
must process.

Process eventsEVENT: for ( my $event_ix = 0 ; my $event = $recce->event($event_ix) ; $event_ix++ ) { my $name = $event->[0]; if ( $name eq 'before C' ) {... Process "before C" event ...} die qq{Unexpected event: name="$name"}; }

In this application, only one event can occur at any location,
so the above loop is "overkill".
It loops through the events, one by one.
The `event` method returns a reference to an array
of event data.
The only element we care about is the event name.
In fact, if we weren't being careful about error checking,
we would not even care about the event name,
since there can be only one.

If, as expected, the event name is "`before C`",
we process it.
In any other case, we die with an error message.

Process "before C" eventmy ( $start, $length ) = $recce->last_completed_span('ABs'); my $c_length = ($length) / 2; my $c_seq = ( 'c' x $c_length ); if ( substr( ${$input}, $pos, $c_length ) eq $c_seq ) { $recce->lexeme_read( 'Cs', $pos, $c_length, $c_seq ); next EVENT; } die qq{Too few C's};

This is the core part of our procedural logic,
where we have a "`before C`" event.
We must

- determine the right number of
`c`characters; - check that the input has
the right number of
`c`characters; - put together a lexeme to feed the Marpa parser; and
- return control to Marpa.

my ( $start, $length ) = $recce->last_completed_span('ABs'); my $c_length = ($length) / 2;

Marpa claims to be "left-eidetic", that is, to have full knowledge of the parse so far, and to make this knowledge available to the programmer. How does a programmer cash in on this promise?

Of course, there is a fully general interface, which allows you to go through the Earley tables and extract the information in any form necessary. But another, more convenient interface, fits our purposes here. Specifically,

- we want to determine how many
`c`characters we are looking for. - How many
`c`characters we are looking for depends on the number of`a`and`b`characters that we have already seen in the target. - The
`a`and`b`characters that we have already seen in the target are in the`<ABs>`symbol instance. - So, what we want to know is the length of the
most recent
`<ABs>`symbol instance.

Marpa has a `last_completed_span()` method,
and that is just what we need.
This finds the most recent instance of a symbol.
(If there had been more than one most recent instance,
it would have found the longest.)
The `last_completed_span()` method returns the start
of the symbol instance (which we do not care about)
and its length.
The desired number of `c` characters,
`$c_length`, is half the length of the
`<ABs>` instance.

my $c_seq = ( 'c' x $c_length ); if ( substr( ${$input}, $pos, $c_length ) eq $c_seq ) {...}

Marpa allows external parsing. You can pause Marpa, as we have done, and hand control over to another parser -- including another instance of Marpa.

Here external parsing is necessary to make our parser context-sensitive, but the external parser does not have to be fancy. All it needs to do is some counting -- not hard, but something that a context-free grammar cannot do.

`$pos` is the current position in the input,
as returned by the `read()` or `resume()`
method in the outer loop.
Our input is the string referred to by `$input`.
We just calculated `$c_length` as the number of
`c` characters required.
The above code checks to see that the required number of
`c` characters is at `$pos` in the input.

$recce->lexeme_read( 'Cs', $pos, $c_length, $c_seq );

Our external logic is doing the parsing,
but we need to let Marpa know what we are finding.
We do this with the `lexeme_read()` method.
`lexeme_read()` needs to know what symbol we are reading
(`Cs` in our case);
and its value
(`$c_seq` in our case).

Marpa requires that
every symbol be tied in some way to the input.
The tie-in is only for error reporting,
and it can be hack-ish or completely artificial,
if necessary.
In this application, our symbol instance is tied into
the input in a very natural way --
it is the stretch of the input that we compared
to `$c_seq` in the display before last.
We therefore tell Marpa
that the symbol is at `$pos` in the input,
and of length `$c_length`.

next EVENT;

External parsing can go on quite a long time.
In fact, an external parser **never** has to hand
control back to Marpa.
But in this case, we are done very quickly.

We ask for the next iteration of the `EVENT`
loop.
(In this code,
there will not be a next iteration, unless there is an error.)
Once done, the `EVENT` loop will hand control
over to the outer loop.
The outer loop will call the `resume()`
method to return control back to Marpa.

The full code for this example is on-line. There is a lot more to Marpa, including more facilities for adding procedural logic to your Marpa parsers. To learn more about Marpa, a good first stop is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

**1.**
To handle procedural logic well,
an Earley engine needs to complete its Earley sets
in strict order --
that is, Earley set `N`
cannot change after work on Earley set `N+1`
has begun.
I have not looked at every Earley parse engine,
and some may have had this strict-sequencing property.
And many of the papers are agnostic about the order
of operations.
But Marpa is the first Earley parser to recognize
and exploit strict-sequencing as a feature.
↩

**2.**
The ABC grammar, in fact,
is not all that easy or natural to describe
even with a context-sensitive phrase structure description.
A solution is given on Wikipedia:
https://en.wikipedia.org/wiki/Context-sensitive_grammar#Examples.
↩

posted at: 20:02 | direct link to this entry

§
§
§

In a cordial Twitter exchange with Matt Might and and David Darais, Prof. Might asked if I was interested in looking at their derivatives-based approach. I answered that I was looking at it -- Marpa is an optimization of the Might/Darais approach.

This may sound strange. At first glance, our two algorithms seem about as different as parsing algorithms can get. My Marpa parser is an Earley parser, table-driven, and its parse engine is written in C language.[1]

The MDS (Might/Darais/Spiewak) parser is an extension of regular expressions which constructs states on the fly. MDS uses combinators and has implementations in several functional programming languages.[2]

Why then do I imagine that Marpa is an optimized version of the MDS
approach?
The reason is a paper sent to
me by Prof. Keshav Pingali at Austin: "Parsing with Pictures".[3]
The title is a little misleading:
their approach is not
**that**
easy,
and the paper requires a considerable amount of math.
But it is a lot easier than the traditional way
of learning the various approaches to parsing.

The basis of the Pingali-Bilardi approach is the Grammar Flow Graph (GFG). These GFGs are the "pictures" of their title. GFGs are NFAs with recursion added. As has long been known, adding recursion to NFAs allows them to represent any context-free language.

Pingali and Bilardi's next step is new. A GFG can be "simulated" using the same algorithm used to simulate an NFA. However, the result is not immediately impressive. The simulation does produce a recognizer, but not a good recognizer: Some of the strings recognized by the GFG simulator are not in the context-free language.[4]

To repair their "simulation", Pingali and Bilardi add a tag to each state. This tag tracks where the recursion began.[5]. This not only fixes the recognizer, but the added information is enough to allow the set of parse trees to be efficiently recovered from the sets of GFG states. In other words, with tags, the GFG recognizer now is a parser.

It turns out that this recognizer-turned-parser is not new.
In fact, it is
**exactly**
Earley's algorithm.

Pingali and Bilardi do not stop there. Using their new framework, they go on to show that all LL-based and LR-based algorithms are simplifications of their Earley parser. From this point of view, Earley parsing is the foundation of all context-free parsing, and LL- and LR-based algorithms are Earley optimizations.[6]

To show that Marpa is an optimization of the MDS approach, I will start with the MDS algorithm, and attempt to optimize it. For its functional programming language, the MDS paper uses Racket. The MDS parser is described directly, in the usual functional language manner, as a matching operation.

In the MDS paper, the MDS parser is optimized with laziness and memoization. Nulls are dealt with by computing their fixed points on the fly. Even with these three optimizations, the result is still highly inefficient. So, as an additional step, MDS also implements "deep recursive simplication" -- in effect, strategically replacing laziness with eagerness. With this the MDS paper conjectures that the algorithm's time is linear for a large class of practical grammars.

Next, we notice that the context-free grammars of the MDS algorithm are regular expressions extended to allow recursion. Essentially, they are GFG's translated into Racket match expressions. The equivalence is close enough that you could imagine the MDS paper using GFG's for its illustrations.

Unsurprisingly, then, the MDS and GFG approaches are similar in their first step. Each consumes a single character to produce a "partial parse". A partial parse, for both of these algorithms, can be represented as a duple. One element of the duple is a string representing the unconsumed input. The other is a representation of a set of parse trees. In the case of MDS, in keeping with its functional approach, the set of parse trees is represented directly. In the case of the GFG-simulator, the set of parse trees is compressed into a sequence of GFG-state-sets. There is one GFG-state-set for the start of parsing, and one for each consumed character.

At this point, with the introduction of the GFG state-sets to represent parse-trees, the MDS algorithm and its optimized GFG equivalent have "forked". Recall from above, that this GFG simulator has a bug -- it is over-liberal.

The bug is the one already described, and our fix is the one already described: Each GFG state either starts a recursion or is part of one. We fix the bug by tagging each GFG state with the index of the GFG state-set that starts its recursion. Once these tags are added, the GFG state-sets are exactly Earley sets.

Next we incorporate an optimization by Joop Leo, which makes Earley parsing linear for all LR-regular grammars, without using lookahead. Since LR-regular is a superset of LR(k) and LL(k), including LL(*), we do not bother with lookahead.

To get from an Earley/Leo parser to a Marpa parser, we need to address one more major point. In modern parsing practice, programmers expect the ability to introduce procedural logic, even to the point of switching parsers. By ensuring that processing each Earley set is complete before processing on the next Earley set begins, we accomplish this.

This means that Marpa has available full information about the parse so far -- it is left-eidetic. Error reporting is unexcelled, and procedural logic can use this information as well. For example, full parse information implies full knowledge of which tokens are expected next.

This means that you can write an liberal HTML parser, starting with a very illiberal HTML grammar. When the illiberal parser encounters a point where it cannot continue because of a missing token, procedural logic can ask what the expected token is; concoct that token on the fly; supply that token to the illiberal parser; and then ask the illiberal parser to continue. This is called the "Ruby Slippers" technique, and an HTML parser based on it has been implemented.[7]

As mentioned, the MDS algorithm has its own approach to optmization, one which takes maximum advantage of functional programming. In contrast, Marpa relies on C level coding.

One example of the contrast in optimization techniques is null handling. Recall from above that, to deal with null processing, the MDS algorithm uses a fixed-point algorithm on the fly. Marpa, on the other hand, before parsing begins. precomputes a list of nullable symbols using bit vectors. In this particular case, Marpa will usually be the winner.

A second case in which hand optimization wins is the all-important Leo optimization. Simon Peyton-Jones is a very smart man, but nonetheless I believe that the day that GHC will look at a functional specification of an Earley parser and rediscover the Leo optimization at compile time is far off.

On the other hand, I do not imagine that hand optimization and/or C language is the winner in all cases. A programmer is deceiving himself if he imagines that he can spot all the cases where lazy evaluation or memoization will be effective in the general case. And of course, even an omniscient programmer is not going to be there at run-time to do "just in time" optimization. Perhaps the optimal parser of the future will combine important hand optimizations with functional programming.

To learn about Marpa, my Earley/Leo-based parsing project, there is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

**1.**
Kegler, Jeffrey.
"Marpa, A Practical General Parser: The Recognizer.", 2013.
PDF accessed 24 April 2018.

Marpa has a stable implementation.
For it, and for more information on Marpa, there are these resources:
Marpa website, accessed 25 April 2018.
Kegler's website, accessed 25 April 2018.
Github repo, accessed 25 April 2018.
MetaCPAN, accessed 30 April 2018..
↩

**2.**
Matthew Might, David Darais and Daniel Spiewak. "Functional Pearl: Parsing with Derivatives." International Conference on Functional Programming 2011 (ICFP 2011). Tokyo, Japan. September, 2011. pages 189--195.
PDF accessed 9 Jun 2018.
Slides accessed 9 June 2018.
Video accessed 9 June 2018.
↩

**3.**
Keshav Pingali and Gianfranco Bilardi, UTCS tech report TR-2012.
2012.
PDF accessed 9 Junk 2018.
Video accessed 9 June 2018.

Less accessible,
but with more details about GFGs and GFG-based
Earley parsing is
Keshav Pingali and Gianfranco Bilardi,
"A graphical model for context-free grammar parsing."
Compiler Construction - 24th International Conference, CC 2015.
Lecture Notes in Computer Science,
Vol. 9031, pp. 3-27, Springer Verlag, 2015.
PDF accessed 9 June 2018.
↩

**4.**
Pingali and Bilardi 2012, section 3.1.
↩

**5.**
Pingali and Bilardi 2015, p. 11.
↩

**6.**
Pingali and Bilardi 2012, Sections 4-7.
↩

**7.**
I have based an
liberal HTML pretty-printer on that parser,
one which I use quite frequently.
I used it, for example, when writing this blog post.
To find out more about Ruby Slippers parsing see the Marpa FAQ,
questions 122
and
123;
my
blog series on parsing html; and
my blog post
"Marpa and the Ruby Slippers".
↩

posted at: 03:06 | direct link to this entry

§
§
§