**Jeffrey Kegler's blog**
about Marpa, his new parsing algorithm,
and other topics of interest

The Ocean of Awareness blog: home page, chronological index, and annotated index.

A recent blog post by Michael Arntzenius ended with a friendly challenge to Marpa. Haskell list comprehensions are something that Haskell's own parser handles only with difficulty. A point of Michael's critique of Haskell's parsing was that Haskell's list comprehension could be even more powerful if not for these syntactic limits.

Michael wondered aloud if Marpa could do better. It can.The problem syntax occurs with the "guards", a very powerful facility of Haskell's list comprehension. Haskell allows several kinds of "guards". Two of these "guards" can have the same prefix, and these ambiguous prefixes can be of arbitrary length. In other words, parsing Haskell's list comprehension requires either lookahead of arbitrary length, or its equivalent.

To answer Michael's challenge, I extended my Haskell subset parser to deal with list comprehension. That parser, with its test examples, is online.[1] I have run it for examples thousands of tokens long and, more to the point, have checked the Earley sets to ensure that Marpa will stay linear, no matter how long the ambiguous prefix gets.[2]

Earley parsing, which Marpa uses, accomplishes the seemingly impossible here. It does the equivalent of infinite lookahead efficiently, without actually doing any lookahead or backtracking. That Earley's algorithm can do this has been a settled fact in the literature for some time. But today Earley's algorithm is little known even among those well acquainted with parsing, and to many claiming the equivalent of infinite lookahead, without actually doing any lookahead at all, sounds like a boast of magical powers.In the rest of this blog post, I hope to indicate how Earley parsing follows more than one potential parse at a time. I will not describe Earley's algorithm in full.[3] But I will show that no magic is involved, and that in fact the basic ideas behind Earley's method are intuitive and reasonable.

List comprehension in Haskell is impressive. Haskell allows you to build a list using a series of "guards", which can be of several kinds. The parsing issue arises because two of the guard types -- generators and boolean expressions -- must be treated quite differently, but can look the same over an arbitrarily long prefix.

Here is one example of a Haskell generator, from the test case for this blog post:

list = [ x | [x, 1729, -- insert more here 99 ] <- xss ][4]

This says to build a lists of `x`'s
such that the guard
`[x, 1729, 99 ] <- xss`
holds.
The clue that this guard is a generator is the
`<-` operator.
The `<-` operator
will appear in every generator,
and means "draw from".

The LHS of the `<-` operator is a pattern
and the RHS is an expression.
This generator draws all the elements from `xss`
which match the pattern `[x, 1729, 99 ]`.
In other words, it draws out
all the elements of `xss`,
and tests that they
are lists of length 3
whose last two subelements are 1729 and 99.

The variable `x` is set to the 1st subelement.
`list` will be a list of all those `x`'s.
In the test suite, we have

so that list becomesxss = [ [ 42, 1729, 99 ] ][5]

Generators can share very long prefixes with Boolean guards.

list2 = [ x | [x, 1729, 99] <- xss, [x, 1729, -- insert more here 99 ] == ys, [ 42, 1729, 99 ] <- xss ][6]

The expression defining `list2`
has 3 comma-separated guards:
The first guard is a generator,
the same one as in the previous example.
The last guard is also a generator.

The middle guard is of a new type: it is a Boolean:
`[x, 1729, 99 ] == ys`.
This guard insists that `x` be such that the triple
`[x, 1729, 99 ]` is equal to `ys`.

In the test suite, we have

so thatys = [ 42, 1729, 99 ][7]

From the parser's point of view, Boolean guards
and generators start out looking the same --
in the examples above, three of our guards start out
the same -- with the string `[x, 1729, 99 ]`,
but

- in one case (the Boolean guard),
`[x, 1729, 99 ]`is the beginning of an expression; and - in the other two cases (the generators),
`[x, 1729, 99 ]`is a pattern.

So Haskell list comprehension presents a problem for parsers. A parser must determine whether it is parsing an expression or a pattern, but it cannot know this for an arbitrarily long time. A parser must keep track of two possibilities at once -- something traditional parsing has refused to do. As I have pointed out[9], belief that traditional parsing "solves" the parsing problem is belief in human exceptionalism -- that human have calculating abilities that Turing machines do not. Keeping two possibilites in mind for a long time is trivial for human beings -- in one form we call it worrying, and try to prevent ourselves from doing it obsessively. But it has been the orthodoxy that practical parsing algorithms cannot do this.

Arntzenius has a nice summary of the attempts to parse this construct while only allowing one possibility at a time -- that is, determistically. Lookahead clearly cannot work -- it would have to be arbitrarily long. Backtracking can work, but can be very costly and is a major obstacle to quality error reporting.

GHC avoids the problems with backtracking by using post-processing.
At parsing time, GHC treats an ambiguous guard as a
Boolean.
Then, if it turns out that is a generator,
it rewrites it in post-processing.
This inelegance incurs some real technical debt --
either a pattern must **always** be a valid expression,
or even more trickery must be resorted to.[10]

Earley parsing deals with this issue by doing what a human would do -- keeping both possibilities in mind at once. Jay Earley's innovation was to discover a way for a computer to track multiple possible parses that is compact, efficient to create, and efficient to read.

Earley's algorithm maintains an "Earley table"
which contains "Earley sets",
one for each token.
Each Earley set contains "Earley items".
Here are some Earley items from Earley set 25
in one of our test cases:

origin = 22; <atomic expression> ::= '[' <expression> '|' . <guards> ']' origin = 25; <guards> ::= . <guard<> origin = 25; <guards> ::= . <guards> ',' <guard<> origin = 25; <guard<> ::= . <pattern> '< <expression> origin = 25; <guard<> ::= . <expression>[11]

In the code, these represent the state of the parse just after
the pipe symbol ("`|`") on line 4 of our test code.

The other four rules have the dot at the beginning of the RHS. These four rules are "predictions" -- none of their symbols have been parsed yet, but we know that these rules might occur, starting at the location of this Earley set.

Each item also records an "origin": the location in the input where the rule described in the item began. For predictions the origin is always the same as the Earley set. For the first Earley item, the origin is 3 tokens earlier, in Earley set 22.

And now we have come to the secret of efficient non-deterministic parsing -- a "secret" which I hope to convince the reader is not magic, or even much of a mystery. Here, again, are two of the items from Earley set 25:

origin = 25; <guard<> ::= . <pattern> '< <expression> origin = 25; <guard<> ::= . <expression>[12]

At this point there are two possibilities going forward -- a generator guard or a Boolean expression guard. And there is an Earley item for each of these possibilities in the Earley set.

That is the basic idea -- that is all there is to it. Going forward in the parse, for as long as both possibilities stay live, Earley items for both will appear in the Earley sets.

From this point of view, it should now be clear why the Earley algorithm can keep track of several possibilities without lookahead or backtracking. No lookahead is needed because all possibilities are in the Earley set, and selection among them will take place as the rest of the input is read. And no backtracking is needed because every possibility was already recorded -- there is nothing new to be found by backtracking.

It may also be clearer why I claim that Marpa is left-eidetic, and how the Ruby Slippers work.[13] Marpa has perfect knowledge of everything in the parse so far, because it is all in the Earley tables. And, given left-eidetic knowledge, Marpa also knows what terminals are expected at the current location, and can "wish" them into existence as necessary.

A permalink to the full code and a test suite for this prototype, as described in this blog post, is on Github. In particular, the permalink of the the test suite file for list comprehension is here. I expect to update this code, and the latest commit can be found here.

To learn more about Marpa, a good first stop is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

**1.**
If you are interested in my Marpa-driven Haskell subset parser,
this blog post
may be the best introduction.
The code is
on Github.
↩

**2.**
The Earley sets for the ambigious prefix immediately reach a size
of 46 items, and then stay at that level.
This is experimental evidence that the Earley set
sizes stay constant.

And, if the Earley items are examined,
and their derivations traced,
it can be seen that
they must repeat the same Earley item count
for as long as the ambiguous prefix continues.
The traces I examined are
here,
and the code which generated them is
here,
for the
reader who wants to convince himself.

The guard prefixes of Haskell are ambiguous,
but (modulo mistakes in the standards)
the overall Haskell grammar is not.
In the literature on Earley's,
it has been shown that for an unambiguous grammar,
each Earley item has an constant amortized cost in time.
Therefore,
if a parse produces
a Earley sets that are all of less than a constant size,
it must have linear time complexity.
↩

**3.**
There are many descriptions of Earley's algorithm out there.
The
Wikipedia page on Earley's algorithm
(accessed 27 August 2018)
is one good place to start.
I did
another very simple introduction to Earley's in
an
earlier blog post,
which may be worth looking at.
Note that Marpa contains
improvements to Earley's algorithm.
Particularly, to fulfill Marpa's claim of linear time for all
LR-regular grammars, Marpa uses Joop Leo's speed-up.
But Joop's improvement is **not** necessary or useful
for parsing
Haskell list comprehension,
is not used in this example,
and will not be described in this post.
↩

**4.**
Permalink to this code,
accessed 27 August 2018.
↩

**5.**
Permalink to this code,
accessed 27 August 2018.
↩

**6.**
Permalink to this code,
accessed 27 August 2018.
↩

**7.**
Permalink to this code,
accessed 27 August 2018.
↩

**8.**
Note that if the list is extended, the patterns matches and Boolean
tests fail, so that 42 is no longer the answer.
From the parsing point of view, this is immaterial.
↩

**9.**
In several places, including
this blog post.
↩

**10.**
This account of the state of the art summarizes
Arntzenius's recent post,
which should be consulted for the details.
↩

**11.**
Adapted from
this trace output,
accessed 27 August 2018.
↩

**12.**
Adapted from
this trace output,
accessed 27 August 2018.
↩

**13.**
For more on the Ruby Slippers see
my
just previous blog post,
↩

posted at: 07:30 | direct link to this entry

§
§
§