Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest
The Ocean of Awareness blog: home page, chronological index, and annotated index.
A recent blog post by Michael Arntzenius ended with a friendly challenge to Marpa. Haskell list comprehensions are something that Haskell's own parser handles only with difficulty. A point of Michael's critique of Haskell's parsing was that Haskell's list comprehension could be even more powerful if not for these syntactic limits.Michael wondered aloud if Marpa could do better. It can.
The problem syntax occurs with the "guards", a very powerful facility of Haskell's list comprehension. Haskell allows several kinds of "guards". Two of these "guards" can have the same prefix, and these ambiguous prefixes can be of arbitrary length. In other words, parsing Haskell's list comprehension requires either lookahead of arbitrary length, or its equivalent.
To answer Michael's challenge, I extended my Haskell subset parser to deal with list comprehension. That parser, with its test examples, is online. I have run it for examples thousands of tokens long and, more to the point, have checked the Earley sets to ensure that Marpa will stay linear, no matter how long the ambiguous prefix gets.Earley parsing, which Marpa uses, accomplishes the seemingly impossible here. It does the equivalent of infinite lookahead efficiently, without actually doing any lookahead or backtracking. That Earley's algorithm can do this has been a settled fact in the literature for some time. But today Earley's algorithm is little known even among those well acquainted with parsing, and to many claiming the equivalent of infinite lookahead, without actually doing any lookahead at all, sounds like a boast of magical powers.
In the rest of this blog post, I hope to indicate how Earley parsing follows more than one potential parse at a time. I will not describe Earley's algorithm in full. But I will show that no magic is involved, and that in fact the basic ideas behind Earley's method are intuitive and reasonable.
List comprehension in Haskell is impressive. Haskell allows you to build a list using a series of "guards", which can be of several kinds. The parsing issue arises because two of the guard types -- generators and boolean expressions -- must be treated quite differently, but can look the same over an arbitrarily long prefix.
Here is one example of a Haskell generator, from the test case for this blog post:
list = [ x | [x, 1729, -- insert more here 99 ] <- xss ] 
This says to build a lists of x's such that the guard [x, 1729, 99 ] <- xss holds. The clue that this guard is a generator is the <- operator. The <- operator will appear in every generator, and means "draw from".
The LHS of the <- operator is a pattern and the RHS is an expression. This generator draws all the elements from xss which match the pattern [x, 1729, 99 ]. In other words, it draws out all the elements of xss, and tests that they are lists of length 3 whose last two subelements are 1729 and 99.
The variable x is set to the 1st subelement. list will be a list of all those x's. In the test suite, we have
xss = [ [ 42, 1729, 99 ] ] so that list becomes  -- a list of one element whose value is 42.
Generators can share very long prefixes with Boolean guards.
list2 = [ x | [x, 1729, 99] <- xss, [x, 1729, -- insert more here 99 ] == ys, [ 42, 1729, 99 ] <- xss ] 
The expression defining list2 has 3 comma-separated guards: The first guard is a generator, the same one as in the previous example. The last guard is also a generator.
The middle guard is of a new type: it is a Boolean: [x, 1729, 99 ] == ys. This guard insists that x be such that the triple [x, 1729, 99 ] is equal to ys.
In the test suite, we have
ys = [ 42, 1729, 99 ] so that list2 is also .
From the parser's point of view, Boolean guards and generators start out looking the same -- in the examples above, three of our guards start out the same -- with the string [x, 1729, 99 ], but
So Haskell list comprehension presents a problem for parsers. A parser must determine whether it is parsing an expression or a pattern, but it cannot know this for an arbitrarily long time. A parser must keep track of two possibilities at once -- something traditional parsing has refused to do. As I have pointed out, belief that traditional parsing "solves" the parsing problem is belief in human exceptionalism -- that human have calculating abilities that Turing machines do not. Keeping two possibilites in mind for a long time is trivial for human beings -- in one form we call it worrying, and try to prevent ourselves from doing it obsessively. But it has been the orthodoxy that practical parsing algorithms cannot do this.
Arntzenius has a nice summary of the attempts to parse this construct while only allowing one possibility at a time -- that is, determistically. Lookahead clearly cannot work -- it would have to be arbitrarily long. Backtracking can work, but can be very costly and is a major obstacle to quality error reporting.
GHC avoids the problems with backtracking by using post-processing. At parsing time, GHC treats an ambiguous guard as a Boolean. Then, if it turns out that is a generator, it rewrites it in post-processing. This inelegance incurs some real technical debt -- either a pattern must always be a valid expression, or even more trickery must be resorted to.
Earley parsing deals with this issue by doing what a human would do -- keeping both possibilities in mind at once. Jay Earley's innovation was to discover a way for a computer to track multiple possible parses that is compact, efficient to create, and efficient to read.
Earley's algorithm maintains an "Earley table"
which contains "Earley sets",
one for each token.
Each Earley set contains "Earley items".
Here are some Earley items from Earley set 25
in one of our test cases:
origin = 22; <atomic expression> ::= '[' <expression> '|' . <guards> ']' origin = 25; <guards> ::= . <guard<> origin = 25; <guards> ::= . <guards> ',' <guard<> origin = 25; <guard<> ::= . <pattern> '< <expression> origin = 25; <guard<> ::= . <expression> 
In the code, these represent the state of the parse just after the pipe symbol ("|") on line 4 of our test code.Each Earley item describes progress in one rule of the grammar. There is a dot (".") in each rule, which indicates how far the parse has progressed inside the rule. One of the rules has the dot just after the pipe symbol, as you would expect, since we have just seen a pipe symbol.
The other four rules have the dot at the beginning of the RHS. These four rules are "predictions" -- none of their symbols have been parsed yet, but we know that these rules might occur, starting at the location of this Earley set.
Each item also records an "origin": the location in the input where the rule described in the item began. For predictions the origin is always the same as the Earley set. For the first Earley item, the origin is 3 tokens earlier, in Earley set 22.
And now we have come to the secret of efficient non-deterministic parsing -- a "secret" which I hope to convince the reader is not magic, or even much of a mystery. Here, again, are two of the items from Earley set 25:
origin = 25; <guard<> ::= . <pattern> '< <expression> origin = 25; <guard<> ::= . <expression> 
At this point there are two possibilities going forward -- a generator guard or a Boolean expression guard. And there is an Earley item for each of these possibilities in the Earley set.
That is the basic idea -- that is all there is to it. Going forward in the parse, for as long as both possibilities stay live, Earley items for both will appear in the Earley sets.
From this point of view, it should now be clear why the Earley algorithm can keep track of several possibilities without lookahead or backtracking. No lookahead is needed because all possibilities are in the Earley set, and selection among them will take place as the rest of the input is read. And no backtracking is needed because every possibility was already recorded -- there is nothing new to be found by backtracking.
It may also be clearer why I claim that Marpa is left-eidetic, and how the Ruby Slippers work. Marpa has perfect knowledge of everything in the parse so far, because it is all in the Earley tables. And, given left-eidetic knowledge, Marpa also knows what terminals are expected at the current location, and can "wish" them into existence as necessary.
A permalink to the full code and a test suite for this prototype, as described in this blog post, is on Github. In particular, the permalink of the the test suite file for list comprehension is here. I expect to update this code, and the latest commit can be found here.
To learn more about Marpa, a good first stop is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.
1. If you are interested in my Marpa-driven Haskell subset parser, this blog post may be the best introduction. The code is on Github. ↩
The Earley sets for the ambigious prefix immediately reach a size
of 46 items, and then stay at that level.
This is experimental evidence that the Earley set
sizes stay constant.
And, if the Earley items are examined, and their derivations traced, it can be seen that they must repeat the same Earley item count for as long as the ambiguous prefix continues. The traces I examined are here, and the code which generated them is here, for the reader who wants to convince himself.
The guard prefixes of Haskell are ambiguous, but (modulo mistakes in the standards) the overall Haskell grammar is not. In the literature on Earley's, it has been shown that for an unambiguous grammar, each Earley item has an constant amortized cost in time. Therefore, if a parse produces a Earley sets that are all of less than a constant size, it must have linear time complexity. ↩
3. There are many descriptions of Earley's algorithm out there. The Wikipedia page on Earley's algorithm (accessed 27 August 2018) is one good place to start. I did another very simple introduction to Earley's in an earlier blog post, which may be worth looking at. Note that Marpa contains improvements to Earley's algorithm. Particularly, to fulfill Marpa's claim of linear time for all LR-regular grammars, Marpa uses Joop Leo's speed-up. But Joop's improvement is not necessary or useful for parsing Haskell list comprehension, is not used in this example, and will not be described in this post. ↩
4. Permalink to this code, accessed 27 August 2018. ↩
5. Permalink to this code, accessed 27 August 2018. ↩
6. Permalink to this code, accessed 27 August 2018. ↩
7. Permalink to this code, accessed 27 August 2018. ↩
8. Note that if the list is extended, the patterns matches and Boolean tests fail, so that 42 is no longer the answer. From the parsing point of view, this is immaterial. ↩
9. In several places, including this blog post. ↩
10. This account of the state of the art summarizes Arntzenius's recent post, which should be consulted for the details. ↩
11. Adapted from this trace output, accessed 27 August 2018. ↩
12. Adapted from this trace output, accessed 27 August 2018. ↩
13. For more on the Ruby Slippers see my just previous blog post, ↩
posted at: 07:30 | direct link to this entry