Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest
The Ocean of Awareness blog: home page, chronological index, and annotated index.
In a previous post, I outlined a method for using the Marpa algorithm as the basis for better combinator parsing. This post follows up with a trial implementation.
For this trial, I choose the most complex example from the classic 1996 tutorial on combinator parsing by Hutton and Meijer[1]. Their example implements the offside-rule parsing of a functional language -- parsing where whitespace is part of the syntax.[2] The Hutton and Meijer example is for Gofer, a now obsolete implementation of Haskell. To make the example more relevant, I wrote a parser for Haskell layout according to the Haskell 2010 Language Report[3].
For tests, I used the five examples (2 long, 3 short) provided in the 2010 Report[4], and the four examples given in the "Gentle Introduction" to Haskell[5]. I implemented only enough of the Haskell syntax to run these examples, but this was enough to include a substantial subset of Haskell's syntax.
This description of the implementation includes many extracts from the code. For those looking for more detail, the full code and test suite for this example are on Github. While the comments in the code do not amount to a tutorial, they are extensive. Readers who like to "peek ahead" are encouraged to do so.
It won't be necessary to know Haskell to follow this post. This section will describe Haskell's layout informally. Briefly, these two code snippets should have the same effect:
let y = a*b f x = (x+y)/y in f c + f d [6]
let { y = a*b ; f x = (x+y)/y } [7]
In my test suite, both code snippets produce the same AST. The first code display uses Haskell's implicit layout parsing, and the second code display uses explicit layout. In each, the "let" is followed by a block of declarations (symbol <decls>). Each decls contains one or more declarations (symbol <decl>). For the purposes of determining layout, Haskell regards <decls> as a "block", and each <decl> as a block "item". In both displays, there are two items in the block. The first item is y = a*b, and the second <decl> item is f x = (x+y)/y.
In explicit layout, curly braces surround the block, and semicolons separate each item. Implicit layout follows the "offside rule": The first element of the laid out block determines the "block indent". The first non-whitespace character of every subsequent non-empty line determines the line indent. The line indent is compared to the block indent.
Explicit semicolons can be used in implicit layout: If a semicolon occurs in implicit layout, it separates block items. In our test suite, the example
let y = a*b; z = a/b f x = (x+y)/z in f c + f d [8]contains three <decl> items.
The examples in the displays above are simple. The two long examples from the 2010 Report are more complicated: 6 blocks of 4 different kinds, with nesting twice reaching a depth of 4. The two long examples in the 2010 Report are the same, except that one uses implicit layout and the other uses explicit layout. In the test of my Haskell subset parser, both examples produce identical ASTs.
There are additional rules, including for tabs, Unicode characters and multi-line comments. These rules are not relevant in the examples I took from the Haskell literature; they present no theoretical challenge to this parsing method; and they are not implemented in this prototype Haskell parser.
To tackle Haskell layout parsing, I use a separate combinator for each layout block. Every combinator, therefore, has its own block and item symbols, and its own block indent; and each combinator implements exactly one method of layout -- explicit or implicit.
From the point of view of its parent combinator, a child combinator is a lexeme, and the parse tree it produces is the value of the lexeme. Marpa can automatically produce an AST, and it adds lexeme values to the AST as leaves. The effect is that Marpa automatically assembles a parse tree for us from the tree segments returned by the combinators.
In outlining this algorithm, I will start by explaining where the "missing" semicolons come from in the implicit layout. Marpa allows various kinds of "events", including on discarded tokens. ("Discards" are tokens thrown away, and not used in the parse. The typical use of token discarding in Marpa is for the handling of whitespace and comments.)
The following code sets an event named 'indent', which happens when Marpa finds a newline followed by zero or more whitespace characters.[9] This does not capture the indent of the first line of a file, but that is not an issue -- the 2010 Report requires that the first indent be treated as a special case anyway.:discard ~ indent event => indent=off indent ~ newline whitechars [10]
Indent events, like others, occur in the main read loop of each combinator. Outdents and EOFs are dealt with by terminating the read loop.[11] Line indents deeper than the current block indent are dealt with by resuming the read loop. [12] Line indents equal to the block indent trigger the reading of a Ruby Slippers semicolon as shown in the following:
$recce->lexeme_read( 'ruby_semicolon', $indent_start, $indent_length, ';' ) [13]
In Marpa, a "Ruby Slippers" symbol is one which does not actually occur in the input. Ruby Slippers parsing is new with Marpa, and made possible because Marpa is left-eidetic. By left-eidetic, I mean that Marpa knows, in full detail, about the parse to the left of its current position, and can provide that information to the parsing app. This implies that Marpa also knows which tokens are acceptable to the parser at the current location, and which are not.
Ruby Slippers parsing enables a very important trick which is useful in "liberal" parsing -- parsing where certain elements might be in some sense "missing". With the Ruby Slippers you can design a "liberal" parser with a "fascist" grammar. This is, in fact, how the Haskell 2010 Report's context-free grammar is designed -- the official syntax requires explicit layout, but Haskell programmers are encouraged to omit most of the explicit layout symbols, and Haskell implementations are required to "dummy up" those symbols in some way. Marpa's method for doing this is left-eideticism and Ruby Slippers parsing.The term "Ruby Slippers" refers to a widely-known scene in the "Wizard of Oz" movie. Dorothy is in the fantasy world of Oz, desperate to return to Kansas. But, particularly after a shocking incident in which orthodox Oz wizardry is exposed as an affable fakery, she is completely at a loss as to how to escape. The "good witch" Glenda appears and tells Dorothy that in fact she's always had what she's been wishing for. The Ruby Slippers, which she had been wearing all through the movie, can return her to Kansas. All Dorothy needs to do is wish.
In Ruby Slippers parsing, the "fascist" grammar "wishes" for lots of things that may not be in the actual input. Procedural logic here plays the part of a "good witch" -- it tells the "fascist" grammar that what it wants has been there all along, and supplies it. To do this, the procedural logic has to have a reliable way of knowing what the parser wants. Marpa's left-eideticism provides this.
This brings us to a question I've postponed -- how do we know which combinator to call when? The answer is Ruby Slippers parsing. First, here are some lexer rules for "unicorn" symbols. We use unicorns when symbols need to appear in Marpa's lexer, but must never be found in actual input.
:lexeme ~ L0_unicorn L0_unicorn ~ unicorn unicorn ~ [^\d\D] ruby_i_decls ~ unicorn ruby_x_decls ~ unicorn [14]
<unicorn> is defined to match [^\d\D]. This pattern is all the symbols which are not digits and not non-digits -- in other words, it's impossible that this pattern will ever match any character. The rest of the statements declare other unicorn lexemes that we will need. <unicorn> and <L0_unicorn> are separate, because we need to use <unicorn> on the RHS of some lexer rules, and a Marpa lexeme can never occur on the RHS of a lexer rule.[15]
In the above Marpa rule,
laidout_decls ::= ('{') ruby_x_decls ('}') | ruby_i_decls | L0_unicorn decls L0_unicorn [16]
It is the expectation of a <laidout_decls> symbol that causes child combinators to be invoked. Because <L0_unicorn> will never be found in the input, the <decls> alternative will never match -- it is there for documentation and debugging reasons.[17] Therefore Marpa, when it wants a <laidout_decls>, will look for a <ruby_x_decls> if a open curly brace is read; and a <ruby_i_decls> otherwise. Neither <ruby_x_decls> or <ruby_i_decls> will ever be found in the input, and Marpa will reject the input, causing a "rejected" event.
In this code, as often, the "good witch" of Ruby Slippers does her work through "rejected" events. These events can be set up to happen when, at some parse location, none of the tokens that Marpa's internal lexer finds are acceptable.
In the "rejected" event handler, we can use Marpa's left eideticism to find out what lexemes Marpa would consider acceptable. Specifically, there is a terminals_expected() method which returns a list of the symbols acceptable at the current location.
my @expected = grep { /^ruby_/xms; } @{ $recce->terminals_expected() }; [18]
Once we "grep" out all but the symbols with the "ruby_" prefix, there are only 4 non-overlapping possibilities:
If Marpa does not expect any of the Ruby Slippers lexemes, there was a syntax error in the Haskell code.[19]
If a <ruby_i_decls> or a <ruby_x_decls> lexeme is expected, a child combinator is invoked. The Ruby Slippers symbol determines whether the child combinator looks for implicit or explicit layout. In the case of implicit layout, the location of the rejection determines the block indent.[20]
If a <ruby_semicolon> is expected, then the parser is at the point where a new block item could start, but none was found. Whether the block was implicit or explicit, this indicates we have reached the end of the block, and should return control to the parent combinator.[21]
To explain why <ruby_semicolon> indicates end-of-block, we look at both cases. In the case of an explicit layout combinator, the rejection should have been caused by a closing curly brace, and we return to the parent combinator and retry it. In the parent combinator, the closing curly brace will be acceptable.
If we experience a "rejected" event while expecting a <ruby_semicolon> in an implicit layout combinator, it means we did not find an explicit semicolon; and we also never found the right indent for creating a Ruby semicolon. In other words, the indentation is telling us that we are at the end of the block. We therefore return control to the parent combinator.
With this, we've covered the major points of this Haskell prototype parser. It produces an AST whose structure and node names are those of the 2010 Report. (The Marpa grammar introduces non-standard node names and rules, but these are pruned from the AST in post-processing.)
In the code, the grammars from the 2010 Report are included for comparison, so a reader can easily determine what syntax we left out. It might be tedious to add the rest, but I believe it would be unproblematic, with one interesting exception: fixity. To deal with fixity, we may haul out the Ruby Slippers again.
A permalink to the full code and a test suite for this prototype, as described in this blog post, is on Github. I expect to update this code, and the latest commit can be found here. Links for specific lines of code in this post are usually static permalinks to earlier commits.
To learn more about Marpa, a good first stop is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.
1. Graham Hutton and Erik Meijer, Monadic parser combinators, Technical Report NOTTCS-TR-96-4. Department of Computer Science, University of Nottingham, 1996, pp 30-35. http://eprints.nottingham.ac.uk/237/1/monparsing.pdf. Accessed 19 August 2018. ↩
2. I use whitespace-significant parsing as a convenient example for this post, for historical reasons and for reasons of level of complexity. This should not be taken to indicate that I recommend it as a language feature. ↩
3. Simon Marlow, Haskell 2010 Language Report, 2010. Online version accessed 21 August 2018. For layout, see in particular section 2.7 (pp. 12-14) and section 10.3 (pp. 131-134). ↩
4. 2010 Report. The short examples are on p. 13 and p. 134. The long examples are on p. 14. ↩
5. Paul Hudak, John Peterson and Joseph Fasel Gentle Introduction To Haskell, version 98. Revised June, 2000 by Reuben Thomas. Online version accessed 21 August 2018. The examples are in section 4.6, which is on pp. 20-21 of the October 1999 PDF. ↩
9. Single-line comments are dealt with properly by lexing them as a different token and discarding them separately. Handling multi-line comments is not yet implemented -- it is easy in principle but tedious in practice and the examples drawn from the Haskell literature did not provide any test cases. ↩
10. Github Permalink. ↩
11. Github Permalink. ↩
12. Github Permalink. ↩
13. Github Permalink. ↩
14. Github Permalink. ↩
15. The reason for this is that by default a Marpa grammar determines which of its symbols are lexemes using the presence of those symbol on the LHS and RHS of the rules in its lexical and context-free grammars. A typical Marpa grammar requires a minimum of explicit lexeme declarations. (Lexeme declarations are statements with the :lexeme pseudo-symbol on their LHS.) As an aside, the Haskell 2010 Report is not always careful about the lexer/context-free boundary, and adopting its grammar required more use of Marpa's explicit lexeme declarations than usual. ↩
16. Github Permalink. ↩
17. Specifically, the presense of a <decls> alternative silences the usual warnings about symbols inaccessible from the start symbol. These warnings can be silenced in other ways, but at the prototype stage it is convenient to check that all symbols supposed to be accessible through <decls> are in fact accessible. There is a small startup cost to allowing the extra symbols in the grammars, but the runtime cost is probably not measureable. ↩
18. Github Permalink. ↩
19. Currently the handling of these is simplistic. A practical implementation of this method would want better reporting. In fact, Marpa's left eideticism allows some interesting things to be done in this respect. ↩
20. Github Permalink. ↩
21. Github Permalink. ↩
posted at: 10:24 | direct link to this entry