Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website

Google+

Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Sat, 01 Jun 2019


Infinite Lookahead and Ruby Slippers

About this post

This post presents a practical, compact example which demonstrates a use case for both infinite lookahead and Ruby Slippers parsing. While the example itself is very simple, this post may not be a good first tutorial -- it focuses on Marpa implementation strategy, instead of basics.

About Urbit

The example described in this post is one part of hoonlint. hoonlint, currently under development, will be a "lint" program for a language called Hoon.

Hoon is part of the Urbit project. Urbit is an effort to return control of the Internet experience to the individual user. (The Urbit community has, generously, been supporting my work on Hoon.)

The original Internet and its predecessors were cosy places. Users controlled their experience. Authority was so light you could forget it was there, but so adequate to its task that you could forget why it was necessary. What we old timers do remember of the early Internet was the feeling of entering into a "brave new world".

The Internet grew beyond our imaginings, and our pure wonder of decades ago now seems ridiculous. But the price has been a shift of power which should be no laughing matter. Control of our Internet experience now resides in servers, run by entities which make no secret of having their own interests. Less overt, but increasingly obvious, is the single-mindedness with which they pursue those interests.

And the stakes have risen. In the early days, we used the Internet as a supplement in our intellectual lives. Today, we depend on it in our financial and social lives. Today, the server-sphere can be a hostile place. Going forward it may well become a theater of war.

We could try to solve this problem by running our own servers. But this is a lot of work, and only leaves us in touch with those willing and able to do that. In practice, this seems to be nobody.

Urbit seeks to solve these problems with hassle-free personal servers, called urbits. Urbits are journaling databases, so they are incorruptable. To make sure they can be run anywhere in the cloud[1], they are based on a tiny virtual machine, called Nock. To keep urbits compact and secure, Urbit takes on code bloat directly -- Urbit is an original design from a clean slate, with a new protocol stack.

About Hoon

Nock's "machine language" takes the form of trees of arbitrary precision integers. The integers can be interpreted as strings, floats, etc., as desired. And the trees can be interpreted as lists, giving Nock a resemblance to a LISP VM. Nock does its own memory management and takes care of its own garbage collection.[2]

Traditionally, there are two ways to enter machine language,

Like traditional machine language, Nock cannot be written directly. Hoon is Urbit's equivalent of C -- it is Urbit's "close to the metal" higher level language.

Not that Hoon looks much like C, or anything else you've ever seen. This is a Hoon program that takes an integer argument, call it n, and returns the first n counting numbers:


    |=  end=@                                               ::  1
    =/  count=@  1                                          ::  2
    |-                                                      ::  3
    ^-  (list @)                                            ::  4
    ?:  =(end count)                                        ::  5
      ~                                                     ::  6
    :-  count                                               ::  7
    $(count (add 1 count))                                  ::  8
    

Hoon comments begin with a "::" and run until the next newline. The above Hoon sample uses comments to show line numbers.

The example for this post will be a hoonlint subset: a multi-line comment linter. Multi-line comments are the only Hoon syntax we will talk about. (For those who want to know more about Hoon, there is a tutorial.)

About Hoon comments

In basic Hoon syntax, multi-line comments are free-form. In practice, Hoon authors tend to follow a set of conventions.

Pre-comments

In the simplest case, a comment must precede the code it describes, and be at the same indent. These simple cases are called "pre-comments".[3] For example, this code contains a pre-comment:


	  :: pre-comment 1
	  [20 (mug bod)]
    

Inter-comments

Hoon multi-line comments may also contain "inter-comments". The inter-comments are aligned depending on the syntax. In the display below, the inter-comments are aligned with the "rune" of the enclosing sequence. A "rune" is Hoon's rough equivalent of a "keyword". Runes are always digraphs of special ASCII characters. The rune in the following code is :~, and the sequence it introduces includes pre-comments, inter-comments and meta-comments.


      :~  [3 7]
      ::
	  :: pre-comment 1
	  [20 (mug bod)]
      ::
	  :: pre-comment 2
	  [2 yax]
      ::
	  :: pre-comment 3
	  [2 qax]
    ::::
    ::    :: pre-comment 4
    ::    [4 qax]
      ::
	  :: pre-comment 5
	  [5 tay]
      ==
    

When inter-comments are empty, as they are in the above, they are called "breathing comments", because they serve to separate, or allow some "air" between, elements of a sequence. For clarity, the pre-comments in the above are further indicated: all and only pre-comments contain the text "pre-comment".

Meta-comments

The above code also contains a third kind of comment -- meta-comments. Meta-comments must occur at the far left margin -- at column 1. These are called meta-comments, because they are allowed to be outside the syntax structure. One common use for meta-comments is "commenting out" other syntax. In the above display, the meta-comments "comment out" the comment labeled "pre-comment 4" and its associated code.

Staircase comments

Finally, there are "staircase comments", which are used to indicate the larger structure of Hoon sequences and other code. For example,


    ::                                                      ::
    ::::  3e: AES encryption  (XX removed)                  ::
      ::                                                    ::
      ::
    ::                                                      ::
    ::::  3f: scrambling                                    ::
      ::                                                    ::
      ::    ob                                              ::
      ::
     

Each staircase consists of three parts. In lexical order, these parts are an upper riser, a tread, and a lower riser. The upper riser is a sequence of comments at the same alignment as an inter-comment. The tread is also at the inter-comment alignment, but must be 4 colons ("::::") followed by whitespace. The lower riser is a sequence of comments indented two spaces more than the tread.

Hoon comment conventions

Hoon's basic syntax allows comments to be free-form. In practice, there are strict conventions for these comments, conventions we would like to enforce with hoonlint.

  1. A multi-line comment may contain an "inter-part", a "pre-part", or both.
  2. If both an inter-part and a pre-part are present, the inter-part must precede the pre-part.
  3. The inter-part is a non-empty sequence of inter-comments and staircases.
  4. A pre-part is a non-empty sequence of pre-comments.
  5. Meta-comments may be inserted anywhere in either the pre-part or the inter-part.
  6. Comments which do not obey the above rules are bad comments. A good comment is any comment which is not a bad comment.
  7. A comment is not regarded as a meta-comment if it can be parsed as structural comment. An structural comment is any good comment which is not a meta-comment.

Grammar

We will implement these conventions using the BNF of this section. The sections to follow outline the strategy behind the BNF.


    :start ::= gapComments
    gapComments ::= OptExceptions Body
    gapComments ::= OptExceptions
    Body ::= InterPart PrePart
    Body ::= InterPart
    Body ::= PrePart
    InterPart ::= InterComponent
    InterPart ::= InterruptedInterComponents
    InterPart ::= InterruptedInterComponents InterComponent

    InterruptedInterComponents ::= InterruptedInterComponent+
    InterruptedInterComponent ::= InterComponent Exceptions
    InterComponent ::= Staircases
    InterComponent ::= Staircases InterComments
    InterComponent ::= InterComments

    InterComments ::= InterComment+

    Staircases ::= Staircase+
    Staircase ::= UpperRisers Tread LowerRisers
    UpperRisers ::= UpperRiser+
    LowerRisers ::= LowerRiser+

    PrePart ::= ProperPreComponent OptPreComponents
    ProperPreComponent ::= PreComment
    OptPreComponents ::= PreComponent*
    PreComponent ::= ProperPreComponent
    PreComponent ::= Exception

    OptExceptions ::= Exception*
    Exceptions ::= Exception+
    Exception ::= MetaComment
    Exception ::= BadComment
    Exception ::= BlankLine
    

Technique: Combinator

Our comment linter is implemented as a combinator. The main hoonlint parser invokes this combinator when it encounters a multi-line comment. Because of the main parser, we do not have to worry about confusing comments with Hoon's various string and in-line text syntaxes.

Note that while combinator parsing is useful, it is a technique that can be oversold. Combinators have been much talked about in the functional programming literature[4], but the current flagship functional programming language compiler, the Glasgow Haskell Compiler, does not use combinators to parse its version of the Haskell -- instead it uses a parser in the yacc lineage.[5] As a parsing technique on its own, the use of combinators is simply another way of packaging recursive descent with backtracking, and the two techniques share the same power, the same performance, and the same downsides.

Marpa is much more powerful than either LALR (yacc-lineage) parsers or combinators, so we can save combinator parsing for those cases where combinator parsing really is helpful. One such case is lexer mismatch.

Lexer mismatch

The first programming languages, like BASIC and FORTRAN, were line-structured -- designed to be parsed line-by-line.[6] After ALGOL, new languages were usually block-structured. Blocks can start or end in the middle of a line, and can span multiple lines. And blocks are often nested.

A line-structured language requires its lexer to think in terms of lines, but this approach is completely useless for a block-structured language. Combining both line-structured and block-structured logic in the same lexer usually turns the lexer's code into a rat's nest.

Calling a combinator every time a line-structured block is encountered eliminates the problem. The main lexer can assume that the code is block-structured, and all the line-by-line logic can go into combinators.

Technique: Non-determinism

Our grammar is non-deterministic, but unambiguous. It is unambiguous because, for every input, it will produce no more than one parse.

It is non-deterministic because there is a case where it tracks two possible parses at once. The comment linter cannot immediately distinguish between a prefix of the upper riser of a staircase, and a prefix of a sequence of inter-comments. When a tread and lower riser is encountered, the parser knows it has found a staircase, but not until then. And if the parse is of an inter-comment sequence, the comment linter will not be sure of this until the end of the sequence.

Technique: Infinite lookahead

As just pointed out, the comment linter does not know whether it is parsing a staircase or an inter-comment sequence until either

To determine which of these two choices is the correct parse, the linter needs to read an arbitrarily long sequence of tokens -- in other words, the linter needs to perform infinite lookahead.

Humans deal with infinite lookaheads all the time -- natural languages are full of situations that require them.[7] Modern language designers labor to avoid the need for infinite lookahead, but even so cases where it is desirable pop up.[8]

Fortunately, in 1991, Joop Leo published a method that allows computers to emulate infinite lookahead efficiently. Marpa uses Joop's technique. Joop's algorithm is complex, but the basic idea is to do what humans do in the same circumstance -- keep all the possibilities in mind until the evidence comes in.

Technique: the Ruby Slippers

Recall that, according to our conventions, our parser does not recognize a meta-comment unless no structural comment can be recognized. We could implement this in BNF, but it is much more elegant to use the Ruby Slippers.[9]

As those already familiar with Marpa may recall, the Ruby Slippers are invoked when a Marpa parser finds itself unable to proceed with its current set of input tokens. At this point, the lexer can ask the Marpa parser what token it does want. Once the lexer is told what the "wished-for" token is, it can concoct one, out of nowhere if necessary, and pass it to the Marpa parser, which then proceeds happily. In effect, the lexer acts like Glenda the Good Witch of Oz, while the Marpa parser plays the role of Dorothy.

In our implementation, the Marpa parser, by default, looks only for structural comments. If the Marpa parser of our comment linter finds that the current input line is not a structural comment, the Marpa parser halts and tells the lexer that there is a problem. The lexer then asks the Marpa parser what it is looking for. In this case, the answer will always be the same: the Marpa parser will be looking for a meta-comment. The lexer checks to see if the current line is a comment starting at column 1. If there is a comment starting at column 1, the lexer tells the Marpa parser that its wish has come true -- there is a meta-comment.

Another way to view the Ruby Slippers is as a kind of exception mechanism for grammars. In this application, we treat inability to read an structural comment as an exception. When the exception occurs, if possible, we read a meta-comment.

Technique: Error Tokens

Error tokens are a specialized use of the Ruby Slippers. The application for this parser is "linting" -- checking that the comments follow conventions. As such, the main product of the parser is not the parse -- it is the list of errors gathered along the way. So stopping the parser at the first error does not make sense.

What is desirable is to treat all inputs as valid, so that the parsing always runs to the end of input, in the process producing a list of the errors. To do this, we want to set up the parser so that it reads special "error tokens" whenever it encounters a reportable error.

This is perfect for the Ruby Slippers. If an "exception" occurs, as above described for meta-comments, but no meta-comment is available, we treat it as a second level exception.

When would no meta-comment be available? There are two cases:

On the second exception level, the current line will be read as either a <BlankLine>, or a <BadComment>. We know that every line must lex as either a <BlankLine> or a <BadComment> because our comment linter is called as a combinator, and the parent Marpa parser guarantees this.

Technique: Ambiguity

Marpa allows ambiguity, which could have been exploited as a technique. For example, in a simpler BNF than that we used above, it might be ambiguous whether a meta-comment belongs to an <InterPart> which immediately precedes it; or to a <PrePart> which immediately follows it. We could solve the dilemma by noting that it does not matter: All we care about is spotting bad comments and blank lines, so that picking one of two ambiguous parses at random will work fine.

But efficiency issues are sometimes a problem with ambiguity and unambiguity can be a good way of avoiding them.[10] Also, requiring the grammar to be unambiguous allows an additional check that is useful in the development phase. In our code we test each parse for ambiguity. If we find one, we know that hoonlint has a coding error.

Keeping the parser unambiguous makes the BNF less elegant than it could be. To avoid ambiguity, we introduced extra symbols; introduced extra rules; and restricted the use of ambiguous tokens.

Recall that I am using the term "ambiguous" in the strict technical sense that it has in parsing theory, so that a parser is only ambiguous if it can produce two valid parses for one string. An unambiguous parser can allow non-deterministism and can have ambiguous tokens. In fact, our example grammar does both of these things, but is nonetheless unambiguous.

Extra symbols

One example of an extra symbol introduced to make this parser unambiguous is <ProperPreComment>. <ProperPreComment> is used to ensure that a <PrePart> never begins with a meta-comment.[11]

The BNF requires that the first line of a <PrePart> must be a <ProperPreComment>. This means that, if a <MetaComment> is found at the boundary between an <InterPart> and a <PrePart>, it cannot be the first line of the <PrePart> and so must be the last line of the <InterPart>.

Extra rules

In our informal explanation of the comment conventions, we stated that an inter-part is a sequence, each element of which is an inter-comment or a staircase. While BNF that directly implemented this rule would be correct, it would also be highly ambiguous: If an inter-comment occurs before a tread or an upper riser line, it could also be parsed as part of the upper riser.

To eliminate the ambiguity, we stipulate that if comment can be parsed as part of a staircase, then it must be parsed as part of a staircase. This stipulation still leaves the grammar non-deterministic -- we may not know if our comment could be part of a staircase until many lines later.

With our stipulation we know that, if an <InterComponent> contains a staircase, then that staircase must come before any of the inter-comments. In an <InterComponent> both staircases and inter-comments are optional, so the unambiguous representation of <InterComponent> is


    InterComponent ::= Staircases
    InterComponent ::= Staircases InterComments
    InterComponent ::= InterComments
    
Notice that, although both staircases and inter-comments are optional, we do not include the case where both are omitted. This is because we insist that an <InterComponent> contain at least one line.

Ambiguous tokens

Our parser is not ambiguous, but it does allow ambiguous tokens. For example, a comment with inter-comment alignment could be either an <InterComment> or an <UpperRiser>; and our lexer returns both. The parser remains unambiguous, however, because only one of these two tokens will wind up in the final parse.

Call the set of tokens returned by our parser for a single line, a "token set". If the token set contains more than one token, the tokenization is ambiguous for that line. If the token set contains only one token, the token set is called a "singleton", and tokenization is unambiguous for that line.

To keep this parser unambiguous, we restrict the ambiguity at the lexer level. For example, our lexer is set up so that a meta-comment is never one of the alternatives in a lexical ambiguity. If a token set contains a <MetaComment>, that token set must be a singleton. The Ruby Slippers are used to enforce this.[12] Similarly, the Ruby Slippers are used to guarantee that any set of tokens containing either a <BadComment> or a <BlankLine> is a singleton.

Code

This post did not walk the reader through the code. Instead, we talked in terms of strategy. The code is available on Github in unit test form. For those who want to see the comment-linter combinator in a context, a version of the code embedded in hoonlint in also on Github.[13]

Comments on this blog post, etc.

To learn about Marpa, my Earley/Leo-based parser, there is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Footnotes

1. In their present form, urbits run on top of Unix and UDP.

2. Garbage collection and arbitrary precision may seem too high-level for something considered a "machine language", but our concepts evolve. The earliest machine languages required programmers to write their own memory caching logic and to create their own floating point representations, both things we now regard as much too low-level to deal with even at the lowest software level.

3. This post attempts to follow standard Hoon terminology, but for some details of Hoon's whitespace conventions, there is no settled terminology, and I have invented terms as necessary. The term "pre-comment" is one of those inventions.

4. For a brief survey of this literature, see the entries from 1990 to 1996 in my "timeline" of parsing history.

5. This is the LALR grammar for GHC, from GHC's Github mirror.

6. This is simplified. There were provisions for line continuation, etc. But, nonetheless, the lexers for these languages worked in terms of lines, and had no true concept of a "block".

7. An example of a requirement for infinite lookahead is the sentence "The horse raced past the barn fell". Yes, this sentence is not, in fact, infinitely long, but the subclause "raced past the barn" could be anything, and therefore could be arbitrarily long. In isolation, this example sentence may seem unnatural, a contrived "garden path". But if you imagine the sentence as an answer to the question, "Which horse fell?", expectations are set so that the sentence is quite reasonable.

8. See my blog post "A Haskell challenge".

9. To find out more about Ruby Slippers parsing see the Marpa FAQ, questions 122 and 123; my blog series on parsing HTML; my recent blog post "Marpa and combinator parsing 2"; and my much older blog post "Marpa and the Ruby Slippers".

10. This, by the way, is where I believe parsing theory went wrong, beginning in the 1960's. In an understandable search for efficiency, mainstream parsing theory totally excluded not just ambiguity, but non-determinism as well. These draconian restrictions limited the search for practical parsers to a subset of techniques so weak that they cannot even duplicate human parsing capabilities. This had the bizarre effect of committing parsing theory to a form of "human exceptionalism" -- a belief that human beings have a special ability to parse that computers cannot emulate. For more on this story, see my "timeline" of parsing history.

11. This example illustrates the efficiency considerations involved in the decision to tolerate, or to exclude, efficiency. If n meta-comments occur between a <InterPart> and a <PrePart>, the dividing line is arbitrary, so that there are n+1 parses. This will, in theory, make the processing time quadratic. And, in fact, long sequences of meta-comments might occur between the inter- and pre-comments, so the inefficiency might be real.

12. Inter-comments and comments that are part of upper risers may start at column 1, so that, without special precautions in the lexer, an ambiguity between a structural comment and a meta-comment is entirely possible.

13. For the hoonlint-embedded form, the Marpa grammar is here and the code is here. These are snapshots -- permalinks. The application is under development, and probably will change considerably. Documentation is absent and testing is minimal, so that this pre-alpha embedded form of the code will mainly be useful for those who want to take a quick glance at the comment linter in context.


posted at: 13:03 | direct link to this entry

§         §         §