Sat, 01 Jun 2019
Infinite Lookahead and Ruby Slippers
About this post
This post presents a practical, compact example which
demonstrates a use case for both infinite lookahead
and Ruby Slippers parsing.
While the example itself is very simple,
this post may not be a good first tutorial --
it focuses on Marpa implementation strategy,
instead of basics.
The example described in this post is one part of
hoonlint, currently under development,
will be a "lint" program for a language called Hoon.
Hoon is part of
the Urbit project.
Urbit is an effort to return control of the Internet
experience to the individual user.
(The Urbit community has, generously, been supporting my work on Hoon.)
The original Internet and its predecessors were cosy places.
Users controlled their experience.
Authority was so light you
could forget it was there,
but so adequate to its task that you could forget why it
What we old timers do remember of the early Internet was the feeling of entering
into a "brave new world".
The Internet grew beyond our imaginings,
and our pure wonder of decades ago now seems ridiculous.
But the price has been a shift
of power which should be no laughing matter.
Control of our Internet experience now resides in
run by entities which make no secret of having their own interests.
Less overt, but increasingly obvious, is the single-mindedness with which they pursue
And the stakes have risen.
In the early days,
we used the Internet as a supplement in our intellectual lives.
Today, we depend on it in our financial and social lives.
Today, the server-sphere can be a hostile place.
Going forward it may well become a theater of war.
We could try to solve this problem by running our own servers.
But this is a lot of work, and only leaves us in touch with those
willing and able to do that. In practice, this seems to be nobody.
Urbit seeks to solve these problems with
hassle-free personal servers, called urbits.
Urbits are journaling databases, so they are incorruptable.
To make sure they can be run anywhere in the cloud,
they are based on a tiny virtual machine, called Nock.
To keep urbits compact and secure,
Urbit takes on code bloat directly --
Urbit is an original design from a clean slate,
with a new protocol stack.
Nock's "machine language" takes the form of trees of arbitrary precision integers.
The integers can be interpreted as strings, floats, etc.,
And the trees can be interpreted as lists,
giving Nock a resemblance to a LISP VM.
Nock does its own memory management
and takes care of its own garbage collection.
Traditionally, there are two ways to enter machine language,
- Physically, for example,
by toggling it into a machine's front panel.
Originally, entering it physically was the only way.
- Indirectly, using
assembler or some higher-level language, like C.
Once these indirect methods existed, they
rapidly took over as the most common way to create machine language.
machine language, Nock cannot be written directly.
Hoon is Urbit's equivalent of C -- it is Urbit's
"close to the metal" higher level language.
Not that Hoon looks much like C,
or anything else you've ever seen.
This is a Hoon program that takes an integer argument,
call it n,
and returns the first n counting numbers:
|= end=@ :: 1
=/ count=@ 1 :: 2
|- :: 3
^- (list @) :: 4
?: =(end count) :: 5
~ :: 6
:- count :: 7
$(count (add 1 count)) :: 8
Hoon comments begin with a "::" and run until the next
The above Hoon sample uses comments to show line numbers.
The example for this post will be
a hoonlint subset: a multi-line comment linter.
Multi-line comments are the only Hoon syntax we will talk about.
(For those who want to know more about Hoon,
there is a tutorial.)
About Hoon comments
In basic Hoon syntax, multi-line comments are free-form.
In practice, Hoon authors tend to follow a set of conventions.
In the simplest case, a comment must precede the code it
describes, and be at the same indent.
These simple cases are called "pre-comments".
For example, this code contains a pre-comment:
:: pre-comment 1
[20 (mug bod)]
Hoon multi-line comments may also
The inter-comments are aligned depending on the syntax.
In the display below, the inter-comments are aligned with the "rune" of the enclosing sequence.
A "rune" is Hoon's rough equivalent of a "keyword".
Runes are always digraphs of special ASCII characters.
The rune in the following code is
and the sequence it introduces
includes pre-comments, inter-comments and meta-comments.
:~ [3 7]
:: pre-comment 1
[20 (mug bod)]
:: pre-comment 2
:: pre-comment 3
:: :: pre-comment 4
:: [4 qax]
:: pre-comment 5
When inter-comments are empty, as they are in the above,
they are called "breathing comments", because they serve to separate,
or allow some "air" between, elements of a sequence.
the pre-comments in the above are further indicated:
all and only pre-comments contain the text "pre-comment".
The above code also contains a third kind of comment -- meta-comments.
Meta-comments must occur at the far left margin -- at column 1.
These are called meta-comments, because they are allowed
to be outside the syntax structure.
One common use for meta-comments is "commenting out" other syntax.
In the above display, the meta-comments "comment out"
the comment labeled "pre-comment 4"
and its associated code.
Finally, there are "staircase comments", which are used
to indicate the larger structure of Hoon sequences and other
:::: 3e: AES encryption (XX removed) ::
:::: 3f: scrambling ::
:: ob ::
Each staircase consists of three parts.
In lexical order, these parts are
an upper riser,
a tread, and a lower riser.
The upper riser is a sequence of comments at the same
alignment as an inter-comment.
The tread is also at the inter-comment alignment,
but must be 4 colons ("::::") followed
The lower riser is a sequence of comments
indented two spaces more than the tread.
Hoon comment conventions
Hoon's basic syntax allows comments to be free-form.
In practice, there are strict conventions for these comments,
conventions we would like to enforce with hoonlint.
- A multi-line comment may contain
an "inter-part", a "pre-part",
- If both an inter-part and a pre-part are present,
the inter-part must precede the pre-part.
- The inter-part is a non-empty sequence of inter-comments
- A pre-part is a non-empty sequence of pre-comments.
- Meta-comments may be inserted anywhere in either the pre-part
or the inter-part.
- Comments which do not obey the above rules are
A good comment is any comment which is not a bad comment.
- A comment is not regarded as a meta-comment
if it can be parsed as structural comment.
An structural comment is any good comment which is
not a meta-comment.
We will implement these conventions using the BNF
of this section.
The sections to follow outline the strategy behind the BNF.
:start ::= gapComments
gapComments ::= OptExceptions Body
gapComments ::= OptExceptions
Body ::= InterPart PrePart
Body ::= InterPart
Body ::= PrePart
InterPart ::= InterComponent
InterPart ::= InterruptedInterComponents
InterPart ::= InterruptedInterComponents InterComponent
InterruptedInterComponents ::= InterruptedInterComponent+
InterruptedInterComponent ::= InterComponent Exceptions
InterComponent ::= Staircases
InterComponent ::= Staircases InterComments
InterComponent ::= InterComments
InterComments ::= InterComment+
Staircases ::= Staircase+
Staircase ::= UpperRisers Tread LowerRisers
UpperRisers ::= UpperRiser+
LowerRisers ::= LowerRiser+
PrePart ::= ProperPreComponent OptPreComponents
ProperPreComponent ::= PreComment
OptPreComponents ::= PreComponent*
PreComponent ::= ProperPreComponent
PreComponent ::= Exception
OptExceptions ::= Exception*
Exceptions ::= Exception+
Exception ::= MetaComment
Exception ::= BadComment
Exception ::= BlankLine
Our comment linter is implemented as a combinator.
The main hoonlint
parser invokes this combinator when it encounters
a multi-line comment.
Because of the main parser,
we do not have to worry about confusing comments with
Hoon's various string and in-line text syntaxes.
Note that while combinator parsing is useful,
it is a technique that can be oversold.
Combinators have been much talked about in the functional programming
but the current flagship functional programming language compiler,
the Glasgow Haskell Compiler,
does not use combinators to parse its version of the Haskell --
instead it uses a parser in the yacc lineage.
As a parsing technique on its own,
the use of combinators is simply another way of packaging recursive
descent with backtracking,
and the two techniques share the same power,
the same performance,
and the same downsides.
Marpa is much more powerful than either LALR (yacc-lineage) parsers or combinators,
so we can save combinator parsing for those cases where
combinator parsing really is helpful.
One such case is lexer mismatch.
The first programming languages, like BASIC and FORTRAN,
were line-structured -- designed to be parsed line-by-line.
After ALGOL, new languages were usually block-structured.
Blocks can start or end in the middle of a line,
and can span multiple lines.
And blocks are often nested.
A line-structured language requires its lexer to think in
terms of lines,
but this approach is completely useless for a block-structured
Combining both line-structured and block-structured logic in the same lexer
usually turns the lexer's code into a rat's nest.
Calling a combinator every time
a line-structured block is encountered eliminates the problem.
The main lexer can assume that the code is block-structured,
and all the line-by-line logic can go into combinators.
Our grammar is non-deterministic,
It is unambiguous because,
for every input,
it will produce no more than one parse.
It is non-deterministic because there is a case
where it tracks two possible parses at once.
The comment linter cannot immediately distinguish between
a prefix of the upper riser of a staircase,
and a prefix of a sequence of inter-comments.
When a tread and lower riser is encountered,
the parser knows it has found a staircase,
but not until then.
And if the parse is of an inter-comment sequence,
the comment linter will
not be sure of this until the end of the sequence.
Technique: Infinite lookahead
As just pointed out,
the comment linter does not know whether it is parsing a staircase or
an inter-comment sequence until either
- it finds a tread and lower riser, in which case
it knows the correct parse will be a staircase; or
- it successfully reaches the end of the inter-comment sequence,
in which case it knows the correct parse is an inter-comment sequence.
To determine which of these two choices is the correct parse,
the linter needs to read
an arbitrarily long sequence of tokens --
in other words, the linter needs to perform infinite lookahead.
Humans deal with infinite lookaheads all the time --
natural languages are full of situations that require them.
Modern language designers labor to avoid the need
for infinite lookahead,
but even so
cases where it is desirable pop up.
Fortunately, in 1991, Joop Leo published a method that
allows computers to emulate infinite lookahead efficiently.
Marpa uses Joop's technique.
Joop's algorithm is complex,
but the basic idea is to do what humans do in the same circumstance --
keep all the possibilities in mind until the evidence comes in.
Technique: the Ruby Slippers
Recall that, according to our conventions,
our parser does not recognize a meta-comment unless
no structural comment can be recognized.
We could implement this in BNF,
but it is much more elegant to use the Ruby Slippers.
As those already familiar with Marpa may recall,
the Ruby Slippers are invoked when a Marpa parser finds itself
unable to proceed with its current set of input tokens.
At this point, the lexer can ask the Marpa parser what token it does want.
Once the lexer is told what the "wished-for" token is,
it can concoct one, out of nowhere if necessary, and pass it to the Marpa parser,
which then proceeds happily.
In effect, the lexer acts like Glenda the Good Witch of Oz,
while the Marpa parser plays the role of Dorothy.
In our implementation, the Marpa parser, by default,
looks only for structural comments.
If the Marpa parser of our
comment linter finds that the current input line is not
a structural comment,
the Marpa parser halts
and tells the lexer that there is a problem.
The lexer then asks the Marpa parser what it is looking for.
In this case, the answer will always be the same:
the Marpa parser will be looking for a meta-comment.
The lexer checks to see if the current line is a comment
starting at column 1.
If there is a comment starting at column 1,
the lexer tells the Marpa parser that its wish has come true --
there is a meta-comment.
Another way to view the Ruby Slippers is as a kind of exception
mechanism for grammars.
In this application, we treat inability to read an structural
comment as an exception.
When the exception occurs,
if possible, we read a meta-comment.
Technique: Error Tokens
Error tokens are a specialized use of the Ruby Slippers.
The application for this parser is "linting" --
checking that the comments follow conventions.
As such, the main product of the parser is not the parse --
it is the list of errors gathered along the way.
So stopping the parser at the first error does not make sense.
What is desirable is to treat all inputs as valid,
so that the parsing always runs to the end of input,
in the process producing a list of the errors.
To do this, we want to set up the parser so that it reads
special "error tokens" whenever it encounters a reportable error.
This is perfect for the Ruby Slippers.
If an "exception" occurs,
as above described for meta-comments,
but no meta-comment is available,
we treat it as a second level exception.
When would no meta-comment be available?
There are two cases:
- The line read is a comment,
but it does not start at column 1.
- The line read is a blank line (all whitespace).
On the second exception level, the current line
will be read as either a <BlankLine>,
or a <BadComment>.
We know that every line must lex as either a
or a <BadComment> because our comment linter
is called as a combinator,
and the parent Marpa parser guarantees this.
Marpa allows ambiguity,
which could have been exploited as a technique.
For example, in a simpler BNF than that we used above,
it might be ambiguous whether a meta-comment belongs to an <InterPart>
which immediately precedes it;
or to a <PrePart> which immediately follows it.
We could solve the dilemma by noting that it does not matter:
All we care about is spotting bad comments and blank lines,
so that picking one of two ambiguous parses at random will work fine.
But efficiency issues are sometimes a problem with ambiguity
and unambiguity can be a good way of avoiding them.
Also, requiring the grammar to be unambiguous allows
an additional check that is useful in the development phase.
In our code we test each parse for ambiguity.
If we find one, we know that hoonlint has a coding error.
Keeping the parser unambiguous makes the BNF
less elegant than it could be.
To avoid ambiguity,
we introduced extra symbols;
introduced extra rules;
and restricted the use of ambiguous tokens.
Recall that I am using the term "ambiguous" in the strict technical
sense that it has in parsing theory, so that a parser is only ambiguous
if it can produce two valid parses for one string.
An unambiguous parser
can allow non-deterministism and
can have ambiguous tokens.
In fact, our example grammar does both of these things,
but is nonetheless unambiguous.
One example of an extra symbol introduced to make this parser
unambiguous is <ProperPreComment>
is used to ensure that a
never begins with a meta-comment.
The BNF requires that the first line of a
must be a
This means that, if a
<MetaComment> is found
at the boundary between an
it cannot be the first line of the
and so must be the last line of the
In our informal explanation of the comment conventions,
we stated that an inter-part is a sequence, each element of
which is an inter-comment or a staircase.
While BNF that directly implemented this rule would be correct,
it would also be highly ambiguous:
If an inter-comment occurs before a tread or an upper riser line,
it could also be parsed as part of the upper riser.
To eliminate the ambiguity,
we stipulate that if comment can be parsed as part of a staircase,
then it must be parsed as part of a staircase.
This stipulation still leaves the grammar non-deterministic --
we may not know if our comment could be part of a staircase until
many lines later.
With our stipulation we know that, if an
a staircase, then that staircase must come before any of the inter-comments.
In an <InterComponent>
both staircases and inter-comments are optional, so the
unambiguous representation of
InterComponent ::= Staircases
InterComponent ::= Staircases InterComments
InterComponent ::= InterComments
Notice that, although
both staircases and inter-comments are optional,
we do not include the case where both are omitted.
This is because we insist that an
contain at least one line.
Our parser is not ambiguous, but
it does allow ambiguous tokens.
For example, a comment with inter-comment alignment
could be either an
and our lexer returns both.
The parser remains unambiguous, however, because
only one of these two tokens will wind up in the
Call the set of tokens returned
by our parser for a single line,
a "token set".
If the token set contains more than one token,
the tokenization is ambiguous for that line.
If the token set contains only one token,
the token set is called a "singleton",
and tokenization is unambiguous for that line.
this parser unambiguous, we restrict the
ambiguity at the lexer level.
our lexer is set up so
that a meta-comment is never one of the alternatives
in a lexical ambiguity.
If a token set contains a
that token set must be a singleton.
The Ruby Slippers are used to enforce this.
Similarly, the Ruby Slippers are used to guarantee that
any set of tokens containing either
<BlankLine> is a singleton.
This post did not walk the reader through the code.
Instead, we talked in terms of strategy.
The code is available on Github
in unit test form.
For those who want to see the comment-linter combinator in a context,
a version of the code embedded in hoonlint
in also on Github.
Comments on this blog post, etc.
To learn about Marpa,
my Earley/Leo-based parser,
there is the
semi-official web site, maintained by Ron Savage.
The official, but more limited, Marpa website
is my personal one.
Comments on this post can be made in
Marpa's Google group,
or on our IRC channel: #marpa at freenode.net.
posted at: 13:03 |
direct link to this entry