**Jeffrey Kegler's blog**
about Marpa, his new parsing algorithm,
and other topics of interest

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Back around 1980, I had access to UNIX and a language I wanted to parse. I knew that UNIX had all the latest CS tools. So I expected to type in my BNF and "Presto, Language!".

Not so easy, I was told. Languages were difficult things created with complex tools written by experts who understood the issues. I recall thinking that, while English had a syntax that is as hard as they come, toddlers manage to parse it just fine. But experts are experts, and more so at second-hand.

I was steered to an LALR-based parser called yacc. (Readers may be more familiar with bison, a yacc successor.) LALR had extended the class of quickly parseable grammars a bit beyond recursive descent. But recursive descent was simple in principle, and its limits were easy to discover and work around. LALR, on the hand, was OK when it worked, but figuring out why it failed when it failed was more like decryption than debugging, and this was the case both with parser development and run-time errors. I soon gave up on yacc and found another way to solve my problem.

Few people complained about yacc on the Internet. If you noise it about that you are unable to figure out how to use what everybody says is the state-of-the-art tool, the conclusions drawn may not be the ones you want. But my experience seems to have been more than common.

LALR's claim to fame was that it was the basis of the industry-standard C compiler. Over three decades, its maintainers suffered amid the general silence. But by 2006, they'd had enough. GCC (the new industry standard) ripped its LALR engine out. By then the trend back to recursive descent was well underway.

Back in the 1970's, there had been more powerful alternatives to LALR and recursive descent. But they were reputed to be slow.

For some applications slow is OK. In 2007 I decided that a parsing tool that parsed all context-free languages at state-of-the-art speeds, slow or fast as the case might be, would be a useful addition to programmer toolkits. And I ran into a surprise.

Hidden in the literature was an amazing discovery -- an 1991 article by Joop Leo that described how to modify Earley's algorithm to be fast for every language class in practical use. (When I say "fast" in this article, I will mean "linear".) Leo's article had been almost completely ignored -- my project (Marpa) would become its first practical implementation.

The implications of Leo's discovery go well beyond speed. If you can rely on the BNF that you write always producing a practical parser, you can auto-generate your language. In fact, you can write languages which write languages.

The Leo/Earley algorithm is not fast for every BNF-expressible language. BNF is powerful, and you can write exponentially ambiguous languages in it. But programmers these days mostly care about unambiguous languages -- we are accustomed to tools and techniques that parse only a subset of these.

As I've said, Marpa is fast for every language
class in practical use today.
Marpa is almost certainly fast for any language
that a modern programmer has in mind.
Unless you peek ahead at the hints I am about to give you,
in fact, it is actually
**hard**
to write an unambiguous
grammar that goes non-linear on Marpa.
Simply mixing up lots of left, right and middle recursions
will
**not**
be enough to make an
unambiguous grammar go non-linear.
You will also need to violate a rule
in the set that
I am about to give you.

To guarantee that Marpa is fast for your BNF language, follow three rules:

- Rule 1: Your BNF must be unambiguous.
- Rule 2: Your BNF must have no "unmarked" middle recursions.
- Rule 3: All of the right-recursive symbols in your BNF must be dedicated to right recursion.

Rule 3 turns out to be very easy to obey. I discuss it in detail in the next section, which will be about how to break these rules and get away with it.

Before we do that, let's look at what an "unmarked" middle recursion is. Here's an example of a "marked" middle recursion:

M ::= 'b' M ::= 'a' M 'a'

Here the "b" symbol is the marker. This marked middle recursion generates sequences like

b a b a a a b a a

Now for an "unmarked" middle recursion:

M ::= 'a' 'a' M ::= 'a' M 'a'

This unmarked middle recursion generates sequences like

a a a a a a a a a a a a

In this middle recursion there is no marker. To know where the middle is, you have to scan all the way to the end, and then count back.

A rule of thumb is that if you can "eyeball" the middle of a long sequence, the recursion is marked. If you can't, it is unmarked. Unfortunately, we can't characterize exactly what a marker must look like -- a marker can encode the moves of a Turing machine, so marker detection is undecidable.

The rules about ambiguity and recursions are "soft". If you only use limited ambiguity and keep your rule-breaking recursions short, your parser will stay fast.

Above, I promised to explain rule 3, which insisted that a right recursive symbol be "dedicated". A right recursive symbol is "dedicated" if it appears only as the recursive symbol in a right recursion. If your grammar is unambiguous, but you've used an "undedicated" right-recursive symbol, that is easy to fix. Just rewrite the grammar, replacing the "undedicated" symbol with two different symbols. Dedicate one of the two to the right recursion, and use the other symbol everywhere else.

The languages I have described as "fast" for Marpa include all those in practical use and many more. But do you really want to use Marpa for all of them? Here are four cases for which Marpa is probably not your best alternative.

The first case: a language that parses easily with a regular expression. The regular expression will be much faster. Don't walk away from a good thing.

The second case: a language that is easily parsed using a single loop and some state that fits into constant space. This parser might be very easy to write and maintain. If you are using a much slower higher level language, Marpa's optimized C language may be a win on CPU speed. But, as before, if you have a good thing, don't walk away from it.

The third case:
a variation on the second.
Here your single loop might be getting out of hand,
making you yearn for the syntax-driven convenience
of Marpa,
but your state still fits into constant space.
In its current implementation, Marpa keeps all of its
parse tables forever, so Marpa does
**not**
parse in constant space.
Keeping the tables
allows Marpa to deal with the full structure of its
input, in a way that a SAX-ish approaches cannot.
But if space requirements are an issue,
and your application allows a simplified constant-space
approach,
Marpa's power and convenience may not be enough to
make up for that.

The fourth case: a language that

- is very small;
- changes slowly or not at all, and does not grow in complexity;
- merits careful hand-optimization, and has available the staff to do it;
- merits and has available the kind of on-going support that will keep your code optimized under changing circumstances; and
- is easily parseable via recursive descent:

It is rare that all of these are the case, but when that happens, recursive descent is often preferable to Marpa. Lua and JSON are two languages which meet the above criteria. In Lua's case, it targets platforms with very restricted memories, which is an additional reason to prefer recursive descent -- Marpa has a relatively large footprint.

It was not good luck that made both Lua and JSON good targets for recursive descent -- they were designed around its limits. JSON is a favorite test target of Marpa for just these reasons. There are carefully hand-optimized C language parsers for us to benchmark against.

We get closer and closer, but Marpa will never beat small hand-optimized JSON parsers in software. However, while recursive descent is a technique for hand-writing parsers, Marpa is a mathematical algorithm. Someday, instructions for manipulating Earley items could be implemented directly in silicon. When and if that day comes, Earley's algorithm will beat recursive descent even at parsing the grammars that were designed for it.

Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net. To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.

posted at: 16:36 | direct link to this entry

§
§
§

I have claimed that my new parser, Marpa, is linear for vast classes of grammars, going well beyond what the traditional parsers can do. But skepticism is justified. When it comes to parsing algorithms, there have been a lot of time complexity claims that are hand-wavy, misleading or just plain false. This post describes how someone, who is exercising the appropriate degree of skepticism, might conclude that believing Marpa's claims is a reasonable and prudent thing to do.

Marpa's linearity claims seem to be, in comparison with the other parsers in practical use today, bold. Marpa claims linearity, not just for every class of grammar for which yacc/bison, PEG and recursive descent currently claim linearity, but for considerably more. (The mathematical details of these claims are in a section at the end.) It seems too good to be true.

The most important thing to realize, in assessing the believability of Marpa's time complexity claims, is that they are not new. They were already proved in a long-accepted paper in the refereed literature. They are the time complexity claims proved by Joop Leo for his algorithm in 1991, over two decades ago. Marpa is derived from Leo's algorithm, and its time complexity claims are those proved for Leo's algorithm.

Above I said that Marpa's time complexity claims "seem" bold.
On any objective assessment, they are in fact a bit of a yawn.
The claims
**seem**
surprising only because a lot of people
are unaware of Leo's results.
That is, they are surprising in the same sense that someone who
had avoided hearing about radio waves would be surprised
to learn that he can
communicate instantly
with someone on the other side of the world.

So, if there's so little to prove, why does the Marpa paper have proofs? In Marpa, I made many implementation decisions about, and some changes to, the Leo/Earley algorithm. None of my changes produced better time complexity results -- my only claim is that I did not change the Leo/Earley algorithm in a way that slowed it down. To convince myself of this claim, I reworked the original proofs of Leo and Earley, changing them to reflect my changes, and demonstrated that the results that Leo had obtained still held.

Proofs of this kind, which introduce no new mathematical techniques, but simply take a previous result and march from here to there by well-know means, are called "tedious". In journals, where there's a need to conserve space, they are usually omitted, especially if, as is the case with Marpa's time complexity proofs, the results are intuitively quite plausible.

So let's say you are not going to work through every line of Marpa's admittedly tedious proofs. We've seen that the results are intuitively plausible, as long as you don't reject the previous literature. But can we do better than merely "plausible"?

As an aside, many people misunderstand the phrase "mathematically proven", especially as it applies to branches of math like parsing theory. The fact is that proofs in papers often contain errors. Usually these are minor, and don't affect the result. On the other hand, Jay Earley's paper, while one of the best Computer Science papers ever published, also contained a very serious error. And this error slipped past his Ph.D. committee and his referees. Mathematical arguments and proofs do not allow us to achieve absolute certainty. They can only improve the degree of certainty.

There's a second way to dramatically increase your degree of conviction in Marpa's linearity claims, and it is quite simple. Create examples of problematic grammars, run them and time them. This is not as satisfying as a mathematical proof, because no set of test grammars can be exhaustive. But if you can't find a counter-example to Marpa's linearity claims among the grammars of most interest to you, that should help lift your level of certainty to "certain for all practical purposes".

Much of this increase in certainty can be achieved without bothering to run your own tests. Marpa is in wide use at this point. If Marpa was going quadratic on grammars for which it claimed to be linear, and these were grammars of practical interest, that would very likely have been noticed by now.

Let's suppose all this has not brought you to the level of certainty you need to use Marpa. That means the reasonable thing is to continue to struggle to work with the restrictions of the traditional algorithms, right? No, absolutely not.

OK, so you don't believe that Marpa preserves the advances in power and speed made by Leo. Does that mean that parsers have to stay underpowered? No, it simply means that there should be a more direct implementation of Leo's 1991, bypassing Marpa.

But if you are looking for an implementation of Leo's 1991 algorithm, I think you may end up coming back to Marpa as the most reasonable choice. Marpa's additional features include the ability to use custom, procedural logic, as you can with recursive descent. And Marpa has worked out a lot of the implementation details for you.

Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net. To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.

Above I talked about algorithms, classes of grammars and their linearity claims. I didn't give details because most folks aren't interested. For those who are, they are in this section.

yacc is linear for a grammar class called LALR, which is a subset of another grammar class called LR(1). If you are willing to hassle with GLR, bison claims linearity for all of LR(1). Recursive descent is a technique, not an algorithm, but it is top-down with look-ahead, and therefore can be seen as some form of LL(k), where k depends on how it is implemented. In practice, I suspect k is never much bigger than 3, and usually pretty close to 1. With packratting, PEG can be made linear for everything it parses but there is a catch -- only in limited cases do you know what language your PEG grammar actually parses. In current practice, that means your PEG grammar must be LL(1). Some of the PEG literature looks at techniques for extending this as far as LL-regular, but there are no implementations, and it remains to be seen if the algorithms described are practical.

The Marpa paper contains a proof, based on a proof of the same claim by Joop Leo, that Marpa is linear for LR-regular grammars. The LR-regular grammars include the LR(k) grammars for every k. So Marpa is linear for LR(1), LR(2), LR(8675309), etc. LR-regular also includes LL-regular. So every class of grammar under discussion in the PEG literature is already parsed in linear time by Marpa. From this, it is also safe to conclude that, if a grammar can be parsed by anything reasonably described as recursive descent, it can be parsed in linear time by Marpa.

posted at: 22:01 | direct link to this entry

§
§
§

PEG parsing is a new notation for a notorously tricky algorithm that goes back to the earliest computers. In its PEG form, this algorithm acquired an seductive new interface, one that looks like the best of extended BNF combined with the best of regular expressions. Looking at a sample of it, you are tempted to imagine that writing a parser has suddenly become a very straightforward matter. Not so.

For those not yet in the know on this, I'll illustrate with a pair of examples from an excellent 2008 paper by Redziejowski. Let's start with these two PEG specifications.

("a"|"aa")"a" ("aa"|"a")"a"

One of these two PEG grammars accepts
the string "`aaa`" but not the string "`aa`".
The other does the opposite -- it accepts the string
the string "`aa`" but not the string "`aaa`".
Can you tell which one?
(For the answer,
see page 4 of
Redziejowski 2008.)

Here is another example:

A = "a"A"a"/"aa"

What language does this describe?
All the strings in the
language are obviously
the letter "`a`",
repeated some number of times.
But which string lengths are in the language,
and which are not?
Again the answer is on
page 4 of
Redziejowski 2008
-- it's exactly those strings
whose length is a power of 2.

With PEG, what you see in the extended BNF is not what you get. PEG parsing has been called "precise", apparently based on the idea that PEG parsing is in a certain sense unambiguous. In this case "precise" is taken as synonymous with "unique". That is, PEG parsing is precise in exactly the same sense that Jimmy Hoffa's body is at a precise location. There is (presumably) exactly one such place, but we are hard put to be any more specific about the matter.

The advantage of using a syntax-driven parser generator is that the syntax you specify describes the language that will be parsed. For most practical grammars, PEG is not syntax-driven in this sense. Several important PEG researchers understand this issue, and have tried to deal with it. I will talk about their work below. This is much more at stake than bragging rights over which algorithm is really syntax-driven and which is not.

When you do not know the language your parser is parsing, you of course have the problem that your parser might not parse all the strings in your language. That can be dealt with by fixing the parser to accept the correct input, as you encounter problems.

A second, more serious, problem is often forgotten.
Your PEG parser might accept strings that are
**not**
in your language.
At worst, this creates a security loophole.
At best, it leaves with a choice:
break compatiblity,
or leave the problem unfixed.

It's important to be able to convince yourself that your code is correct by examining it and thinking about it. Beginning programmers often simply hack things, and call code complete once it passes the test suite. Test suites don't catch everything, but there is a worse problem with the beginner's approach.

Since the beginner has no clear idea of why his code works, even when it does, it is unlikely to be well-organized or readable. Programming techniques like PEG, where the code can be made to work, but where it is much harder, and in practice usually not possible, to be sure why the code works, become maintenance nightmares.

The maintenance implications are especially worrisome if the PEG parser is for a language with a life cycle that may involve bug fixes or other changes. The impact of even small changes to a PEG specification is hard to predict and hard to discover after the fact.

PEG is not unambiguous in any helpful sense of that word. BNF allows you to specify ambiguous grammars, and that feature is tied to its power and flexibility and often useful in itself. PEG will only deliver one of those parses. But without an easy way of knowing which parse, the underlying ambiguity is not addressed -- it is just ignored.

My Marpa parser is a general BNF parser based on Earley's. It also can simply throw all but one of the parses in an ambiguous parse away. But I would not feel justified in saying to a user who has an issue with ambiguity, that Marpa has solved her problem by throwing all but one arbitrarily chosen result.

Sticking with Marpa for a moment, we can see one example of a more helpful approach to ambiguity. Marpa allows a user to rank rules, so that all but the highest ranking rules are not used in a parse. Marpa's rule rankings are specified in its BNF, and they work together with the BNF in an intuitive way. In every case, Marpa delivers precisely the parses its BNF and its rule rankings specify. And it is "precision" in this sense that a parser writer is looking for.

I'll return to Marpa at the end of this post. For now, let's assume that you are not interested in using Marpa -- you are committed to PEG, and you want to make the best of PEG. Several excellent programmers have focused on PEG, without blinding themselves to its limitations. I've already mentioned one important paper by Redziejowski. Many of Redziejowski's collected papers are about PEG, and Redziejowski, in his attempts to use PEG, does not sugarcoat its problems.

Roberto Ierusalimschy, author of Lua and one of the best programmers of our time, has written a PEG-based parser of his own. Roberto is fully aware of PEG's limits, but he makes a very good case for choosing PEG as the basis of LPEG, his parser generator. LPEG is intended for use with Lua, a ruthlessly minimal language. Roberto's minimalist implementation limits the power of his parser, but his aim is to extend regular expressions in a disciplined way, and a compact parser of limited power is quite acceptable for his purposes.

As Redziejowski and Ierusalimschy and the other authors of Mascarenhas et al, 2013 recognize, not knowing what language you are parsing is more than an annoyance. We can call a language "well-behaved for PEG" if the PEG spec delivers exactly the language the BNF describes.

Which languages are are well-behaved for PEG? According to Mascarenhas et al, 2013, the LL(1) languages are well-behaved. (The LL(1) languages are the languages a top-down parser can parse based on at most one character of input.) Syntax-driven parsers for LL(1) have been around for much longer than PEG -- one such parser is described in the first paper to describe recursive descent (Peter Lucas, 1961). But most practical languages are not LL(1). Redziejowski 2013 and Redziejowski 2014 seek to extend this result by defining the language class LL(1p) -- those top-down languages with one "parsing procedure" of lookahead. The LL(1p) languages are also well-behaved for PEG.

Mascarenhas et al, 2013 also look at a different approach -- instead of writing a PEG specification and trying to keep it well-behaved, they look at taking languages from larger top-down classes and translating them to PEG. I don't know of any followup, but it's possible this approach could produce well-behaved top-down parsers which are an improvement over direct-from-PEG parsing. But for those who are open to leaving top-down parsing behind, a parser which handles languages in all these classes and more is already available.

In this post, I have adopted the point of view of programmers using PEG, or thinking of doing so. My own belief in this matter is that very few programmers should want to bother with the issues I've just described. My reason for this is the Marpa parser -- a general BNF Earley-driven parser that

- has an implementation you can use today;
- allows the application to combine syntax-driven parsing with custom procedural logic;
- makes available full, left-eidetic knowledge of the parse to the procedural logic;
- and parses a vast class of grammars in linear time, including all the LR-regular grammars.

The LR-regular grammars include the LR(k) and LL(k)
grammars for all
*k*.
LR-regular includes all the languages
which are well-behaved under PEG,
and all of those that
Mascarenhas et al, 2013
consider translating into PEG.

Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net. To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.

posted at: 19:56 | direct link to this entry

§
§
§

Comparisons between top-down and bottom-up parsing are often either too high-level or too low-level. Overly high-level treatments reduce the two approaches to buzzwords, and the comparison to a recitation of received wisdom. Overly low-level treatments get immersed in the minutiae of implementation, and the resulting comparison is as revealing as placing two abstractly related code listings side by side. In this post I hope to find the middle level; to shed light on why advocates of bottom-up and top-down parsing approaches take the positions they do; and to speculate about the way forward.

The basic idea of top-down parsing is as brutally simple as anything in programming: Starting at the top, we add pieces. We do this by looking at the next token and deciding then and there where it fits into the parse tree. Once we've looked at every token, we have our parse tree.

In its purest form, this idea is too simple for practical parsing, so top-down parsing is almost always combined with lookahead. Lookahead of one token helps a lot. Longer lookaheads are very sparsely used. They just aren't that helpful, and since the number of possible lookaheads grows exponentially, they get very expensive very fast.

Top-down parsing has an issue with left recursion. It's straightforward to see why. Take an open-ended expression like

a + b + c + d + e + f + [....]

Here the plus signs continue off to the right, and adding any of them to the parse tree requires a dedicated node which must be above the node for the first plus sign. We cannot put that first plus sign into a top-down parse tree without having first dealt with all those plus signs that follow it. For a top-down strategy, this is a big, big problem.

Even in the simplest expression, there is no way of counting the plus signs without looking to the right, quite possibly a very long way to the right. When we are not dealing with simple expressions, this rightward-looking needs to get sophisticated. There are ways of dealing with this difficulty, but all of them share one thing in common -- they are trying to make top-down parsing into something that it is not.

Top-down parsing does not look at the right context in any systematic way, and in the 1970's it was hard to believe that top-down was as good as we can do. (It's not all that easy to believe today.) But its extreme simplicity is also top-down parsing's great strength. Because a top-down parser is extremely simple, it is very easy to figure out what it is doing. And easy to figure out means easy to customize.

Take another of the many constructs incomprehensible to a top-down parser:

2 * 3 * 4 + 5 * 6

How do top-down parsers typically handle this? Simple: as soon as they realize they are faced with an expression, they give up on top-down parsing and switch to a special-purpose algorithm.

These two properties -- easy to understand and easy to customize -- have catapulted top-down parsing to the top of the heap. Behind their different presentations, combinator parsing, PEG, and recursive descent are all top-down parsers.

Few theoreticians of the 1970's imagined that top-down parsing might be the end of the parsing story. Looking to the right in ad hoc ways clearly does help. It would be almost paradoxical if there was no systematic way to exploit the right context.

In 1965, Don Knuth found an algorithm to exploit right context. Knuth's LR algorithm was, like top-down parsing as I have described it, deterministic. Determinism was thought to be essential -- allowing more than one choice easily leads to a combinatorial explosion in the number of possibilities that have to be considered at once. When parsers are restricted to dealing with a single choice, it is much easier to guarantee that they will run in linear time.

Knuth's algorithm did **not**
try to hang
each token from a branch of a top-down parse tree
as soon as it was encountered.
Instead, Knuth suggested delaying that decision.
Knuth's algorithm collected
"subparses".

When I say "subparses" in this discussion, I mean pieces of the parse that contain all the decisions necessary to construct the part of the parse tree that is below them. But subparses do not contain any decisions about what is above them in the parse tree. Put another way, subparses know who they are, but not where they belong.

Subparses may not know where they belong, but knowing who they are is enough for them to be assembled into larger subparses. And, if we keep assembling the subparses, eventually we will have a "subparse" that is the full parse tree. And at that point we will know both who everyone is and where everyone belongs.Knuth's algorithm stored subparses by shifting them onto a stack. The operation to do this was called a "shift". (Single tokens of the input are treated as subparses with a single node.) When there was enough context to build a larger subparse, the algorithm popped one or more subparses off the stack, assembled a larger subparse, and put the resulting subparse back on the stack. This operation was called a "reduce", based on the idea that its repeated application eventually "reduces" the parse tree to its root node.

In handling the stack, we will often be faced with choices. One kind of choice is between using what we already have on top of the stack to assemble a larger subparse; or pushing more subparses on top of the stack instead ("shift/reduce"). When we decide to reduce, we may encounter the other kind of choice -- we have to decide which rule to use ("reduce/reduce").

Like top-down parsing, bottom-up parsing is usually combined with lookahead. For the same lookahead, a bottom-up parser parses everything that a top-down parser can handle, and more.

Formally, Knuth's approach is now called shift/reduce parsing. I want to demonstrate why theoreticians, and for a long time almost everybody else as well, was so taken with this method. I'll describe how it works on some examples, including two very important ones that stump top-down parsers: arithmetic expressions and left-recursion. My purpose here is bring to light the basic concepts, and not to guide an implementor. There are excellent implementation-oriented presentations in many other places. The Wikipedia article, for example, is excellent.

Bottom-up parsing solved the problem of left recursion. In the example from above,

a + b + c + d + e + f + [....]

we simply build one subparse after another, as rapidly as we can. In the terminology of shift/reduce, whenever we can reduce, we do. Eventually we will have run out of tokens, and will have reduced until there is only one element on the stack. That one remaining element is the subparse that is also, in fact, our full parse tree.

The top-down parser had a problem with left recursion precisely because it needed to build top-down. To build top-down, it needed to know about all the plus signs to come, because these needed to be fitted into the parse tree above the current plus sign. But when building bottom-up, we don't need to know anything about the plus signs that will be above the current one in the parse tree. We can afford to wait until we encounter them.

But if working bottom-up solves the left recursion problem, doesn't it create a right recursion problem? In fact, for a bottom-up parser, right recursion is harder, but not much. That's because of the stack. For a right recursion like this:

a = b = c = d = e = f = [....]

we use a strategy opposite to the one we used for the left recursion. For left recursion, we reduced whenever we could. For right recursion, when we have a choice, we always shift. This means we will immediately shift the entire input onto the stack. Once the entire input is on the stack, we have no choice but to start reducing. Eventually we will reduce the stack to a single element. At that point, we are done. Essentially, what we are doing is exactly what we did for left recursion, except that we use the stack to reverse the order.

Arithmetic expressions like

2 * 3 * 4 + 5 * 6

require a mixed strategy. Whenever we have a shift/reduce choice, and one of the operators is on the stack, we check to see if the topmost operator is a multiply or an addition operator. If it is a multiply operator, we reduce. In all other cases, if there is a shift/reduce choice, we shift.

In the discussion above, I have pulled the strategy for making stack decisions (shift/reduce and reduce/reduce) out of thin air. Clearly, if bottom-up parsing was going to be a practical parsing algorithm, the stack decisions would have to be made algorithmically. In fact, discovering a practical way to do this was a far from trivial task. The solution in Knuth's paper was considered (and apparently intended) to be mathematically provocative, rather than practical. But by 1979, it was thought a practical way to make stack decisions had been found and yacc, a parser generator based on bottom-up parsing, was released. (Readers today may be more familiar with yacc's successor, bison.)

With yacc, it looked as if the limitations of top-down parsing were past us. We now had a parsing algorithm that could readily and directly parse left and right recursions, as well as arithmetic expressions. Theoreticians thought they'd found the Holy Grail.

But not all of the medieval romances had happy endings. And as I've described elsewhere, this story ended badly. Bottom-up parsing was driven by tables which made the algorithm fast for correct inputs, but unable to accurately diagnose faulty ones. The subset of grammars parsed was still not quite large enough, even for conservative language designers. And bottom-up parsing was very unfriendly to custom hacks, which made every shortcoming loom large. It is much harder to work around a problem in a bottom-up parser than than it is to deal with a similar shortcoming in a top-down parser. After decades of experience with bottom-up parsing, top-down parsing has re-emerged as the algorithm of choice.

For many, the return to top-down parsing answers the question that we posed earlier: "Is there any systematic way to exploit right context when parsing?" So far, the answer seems to be a rather startling "No". Can this really be the end of the story?

It would be very strange if the best basic parsing algorithm we know is top-down. Above, I described at some length some very important grammars that can be parsed bottom-up but not top-down, at least not directly. Progress like this seems like a lot to walk away from, and especially to walk back all the way to what is essentially a brute force algorithm. This perhaps explains why lectures and textbooks persist in teaching bottom-up parsing to students who are very unlikely to use it. Because the verdict from practitioners seems to be in, and likely to hold up on appeal.

Fans of deterministic top-down parsing, and proponents of deterministic bottom-up parsing share an assumption: For a practical algorithm to be linear, it has to be deterministic. But is this actually the case?

It's not, in fact. To keep bottom-up parsing deterministic, we restricted ourselves to a stack. But what if we track all possible subpieces of parses? For efficiency, we can link them and put them into tables, making the final decisions in a second pass, once the tables are complete. (The second pass replaces the stack-driven see-sawing back and forth of the deterministic bottom-up algorithm, so it's not an inefficiency.) Jay Earley in 1968 came up with an algorithm to do this, and in 1991 Joop Leo added a memoization to Earley's algorithm which made it linear for all deterministic grammars.

The "deterministic grammars" are exactly the bottom-up parseable grammars with lookahead -- the set of grammars parsed by Knuth's algorithm. So that means the Earley/Leo algorithm parses, in linear time, everything that a deterministic bottom-up parser can parse, and therefore every grammar that a deterministic top-down parser can parse. (In fact, the Earley/Leo algorithm is linear for a lot of ambiguous grammars as well.)

Top-down parsing had the advantage that it was easy to know where you are. The Earley/Leo algorithm has an equivalent advantage -- its tables know where it is, and it is easy to query them programmatically. In 2010, this blogger modified the Earley/Leo algorithm to have the other big advantage of top-down parsing: The Marpa algorithm rearranges the Earley/Leo parse engine so that we can stop it, perform our own logic, and restart where we left off. A quite useable parser based on the Marpa algorithm is available as open source.

posted at: 17:53 | direct link to this entry

§
§
§

What makes a parsing algorithm successful? Two factors, I think. First, does the algorithm parse a workably-defined set of grammars in linear time? Second, does it allow the application to intervene in the parse with custom code? When parsing algorithms are compared, typically neither of these gets much attention. But the successful algorithms do one or the other.

"Workably-defined" means more than well-defined
in the mathematical sense --
the definition has to be **workable**.
That is, the
definition must be something that,
with reasonable effort,
a programmer can use in practice.

The algorithms in regular expression engines are workably-defined. A regular expression, in the pure sense consists of a sequence of symbols, usually shown by concatenation:

a b c

or a choice among sequences, usually shown by a vertical bar:

a | b | c

or a repetition of any of the above, typically shown with a star:

a*

or any recursive combination of these. True, if this definition is new to you, it can take time to get used to. But vast numbers of working programming are very much "used to it", can think in terms of regular expressions, and can determine if a particular problem will yield to treatment as a regular expression, or not.

Parsers in the LALR family (yacc, bison, etc.)
do **not**
have a workably defined set of grammars.
LALR is perfectly well-defined mathematically,
but even experts in parsing theory are hard put to decide
if a particular grammar is LALR.

Recursive descent also does not have a workably defined set of grammars. Recursive descent doesn't even have a precise mathematical description -- you can say that recursive descent is LL, but in practice LL tables are rarely used. Also in practice, the LL logic is extended with every other trick imaginable, up to and including switching to other parsing algorithms.

It is not easy for users to intervene in the processing of a regular expression, though some implementations attempt to allow such efforts. LALR parsers are notoriously opaque. Those who maintain the LALR-driven Perl parser have tried to supplement its abilities with custom code, with results that will not encourage others making the same attempt.

Recursive descent, on the other hand, has no parse engine -- it is 100% custom code. You don't get much friendlier than that.

Regular expressions are a success, and will remain so, because the set of grammars they handle is very workably-defined. Applications using regular expressions have to take what the algorithm gives them, but what it gives them is very predictable.

For example, an application can write regular expressions on the fly, and the programmer can be confident they will run as long as they are well-formed. And it is easy to determine if the regular expression is well-formed. (Whether it actually does what you want is a separate issue.)

Recursive descent does not handle a workably-defined set of grammars, and it also has to be hand-written. But it makes up for this by allowing the user to step into the parsing process anywhere, and "get his hands dirty". Recursive descent does nothing for you, but it does allow you complete control. This is enough to make recursive descent the current algorithm of choice for major parsing projects.

As I have
chronicled
elsewhere,
LALR was once,
on highly convincing theoretical grounds,
seen as
**the** solution to the parsing problem.
But while mathematically well-defined,
LALR was not workably defined.
And it was very hostile to applications that tried to alter,
or even examine, its syntax-driven workings.
After decades of trying to make it work,
the profession has abandoned LALR almost totally.

Marpa has both properties: its set of grammars is workably-defined. And, while Marpa is syntax-driven like LALR and regular expressions, it also allows the user to stop the parse engine, communicate with it about the state of the parse, do her own parsing for a while, and restart the parse engine at any point she wants.

Marpa's workable definition has a nuance that the one
for regular expressions does not.
For regular expressions, linearity is a given --
they parse in linear time or fail.
Marpa parses a much larger class of grammars, the context-free grammars --
anything that can be written in BNF.
BNF is used to describe languages in standards,
and is therefore itself a kind of "gold standard"
for a workable definition of a
set of grammars.
However, Marpa does **not**
parse everything that can be written in BNF in linear time.

Marpa linearly-parsed set of grammars is smaller than the context-free grammars, but it is still very large, and it is still workably-defined. Marpa will parse any unambiguous language in linear time, unless it contains unmarked middle recursions. An example of a "marked" middle recursion is the language described by

S ::= a S a | x

a string of which is "`aaaxaaa`",
where the "`x`" marks the middle.
An example of an "unmarked" middle recursion is the language described
by

S ::= a S a | a

a string of which is "`aaaaaaa`",
where nothing marks the middle, so that you don't know until the end where the
middle of the recursion is.
If a human can reliably find the middle by eyeball, the middle recursion is marked.
If a human can't, then the middle recursion might be unmarked.

Marpa also parses a large set of unambiguous grammars linearly, and this set of grammars is also workably-defined. Marpa parses an ambiguous grammar in linear time if

- It has no unmarked middle recursions.
- All right recursions are unambiguous.
- There are no cycles.
A cycle occurs, for example, if there is a rule
`A ::= A`in the grammar. - Marpa's level of ambiguity at any location is bounded by a constant.

The term "level of ambiguity" requires a bit of explanation. At any given location, there can be as many rules "in play" as you like, without affecting the level of ambiguity. The key question: What is the maximum number of different origins that a rule might have? (The "origin" of a rule is the location where it begins.) That is, can a rule currently in play have at most 20 different origins? Or could it have its origin at every location so far? If the maximum number of origins is 20 or any other fixed constant, the level of ambiguity is "bounded". But if the maximum number of origins keeps growing as the length of the input grows, the level of ambiguity is unbounded.

For the unambiguous case, Marpa's workable definition encompasses a much larger class of grammars, but is no more complex than that for regular expressions. If you want to extend even further, and work with ambiguous grammars, the definition remains quite workable. Of the four restrictions needed to ensure linearity, the one requiring a bounded level of ambiguity is the only one that might force you to exercise real vigliance -- once you get into ambiguity, unboundedness is easy to slip into.

As for the other three, cycles never occur in a practical grammars, and Marpa reports them, so that you simply fix them when they happen. Most recursions will be left recursions, which are unrestricted. My experience has been that, in practical grammars, unmarked middle recursions and ambiguous right recursions are not especially tempting features. If you note whenever you use a right recursion, checking that it is not ambiguous, and if you note whenever you use a middle recursion, checking that it is marked, then you will stay linear.

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

posted at: 21:10 | direct link to this entry

§
§
§