Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website

Google+

Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Mon, 02 Mar 2015


PEG: Ambiguity, precision and confusion

Precise?

PEG parsing is a new notation for a notorously tricky algorithm that goes back to the earliest computers. In its PEG form, this algorithm acquired an seductive new interface, one that looks like the best of extended BNF combined with the best of regular expressions. Looking at a sample of it, you are tempted to imagine that writing a parser has suddenly become a very straightforward matter. Not so.

For those not yet in the know on this, I'll illustrate with a pair of examples from an excellent 2008 paper by Redziejowski. Let's start with these two PEG specifications.

    ("a"|"aa")"a"
    ("aa"|"a")"a"
    

One of these two PEG grammars accepts the string "aaa" but not the string "aa". The other does the opposite -- it accepts the string the string "aa" but not the string "aaa". Can you tell which one? (For the answer, see page 4 of Redziejowski 2008.)

Here is another example:

    A = "a"A"a"/"aa"
    

What language does this describe? All the strings in the language are obviously the letter "a", repeated some number of times. But which string lengths are in the language, and which are not? Again the answer is on page 4 of Redziejowski 2008 -- it's exactly those strings whose length is a power of 2.

With PEG, what you see in the extended BNF is not what you get. PEG parsing has been called "precise", apparently based on the idea that PEG parsing is in a certain sense unambiguous. In this case "precise" is taken as synonymous with "unique". That is, PEG parsing is precise in exactly the same sense that Jimmy Hoffa's body is at a precise location. There is (presumably) exactly one such place, but we are hard put to be any more specific about the matter.

Syntax-driven?

The advantage of using a syntax-driven parser generator is that the syntax you specify describes the language that will be parsed. For most practical grammars, PEG is not syntax-driven in this sense. Several important PEG researchers understand this issue, and have tried to deal with it. I will talk about their work below. This is much more at stake than bragging rights over which algorithm is really syntax-driven and which is not.

When you do not know the language your parser is parsing, you of course have the problem that your parser might not parse all the strings in your language. That can be dealt with by fixing the parser to accept the correct input, as you encounter problems.

A second, more serious, problem is often forgotten. Your PEG parser might accept strings that are not in your language. At worst, this creates a security loophole. At best, it leaves with a choice: break compatiblity, or leave the problem unfixed.

It's important to be able to convince yourself that your code is correct by examining it and thinking about it. Beginning programmers often simply hack things, and call code complete once it passes the test suite. Test suites don't catch everything, but there is a worse problem with the beginner's approach.

Since the beginner has no clear idea of why his code works, even when it does, it is unlikely to be well-organized or readable. Programming techniques like PEG, where the code can be made to work, but where it is much harder, and in practice usually not possible, to be sure why the code works, become maintenance nightmares.

The maintenance implications are especially worrisome if the PEG parser is for a language with a life cycle that may involve bug fixes or other changes. The impact of even small changes to a PEG specification is hard to predict and hard to discover after the fact.

Is PEG unambiguous?

PEG is not unambiguous in any helpful sense of that word. BNF allows you to specify ambiguous grammars, and that feature is tied to its power and flexibility and often useful in itself. PEG will only deliver one of those parses. But without an easy way of knowing which parse, the underlying ambiguity is not addressed -- it is just ignored.

My Marpa parser is a general BNF parser based on Earley's. It also can simply throw all but one of the parses in an ambiguous parse away. But I would not feel justified in saying to a user who has an issue with ambiguity, that Marpa has solved her problem by throwing all but one arbitrarily chosen result.

Sticking with Marpa for a moment, we can see one example of a more helpful approach to ambiguity. Marpa allows a user to rank rules, so that all but the highest ranking rules are not used in a parse. Marpa's rule rankings are specified in its BNF, and they work together with the BNF in an intuitive way. In every case, Marpa delivers precisely the parses its BNF and its rule rankings specify. And it is "precision" in this sense that a parser writer is looking for.

Is there a sensible way to use PEG?

I'll return to Marpa at the end of this post. For now, let's assume that you are not interested in using Marpa -- you are committed to PEG, and you want to make the best of PEG. Several excellent programmers have focused on PEG, without blinding themselves to its limitations. I've already mentioned one important paper by Redziejowski. Many of Redziejowski's collected papers are about PEG, and Redziejowski, in his attempts to use PEG, does not sugarcoat its problems.

Roberto Ierusalimschy, author of Lua and one of the best programmers of our time, has written a PEG-based parser of his own. Roberto is fully aware of PEG's limits, but he makes a very good case for choosing PEG as the basis of LPEG, his parser generator. LPEG is intended for use with Lua, a ruthlessly minimal language. Roberto's minimalist implementation limits the power of his parser, but his aim is to extend regular expressions in a disciplined way, and a compact parser of limited power is quite acceptable for his purposes.

Matching the BNF to the PEG spec

As Redziejowski and Ierusalimschy and the other authors of Mascarenhas et al, 2013 recognize, not knowing what language you are parsing is more than an annoyance. We can call a language "well-behaved for PEG" if the PEG spec delivers exactly the language the BNF describes.

Which languages are are well-behaved for PEG? According to Mascarenhas et al, 2013, the LL(1) languages are well-behaved. (The LL(1) languages are the languages a top-down parser can parse based on at most one character of input.) Syntax-driven parsers for LL(1) have been around for much longer than PEG -- one such parser is described in the first paper to describe recursive descent (Peter Lucas, 1961). But most practical languages are not LL(1). Redziejowski 2013 and Redziejowski 2014 seek to extend this result by defining the language class LL(1p) -- those top-down languages with one "parsing procedure" of lookahead. The LL(1p) languages are also well-behaved for PEG.

Mascarenhas et al, 2013 also look at a different approach -- instead of writing a PEG specification and trying to keep it well-behaved, they look at taking languages from larger top-down classes and translating them to PEG. I don't know of any followup, but it's possible this approach could produce well-behaved top-down parsers which are an improvement over direct-from-PEG parsing. But for those who are open to leaving top-down parsing behind, a parser which handles languages in all these classes and more is already available.

Marpa

In this post, I have adopted the point of view of programmers using PEG, or thinking of doing so. My own belief in this matter is that very few programmers should want to bother with the issues I've just described. My reason for this is the Marpa parser -- a general BNF Earley-driven parser that

The LR-regular grammars include the LR(k) and LL(k) grammars for all k. LR-regular includes all the languages which are well-behaved under PEG, and all of those that Mascarenhas et al, 2013 consider translating into PEG.

Comments

Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net. To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.


posted at: 22:56 | direct link to this entry

§         §         §