Mon, 02 Mar 2015
PEG: Ambiguity, precision and confusion
is a new notation
for a notorously tricky algorithm that goes back
to the earliest computers.
In its PEG form,
this algorithm acquired an seductive new interface,
one that looks like the best of
extended BNF combined with the best of regular expressions.
Looking at a sample of it, you are tempted to imagine
that writing a parser has suddenly become a very straightforward
matter. Not so.
For those not yet in the know on this,
I'll illustrate with a pair of examples from
an excellent 2008 paper by Redziejowski.
Let's start with these two PEG specifications.
One of these two PEG grammars accepts
the string "aaa" but not the string "aa".
The other does the opposite -- it accepts the string
the string "aa" but not the string "aaa".
Can you tell which one?
(For the answer,
see page 4 of
Here is another example:
A = "a"A"a"/"aa"
What language does this describe?
All the strings in the
language are obviously
the letter "a",
repeated some number of times.
But which string lengths are in the language,
and which are not?
Again the answer is on
page 4 of
-- it's exactly those strings
whose length is a power of 2.
With PEG, what you see in the extended BNF
is not what you get.
PEG parsing has been called "precise",
apparently based on
the idea that PEG parsing is in a certain sense unambiguous.
In this case "precise" is taken as synonymous with
That is, PEG parsing is precise in exactly the same
sense that Jimmy Hoffa's body
is at a precise location.
There is (presumably) exactly one such place,
but we are hard put to be any more specific about the matter.
The advantage of using
a syntax-driven parser generator is that
the syntax you specify describes
the language that will be parsed.
For most practical grammars, PEG is not syntax-driven
in this sense.
PEG researchers understand this issue,
and have tried to deal with it.
I will talk about their work below.
This is much more at stake than bragging rights
over which algorithm is really syntax-driven and
which is not.
When you do not know the language your parser is
parsing, you of course have the problem that
might not parse all the strings in your language.
That can be dealt with by
fixing the parser to accept the correct input,
as you encounter problems.
A second, more serious, problem is often forgotten.
Your PEG parser might accept strings that are
in your language.
At worst, this creates a security loophole.
At best, it leaves with a choice:
or leave the problem unfixed.
It's important to be able to convince
that your code is correct by examining it and thinking
Beginning programmers often simply hack things,
and call code complete once it passes the test suite.
Test suites don't catch everything,
but there is a worse problem with the beginner's approach.
Since the beginner has no clear idea of why his code
works, even when it does,
it is unlikely to be well-organized or readable.
Programming techniques like PEG,
where the code can be made to work,
but where it is much harder,
and in practice usually not possible,
to be sure why the code works,
become maintenance nightmares.
The maintenance implications are especially worrisome if
the PEG parser is for a language with a life cycle that may
involve bug fixes or other changes.
The impact of even
small changes to a PEG specification
is hard to predict and
hard to discover after the fact.
Is PEG unambiguous?
PEG is not unambiguous in any
helpful sense of that word.
BNF allows you to specify ambiguous grammars,
and that feature is tied to its power and flexibility
and often useful in itself.
PEG will only deliver one of those parses.
an easy way of knowing which parse,
the underlying ambiguity is not addressed --
it is just ignored.
My Marpa parser
is a general BNF parser based on Earley's.
It also can simply throw all but one of the parses in an ambiguous
But I would not feel justified in saying to a user who has an
issue with ambiguity,
that Marpa has solved her problem
by throwing all but one arbitrarily chosen result.
Sticking with Marpa for a moment,
we can see one
example of a more helpful approach
Marpa allows a user to rank rules,
so that all but the highest ranking rules are not used
in a parse.
Marpa's rule rankings are specified in its BNF,
and they work together with the BNF
in an intuitive way.
In every case,
Marpa delivers precisely the parses its BNF and its rule
is "precision" in this sense that a parser writer is looking for.
Is there a sensible way to use PEG?
I'll return to Marpa at the end of this post.
that you are not interested in using Marpa --
you are committed to PEG,
and you want to make the best of PEG.
Several excellent programmers have focused
without blinding themselves to its limitations.
I've already mentioned one important paper
Redziejowski's collected papers
are about PEG,
and Redziejowski, in his attempts to use PEG,
does not sugarcoat its problems.
Roberto Ierusalimschy, author of Lua and one of the best
programmers of our time,
has written a PEG-based parser of his own.
Roberto is fully aware of PEG's limits,
but he makes a very good case for choosing PEG
as the basis of LPEG, his parser generator.
LPEG is intended for use with Lua,
a ruthlessly minimal language.
Roberto's minimalist implementation limits the power of his parser,
but his aim is to extend regular expressions in a disciplined way,
and a compact parser of limited power is quite acceptable for his
Matching the BNF to the PEG spec
As Redziejowski and Ierusalimschy and the other authors
Mascarenhas et al, 2013
not knowing what language you are
parsing is more than an annoyance.
We can call a language
"well-behaved for PEG"
if the PEG spec delivers exactly
the language the BNF describes.
Which languages are are well-behaved for PEG?
Mascarenhas et al, 2013,
the LL(1) languages are well-behaved.
(The LL(1) languages are the languages
a top-down parser can parse based on at
most one character of input.)
Syntax-driven parsers for LL(1) have been around
for much longer than PEG --
one such parser is described
in the first paper to describe recursive descent
But most practical languages are not LL(1).
seek to extend this result by defining the language class LL(1p) --
those top-down languages with one
The LL(1p) languages are also well-behaved for PEG.
Mascarenhas et al, 2013
also look at a different approach -- instead of writing a PEG specification
and trying to keep it well-behaved,
they look at taking languages from larger top-down classes
and translating them to PEG.
I don't know of any followup,
but it's possible this approach could produce
well-behaved top-down parsers which are
an improvement over direct-from-PEG parsing.
But for those who are open to leaving top-down parsing behind,
a parser which handles languages in all these classes
and more is already available.
In this post,
I have adopted the point of view of programmers using PEG,
or thinking of doing so.
My own belief in this matter is that
very few programmers
should want to bother with the issues I've just described.
My reason for this is the Marpa parser --
a general BNF Earley-driven parser that
- has an implementation you can use today;
- allows the application to combine syntax-driven parsing
with custom procedural logic;
- makes available full, left-eidetic knowledge of the parse to
the procedural logic;
- and parses a vast class of grammars in linear time,
including all the LR-regular grammars.
The LR-regular grammars include the LR(k) and LL(k)
grammars for all
LR-regular includes all the languages
which are well-behaved under PEG,
and all of those that
Mascarenhas et al, 2013
consider translating into PEG.
Comments on this post can be made in
Marpa's Google group,
or on our IRC channel: #marpa at freenode.net.
To learn more about Marpa,
official web site maintained by Ron Savage.
I also have
a Marpa web site.
posted at: 19:56 |
direct link to this entry