Sun, 10 Apr 2011
Bovicide 5: Parse-time Error Reporting
This post is one of
an exchange of papers on the Internet.
That exchange discussed what it would take to produce
an acceptable replacement for yacc
and the entire family of LALR-based parsers.
My first post
concentrated on speed and
power -- the ability to parse all context-free grammars in O(n**3) time,
and the ability to parse just about every
grammar in practical use,
including all yacc-able grammars,
in linear (O(n)) time.
This is the second of two posts
about error reporting.
Error reporting is often ignored,
but it is extremely important.
a previous post,
I suggested that its poor error reporting properties are
the real reason why production parsers are abandoning the once-dominant LALR
in favor of hand-written recursive descent.
LALR's feedback on grammar errors could pass for
Theory heavily influences how programmers look
at their craft.
Error detection is a good example.
Great theoreticians are people who know what to ignore.
As Michangelo is alleged to have said,
the statue is already in the marble,
you just have to chip away the extra stuff.
Error detection is certainly grotty detail.
So the first examinations of parsing pretty much ignored
it. As did the second, third, fourth, ...
Things did eventually get better.
These days parsing texts often look carefully
at an algorithm's parse-time error detection properties.
Classification of the parsers by error detection currently
focuses on their
ability to determine where the error is.
It makes sense to focus on this because it's simpler than determining
why there is an error,
and in any case, you aren't likely to find the why if you don't know
A error is said to occur at the first token which is not part of a correct prefix.
(I will call that token the error location.)
A correct prefix is a string of tokens which is a prefix of some input that parses successfully.
Note that this definition of error location may not always match your intuition
of where the error is.
Intuitively, the error is the first token which cannot be part of
one of the inputs that you intended -- it is the point
where you "went wrong".
But neither the theoreticians or I have any idea of how to determine what
a programmer really intended,
short of a Vulcan Mind Meld.
So correct prefixes are going to have to be good enough.
Here's the theoretician's breakdown, using
- Clueless Parsers. After it's all over, these parsers realize
the something went wrong,
but they have no idea where.
- Clueish Parsers. These parsers don't always know it immediately
when things go wrong, but they clue in soon after.
- Clueful Parsers. These parsers know exactly when things go wrong.
The Fifth Requirement for Replacing yacc:
Good Reporting of Parse-time Errors
Clueless parsers stay confined to the pages of
textbooks, as you might expect.
You might also expect this of clueish parsers, but in fact
is clueish, which means clueish parsing
was the industry standard for production parsing
Generations of programmers got used to mysterious
error messages from even the best compilers,
and as a result standards for parse-time error reporting
I don't expect them to stay that way.
I hope in the future clueful parsing will be seen
as a minimum in production quality parsing.
Regular expressions are clueful.
Recursive descent is hard to characterize.
The underlying algorithm is usually less than full LL(1),
which makes recursive descent in theory clueish.
But hand-written recursive descent is attractive because it can be
extended as needed with hacks,
so in practice a hand-written recursive descent grammar might
be clueful when it counts.
is clueful and much more.
Marpa breaks new ground in other areas,
but I think better error reporting will be its most
The Ruby Slippers Property
Marpa has the Ruby Slippers Property -- Marpa not only knows
exactly where the parse went wrong, it knows why, and
can report that to the user in convenient form.
If a lexer passes Marpa a bad token,
Marpa can easily tell the lexer
which token or tokens it will accept.
The Ruby Slippers
are useful for much more than error reporting.
For example, writing BNF to parse very liberal HTML is
a difficult task using conventional methods.
Just dealing with the cases of missing
start or end tags makes the grammar very complex,
and very hard
to maintain and modify.
With Marpa, you can skip all that work.
You write the BNF for strict HTML, on the assumption
that all the start and end tags will be there.
Then, while running, if the parser rejects a token,
the lexer can ask Marpa what it wanted instead.
If it was an start or end tag, the lexer can invent
one and pass it on to keep the parse going.
And that's all you need to do to handle the issue of missing
HTML start and end tags.
Marpa knows not only what tokens it is looking for,
but what rules it is working on and how far it
has progressed into them.
the only application I've put this additional information to
I call the lists of rules in progress,
I created progress reports when I started to work on complex grammars in Marpa,
hoping to make a traditionally difficult task easier.
Right off the bat, these "progress reports" were a big improvement.
Instead of struggling with the internals of the parser generator,
as I then had to do with Marpa,
and as users of yacc still must,
I now had a parser which provided
a window directly into my grammar.
post addressed the reporting of grammar errors.
Why my terms? Because the theoretician's terms are cumbersome and misleading.
Theoreticians defined "the immediate error detection property" and
"the correct prefix property".
Clueless parsers lack both properties.
Clueish parsers have the correct prefix property, but lack the immediate
error detection property.
Clueful parsers have both properties.
The "immediate error detection property" is what you'd think it is,
which is why all parsers with that property are clueful.
What's not so clear is why the
"immediate detection property"
is different from
the "correct prefix property".
Parsers with the "correct prefix property" are those which reject as
soon as an incorrect prefix has been processed.
That is not the same as
"the immediate error detection property" because "processing"
often involves reading input well beyond the correct
prefix, as well as destroying useful evidence
about where the last correct prefix was.
Marpa's speed improvements
derive from one algorithm
published by John Aycock and Nigel Horspool,
and another published by Joop Leo.
Marpa is the first parser to combine the algorithms into one,
and Marpa is the first practical implementation of Leo's algorithm.
My Ruby Slippers HTML parser is Marpa::HTML.
There is more about Ruby Slippers parsing in
one of my previous posts.
posted at: 19:58 |
direct link to this entry