Mon, 27 May 2013
Why Marpa works: table parsing
Marpa works very differently from the parsers
in wide use today.
Marpa is a table parser.
And Marpa is unusual among table parsers --
its focus is on speed.
The currently favored parsers use stacks,
in some cases together with a state machine.
These have the advantage that it is easy
to see how they can be made
to run fast.
They have the disadvantage of severely limiting
what the parser can do and how much it can know.
What is table parsing?
Table parsing means parsing by constructing a table of all the possible parses.
This is pretty clearly something you want -- anything less means
not completely knowing what you're doing.
It's like walking across the yard blindfolded.
It's fine if you can make sure there are
no walls, open pits, or other dangerous obstacles.
But for the general case,
it's a bad idea.
Where the analogy between walking blindfolded and parsing breaks
down is that while taking off the blindfold has no cost,
building a table of everything that is happening while you parse
have a cost.
If you limit your parser to a stack and a state machine,
you may be parsing with a blindfold on,
but it is clear that your parser can be fast.
How to make a table parser run fast
is not so clear.
The advantages of table parsing
What are the advantages of taking off the blindfold?
First, your parser can be completely general --
anything you can write in BNF it can parse.
you always know exactly what is going on -- what rules
are possible, what rules have been recognized,
how far into a rule you have recognized it,
what symbols you expect next, etc. etc.
We programmers have gotten used to parsers which run blindfolded.
When you bump into something while
blindfolded you don't know
what it was.
When non-table parsers fail, they usually don't know why --
they can only guess.
If you have a full parse table,
built from left to right,
you know what you were looking for and what you already think you
This means that you can pinpoint and identify errors precisely.
Knowing where you are in a parse also allows certain tricks,
like the one I call the Ruby Slippers.
In this, you parse with an over-simplified grammar and,
when it fails, you ask the parser what it was looking for.
Then -- poof! -- you provide it.
The Ruby Slippers work beautifully when dealing with
missing tags in HTML.
You can define a simplified HTML grammar,
one that lives in a non-existent world --
an ideal world where all start
and end tags are always where they belong.
Then you parse away.
If, as will happen with real-world input, a tag is missing,
you ask the table parser what it was looking for,
and give it a virtual tag.
And as for fast ...
When I decided to write Marpa in 2007 my goal was to create a table parser
that was as fast as possible.
I was surprised to find that the academic literature contained a
major improvement to table parsing by Joop Leo,
an improvement which nobody had ever made a serious attempt to implement.
Marpa is the first implementation of Joop Leo's 1991 improvement to table parsing which,
as far as theory goes,
makes Marpa as fast any parser
in practical use today.
Any class of grammar that
recursive descent, bison, etc. will parse,
Marpa will parse in linear time.
Table parsing has had a reputation for being slow due to a
bad "constant factor".
Theoreticians, when looking at speed as time complexity,
throw away constant factors.
What's left once the constant factor is ignored is always more
time complexity results which ignore
constant factors are also the most
meaningful results in practical terms.
But not 100% of the time.
Sometimes the constant factor makes all the difference.
And deciding when a constant factor does make a difference,
and when it does not,
is a tricky matter,
one that lies in that murky zone between practice and theory
where neither practitioner or theorist feels fully at home.
It's time to look again, for two reasons.
First, Aycock and Horspool did a lot of careful work on
the constant factor for table parsing,
work which Marpa incorporates
and builds on.
the judgment about the constant factor dates from 1968,
when computers were literally a billion
times slower then they are now.
Why has nobody re-examined this judgment as the years and the order
of magnitude speed-ups marched by?
When a judgment crosses sub-disciplines, it can be "sticky"
beyond all reason,
and this one is a good example.
The decision that its "constant factor" made table parsing
too slow for many practical applications
is in part a practical take on a theoretical issue,
and in part a theoretical call on a practical matter.
the theoretical results have improved.
But the theoreticians did not change
their mind about the speed of table parsing
because it was a judgment about the practical application
of the theory.
The practitioners were actually out there writing compilers.
When it comes down to practice,
you have to assume that practitioners know what they
are talking about, right?
Meanwhile the practice of writing software underwent
revolution after revolution.
But the practitioners continued to write off table parsing
Talking about the speed of table parsers quickly got you
into some very heavy math.
And some of the algorithms
did not even exist except as mathematical
notation on the pages
of the journals and textbooks.
When it comes down to theory about things
that don't exist outside of theory,
who do you listen to if not
I look carefully at the issue
of the "constant factor" in
a previous post.
Forty-five years have passed and
cost of CPU has fallen
nine orders of magnitude.
(Others say the cost of CPU has dropped by 50% a year,
in which case it's over 14 orders of magnitude.
But why quibble?)
It's reasonable to suspect that
the constant factor that practitioners
and theoreticians were worried about in 1968
stopped being a
major obstacle many years ago.
For more about Marpa
Marpa's latest version is
which is available on CPAN.
A list of my Marpa tutorials can be found
a new tutorial by Peter Stuifzand.
focuses on Marpa,
and it has
an annotated guide.
Marpa also has
a web page.
For questions, support and discussion, there is a
Comments on this post can be made there.
posted at: 04:37 |
direct link to this entry