Mon, 21 Jun 2010
Jay Earley's Idea
Truth == Simplicity?
In other posts,
I talked about improvements to Jay Earley's parsing algorithm
-- some from Joop Leo, some from Aycock and Horspool,
some of mine
Here I'd like to talk about Jay Earley's original algorithm.
A common belief of great scientists is that,
if an idea is basic and true,
it is in essence also simple,
and therefore it must have a simple explanation.
Finding the simple explanation might be far from simple.
But a simple explanation there ought to be.
I like to look for those simple explanations.
Writing my mathematical novel, The God Proof,
me in this habit.
Whenever I am studying something,
and it seems important and true,
I look for the simple explanation.
The idea behind Earley's algorithm is that you can
parse by building a table of rules
and where you are in those rules.
"Where" means two things: location in the rule relative to the rule's
and location relative to the parse's input stream.
Let's look at an example of a rule in a context-free grammar.
Here's the rule for assignment in perl's
termbinop => term ASSIGNOP term
In parsing this rule, we can be at the beginning, before all of the symbols.
We can also be immediately after any of the three symbols.
That's a total of four possible locations.
Within a rule, position relative to the symbols of the rule
is traditionally indicated with a dot.
In fact, the symbol-relative rule position
is very often called the dot location.
Taken as a pair,
a rule and a dot location are called a dotted rule.
Here's our rule with dot location indicated:
termbinop => · term ASSIGNOP term
Here the dot is at the beginning of the rule.
This dotted rule says we have not recognized any symbols in the rule yet.
All we are doing is predicting that the rule will occur.
A dotted rule with the dot before all of its symbols is called a prediction
Here's another dotted rule:
termbinop => term · ASSIGNOP term
In this dotted rule,
we are saying we have seen a term
, but have not yet recognized
's internal name for
the assignment operator. In plain Perl terms, this is the "=
There's another special kind of dotted rule, a completion.
Here's the completed version of the example we've been using:
termbinop => term ASSIGNOP term ·
is a dotted rule with the dot after all of the symbols.
It indicates that a rule has been fully recognized.
In dotted rules, Earley's algorithm has all but one piece of the information it needs to track.
The final piece is the second of the two "wheres": where in the input stream.
To associate input stream location and dotted rules, Earley's algorithm uses what are now called Earley items.
Earley items match input streams locations with dotted rules
on a many-to-many basis.
A convenient way to think of an Earley item is as a triple, or 3-tuple,
consisting of dotted rule, origin and current location.
The origin is the location in the input stream where the dotted rule starts.
The current location is the location in the input stream which corresponds to the
Two noteworthy consequences
follow from the definitions in the previous paragraph.
if a dotted rule is a prediction, origin and current location will always be the same.
the input stream location where a rule ends is not necessarily tracked, unless the
dotted rule is a completion.
In other cases,
we don't necessarily know the location at which (or even if) the rule will
A Note on Terminology
Readers familiar with the literature on
Earley parsing will notice
that what I call the origin is usually called the "parent".
Parent" is a confusing and heavily overloaded
I avoid it when I can.
representation of Earley items
as pairs of dotted rule and origin.
Current location is represented by membership in an
Earley set, with the Earley sets corresponding,
one-to-one, with locations in the input stream.
the traditional representation tangles the basic idea
I find thinking in terms of the 3-tuple is cleaner.
And That's All! Kind of
Everything else about Earley's algorithm
is about methods of creating the Earley items,
about how to extract the parse from them,
and about ways of speeding things up.
It gets very complicated.
it gets so complicated it can be hard to remember that
there is a basic idea.
But there is,
and it is Jay Earley's.
posted at: 03:41 |
direct link to this entry