Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website


Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Mon, 21 Jun 2010

Jay Earley's Idea

Truth == Simplicity?

In other posts, I talked about improvements to Jay Earley's parsing algorithm -- some from Joop Leo, some from Aycock and Horspool, some of mine . Here I'd like to talk about Jay Earley's original algorithm. A common belief of great scientists is that, if an idea is basic and true, it is in essence also simple, and therefore it must have a simple explanation.

Finding the simple explanation might be far from simple. But a simple explanation there ought to be. I like to look for those simple explanations. Writing my mathematical novel, The God Proof, confirmed me in this habit. Whenever I am studying something, and it seems important and true, I look for the simple explanation.

Dotted Rules

The idea behind Earley's algorithm is that you can parse by building a table of rules and where you are in those rules. "Where" means two things: location in the rule relative to the rule's symbols, and location relative to the parse's input stream.

Let's look at an example of a rule in a context-free grammar. Here's the rule for assignment in perl's perly.y :

    termbinop => term ASSIGNOP term
In parsing this rule, we can be at the beginning, before all of the symbols. We can also be immediately after any of the three symbols. That's a total of four possible locations.

Within a rule, position relative to the symbols of the rule is traditionally indicated with a dot. In fact, the symbol-relative rule position is very often called the dot location. Taken as a pair, a rule and a dot location are called a dotted rule.

Here's our rule with dot location indicated:

    termbinop => · term ASSIGNOP term
Here the dot is at the beginning of the rule. This dotted rule says we have not recognized any symbols in the rule yet. All we are doing is predicting that the rule will occur. A dotted rule with the dot before all of its symbols is called a prediction.

Here's another dotted rule:

    termbinop => term · ASSIGNOP term
In this dotted rule, we are saying we have seen a term, but have not yet recognized an ASSIGNOP. (ASSIGNOP is perly.y's internal name for the assignment operator. In plain Perl terms, this is the "=" character.)

There's another special kind of dotted rule, a completion. Here's the completed version of the example we've been using:

    termbinop => term ASSIGNOP term ·
A completion is a dotted rule with the dot after all of the symbols. It indicates that a rule has been fully recognized.

The Wheres

In dotted rules, Earley's algorithm has all but one piece of the information it needs to track. The final piece is the second of the two "wheres": where in the input stream. To associate input stream location and dotted rules, Earley's algorithm uses what are now called Earley items. Earley items match input streams locations with dotted rules on a many-to-many basis.

A convenient way to think of an Earley item is as a triple, or 3-tuple, consisting of dotted rule, origin and current location. The origin is the location in the input stream where the dotted rule starts. The current location is the location in the input stream which corresponds to the dot position.

Two noteworthy consequences follow from the definitions in the previous paragraph. First, if a dotted rule is a prediction, origin and current location will always be the same. Second, the input stream location where a rule ends is not necessarily tracked, unless the dotted rule is a completion. In other cases, we don't necessarily know the location at which (or even if) the rule will be completed.

A Note on Terminology

Readers familiar with the literature on Earley parsing will notice that what I call the origin is usually called the "parent". Parent" is a confusing and heavily overloaded term. I avoid it when I can.

The traditional representation of Earley items defines them as pairs of dotted rule and origin. Current location is represented by membership in an Earley set, with the Earley sets corresponding, one-to-one, with locations in the input stream. For me, the traditional representation tangles the basic idea up in implementation details. I find thinking in terms of the 3-tuple is cleaner.

And That's All! Kind of

Everything else about Earley's algorithm is about methods of creating the Earley items, about how to extract the parse from them, and about ways of speeding things up. It gets very complicated. In Marpa it gets so complicated it can be hard to remember that there is a basic idea. But there is, and it is Jay Earley's.

posted at: 00:41 | direct link to this entry

§         §         §