Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest
The Ocean of Awareness blog: home page, chronological index, and annotated index.
Abstract Syntax Forests (ASF's) are my most recent project. I am adding ASF's to my Marpa parser. Marpa has long supported ambiguous parsing, and allowed users to iterate through, and examine, all the parses of an ambiguous parse. This was enough for most applications.
Even applications which avoid ambiguity benefit from better ways to detect and locate it. And there are applications that require the ability to select among and manipulate very large sets of ambiguous parses. Prominent among these is Natural Language Processing (NLP). This post will introduce an experiment. Marpa in fact seems to have some potential for NLP.
Writing an efficient ASF in not a simple matter. The naive implementation is to generate complete set of fully expanded abstract syntax trees (AST's). This approach consumes resources that can become exponential in the size of the input. Translation: the naive implementation quickly becomes unuseably slow. Marpa optimizes by aggressively identifying identical subtrees of the AST's. Especially in highly ambiguous parses, many subtrees are identical, and this optimization is often a big win.
My primary NLP example at this point is a quote from Ada Lovelace. It is a long sentence, possibly the longest, from her Notes -- 158 words. A disadvantage of this example is that it is not typical of normal NLP. By modern standards it is an unusually long and complex sentence. An advantage of it, and my reason for the choice, is that it stresses the parser.
The "Note A" from which this sentence is taken is one of Ada's notes on a translation of a paper on the work of her mentor and colleague, Charles Babbage. Ada's "Notes" are longer than the original paper, and far more important. In these "Notes" Ada makes the first distinction between a computer and a calculator, and between software and hardware. In their collaboration, Babbage did all of the hardware design, and he wrote most of the actual programs in her paper. But these two revolutionary ideas, and their elaboration, are Ada's.
Why would Babbage ignore obvious implications of his own invention? The answer is that, while these implications are obvious to us, they simply did not fit into the 1843 view of the world. In those days, algebra was leading-edge math. The ability to manipulate equations was considered an extremely advanced form of reason. For Babbage and his contemporaries, that sort of ability to reason certainly suggested the ability to distinguish between good and evil, and this in turn suggested possession of a soul. Ada's "Notes" were written 20 years after Mary Shelly, while visiting Ada's father in Switzerland, wrote the novel Frankenstein. For Ada's contemporaries, announcing that you planned to create a machine that composed music, or did advanced mathematical reasoning, was not very different from announcing that you planned to assemble a human being in your lab.
Ada was the daughter of the poet Byron. For her, pushing boundaries was a family tradition. Babbage was happy to leave these matters to Ada. As Babbage's son put it, his father
considered the Paper by Menabrea, translated with notes by Lady Lovelace, published in volume 3 of Taylor's 'Scientific Memoirs," as quite disposing of the mathematical aspect of the invention. My business now is not with that.
Ada's notes are worth reading, but the modern reader has to be prepared to face several layers of difficulty:
Those who view mathematical science, not merely as a vast body of abstract and immutable truths, whose intrinsic beauty, symmetry and logical completeness, when regarded in their connexion together as a whole, entitle them to a prominent place in the interest of all profound and logical minds, but as possessing a yet deeper interest for the human race, when it is remembered that this science constitutes the language through which alone we can adequately express the great facts of the natural world, and those unceasing changes of mutual relationship which, visibly or invisibly, consciously or unconsciously to our immediate physical perceptions, are interminably going on in the agencies of the creation we live amidst: those who thus think on mathematical truth as the instrument through which the weak mind of man can most effectually read his Creator's works, will regard with especial interest all that can tend to facilitate the translation of its principles into explicit practical forms.
Ada's sentence may look like what happens when two pickups carrying out-of-date dictionaries to the landfill run into each other on the way. But there is, in fact, a good deal of structure and meaning in all those words. Let's take it as bullet points:
Ada is connecting her new science of software to the history of thought in the West, something which readers of the time would expect her to do. Bullet point 1 alludes to the Greek view of mathematics, especially Plato's. Bullet point 2 alludes to the scientific view, as pioneered by Galileo and Newton. Bullet point 3 alludes to the post-Classical world view, especially the Christian one. But while the language is Christian, Ada's idea is one that Einstein would have had no trouble with. And bullet 4 is the call to action.
When we come to discuss the parse in detail, we'll see that it follows this structure. As an aside, note Ada's mention of "logical completeness" as one of the virtues of math. Gödel came along nearly a century later and showed this vision, which went back to the Greeks, was an illusion. So Ada did not predict everything. On the other hand, Gödel's result was also a complete surprise to Johnny von Neumann, who was in the room that day.
I've gotten Marpa to grind through this sentence, using the same framework as the Stanford NLP demo. That demo, in fact, refuses to even attempt any sentence longer than 70 words, so my Ada quote needs to be broken up. Even on the smaller pieces, the Stanford demo becomes quite slow. Marpa, by contrast, grinds through the whole thing quickly. The Stanford demo is based on a CYK parser, and presumably is O(n3) -- cubic. Marpa seems to be exhibiting linear behavior.
Promising as this seems for Marpa, its first results may not hold up as the experiment gets more realistic. So far, I've only given Marpa enough English grammar and vocabulary to parse this one sentence. That is enough to make the grammar very complex and ambiguous, but even so it must be far less complex and ambiguous than the one behind the Stanford demo. Marpa will never have time worse than O(n3), but it's quite possible that if Marpa's grammar were as ambiguous as the Stanford one, Marpa would be no faster. Marpa, in fact, could turn out to be slower by some linear factor.
There may never be a final decision based on speed. Marpa might turn out to represent one approach, good for certain purposes. And, especially when speed is indecisive, other abilities can prove more important.
Marpa::R2 is available on CPAN. A list of my Marpa tutorials can be found here. There are new tutorials by Peter Stuifzand and amon. The Ocean of Awareness blog focuses on Marpa, and it has an annotated guide. Marpa also has a web page. For questions, support and discussion, there is the "marpa parser" Google Group. Comments on this post can be made there.
posted at: 14:24 | direct link to this entry