Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest
The Ocean of Awareness blog: home page, chronological index, and annotated index.
When a solution has the same shape as the problem, it is a very good thing, and not just because it looks pretty. In previous posts, I have described Marpa::HTML, a Marpa-based, "Ruby Slippers" approach to parsing liberal and defective HTML. A major advantage of Marpa::HTML is that it looks like the problem it solves.
This outline of the solution follows the structure of the problem point for point. In turn, the code follows this outline. It may seem that I just stated the painfully obvious, but in fact the design of the parsers in use today typically does NOT reflect the structure of their target languages in any straightforward way. In particular, the more a parser is considered "production quality", the less likely its code will bear any resemblance to the problem it is solving.
A lot could be said about the aesthetics and philosophy of this. In this post, let me cut straight to the bottom line.
First and least important, it is usually easier to code a solution which looks like the problem. I say "least important," because this perspective views the problem as static, and if the problem is static you can code it up and forget it. It does not matter too much whether the coding effort is fast, if it only has to be done once. But what if the problem keeps changing?
You might say that most parsing is of the static type, and that's true. But that is because previous technology has left little choice in the matter. I believe that, if programmers had the option of hacking production-quality parsers, they'd be doing it all the time.
In the past, hacking production quality parsers has been, for practical purposes, impossible. Look at those existing utilities which do work with, for example, C, HTML or Perl. These usually do NOT even attempt to leverage the production parser for these languages. Instead these tools use a new parser, one created from scratch. One consequence is that they must tolerate a considerable amount of approximation in the parsing.
Why don't programmers take the production parsers for a language as the basis for tools working with that language? If you look at those production parsers, you'll see why. They reflect the structure of the languages so little, and are so complex, that they simply are unusable as a starting point for tools.
A Marpa-powered "Ruby Slippers" approach to HTML, like the one implemented in Marpa::HTML but with its HTML interpretation layer rewritten in C, would be very competitive as a production HTML parser. Not the least of its advantages would be that it would make an excellent basis for HTML utilities.
"previous posts": The previous posts in this series were "How to parse HTML" and "How to parse HTML, part 2".
posted at: 20:44 | direct link to this entry