Wed, 14 Dec 2011
How to parse HTML, part 3
When a solution has the same shape
as the problem,
it is a very good thing,
and not just because it looks pretty.
I have described
a Marpa-based, "Ruby Slippers"
approach to parsing liberal
and defective HTML.
A major advantage
is that it looks like
the problem it solves.
HTML parsing: the problem
The problem of parsing an HTML document
the problem of finding
the hierarchy of its HTML elements.
HTML elements are delimited by start and end tags.
The HTML standards specify that certain of the
start and end tags can be omitted.
In liberal and defective HTML,
any HTML tag might be missing.
In liberal and defective HTML,
unknown and spurious tags
may be present in the physical input.
HTML parsing: the solution
The parse engine uses an over-strict grammar,
one which requires all HTML start and end tags.
When the parse engine runs into a token it cannot accept,
if there is exactly one start or end tag which it
could accept at that point,
the parser uses "the Ruby Slippers".
It invents a virtual token representing the desired
tag, and feeds it to the parse engine.
If there is more than one virtual token is possible,
chooses a token to pass on to the parse engine.
In the current
this is done using rules of thumb.
If no virtual token is possible,
the physical token is treated as "cruft".
The grammar allows cruft to be a part of the contents
of any HTML element,
and the application can decide what to do
This outline of the solution
follows the structure of the problem point for point.
In turn, the code follows this outline.
It may seem
that I just stated the painfully obvious,
but in fact the design of
the parsers in use today typically
does NOT reflect the structure of their target languages
in any straightforward way.
In particular, the more a parser is
considered "production quality",
the less likely its code will bear any resemblance to
the problem it is solving.
Toward hackable parsers
A lot could be said about the
aesthetics and philosophy of this.
In this post,
let me cut straight to the bottom line.
First and least important,
it is usually easier to code a solution which looks like the problem.
I say "least important,"
because this perspective views the problem as static,
and if the problem is static you can code it up
and forget it.
It does not matter too much whether
the coding effort is fast,
if it only has to be done once.
But what if the problem keeps changing?
You might say that most parsing is of the static type,
and that's true.
But that is
because previous technology has left little
choice in the matter.
I believe that,
if programmers had the option of hacking production-quality
parsers, they'd be doing it all the time.
In the past,
hacking production quality parsers has been,
for practical purposes, impossible.
Look at those existing utilities which do work with, for
example, C, HTML or Perl.
These usually do NOT even attempt to leverage the production parser
for these languages.
Instead these tools use a new parser,
one created from scratch.
One consequence is that
they must tolerate a considerable amount
of approximation in the parsing.
Why don't programmers take the production parsers for a language
as the basis for tools working with that language?
If you look at those production parsers,
you'll see why.
They reflect the structure of the languages so little,
and are so complex,
that they simply are unusable as a starting point
A Marpa-powered "Ruby Slippers" approach to HTML,
like the one implemented in
but with its HTML interpretation layer rewritten in C,
would be very competitive as a production HTML parser.
Not the least of its advantages would be that it would make
an excellent basis for HTML utilities.
The previous posts in this series were
to parse HTML"
to parse HTML, part 2".
posted at: 20:44 |
direct link to this entry