Wed, 07 Dec 2011
How to parse HTML, part 2
This is the second of a series of posts that details
a Marpa-based, "Ruby Slippers"
approach to parsing liberal
and defective HTML.
This post assumes you have
the first post.
First, reduce the HTML to a token stream
Most computer languages can be viewed
as a token stream.
HTML is not an exception.
HTML tokens can be blocks of text;
comments and various other SGML entities;
HTML element start tags;
and HTML element end tags.
The HTML token stream is unusual in that
some of its tokens can
be quite complex internally.
In parsing computer languages,
it is a frequent practice to divide the
work between a tokenizer ("lexer")
and a high-level parser
The lexer takes the raw input
and turns it into a token stream.
Tokenizing HTML is a difficult job,
and one for which there is an excellent CPAN module:
to do its tokenization.
determines the large scale structure of the HTML document --
what I will call in this post, "high-level" parsing.
The result of high-level parsing can be
seen as a hierarchical structure.
The goal of high-level parsing is
to build a hierarchy which reflects
the structure of the document.
Conventionally, this hierarchy is visualized
as an upside-down tree,
one where the "leaves" are the tokens,
and where the "root" node represents the
document as a whole.
An HTML document contains components
of many kinds,
but the hard part of determining its structure
is deciding where to begin and end its HTML elements.
Know this and you know how to arrange the HTML elements in
Know how to arrange the HTML elements in a hierarchy,
and you have produced essentially the entire parse
tree for the HTML document.
When it comes to knowing where to start
and end HTML elements
in liberal HTML,
there are complications aplenty.
Conceptually, every HTML element has a start
and end tag but,
even in fully standard-conforming input,
not all of these tags need to be in the physical input.
In real-life HTML, start and end tags are often
Additionally, a liberal HTML parser
is expected to be robust in
the face of spurious tags.
Second, do a census of the HTML elements
In HTML according to the standards,
there is a fixed list of HTML elements.
In theory, a liberal HTML parser could work
from a list which is a superset of all the standards,
plus a list of the known browser-specific tags.
takes another approach.
It makes a first pass over its input to "census"
the elements in the HTML document,
and builds its grammar based on that census.
This has several advantages:
- It is an easy way to deal with spurious tags.
- Vendor-specific tags can handled using the same mechanisms
that deal with spurious tags.
This is reasonable since the boundary between vendor-specific
and spurious tags is not sharp.
- It makes
robust in the
face of new HTML standards.
- It means
would be easy
to extend to deal with XML.
- The grammar avoids having lots of rules for
elements which do not actually exist in the input.
Third, create a grammar
a new grammar for every parse.
Much of the grammar is constant -- there is a "framework".
This framework already includes some of the HTML elements.
Into this framework,
inserts rules for
the additional elements discovered during the
Each element has only one rule,
which takes the form:
ELEMENT ::= START_TAG CONTENTS END_TAG
The grammar assumes that every HTML element
has both a start tag and an end tag.
From the grammar's point of view, these are never
grammar is therefore,
not just strict,
If there were no Ruby Slippers,
the over-strict grammar
would fail to parse many
standard-conforming HTML documents,
and would be hopeless at dealing with liberal HTML.
Fourth, determine which tokens are allowed to be virtual
A necessary complement to
is a list of the possible "virtual"
"Virtual" here means that the token is not present in the
Virtual tokens are tokens that the Ruby Slippers logic
is allowed to invent
in order to make the over-strict grammar's wishes
We don't want all tokens to be allowed as virtual tokens.
For example, we don't want the Ruby Slippers logic
What we want the Ruby Slippers to do is to supply
just enough structure to make the HTML document
conform to the over-strict grammar.
The list of virtual tokens in
consists entirely of element tags.
Every end tag is allowed as a virtual token.
This is necessary for liberal HTML parsing with
the over-strict grammar,
because it enables
to supply a missing end tag for any HTML element.
For start tags,
the list of allowed virtual tokens is much shorter.
Four start tags are allowed as virtual tokens,
because the HTML standards allow them to be
and because they are needed
according to the over-strict grammar.
The four are
goes beyond the standards in
using start tags to
repair a defective table.
The tags allowed as needed to repair a defective table are
Fifth, apply the Ruby Slippers
With all this set up,
we can do the actual parsing.
is used as the parse engine.
Almost certainly, the parse engine will soon
receive a token which is not acceptable to the over-strict grammar.
At that point the parse engine will complain
that it cannot continue.
allows the application to retry with another token.
very special ability
to produce, at any point in the parse,
a list of those tokens that it is willing to accept.
asks to see the list of acceptable tokens and compares it to its
list of virtual tokens.
If exactly one of the allowed virtual tokens is acceptable as input,
invents a token of that type
and passes it on to
When the physical input is standard-conforming HTML,
at every point where the over-strict grammar has a problem,
one and only one virtual token
will be acceptable as input.
is designed to handle
must deal with
- Cruft: tokens that nothing can make right; and
- Choices: situations where more than one
of the virtual tokens
is acceptable to the over-strict grammar.
Sometimes a token appears in the physical input that nothing can
One example of cruft
is a head element start tag
inside an HTML
There cannot be more than one HTML
head element in any HTML document, so this token
has to be classified as cruft.
has a special "cruft" pseudo-class.
The application is allowed to specify a cruft handler,
throw the cruft away,
repackage it as a comment, etc., etc.
The default behavior is simply to pass cruft on, untouched.
When there is a choice of virtual token
If an HTML document is standard-conforming HTML, it
is not possible for more one virtual token to be acceptable
at any point.
In real life,
and therefore in parsing liberal HTML,
such things are not just possible,
but quite common.
between virtual tokens
using rules of thumb.
These often involve tradeoffs.
when it comes to ending an element versus starting one,
the preference is, in general, to start one.
But there are exceptions to this.
One important exception is when starting an element
would lead to an infinite loop.
It would be nice if
behavior in dealing with choices among virtual tokens
was configurable -- as of this writing it is not.
Actually, what I call "high-level parsing" in this
post is, in most textbooks, simply called parsing.
In this kind of context,
lexical analysis is usually considered distinct from
Here, however, I have the problem that the lexing is
done by a module named
Parsing is a loose term.
Lexical analysis is called parsing in some contexts,
name is not actually wrong.
But it is certainly inconvenient,
when the context is a description of what
The solution adopted in
this discussion is to
make a distinction between
"low-level parsing" (lexical analysis or tokenization)
and "high-level" parsing (determining the
full structure of a document).
to XML would be easy,
but I have not done it.
is all about dealing
with the difficult and
varied HTML syntaxes,
and with liberal and defective input.
XML's syntax is well defined
and extremely easy-to-parse.
a conforming XML parser is NOT allowed
to be liberal --
at the first encounter with less-than-perfect
input, a conforming XML parser
must produce an error message
and thereafter it may produce nothing
else but error messages.
When it comes to parsing perfect input
for an easy-to-parse and well-defined language,
approach, frankly, offers
no advantage over the previous
state of the art.
"a new grammar for every parse":
As an alternative strategy,
could use a fixed
grammar with a lot of "fill in the blank"
But it is not clear
that this would be an improvement.
Actually, one can
imagine specific applications where filling
in missing content would be useful.
a site might have headers and/or footers
that need to be on every HTML page.
Or there could be obligatory disclaimers,
license information, etc.
"repair a defective table":
Repairing tables is an impressive trick,
but is not completely clear to me that this is the best
thing to do:
discarding stray table tags as cruft might be a better idea.
However, aggressive table repair is how
and in writing
I decided to follow
posted at: 14:10 |
direct link to this entry