How to parse HTML, part 2

Wed, 07 Dec 2011

How to parse HTML, part 2

This is the second of a series of posts that details a Marpa-based, "Ruby Slippers" approach to parsing liberal and defective HTML. This post assumes you have read the first post.

First, reduce the HTML to a token stream

Most computer languages can be viewed as a token stream. HTML is not an exception. HTML tokens can be blocks of text; comments and various other SGML entities; HTML element start tags; and HTML element end tags. The HTML token stream is unusual in that some of its tokens can be quite complex internally.

In parsing computer languages, it is a frequent practice to divide the work between a tokenizer ("lexer") and a high-level parser. The lexer takes the raw input and turns it into a token stream. Tokenizing HTML is a difficult job, and one for which there is an excellent CPAN module: HTML::Parser. Marpa::HTML relies on HTML::Parser to do its tokenization.

Marpa::HTML determines the large scale structure of the HTML document -- what I will call in this post, "high-level" parsing. The result of high-level parsing can be seen as a hierarchical structure. The goal of high-level parsing is to build a hierarchy which reflects the structure of the document. Conventionally, this hierarchy is visualized as an upside-down tree, one where the "leaves" are the tokens, and where the "root" node represents the document as a whole.

An HTML document contains components of many kinds, but the hard part of determining its structure is deciding where to begin and end its HTML elements. Know this and you know how to arrange the HTML elements in a hierarchy. Know how to arrange the HTML elements in a hierarchy, and you have produced essentially the entire parse tree for the HTML document.

When it comes to knowing where to start and end HTML elements in liberal HTML, there are complications aplenty. Conceptually, every HTML element has a start and end tag but, even in fully standard-conforming input, not all of these tags need to be in the physical input. In real-life HTML, start and end tags are often missing. Additionally, a liberal HTML parser is expected to be robust in the face of spurious tags.

Second, do a census of the HTML elements

In HTML according to the standards, there is a fixed list of HTML elements. In theory, a liberal HTML parser could work from a list which is a superset of all the standards, plus a list of the known browser-specific tags.

Marpa::HTML takes another approach. It makes a first pass over its input to "census" the elements in the HTML document, and builds its grammar based on that census. This has several advantages:

It is an easy way to deal with spurious tags.
Vendor-specific tags can handled using the same mechanisms that deal with spurious tags. This is reasonable since the boundary between vendor-specific and spurious tags is not sharp.
It makes Marpa::HTML robust in the face of new HTML standards.
It means Marpa::HTML would be easy to extend to deal with XML.
The grammar avoids having lots of rules for elements which do not actually exist in the input.

Third, create a grammar

Marpa::HTML creates a new grammar for every parse. Much of the grammar is constant -- there is a "framework". This framework already includes some of the HTML elements. Into this framework, Marpa::HTML inserts rules for the additional elements discovered during the census. Each element has only one rule, which takes the form:

      ELEMENT ::= START_TAG CONTENTS END_TAG

The grammar assumes that every HTML element has both a start tag and an end tag. From the grammar's point of view, these are never optional. Marpa::HTML's grammar is therefore, not just strict, but over-strict. If there were no Ruby Slippers, the over-strict grammar would fail to parse many standard-conforming HTML documents, and would be hopeless at dealing with liberal HTML.

Fourth, determine which tokens are allowed to be virtual

A necessary complement to Marpa::HTML's over-strict grammar is a list of the possible "virtual" tokens. "Virtual" here means that the token is not present in the physical input. Virtual tokens are tokens that the Ruby Slippers logic is allowed to invent in order to make the over-strict grammar's wishes come true.

We don't want all tokens to be allowed as virtual tokens. For example, we don't want the Ruby Slippers logic inventing text. What we want the Ruby Slippers to do is to supply just enough structure to make the HTML document conform to the over-strict grammar.

The list of virtual tokens in Marpa::HTML consists entirely of element tags. Every end tag is allowed as a virtual token. This is necessary for liberal HTML parsing with the over-strict grammar, because it enables Marpa::HTML to supply a missing end tag for any HTML element.

For start tags, the list of allowed virtual tokens is much shorter. Four start tags are allowed as virtual tokens, because the HTML standards allow them to be missing, and because they are needed according to the over-strict grammar. The four are html, head, body, and tbody. Marpa::HTML goes beyond the standards in using start tags to repair a defective table. The tags allowed as needed to repair a defective table are table, tr, and td.

Fifth, apply the Ruby Slippers

With all this set up, we can do the actual parsing. Marpa::XS is used as the parse engine. Almost certainly, the parse engine will soon receive a token which is not acceptable to the over-strict grammar. At that point the parse engine will complain that it cannot continue.

Marpa::XS allows the application to retry with another token. Marpa::XS also has the very special ability to produce, at any point in the parse, a list of those tokens that it is willing to accept. Marpa::HTML asks to see the list of acceptable tokens and compares it to its list of virtual tokens. If exactly one of the allowed virtual tokens is acceptable as input, Marpa::HTML invents a token of that type and passes it on to Marpa::XS.

When the physical input is standard-conforming HTML, at every point where the over-strict grammar has a problem, one and only one virtual token will be acceptable as input. However, Marpa::HTML is designed to handle arbitrary input. That means Marpa::HTML must deal with

Cruft: tokens that nothing can make right; and
Choices: situations where more than one of the virtual tokens is acceptable to the over-strict grammar.

Cruft

Sometimes a token appears in the physical input that nothing can make right. One example of cruft is a head element start tag inside an HTML body element. There cannot be more than one HTML head element in any HTML document, so this token has to be classified as cruft. For these, Marpa::HTML has a special "cruft" pseudo-class. The application is allowed to specify a cruft handler, which can throw the cruft away, repackage it as a comment, etc., etc. The default behavior is simply to pass cruft on, untouched.

When there is a choice of virtual token

If an HTML document is standard-conforming HTML, it is not possible for more one virtual token to be acceptable at any point. In real life, and therefore in parsing liberal HTML, such things are not just possible, but quite common.

Marpa::HTML currently chooses between virtual tokens using rules of thumb. These often involve tradeoffs. For example, when it comes to ending an element versus starting one, the preference is, in general, to start one. But there are exceptions to this. One important exception is when starting an element would lead to an infinite loop. It would be nice if Marpa::HTML's behavior in dealing with choices among virtual tokens was configurable -- as of this writing it is not.

Notes

"high-level parsing": Actually, what I call "high-level parsing" in this post is, in most textbooks, simply called parsing. In this kind of context, lexical analysis is usually considered distinct from parsing. Here, however, I have the problem that the lexing is done by a module named HTML::Parser.

Parsing is a loose term. Lexical analysis is called parsing in some contexts, so HTML::Parser's name is not actually wrong. But it is certainly inconvenient, when the context is a description of what Marpa::HTML does. The solution adopted in this discussion is to make a distinction between "low-level parsing" (lexical analysis or tokenization) and "high-level" parsing (determining the full structure of a document).
"XML": Extension of Marpa::HTML to XML would be easy, but I have not done it. Marpa::HTML is all about dealing with the difficult and varied HTML syntaxes, and with liberal and defective input. XML's syntax is well defined and extremely easy-to-parse. Furthermore, a conforming XML parser is NOT allowed to be liberal -- at the first encounter with less-than-perfect input, a conforming XML parser must produce an error message and thereafter it may produce nothing else but error messages. When it comes to parsing perfect input for an easy-to-parse and well-defined language, the Marpa::HTML approach, frankly, offers no advantage over the previous state of the art.
"a new grammar for every parse": As an alternative strategy, Marpa::HTML could use a fixed grammar with a lot of "fill in the blank" elements. But it is not clear that this would be an improvement.
"inventing text": Actually, one can imagine specific applications where filling in missing content would be useful. For example, a site might have headers and/or footers that need to be on every HTML page. Or there could be obligatory disclaimers, license information, etc.
"repair a defective table": Repairing tables is an impressive trick, but is not completely clear to me that this is the best thing to do: discarding stray table tags as cruft might be a better idea. However, aggressive table repair is how HTML::Tree handles things, and in writing Marpa::HTML I decided to follow its precedent.

posted at: 14:10 | direct link to this entry

§ § §

Ocean of Awareness

Marpa resources

Wed, 07 Dec 2011