Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website


Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Sun, 21 Oct 2012

Configuring the Ruby Slippers for HTML

This post is part of a series describing Marpa::R2::HTML, a configurable HTML parser. The last two posts described how to change the context and contents of the HTML elements, both new and existing. This post describes how to configure optional start tags: how to change which start tags are optional, and how to specify the circumstances in which they will be supplied.

How the parser works

In the first posts in this series I went into some detail describing my Marpa-based approach to HTML parsing. Briefly, it combines a parse engine using a "wishful thinking" grammar with a Ruby Slippers lexer. The "wishful thinking" grammar expects all elements, without exception, to have both start and end tags. This overstrict grammar demands tags even in cases where the HTML 4.01 Strict DTD mandates that they be treated as optional.

The overstrict grammar is liberalized by the Ruby Slippers. Marpa has an unusual property among parsers -- it is fully informed about the state of the parse at all points, and can conveniently and efficiently share that information with the application. In Marpa::R2::HTML, when the parse engine, with its overstrict grammar, grinds to a halt for lack of a tag that does not exist in the physical input, the lexer can ask the parse engine which tag it is looking for. It can then dummy one up, feed it to the parse engine, and start things back up. It's as simple as that.

For HTML end tags, the Ruby Slippers work stunningly well. Only one end tag will be expected at any point. In cases where a stack of elements must be properly terminated, the parse engine will request the end tags, one at a time, in proper order. The grammar can simplify life for itself by demanding a perfect world, and on the lexer's side, things are no harder -- it just has to do what it is told.

For the very few start tags that are optional according to the Strict HTML 4.01 DTD, things are just as simple -- they occur in places where only one at a time will be demanded, and the Ruby Slippers lexer need only do what it is told to. However, if you want to further liberalize HTML, there will be cases where there is a choice between start tags; or between starting one element and ending another.

Configuring the Ruby Slippers

In the last post, I showed how to configure Marpa::R2::HTML to allow or disallow text directly in the <body> element. If Marpa::R2::HTML was configured to disallow text directly in the <body> element, and it encountered such text, Marpa::R2::HTML would start a block. The block was started by supplying a <p> start tag in front of the text. In other words, Marpa::R2::HTML treated the <p> start tag as optional.

Let me give an example. Suppose the HTML document consisted of the string

Hello, world

and that, using the default configuration, we ran html_fmt as follows:

echo 'Hello, world' |
/Users/jeffreykegler/perl5/bin/marpa_r2_html_fmt --no-added-tag-comment

This would be our result:

      Hello, world

This was produced using the default configuration, which resides in the g/config/default.txt file. (All the examples is this post use version 2.022000 of Marpa::R2.)

First, the results

Let's change the behavior of Marpa::R2::HTML so that, instead of starting a new <p> element, it will reject the text as cruft. We create a new configuration, putting it into a file named g/config/reject_text.txt.

Creating the configuration will not be difficult, but it will perhaps be easiest to understand if we first see the result that we are aiming at. Again we run html_fmt:

echo 'Hello, world' |
/Users/jeffreykegler/perl5/bin/marpa_r2_html_fmt \
  --compile reject_pcdata.txt  --no-added-tag-comment

And this is our new result:

    <!-- html_fmt: Next line is cruft -->
    Hello, world

Note that in this second example, there are no tags for the <p> element, and that the text is now labeled as "cruft", as desired.

How it was done

How would we change the default configuration file to refuse to start a new <p> element in front of text? The three relevant lines are:

@block_rubies  = <html> <head> <body>
@inline_rubies = @block_rubies <tbody> <tr> <td> <p>
PCDATA -> @inline_rubies

The symbols with an "@" sigil are lists, which the configuration file uses as a convenient shorthand for groups of symbols which occur frequently. For convenience in this discussion, let's expand them, so that relevant extract looks like this

PCDATA -> <html> <head> <body> <tbody> <tr> <td> <p>

In the configuration file, PCDATA can be thought of as non-whitespace text, occurring in a context which is parsed for markup and entities. (Precisely, it is whatever HTML::Parser returns as text that is not whitespace and does not turn on the is_cdata flag.) What this line says is that, whenever a PCDATA token is rejected, Marpa::R2::HTML should try to fix the problem as follows:

Of these alternatives, the first three allow Marpa::R2::HTML to supply missing structural start tags, as required by the standards. Alternatives 4, 5 and 6 allow Marpa::R2::HTML to continue building a table if table-building is in progress. (But note that the line does not allow Marpa::R2::HTML to deal with rejected PCDATA by starting a new table.) Alternative 7 allows Marpa::R2::HTML to start a new <p> element if PCDATA is rejected.

Alternatives 8 and 9 are implicit. By default, after all the explicit Ruby Slippers alternatives have been tried, Marpa::R2::HTML will create a Ruby Slippers tags for any end tag that is allowed, with two exceptions: Marpa::R2::HTML will not create </body> and </html> end tags except at the end of file. And Marpa::R2::HTML always reserves the possibility of, as a last resort, labeling a token as "cruft" and moving on.

Once you understand how the Ruby Slippers configuration lines work, the fix in this case becomes obvious: In the expanded line, elminate the <p> as one of the alternatives considered for the Ruby Slippers. In terms of the expanded line, this means changing it to

PCDATA -> <html> <head> <body> <tbody> <tr> <td>

In terms of the original set of lines, this means changing the one for the @inline_rubies list:

@inline_rubies = @block_rubies <tbody> <tr> <td>

In the Ruby Slippers configuration lines of the default configuration file, the @inline_rubies list is the only place that the <p> tag is mentioned. So changing @inline_rubies has effect of eliminating <p> as an optional start tag. Only <p> tags actually in the physical input will be recognized. This is what was actually done in g/config/reject_text.txt, the configuration file used in our example.

Code and comments

Comments on this post can be sent to the Marpa Google Group: marpa-parser@googlegroups.com

posted at: 09:48 | direct link to this entry

§         §         §