Sun, 21 Oct 2012
Configuring the Ruby Slippers for HTML
This post is part of
a configurable HTML parser.
The last two posts described how to change
the context and contents of the HTML
elements, both new and existing.
This post describes how to configure
optional start tags: how to change
which start tags
and how to specify the circumstances
in which they will be supplied.
How the parser works
In the first posts in this series I went into some detail describing
my Marpa-based approach to HTML parsing.
Briefly, it combines a parse engine using a "wishful thinking" grammar
with a Ruby Slippers lexer.
The "wishful thinking" grammar expects all elements,
to have both start and end tags.
This overstrict grammar demands tags even in cases
HTML 4.01 Strict DTD
mandates that they be treated as optional.
The overstrict grammar is liberalized by the Ruby Slippers.
Marpa has an unusual property among parsers -- it is fully
informed about the state of the parse at all points,
and can conveniently and efficiently share that information
with the application.
In Marpa::R2::HTML, when the parse engine, with its
overstrict grammar, grinds to a halt for lack
of a tag that does not exist
in the physical input,
the lexer can ask the parse engine which tag it is looking for.
It can then dummy one up, feed it to the parse engine,
and start things back up.
It's as simple as that.
For HTML end tags,
the Ruby Slippers work stunningly well.
Only one end tag will be expected at any point.
In cases where a stack of elements must be properly terminated,
the parse engine will request the end tags, one at a time,
in proper order.
The grammar can simplify life for itself by demanding a perfect
world, and on the lexer's side, things are no harder -- it just
has to do what it is told.
For the very few start tags
that are optional according to the Strict HTML 4.01 DTD,
things are just as simple -- they occur in places where only one
at a time will be demanded, and the Ruby Slippers lexer need
only do what it is told to.
However, if you want to further liberalize HTML, there will be
cases where there is a choice between start tags;
starting one element and ending another.
Configuring the Ruby Slippers
In the last post,
I showed how to configure Marpa::R2::HTML to allow or disallow
text directly in the
was configured to disallow
text directly in the
and it encountered such text,
Marpa::R2::HTML would start a block.
The block was started
by supplying a
start tag in front of the text.
In other words, Marpa::R2::HTML treated
start tag as optional.
Let me give an example.
Suppose the HTML document consisted of the string
and that, using the default configuration,
we ran html_fmt as follows:
echo 'Hello, world' |
This would be our result:
This was produced using the default configuration,
which resides in
(All the examples is this post use version 2.022000 of Marpa::R2.)
First, the results
Let's change the behavior of
Marpa::R2::HTML so that,
instead of starting a new
it will reject the text as cruft.
We create a new configuration,
putting it into a file named
configuration will not be difficult,
but it will perhaps be easiest to understand
if we first see the result
that we are aiming at.
Again we run html_fmt:
echo 'Hello, world' |
--compile reject_pcdata.txt --no-added-tag-comment
And this is our new result:
<!-- html_fmt: Next line is cruft -->
Note that in this second example, there are no tags
and that the text is now labeled as "cruft", as desired.
How it was done
How would we change the default configuration file to refuse to start a new
element in front of text?
The three relevant lines are:
@block_rubies = <html> <head> <body>
@inline_rubies = @block_rubies <tbody> <tr> <td> <p>
PCDATA -> @inline_rubies
The symbols with an "@" sigil are lists,
which the configuration file uses as a convenient shorthand for groups
of symbols which occur frequently.
For convenience in this discussion,
let's expand them, so that relevant extract looks like this
PCDATA -> <html> <head> <body> <tbody> <tr> <td> <p>
In the configuration file,
can be thought of as non-whitespace text,
occurring in a context which is parsed
for markup and entities.
(Precisely, it is whatever
HTML::Parser returns as text that is not whitespace
and does not turn on the
What this line says is that, whenever
a PCDATA token
Marpa::R2::HTML should try to fix the problem as follows:
- 1. If possible, start an
- 2. Otherwise, if possible, start a
- 3. Otherwise, if possible, start a
- 4. Otherwise, if possible, start a
- 5. Otherwise, if possible, start a
- 6. Otherwise, if possible, start a
- 7. Otherwise, if possible, start a
- 8. Otherwise, if it is possible to end
a non-structural or a
element at this point, do so.
(At any point, it will be possible to end
at most one element.)
- 9. Finally, if nothing else works, mark the "PCDATA" as cruft.
Of these alternatives, the first three allow Marpa::R2::HTML to supply missing
structural start tags, as required by the standards.
Alternatives 4, 5 and 6 allow Marpa::R2::HTML to continue building a table
if table-building is in progress.
(But note that the line does not allow Marpa::R2::HTML
to deal with rejected
PCDATA by starting a new table.)
Alternative 7 allows Marpa::R2::HTML to start a new
element if PCDATA is rejected.
Alternatives 8 and 9 are implicit.
By default, after all the explicit Ruby Slippers
alternatives have been tried,
Marpa::R2::HTML will create a Ruby Slippers tags
for any end tag that is allowed,
with two exceptions:
Marpa::R2::HTML will not create
</html> end tags except at the end of file.
And Marpa::R2::HTML always reserves the possibility of,
as a last resort,
labeling a token as "cruft" and moving on.
Once you understand how the Ruby Slippers configuration lines work,
the fix in this case becomes obvious:
In the expanded line,
as one of the alternatives considered for the Ruby Slippers.
In terms of the expanded line,
this means changing it to
PCDATA -> <html> <head> <body> <tbody> <tr> <td>
In terms of the original set of lines,
this means changing the one for the
@inline_rubies = @block_rubies <tbody> <tr> <td>
In the Ruby Slippers configuration lines of
the default configuration file,
the @inline_rubies list is the only place that
<p> tag is mentioned.
as an optional start tag.
Only <p> tags actually in the physical
input will be recognized.
This is what was actually done
the configuration file used in our example.
Code and comments
Comments on this post can be sent to the Marpa Google Group:
posted at: 12:48 |
direct link to this entry