Mon, 15 Oct 2012
A configurable HTML parser, part 2
My last post
a configurable HTML parser.
a configuration file,
the user can change
the variant of HTML
The changes allowed are very wide ranging.
The previous post started with simple changes --
the ability to specify the contents of new tags,
and the context in which they can appear.
In this post the changes get more aggressive.
I change the contents of an existing HTML element --
and not just any element, but
one of the HTML's three "structural" elements.
Marpa::R2::HTML allows the configuration file to change
the contents of all pre-existing
elements, with the exception of the highest level of the three
Can text appear directly in an HTML body?
This post will discuss changing the contents of the
Fundamental to the HTML document as this element is,
the definition of its contents has been very much in play.
Let's start with the question posed in the title of this section:
Can text appear directly in an HTML
That is, must text inside an HTML
be part of one of its child elements,
or can it be directly part of the contents
If you want an
answer strictly according to the standards,
then you get your choice in the matter.
According to the
HTML 4.01 Strict DTD,
contains a "block flow",
which means that
the answer is "No, text must be in the contents of a child element".
Implementations of HTML were encouraged to be liberal, however,
and in practice a lot of the HTML "out there"
has text directly
Users expect their browsers to render these pages
in the way that the writer intended them to look.
Recognizing existing practice,
HTML 5 changed to require conforming implementations to
allow text to be interspersed with the block flow,
in what I call a "mixed flow".
A mixed flow can directly contain blocks and text,
as well as inline elements.
(The inline vs. block element distinction is basic to HTML parsing.
See my earlier post or
the well-organized Wikipedia page on HTML elements.)
Block or mixed flow?
When parsing HTML, do you want to the treat contents of the body
as a block flow or a mixed flow?
Here are some of the factors.
Common practice requires accepting a mixed flow.
Cautious practice suggests writing a block flow.
HTML 4.01 requires block, but suggests being liberal.
HTML 5 requires that a mixed flow be accepted.
But HTML 5 also requires that the mixed flow be displayed as if it was written
in blocks and
suggests that explicit blocking be used to eliminate
Body contains block flow
In this first example,
contains a block flow.
This is what is specified in
the default configuration file.
Here is the pertinent line:
<body> is *block
This line says
element contains a block flow (*block).
Here the star is a sigil which suggests the repetition operator
of DTD's and regular expressions.
(Readers of my last post will notice I've changed the configuration
file syntax and will,
find the new format an improvement.)
For the examples in this post,
the HTML will be
I cannot wait for a start tag<p>I can
We run this through the
Here is the output:
I cannot wait for a start tag</p><p>
The first thing the parser encounters is text,
which in this example is not allowed
to occur directly in the body.
As part of being a highly liberal HTML parser,
however, Marpa::R2::HTML will supply a start tag
in these situations.
(This behavior, by the way, is also configurable --
a change to the configuration file can
tell Marpa::R2::HTML not to do this.)
With its two
one of them conjured up by the Ruby Slippers,
Marpa::R2::HTML breezes through its input.
Body contains mixed flow
In the second example, we liberalize the contents of
to allow a mixed flow:
<body> is *mixed
Here is the result:
I cannot wait for a start tag<p>
In a mixed flow, no
is needed, and none is created.
Its matching end tag
(</p>) also does not
have to be created.
Otherwise, all is as before.
What I decided
Before I made my HTML parser configurable,
I was forced to decide the issue of
contents one way or the other.
implementation of the
utility was based on Marpa::XS
its grammar specified a mixed flow.
When I started a new version
of the utility
based on Marpa::R2,
I reopened the issue.
I decided that a stricter grammar produced a more precise parse,
and that it was best to leave it up
to the Ruby Slippers to "loosen things up"
when the grammar was too strict.
This was close, I hoped, to the best of both worlds,
So I changed the grammar to specify a block
flow for the contents of
This second choice -- strict block-flow-body grammar and liberal Ruby Slippers --
remains the default in the configurable version.
In current developer's releases of Marpa::R2,
and in its next indexed release,
both the grammar and the Ruby Slippers are configurable.
The true best of both worlds happens when
the user gets to decide.
Code and comments
The examples here were run using Marpa::R2 release 2.021_010.
They are part of its test suite and can be found in the
The configurable Marpa::R2::HTML does considerably more than
can be comfortably described in a single post.
This post is the second of a series.
Comments on this post can be sent to the Marpa Google Group: