Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest
The Ocean of Awareness blog: home page, chronological index, and annotated index.
This post introduces an HTML parser which is both liberal and configurable. Currently available as a part of a Marpa::R2 developer's release on CPAN, the new Marpa::R2::HTML allows detailed configuration of new tags and respecification of the behavior of existing tags.
To show how a configurable HTML parser works,
I will start with a simple task.
Let's suppose we have a new tag, call it
<acme>
.
The older, non-configurable version of Marpa, and most browsers,
would recognize this tag.
But they'd simply give it a default configuration,
one which is usually very liberal --
liberal meaning the tag is allowed to contain just about anything,
and to go just about anywhere.
A configurable parser allows us to specify the new tag's behavior
more explicitly and strictly.
In this post I am going to assume that the reader knows, or will look up what he needs to know, about HTML. But block vs. inline is a essential HTML concept which is often ignored -- even the excellent Wikipedia page on HTML does not define "inline", although it uses the term in the technical sense twice. Since the concept is also central to this post, let me briefly summarize it.
Quoting from the HTML 4.01 Strict DTD,
HTML has two basic content models: %inline; character level elements and text strings %block; block-like elements e.g. paragraphs and lists
There is also what I will call a "mixed flow", which can contain anything that can appear in either a block or inline flow. (What I call a mixed flow is simply called a "flow" in the HTML 4.01 DTD.)
Significant in the examples are
<p>
,
which is a block element that contains an inline flow and
<span>
,
which is an inline element that contains an inline flow.
The
<body>
tag is an important structural tag,
which contains an block flow.
For simplicity I am following HTML 4.01 DTD. HTML 5 uses radically different terminology and is more liberal in what it allows. Differences among standards are an important reason for an HTML parser to be configurable.
Let's define the
<acme>
tag
to be an inline tag with inline contents.
This is done by adding the following line to the
Marpa::R2::HTML grammar configuration file:
ELE_acme is a FLO_inline included in GRP_inline
Let's take as our HTML, the following:
<acme>-during-<span>-more inline stuff-<p>-new block-
With the following shell commands,
we add the new line to
default.cfg
,
the default grammar configuration file.
We then use the
marpa_r2_html_fmt
utility
that comes with Marpa::R2
to parse the HTML.
cp default.cfg test.cfg echo "ELE_acme is a FLO_inline included in GRP_inline" >> test.cfg echo '<acme>-during-<span>-more inline stuff-<p>-new block-' | marpa_r2_html_fmt --no-added-tag --compile test.cfg
marpa_r2_html_fmt
indents the HTML and
adds missing tags
This will show us the structure of our small HTML document.
Here is what we get:
<html> <head> </head> <body> <p> <acme> -during-<span> -more inline stuff-</span></acme></p><p> -new block- </p></body> </html>
We see from this that the configuration took proper effect.
Since the
<acme>
tag was configured as an inline element,
it cannot go directly inside the
the
<body>
.
So a new
<p>
is created to contain the
<acme>
element.
(A Marpa::R2::HTML configuration can also change
the contents that are acceptable inside the
<body>
.
For the moment, we'll keep it simple
and accept as given that
<body>
contains a block flow.)
Similarly,
since the
<acme>
tag contains an inline flow,
it must end before the next
<p>
tag.
The parser supplies an
end tag for the
<acme>
element,
and also
closes the
<p>
paragraph that was
started to hold the
<acme>
element.
Our new configuration line can also specify that
<acme>
is a block element:
ELE_acme is a FLO_inline included in GRP_block
The code to test this is very similar to that displayed above. It, and the scripts for all of this post's other examples, can be found as a gist. Here's the result:
<html> <head> </head> <body> <acme> -during-<span> -more inline stuff-</span></acme><p> -new block- </p></body> </html>
Here the
<acme>
element is allowed in a block,
so no new
<p>
element was needed to contain it.
Since
the
<acme>
element's contents are inline,
the
<acme>
element still needed to be ended before
the
<p>
tag in the actual input.
Those who click through to look at the grammar configuration file may notice its length. Three pages. And almost half of those three pages are single-line descriptions of elements which are one or more of deprecated, obsolete or proprietary. Arguably the configuration file should be even shorter.
Other implementations of liberal HTML parsers spread the logic specifying the HTML variant across a considerably larger body of code, sometimes a vastly larger. This very much affects the cost of evolving and maintain the parser.
As for the configuration file's format at the moment: it is experimental. I can state from experience that it is quite serviceable, and fairly readable, but it could be more elegant.
We can also configure the contents of
the
<acme>
element.
This configuration line specifies a mixed flow.
ELE_acme is a FLO_mixed included in GRP_block
And here is what we get:
<html> <head> </head> <body> <acme> -during-<span> -more inline stuff-</span><p> -new block- </p></acme></body> </html>
A mixed flow accepts any contents,
so that the
<acme>
element's contents expand to include the
entire body of the html document.
With this configuration line, we request a block flow for the contents:
ELE_acme is a FLO_block included in GRP_block
The results:
<html> <head> </head> <body> <acme> <p> -during-<span> -more inline stuff-</span></p><p> -new block- </p></acme></body> </html>
Here the
<acme>
element also spans the entire body of the HTML
element,
but because block flows are less liberal than
mixed flows,
the contents of the
<acme>
element have to be properly "packaged" inside
block elements.
We are getting progressively more restrictive with the contents
of the
<acme>
element.
We've already seen the example
of an
<acme>
block element for inline contents.
This configuration line specifies that the contents
must be PCDATA.
PCDATA allows text, but not child elements.
ELE_acme is a FLO_pcdata included in GRP_block
The result:
<html> <head> </head> <body> <acme> -during-</acme><p> <span> -more inline stuff-</span></p><p> -new block- </p></body> </html>
Note here that the
<acme>
tag ends as soon as another element is encountered.
The final restriction on the
<acme>
element is the insistance that it be empty:
ELE_acme is a FLO_empty included in GRP_block
And here is our result:
<html> <head> </head> <body> <acme> </acme><p> -during-<span> -more inline stuff-</span></p><p> -new block- </p></body> </html>
Any of these different configurations of
the
<acme>
tag could have a dramatic effect on what is displayed,
depending on your CSS file.
Whether or not
<acme>
is a block element also affects
what is displayed.
When
<acme>
is a block element,
its boundaries will typically display as paragraph breaks.
In the above examples,
the case where
<acme>
is configured as a block element containing PCDATA
typically displays as three paragraphs:
-during- -more inline stuff- -new block-
In the other cases, the end boundary of the
<acme>
element varies,
but always coincides with the beginning or
end of other block elements,
so that the visible display is as
two paragraphs:
-during- -more inline stuff- -new block-
The configurable Marpa::R2::HTML does considerably more than
could be mentioned in this post.
I hope to say more about it soon.
Comments on this post can be sent to the Marpa Google Group:
marpa-parser@googlegroups.com
posted at: 20:46 | direct link to this entry