Thu, 13 Sep 2012
A Marpa-based HTML reformatter
This post is about
reformatter ("tidier") for liberal HTML.
indents HTML according to the structure of the document,
which makes the HTML a lot easier to read.
In the process
adds missing start and end tags and identifies "cruft".
is ultra-liberal about its input.
Like a browser's rendering engine,
never rejects a file,
no matter how defective it is as an HTML document.
An interesting experiment would be to compare what your
favorite browser does with a random text file feed to
with what it does to the same file
after it has been passed through
is a by-product of moving
this blog to Github.
In the course of bringing over
my old posts,
I wanted a filter that would tidy them up,
so I turned to an old demo script I'd written.
The old demo's usefulness was a pleasant surprise,
but it lacked two features.
First, it wouldn't read from standard input.
Second, in formatting the HTML, it introduced new whitespace.
The first problem was easy to fix.
Fixing the second involved coming up with a
"lowest common denominator" for whitespace treatment
among browsers and HTML variants.
works very well as the first step in dealing with HTML
that you are rewriting by hand.
One quick pass-through and your file is much easier to read,
has all the proper tags,
and has comments pointing out any "cruft" that's there.
A production quality "tidier" would need to be something like
bristling with options.
so far has only two options,
one dealing with whitespace before end tags,
the other allowing
a choice of strategies for avoiding added whitespace.
(One strategy uses comments, while the other simply leaves
the whitespace-sensitive locations as-is.)
These two options are not nearly
sufficient to deal with the full
range of whitespace issues,
never mind anything else.
But from a
"Worse is Better"
point of view,
is a good start.
It is 600 lines,
short enough to find your
way around in,
particularly once you've deleted the parts you don't like.
And its underlying Marpa-based interface is documented:
Marpa::R2::HTML is beta, but has been stable for some time.
is now available as a gist.
In a future release of
it will be available as the
But why wait until then to fork it?
posted at: 23:08 |
direct link to this entry