Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website


Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Thu, 13 Sep 2012

A Marpa-based HTML reformatter

This post is about html_fmt, a Marpa-based reformatter ("tidier") for liberal HTML. html_fmt indents HTML according to the structure of the document, which makes the HTML a lot easier to read. In the process html_fmt adds missing start and end tags and identifies "cruft".

html_fmt is ultra-liberal about its input. Like a browser's rendering engine, html_fmt never rejects a file, no matter how defective it is as an HTML document. An interesting experiment would be to compare what your favorite browser does with a random text file feed to it directly, with what it does to the same file after it has been passed through html_fmt.

html_fmt is a by-product of moving this blog to Github. In the course of bringing over my old posts, I wanted a filter that would tidy them up, so I turned to an old demo script I'd written. The old demo's usefulness was a pleasant surprise, but it lacked two features. First, it wouldn't read from standard input. Second, in formatting the HTML, it introduced new whitespace. The first problem was easy to fix. Fixing the second involved coming up with a "lowest common denominator" for whitespace treatment among browsers and HTML variants.

The result, html_fmt, works very well as the first step in dealing with HTML that you are rewriting by hand. One quick pass-through and your file is much easier to read, has all the proper tags, and has comments pointing out any "cruft" that's there.

A production quality "tidier" would need to be something like gnuindent -- bristling with options. html_fmt so far has only two options, one dealing with whitespace before end tags, the other allowing a choice of strategies for avoiding added whitespace. (One strategy uses comments, while the other simply leaves the whitespace-sensitive locations as-is.) These two options are not nearly sufficient to deal with the full range of whitespace issues, never mind anything else.

But from a "Worse is Better" point of view, html_fmt is a good start. It is 600 lines, short enough to find your way around in, particularly once you've deleted the parts you don't like. And its underlying Marpa-based interface is documented: Marpa::R2::HTML. Marpa::R2::HTML is beta, but has been stable for some time.

html_fmt is now available as a gist. In a future release of Marpa::R2, it will be available as the marpa_r2_html_fmt script. But why wait until then to fork it?

posted at: 20:08 | direct link to this entry

§         §         §