Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website


Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Wed, 17 Nov 2010

Better than Literate!

I'm converting Marpa to XS and have started to use Cweb. Cweb is the C version of the "literate programming" system pioneered by Don Knuth. I'm pleasantly surprised by it. Cweb adds fun to the programming experience and is helping in more ways than the phrase "literate programming" would suggest.

One very important feature of Cweb is something that would seem to be a nuisance, or at best an implementation detail. The .w file which contains the Cweb is now the "source". The .c and .h files are now "built files". I am no longer working with the C language "translation units". ("Translation unit" is standards-committee-speak for the text which is directly input into the C compiler. The term is used in attempt to distance the standard from file conventions.)

Moving one step back from the "translation unit" separates the presentation of the code from the issues of linkage and scope. Over the years, I'd gotten used to organizing my code around the visibility rules for the compiler and the linker. With Cweb, I can lay out C code in the same way that I think about it. It surprises me how liberating this is.

Suppose we are adding an element to a C structure. Typically required might be:

  1. A typedef for a type that is particularly complex.
  2. The entries in the struct.
  3. Initialization of these entries.
  4. "Destruction" of those entries: freeing any memory or other resources they use.
  5. Definition of the function bodies for mutators, accessors, etc.
  6. Public prototypes for some of the functions.
  7. Private prototypes for other functions.
Often, each of these would be put at a different location. But all these bits of code are written and debugged together. If you ever want to change the data structures, all these bits of code would have to be tracked down again. True, skill at picking good names and at performing searches on the source can make this sort of thing tractable. But it is also true that some of the programming skills we've developed over the years could aptly be called symptoms.

Here is some C logic which, while very small, nonetheless has 6 of the 7 components listed above. In Cweb I was able to put them all together. Here's the bottom of one page of my "woven" Cweb code: callback1.png

... and here is the top of the next: callback2.png

For some readers, these samples may adequately illustrate my point. For those curious about the code, I'll close with a few explanations. The code, as I said, is for the XS version of Marpa. For those not familiar, Marpa is a general BNF parser generator -- it parses from any grammar that you can write in BNF. If the grammar is of any of the kinds currently in practical use (yacc, LR(k), LALR, LL, recursive descent, etc.), this parsing is in linear time.

Marpa uses Perl callbacks, and so the XS version must be able to call back to Perl closures. So where is all the Perl logic in my example?

For the XS conversion, I'm separating the code into three layers:

  1. libmarpa is a "pure C" library, which implements the core of the Marpa algorithm. libmarpa is agnostic about whether it is called from Perl, from another high level language, and even from other C code.
  2. There's a "glue" layer to tie the Perl code to libmarpa. This is in Perl's XS language. XS can be thought of as a C dialect. Special preprocessing converts XS into C code, which is then compiled.
  3. Finally, there's a "pure Perl" wrapper. User interface issues are handled at this level.

I am only using Cweb for libmarpa. (Cweb only understands C language.)

Since libmarpa has no Perl-specific logic, its role in dealing with Perl callbacks is very simple. libmarpa needs code to store the callback (which from its point of view is just a C function pointer), and some other code to perform the callback. C does not have closures, so to implement a poor man's version of these, the callback is allowed an argment, which can be examined and set. It comes out to a couple of dozen lines of code, the majority of which are declarations and data definitions.

One thing is missing from the example -- "destructor" logic. The pointers in the example do not "own" the things that they point to, so there is no need to deallocate anything. When allocation and deallocation are separate, it is easy to forget whether you omitted deallocation logic because it was unnecessary, or because you forgot. Using Cweb, I always put allocation and deallocation together. Perhaps that's not enough to make memory leaks a thing of the past, but it certainly makes life easier.

posted at: 19:40 | direct link to this entry

§         §         §