Domain-Specific Languages made simpler

Sun, 26 Aug 2012

Domain-Specific Languages made simpler

Writing your own language

Creating your own language has been A Big Deal (tm). What if you could create a simple language in hours or minutes? There's been a serious obstacle up to now. No practical parser "just parsed" BNF. With Marpa, that restriction is lifted.

In this post, I will describe a small, sample Marpa domain-specific language (DSL). In designing it I am inspired by Mark Dominus's description of the "Worse is Better" philosophy, and its implementation in the form of Blosxom. This DSL is feature-poor, but short, simple and extensible.

A calculator

This DSL is a calculator. Calculators are familiar and, after all, whatever tool you build this DSL into, it will probably be useful to have a calculator as part of it. What follows contains only the parts of the code relevant to the discussion, not necessarily in lexical order. If you find the following interesting, you'll almost certainly want the full code, which is available as a Github gist.

The grammar

Marpa allows you to build your DSL as a clean modular structure, with a separate grammar, tokenizer and semantics. If you're used to doing parsing with regexes or recursive descent, you expect to see things mixed together, and much as you might like modularity in other contexts, this cleaner approach may make you uneasy. And not without reason. Traditionally, parsing tools that took a modular approach were painful to use and, for practical grammars, often rewarded the extra effort they required by failing to work.

Here's the grammar for our calculator.

my $rules = Marpa::Demo::OP2::parse_rules(
    <<'END_OF_GRAMMAR'
reduce_op ::=
    '+'                   => do_arg0
  | '-'                   => do_arg0
  | '/'                   => do_arg0
  | '*'                   => do_arg0
script ::= e              => do_arg0
script ::= script ';' e   => do_arg2
e ::=
     NUM                  => do_arg0
   | VAR                  => do_is_var
   | :group '(' e ')'     => do_arg1
  || '-' e                => do_negate
  || :right e '^' e       => do_binop
  || e '*' e              => do_binop
   | e '/' e              => do_binop
  || e '+' e              => do_binop
   | e '-' e              => do_binop
  || e ',' e              => do_array
  || reduce_op 'reduce' e => do_reduce
  || VAR '=' e            => do_set_var
END_OF_GRAMMAR
);

This is a simple language, but it's already an advance over, say, shell arithmetic. And the reduce operator is even a bit of fanciness. It's a second-order binary operator, whose left operand is another operator.

The grammar is written in another DSL, Marpa::Demo::OP2, which is bundled into the same file. (OP2's grammar is defined directly in Marpa::R2.) Together, these two quite useable DSL's require 600 lines, self-testing included.

I'm using OP2 in this post, as it presents the idea of a grammar more clearly. Marpa::R2's lower level syntax, while more stable, flexible and efficient, is more cluttered. OP2 itself is interesting as an extension and generalization of precedence parsing, as I described in a previous post. Here's its syntax:

::=: A BNF rule in LHS ::= RHS form
'abc': A literal token.
|: Separates alternative RHS's at the same precedence level
||: Separates alternative RHS's at the different precedence levels. The tighter ("higher") precedence alternative is first, the looser ("lower") precedence alternative is second.
=>: rule => semantics, where semantics is a Perl closure.
:left: The alternative is left-associative (the default)
:right: The alternative is right-associative
:group: The alternative is grouping-associative -- that is, its operator(s), regardless of their own precedence, group expressions of the loosest precedence

my $grammar = Marpa::R2::Grammar->new(
    {   start          => 'script',
        actions        => __PACKAGE__,
        rules          => $rules,
    }
);
$grammar->precompute;

The code just above creates a new grammar from the OP2-generated rules. The only other information needed to fully define the grammar is the name of the start symbol ("script") and the name of the package where the semantics can be found (the current one, __PACKAGE__).

Marpa does a lot of precomputation to its grammars. Once a grammar is fully defined, and before a recognizer can be created from it, the precompute() method must be called.

The semantics

Those curious about the semantics of this calculator can look at the Github gist. They are somewhat interesting. But this post is about how to get your interesting semantics out easily and quickly, in the form of a powerful language specifically designed for them.

The lexer

The token table

The calculator's lexer is table-driven. The table is quite simple -- it's an array of two element arrays. In the inner arrays, the first element is the symbol name, as specified in the grammar, and the second is a regex which matches it.

my @terminals = (
    [ q{'reduce'}, qr/reduce\b/xms ],
    [ 'NUM',       qr/\d+/xms ],
    [ 'VAR',       qr/\w+/xms ],
    [ q{'='},      qr/[=]/xms ],
    [ q{';'},      qr/[;]/xms ],
    [ q{'*'},      qr/[*]/xms ],
    [ q{'/'},      qr/[\/]/xms ],
    [ q{'+'},      qr/[+]/xms ],
    [ q{'-'},      qr/[-]/xms ],
    [ q{'^'},      qr/[\^]/xms ],
    [ q{'('},      qr/[(]/xms ],
    [ q{')'},      qr/[)]/xms ],
    [ q{','},      qr/[,]/xms ],
);

Order in the above table matters when you have terminals, one of which can prefix another. An example would be the operators == and =. There is no such pair here, however, so that in this application, the order makes no difference.

As you can see, I am one of those who specify xms for every regex. The symbol names preserve the surrounding single quotes. This is convenient for processing, and it also makes diagnostic messages involving those symbols more comprehensible. Finally, note that the reduce operator is required to end on a word boundary.

The tokenizing engine

    my $rec = Marpa::R2::Recognizer->new( { grammar => $grammar } );

    my $length = length $string;
    pos $string = 0;
    TOKEN: while ( pos $string < $length ) {

        # skip whitespace
        next TOKEN if $string =~ m/\G\s+/gcxms;

        # read other tokens
        TOKEN_TYPE: for my $t (@terminals) {
            next TOKEN_TYPE if not $string =~ m/\G($t->[1])/gcxms;
            if ( not defined $rec->read( $t->[0], $1 ) ) {
                die_on_read_problem( $rec, $t, $1, $string, pos $string );
            }
            next TOKEN;
        } ## end TOKEN_TYPE: for my $t (@terminals)

        die q{No token at "}, ( substr $string, pos $string, 40 ),
            q{", position }, pos $string;
    } ## end TOKEN: while ( pos $string < $length )

The calculator's token engine creates a Marpa recognizer with the new() constructor, and feeds it tokens with the read() method. In this token engine, I use Perl's progressive matching capabilities: the g and c modifiers, the \G assertion and the pos function. When writing a token engine, there is, as the expression goes, more than one way to do it, many of them somewhat easier than this approach. But progressive matching is powerful, efficient, very flexible, and it has the advantage that it leaves the original string intact.

Those who go on to look at the code in the gist may find die_on_read_problem(), the DSL's function for handling read() errors, helpful. It produces a very specific and comprehensive error message. One of Marpa's greatest improvements over previous parsers is that, when a parse fails, Marpa can explain why in considerable detail. It makes sense to take full advantage of that ability.

Evaluating the parse

    my $value_ref = $rec->value;

    if ( !defined $value_ref ) {
        say $rec->show_progress() or die "say failed: $ERRNO";
        die 'Parse failed';
    }
    return ${$value_ref};

Evaluation of the parse is done with the value() method. This can return all the parse results of an ambiguous parse. We want only one parse here, so we call value() only once. value() returns a reference to the value of the parse, and a Perl undef if the parse failed. The error handling is worth noticing. One of Marpa's strengths is that it is fully aware of which rules are being tried at any point, and of how far into those rules recognition has progressed. The show_progress() method reports that information.

OP2

This ends our description of the calculator code. In the Github gist a second DSL immediately follows the calculator DSL. This second DSL is OP2, which is used to define the grammar for the calculator. OP2 is more complicated than the calculator, but its design is similar, and it can be used as a second DSL example.

Alernatives

Marpa::R2 verus Marpa::XS

This calculator uses Marpa::R2. Marpa::R2 is beta, while Marpa::XS is in a stable, bug-fix only release. On the other hand Marpa::R2 is somewhat faster, and its reporting of parse-time problems is better.

Specifying the grammar

The grammar of the calculator is specified in OP2, which is a clear and elegant way to do it. But OP2 is an experimental DSL created just for this one use.

A more robust way to specify grammars is to do it directly in Marpa::R2. OP2's grammar is specified directly in Marpa::R2. A compromise between elegance and stability would be to use OP2 (or a derivative) to generate the rules (or some of them). The OP2-generated rules can be used as is, or edited to taste. When you are happy with them, Data::Dumper can turn the OP2-generated rules into code, which you can then incorporate into your DSL program.

Error messages

It is hard to compare the quality of the messages from these DSL's, unfamiliar programs which explore new ground, against, for example, the comprehensibility of a C compiler's error messages. With the C compiler, I have the advantage of over 40 years of Pavlovian training in guessing what they really mean.

I believe that this DSL's error messages are already, on average, up to the level of typical production languages. My main reason for this bold assertion is that production parsers have set the bar, frankly, extremely low. I hasten to add, this is often not because of lack of care or effort by the implementers. The traditional parsing technologies simply do not provide enough information to support accurate and helpful error reporting.

Much more could be done in error message handling than is done by this calculator DSL. Marpa's situational awareness makes much easier to write usefully accurate error messages than has been the case. And I find better error messages often repay a high priority, even in programs that are strictly for personal use.

posted at: 12:20 | direct link to this entry

§ § §

Ocean of Awareness

Marpa resources

Sun, 26 Aug 2012