Precedence parsing made simpler

Wed, 22 Aug 2012

Precedence parsing made simpler

This post describes a new approach to precedence parsing, one that makes it simpler and more flexible. Many programmers find precedence is an intuitive way to look at problems. The traditional rules of arithmetic are a familiar example:

E ::= ( E )
E ::= n
E ::= E * E
E ::= E + E

Here, as in the rest of this post, the rules are ranked from tightest ("highest") precedence to loosest ("lowest"). The order in the above display indicates that multiplication takes precedence over addition, and parentheses take precedence over everything else.

The old way and the new way

The traditional way to deal with precedence centers on symbols. The symbols are divided sharply in two: those that define structure and those that carry information. The structural symbols (often called "operators") are assigned associativities and precedence. To help guide the parse, symbols may be further classified as infix, prefix, circumfix, etc.

Many rules, even those which intuitively seem part of the precedence order, do not fit into this symbol-oriented view of precedence. Implied operators are ruled out, as is any rule with two non-terminals in a row. Rules with an arity of 3 or above, when not also impossible, are a challenge.

The approach of this post is 100% rule-based. There is no attempt to identify operators or structural symbols, and no attempt to assign properties to them. This rule-based approach allows the convenient expression and efficient implementation of implied operators, of rules of arity 3 or higher, and of rules with any pattern of terminals or non-terminals.

Simpler

Before getting into new features, it is probably best to show the new approach as applied to a grammar that can be parsed with the traditional methods. My notation is mostly standard or transparent, but here are details:

    ::=       separates the lhs of a rule from its rhs alternatives
    |         separates alteratives at the same precedence level
    ||        separates alteratives at different precedence levels
    :group    indicates 'grouping' associativity
    :left     indicates left associativity (the default)
    :right    indicates right associativity

Here is the grammar:

e ::=
     NUM
   | VAR
   | :group '(' e ')'
  || '-' e
  || :right e '^' e
  || e '*' e
   | e '/' e
  || e '+' e
   | e '-' e
  || VAR '=' e

The above fully states the precedence and associativity for the grammar's rules. (As a reminder, the precedence follows the order of the rules, from tightest to loosest.) This is significantly simpler than what is required to set up a traditional precedence parser. On the other hand, intuitively, it looks like all the required information is there. And, in fact, this is the source from which Marpa::Demo::OP1 creates the grammar for a calculator. The code is a Github gist.

In real life, users of a calculator grammar, like the above, will be interested in a numeric result. However, in this post we are not interested in double-checking Perl's ability to do basic arithmetic, so instead we capture the syntactic structure that the calculator creates. Here are sample outputs, with square brackets added to show the parse.

Input: "4 * 3 + 42 / 1"
  Parse: [[4*3]+[42/1]]
Input: "4 * 3 / (a = b = 5) + 42 - 1"
  Parse: [[[[4*3]/[([a=[b=5]])]]+42]-1]
Input: "4 * 3 /  5 - - - 3 + 42 - 1"
  Parse: [[[[[4*3]/5]-[-[-3]]]+42]-1]
Input: "- a - b"
  Parse: [[-a]-b]
Input: "1 * 2 + 3 * 4 ^ 2 ^ 2 ^ 2 * 42 + 1"
  Parse: [[[1*2]+[[3*[4^[2^[2^2]]]]*42]]+1]

More flexible

In the next grammar, I'll introduce an implied operator. An implied operator is prominent among the features that traditional precedence parsers simply could not handle. In the grammar that follows, a missing operator will indicate multiplication, just as in algebra.

Traditional precedence parsers also were stymied by rules with an arity of 3 or more. For Marpa::Demo::OP1, these are no problem at all. I'll introduce two ternary operations, and a quaternary operation. (New in the notation below is the "=> xyz", which specifies a non-default semantics, in this case "xyz()".)

e ::=
     NUM
   | VAR
   | :group '(' e ')'
  || '-' e
  || :right e '^' e
  || e '*' e
   | e e                                          => implied_multiply
   | e '/' e
  || e '+' e
   | e '-' e
  || VAR '=' e
  || :right e '?' e ':' e                         => spaced
   | :right e '??' e ':' e ':' e                  => spaced
  || 'payment' 'on' e 'over' e 'years' 'at' e '%' => spaced

The code for this second example is also a Github gist. And here is the output. (To make it easy to spot them, implied multiplications are shown with an "x" instead of a "*".)

Input: "4 3 42 + 1"
  Parse: [[[4 x 3] x 42]+1]
Input: "e = m c^2"
  Parse: [e=[m x [c^2]]]
Input: "4 * 3 5 (6 7) 8 9 10"
  Parse: [[[[[[4*3] x 5] x [([6 x 7])]] x 8] x 9] x 10]
Input: "1 ? 42 : 2 ?? 3 : 4 : 5 ? 6 : 7"
  Parse: [1 ? 42 : [2 ?? 3 : 4 : [5 ? 6 : 7]]]
Input: "payment on 1000 + 1000 over months/12 years at 5 + 1 %"
  Parse: [payment on [1000+1000] over [months/12] years at [5+1] %]

How rule-based precedence works

Rule-based precedence parsing uses Marpa, a new and efficient general BNF parsing algorithm. With Marpa, the rest is straightforward. The grammars in the examples above are rewritten, using the included precedence and associativity information, into an "order-explicit grammar". The BNF of an order-explicit grammar enforces the precedence and associativity of the original, source grammar. Many textbooks on parsing describe how to write order-explicit BNF by hand.

Creating an algorithm to produce an order-explicit BNF grammar required some careful thought, but no flashes of brilliance. Previously, the major obstacle to this approach would have been the parse engine. Traditional parsers did not "just parse" arbitrary BNF -- far from it. Without (and often even with) programmer intervention, there would be little reason to hope that an LALR or LL parse engine would parse an arbitrary order-explicit BNF grammar. Marpa, on the other hand, does "just parse" arbitrary BNF, and a successful parse is guaranteed.

Any grammar which could have been parsed by yacc (LALR) or an operator-precedence parser will be parsed by Marpa in linear time. LALR and operator precedence are subsets of LR(1), while Marpa is linear for LR-regular, and for LR(k) for all k. This means that Marpa will stay linear for vast classes of grammars that the traditional techniques had no hope of ever parsing.

Acknowledgements

This post is the outcome of a line of thinking started by an exchange with Alberto Simões, begun when he graciously shared with me a pre-release copy of an article on lexical analysis, which he authored jointly. And, in creating the DSL used for the examples, I benefited immensely from studying the approaches used by Peter Stuifzand.

posted at: 12:58 | direct link to this entry

§ § §

Ocean of Awareness

Marpa resources

Wed, 22 Aug 2012