Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest
The Ocean of Awareness blog: home page, chronological index, and annotated index.
Currently, in designing languages, we don't allow ambiguities -- not even potential ones. We insist that it must not be even possible to write an ambiguous program. This is unnecessarily restrictive.
This post is written in English, which is full of ambiguities. Natural languages are always ambiguous, because human beings find that that's best way for versatile, rapid, easy communication. Human beings arrange things so that every sentence is unambiguous in context. Mistakes happen, and ambiguous sentences occur, but in practice, the problem is manageable. In a conversation, for example, we would just ask for clarification.
If we allow our computer languages to take their most natural forms, they will often have the potential for ambiguity. This is even less of a problem on a computer than it is in conversation -- a computer can always spot an actual ambiguity immediately. When actual ambiguities occur, we can deal with them in exactly the same way that we deal with any other syntax problem: The computer catches it and reports it, and we fix it.
To illustrate, I'll use a DSL-writing DSL language. It'll be tiny -- just lexeme declarations and BNF rules. Newlines will not be significant. Statements can end with a semicolon, but that's optional. (The code for this post is in a Github gist.)
Here is a toy calculator written in our tiny DSL-writing language:
Number matches '\d+' E ::= T '*' F E ::= T T ::= F '+' Number T ::= Number
With a grammar this small, just about anything is readable. But let's assume we want to improve it, and that we decide that the lexeme declaration of Number really belongs after the rules which use it. (If our grammar was longer, this could make a real difference.) So we move the lexeme declaration to the end:
E ::= T '*' F E ::= T T ::= F '+' Number T ::= Number Number matches '\d+'
It turns out the grammar for our toy DSL-writer is ambiguous. When a lexeme declaration follows a BNF rule, there's no way to tell whether or not it is actually a lexeme declaration, or part of the BNF rule. Our parser catches that:
Parse of BNF/Scanless source is ambiguous Length of symbol "Statement" at line 4, column 1 is ambiguous Choices start with: T ::= Number Choice 1, length=12, ends at line 4, column 12 Choice 1: T ::= Number Choice 2, length=33, ends at line 5, column 20 Choice 2: T ::= Number\nNumber matches '\\d
Here Marpa tells you why it thinks your script is ambiguous. Two different statements can start at line 4. Both of them are BNF rules, but one is longer than the other.
Instead of having to design a language where ambiguity was not even possible, we designed one where ambiguities can happen. This allows us to design a much more flexible language, like the ones we choose when we humans communicate with each other. The downside is that actual ambiguities will occur, but they can be reported, and fixed, just like any other syntax error.
In this case, we recall we allowed semi-colons to terminate a rule, and our fix is easy:
E ::= T '*' F E ::= T T ::= F '+' Number T ::= Number ; Number matches '\d+'
The code for this post is a gist on Github. It was written using Marpa::R2, which is available on CPAN. A list of my Marpa tutorials can be found here. There are new tutorials by Peter Stuifzand and amon. The Ocean of Awareness blog focuses on Marpa, and it has an annotated guide. Marpa has a web page that I maintain and Ron Savage maintains another. For questions, support and discussion, there is a "marpa parser" Google Group and an IRC channel: #marpa at irc.freenode.net.
Comments on this post can be made in Marpa's Google group.
posted at: 17:12 | direct link to this entry