Thu, 01 Aug 2013
A Marpa-powered C parser
Jean-Damien Durand has just released
which parses C language into an abstract syntax tree (AST).
MarpaX::Languages::C::AST has been tested against
Perl's C source code, as well as Marpa's own C source.
Because it is Marpa-powered,
MarpaX::Languages::C::AST works differently from other C parsers.
In the past,
C parsers have been syntax-driven -- parsing was based on a BNF description
of the C grammar.
More recently, C parsers have used hand-written
recursive descent -- they have been procedurally-driven.
MarpaX::Languages::C::AST uses both approaches.
Marpa has the advantage that it makes full knowledge of the state of the parse
available to the programmer,
so that procedural logic and syntax-driven parsing can reinforce each other.
The result is a combined lexer/parser
which is very compact and easy to understand.
Among the potential applications:
- Customized "lints".
You can write programs to enforce C language standards
and restrictions specific to an individual, a company or a project.
- C interpreters.
By taking the AST and adding your own back end, you can create a special-purpose C interpreter
or a special-purpose compiler.
- C variants.
Because the code for the parser is compact and easy to modify,
it lends itself to language extension and experimentation.
you could reasonably implement compilers
to try out the proposals submitted to
a standards committee.
- C supersets.
Would you like to see some of the syntax from a favorite language available in C?
Here's your chance.
A few of Jean-Damien's implementation choices are worth noting.
A C parser can take one of two strategies: approximate or precise.
A compiler has, of course, to be precise.
Tools, such as cross-referencers, often decide to be approximate, or sloppy.
Sloppiness is easier to implement and has other advantages:
A sloppy tool can tolerate missing C flags: what the C flags should be
can be one of the things it guesses at.
Of the two strategies, Jean-Damien decided to go with "precise".
MarpaX::Languages::C::AST follows the C11 standard,
with either GCC or Microsoft extensions.
This has the advantage that
MarpaX::Languages::C::AST could be used as the front end of a compiler.
MarpaX::Languages::C::AST purpose is to take things as far as an AST,
and let the user take over,
it does not implement those constraints usually implemented in post-processing.
One example of a post-syntactic constraint is the one that bans
"case" labels outside of switch statements.
Perhaps a future version can include a default "first phase" post-processor to enforce
the constraints from the standards.
As currently implemented, the user can check for and enforce these constraints in any
way he likes.
This makes it easier for extensions and customizations,
which I think of as the major purpose of MarpaX::Languages::C::AST.
The parsing strategy
Those familar with the C parsing and its special issues may be interested in
Jean-Damien's approach to them.
MarpaX::Languages::C::AST is, with a few exceptions, syntax-driven -- the parser works
from Marpa's SLIF, an extended BNF variant.
The SLIF-driven logic is sufficient to deal with the
Marpa handles right recursion in linear time,
so that the if-then-else issue could have been dealt with by rewriting the relevant rules.
But Jean-Damien wanted to have his BNF follow closely the grammar in the standards,
and he decided to use Marpa's rule ranking facility instead.
More complicated is the ambiguity in C between variable names and types,
which actually takes C beyond BNF and context-free grammars into context-sensitive
Most C parsers deal with this using
or post-processing hacks.
Marpa allows the parser to do this more elegantly.
Marpa knows the parsing context at all times and can comnunicate this to a user's
Marpa also has the ability to use the parsing context to
decide when to
switch control from the syntax-driven logic to
a user's customized procedural logic,
and for the syntax-driven logic to take control back
when the procedural logic wants to give it back.
This allows the variable-name-versus-type ambiguity to be handled by specifically targeted code
which knows the full context of the decisions it needs to make.
This code can be written more directly, simply and clearly
than was possible with previous parsing methods.
Above I mentioned special-purpose compilers.
What about production compilers?
MarpaX::Languages::C::AST's upper layers are in Perl,
so the speed, while acceptable for special-purpose tools,
will probably not be adequate for production.
Perhaps a future Marpa-powered C parser will rewrite those
upper layers in C, and make the race more interesting.
To learn more
is available on CPAN.
A list of my Marpa tutorials can be found
new tutorials by
The Ocean of Awareness blog
focuses on Marpa,
and it has
an annotated guide.
Marpa also has
a web page.
For questions, support and discussion, there is
the "marpa parser"
Comments on this post can be made there.
posted at: 18:57 |
direct link to this entry