Posts about Marpa: an annotated list
Tutorials
Practical general parsing makes writing a DSL much simpler than it has been in the past. The posts in this series of tutorials are among my most popular. Although you can join the series starting at a topic of special interest, the easiest way to read (or at least skim) them is in the order listed.
Making DSL's even simpler. The best place to start. An updated version of one of my most popular posts, it describes Marpa's Scanless interface (SLIF).
Announcing Marpa's Scanless interface. The announcement of the Scanless interface, this also includes a small tutorial.
A self-parsing and self-lexing grammar. A tutorial that walks through the Scanless interface's own self-defining and self-hosting grammar. Reading a grammar that is defining itself can cause dizziness, but if you tolerate that sort of thing, this makes an excellent advanced tutorial.
-
BNF to AST This post is based on the chapter on the Language Design Pattern, from the very influential Design Patterns book. In that book, the authors (widely known as the Gang of Four) did not attempt to parse the language of their example -- there was no simple way to produce an AST straight from BNF. Now there is. This tutorial contains the Gang of Four example worked out in full with a Marpa-based parser, and a sequel talks about some of the implications of easier parsing.
-
Marpa and procedural parsing 2013. "Procedural parsing" is a fancy term for giving up on your parsing algorithm and just hacking out the parse. With the currently favored parsing algorithms, a lot of this type of code is necessary, and programmers have come to expect a parser to allow for it. Marpa tries to offer the best of both worlds: a declarative parser which knows exactly where it is in your BNF at all times, but one which is willing to step out of the way and let custom code take over, at the same time sharing what it knows with the custom code. Perl heredocs are an example of a language feature which requires procedural parsing. Parsing these correctly requires jumping back and forth in the input. This tutorial for this post shows how to handle them straightforwardly using Marpa's SLIF.
-
Mixing procedural and declarative parsing. This post is another example that combines procedural and syntax-driven parsing in order to get the best of both worlds.
-
Marpa and procedural parsing 2018. A more recent example of Marpa's procedural parsing, using a context-sensitive grammar.
If you're looking for tutorials, you'll want to start, and probably to end, with those based on the SLIF interface. The best of mine for the beginner are above, in this section. The section on programming language design also contains some posts whose examples work as SLIF-based tutorials. More tutorials can be found in the section on combinator parsing.
While SLIF-based tutorials should be strongly preferred, my pre-SLIF tutorials, described below may still have some value. They contain many useful examples, and most of the code in them carries over to SLIF line-for-line.
About Marpa in general
-
Why Marpa works: table parsing. Marpa is different from the other widely used parsers. The others use stacks, state machines or both. Marpa is a table parser. This post explains the advantages of table parsing, but also the challenges faced in implementing it.
The solved problem that isn't, is. In an excellent post, Laurence Tratt described parsing as "the solved problem that isn't". Tratt expressed my reasons for creating Marpa very well. I claim that, with Marpa, the parsing problem is, in fact, now solved. In this post I state, point by point, why I believe I can accurately make that claim.
-
A grammar that exemplifies, describes and parses itself: This post contains the grammar for Marpa's new BNF interface, a grammar that is both self-description and self-example. In the 1970's self-reference was very much the fashion. But this grammar is not just for looks -- it drives its own compiler and will be the basis of its own development: it is handy enough as an interface to make dealing with a circular dependency worthwhile.
-
Marpa and the Ruby Slippers: An older post, but one with a nice explanation of one of Marpa's significant new features: Ruby Slippers parsing.
-
Fast, handy languages: Which languages are fast (that is, linear) for Marpa? Also, when should you NOT use Marpa?
-
Linear? Yeah, right: Marpa is linear for all the deterministic context-free grammars, and therefore all the classes of of grammar currently in practical use, or so I claim. Why should you believe that?
-
Why I Stuck with Perl: I liked Python very much, and at the time I started Marpa, Python looked like a better choice in terms of mindshare. In this post, I explain why the initial Marpa implementation uses Perl. This post has been very popular, even being translated into Japanese.
Parsing timeline and history
In the process of writing Marpa, I read and thought a lot about the various approaches to parsing, and how they developed. When I came to share my thoughts, it discovered quite a few people were also interested -- the posts in this section are my most popular.
-
Parsing: a timeline A timeline of parsing research, describing how other algorithms pushed the parsers in the Earley lineage (like Marpa) aside, only for things to come down twisted. This is the most technical post I've ever done. Much to my surprise, its first version was a runaway hit. Within its first two weeks it became my most popular post, by far. The development of the various parsing algorithms is an interesting story, one that has never really been told as a story before. Readers apparently liked this. [ Because of its length, the current version of the "timeline" is no longer a blog post -- it's a free-standing web page. ]
-
Parsing: top-down and bottom-up If "Parsing: a timeline" was about "when and how?", then this is about "why?" It describes the concepts behind the two major traditional approaches to parsing and confronts some questions: Why did the invention of bottom-up parsing create so much enthusiasm? Why, despite bottom-up's abandonment by practitioners, does it still play such a big role in classrooms and in textbooks? What does this say about the way forward? Is there one?
Like the "timelines", this is a popular post, in this case mostly thanks to the Google bots -- as of this writing it comes up in high position on organic searches. If this post interests you, you may also want to look at the blog post Top-down parsing is guessing.
-
Parsers and Useful Power This post looks at the question of "What do users want in a parser" from a historical perspective. In particular, it looks at the contest between the Irons parser and recursive descent. The Irons parser is declarative and far more powerful than recursive descent. In terms of parsing at linear speed, there were tradeoffs, and in their pure forms, neither algorithm could parse the all-important arithmetic expressions.
But recursive descent could incorporate custom hacks, and those custom hacks could handle arithmetic expressions. Today recursive descent is everywhere, while to find a mention of the Irons parser, you need to make an exhaustive search of the more thorough textbooks.
-
Why is parsing considered solved? "On what grounds would someone say that parsing is 'solved'? To understand this, we need to look at the history of Parsing Theory. In fact, we'll have to start decades before computer Parsing Theory exists, with a now nearly-extinct school of linguistics, and its desire to put itself on strictly scientific basis. This post is a timeline in itself and a revised version of it will eventually be incorporated into the v3 timeline.
Marpa and combinator parsing
Marpa is a better way of doing combinator parsing, as this set of tutorials sets out to show.
-
Marpa and combinator parsing. Marpa allows variable-length and ambiguous tokens. By combining this with external parsing, and generalizing lexers to combinators, Marpa becomes a better combinator parser -- more powerful and more flexible.
-
Marpa and combinator parsing 2. This post takes the ideas of "Marpa and combinator parsing", and provides an actual implementation.
-
A Haskell challenge. I received a friendly challenge to show that Marpa can do what Haskell's native parser cannot. This post answers it.
-
Measuring language popularity. Combinator parsing, in its full power, allows combinators, not just to be ambiguous and variable-length, but to overlap arbitrarily. It is hard to find a parsing problem which actually requires this capability, but language detection is one. In this post, Marpa shows off its full combinator parsing capabilities.
Marpa versus Perl 6 grammars and PEG
In these blog posts, I discuss Perl 6 grammars, and compare them with Marpa. Perl 6 grammars are a PEG variant. I believe that Perl 6 grammars and PEG have a place when used as a "better" regular expression engine. But it can be shown that users (and language designers) whose hopes go beyond that are doomed to disappointment. In particular, if your interest is language-oriented programming, PEG and Perl 6 grammars lack the horsepower and the programmability to advance the state of the art.
-
Grammar reuse Language-oriented programming requires the ability to piece grammars together from other grammars. The Perl 6 syntax makes this convenient, but the underlying parse engine does not adequately support it.
-
Top-down parsing is guessing Perl 6 grammars have a top-down, PEG, parse engine. In this post I describe at a very basic level what it means to parse top-down and argue that the current successes of top-down cannot be taken as evidence that the top-down approach can power language-oriented programming.
-
What are the reasonable computer languages If you are doing language-oriented programming, what kinds of language should you support? A clue is provided by going back to the research of the 1960's and 1970's, when the researchers took an 'anything is possible' approach to language design, and asked for what they really wanted. These researchers did not "know", as we think we do today, that they were demanding the impossible.
-
PEG: ambiguity, precision and confusion PEG is a new notation for a notorously tricky algorithm that goes back to the earliest computers. In its PEG form, this algorithm acquired an seductive new interface, one that looks like the best of extended BNF combined with the best of regular expressions. Looking at a sample of it, you are tempted to imagine that writing a parser has suddenly become a very straightforward matter. Not so.
I'm a fan of the Perl 6 effort. And I certainly should be their supporter, after the many favors they've done for me and the Marpa community over the years. The considerations of these post will disappoint the hopes for LOP applications of the native Perl 6 parser. But Perl 6 is much more than that. It would be a mistake to read these posts as part of a dismissal of the Perl 6 effort as a whole.
Language-oriented programming
-
The Interpreter Design Pattern. The Language Design Pattern, is from the very influential Design Patterns book, which was the subject of one of my tutorials. This post discusses the implications for the Language Design Pattern of making parsers much easier to write.
-
What if languages were free? In 1980, George Copeland wrote an article titled "What if Mass Storage were Free?" Afterwards, the cost of mass storage fell dramatically. Copeland saw the first signs of this, and ran a thought experiment: What might happen costs fell to the point where mass storage was, in effect, free? This post runs a similar thought experiment on languages. What might happen when the time and effort required to create a new language falls toward zero?
If you're interested in language-oriented programming with Marpa, the blog posts Fast, handy languages, Grammar reuse and What are the reasonable computer languages will also be of interest.
Designing languages
Marpa allows things to be done which could not be done before. Many of these things are significant for language design -- they allow languages to take radically new and aggressive approaches to their syntax.
-
Language Design: exploiting ambiguity. Actual ambiguity is a show-stopper. But in the present state of the art, even potential ambiguity is totally avoided in language design. Which only makes sense, right?
Well, actually, no. Our workhorse languages, the ones we humans use to communicate with each other, are full of potential ambiguities, and allowing ambiguity makes those languages more powerful and flexible. Could that same work for computer languages? This post explores that possibility.
-
Evolvable languages. The parsers in heaviest use today (LALR and recursive descent) parse only a very limited subset of what you can express in in BNF. That's a barrier that a language starts bumping up against almost immediately, and bumps up against more and more frequently as its adds features.
But Marpa lets you keep extending your language and (essentially) as long as you can describe your language extension in BNF, or in BNF plus some hacks, you can still parse the language in linear time. "Language design has been like shooting crap in a casino that sets you up to win a lot of the first rolls before the laws of probability grind you down. Marpa changes the game."
-
Significant newlines? Or semicolons?. Currently a programming language designer must choose: either make whitespace semantically significant, or use explicit punctuation to separate statements. But Lua does without either, by keeping the syntax restricted and tiny. And with the power of Marpa, you can also do without either, but at the same time have a lanuage which is as large and as syntactically rich as you want.
-
Reporting mismatched delimiters. Delimiters are parentheses and brackets, but also quote marks and even HTML and XML tags -- anything that explicitly marks the beginning and end of some text. Delimiters are widely used, easy to get wrong and, in current practice, annoying to fix, in large part because current parsers are not good at diagnosing delimiter mismatches. Worse, a mismatched delimiter will almost certainly make the rest of a compiler's error message so unreliable that they are effectively useless.
Marpa allows more accurate reporting of delimiter mismatches than traditional parsers, and in most cases Marpa can recover and go on to accurately report other errors. Read this blog post to find out how. Hint: the Ruby Slippers get involved.
How to parse HTML
This series describes a new strategy for parsing liberal HTML, using Marpa and the Ruby Slippers techniques it makes possible. It is the most worked out example of Marpa use that I've done to date.
The Marpa-based HTML reformatter, html_fmt, is the Marpa application that I use the most. In fact, as I am typing this HTML, I am using it repeatedly as a filter, much in the same way that you'd use gnuindent for C code or perltidy for Perl.
When I left off with the first series, I mentioned that an HTML parser really should be configurable -- the user/application should be able to decide which variant of HTML they want to parse. The result is a parser driven by a configuration file that allows the user to reshape HTML. And, by the way, the configurable version is almost twice as fast.
- A configurable HTML parser
- A configurable HTML parser, part 2
- Configuring the Ruby Slippers for HTML
Marpa v. other parsers
These posts compare Marpa directly to other parsers, with benchmarks.
-
Marpa v. Perl regexes: some numbers pits Marpa directly against Perl regexes. Marpa will never beat regular expressions at their own game, but regular expressions are often taken out of their game, and in those cases the results can get interesting.
-
Marpa v. Perl regexes: a rematch revisits the comparison of the earlier post. In that benchmark, much of the advantage of Perl regexes came, not from faster parsing, but from lower overhead. This post contains a more direct comparison of the two parse engines.
-
Marpa v. Parse::RecDescent: some numbers. Recursive descent is currently very popular. This post takes an interesting example from another post, and rewrites it from Parse::RecDescent into Marpa. Marpa has equal time complexity, and its parse engine is written in heavily optimized C, while Parse::RecDescent is pure Perl. The result is what you'd expect.
The Bovicide Series
Yacc and its derivatives are considered the standard in parser generators. Yacc is part of a great series of traditions it was my privilege to watch being formed, and I have the greatest respect for its inventors. But to my mind yacc is one tradition more honored in the breach than in the observance. In this series I describe what will be required for a parsing algorithm to displace yacc's LALR, and why I think the Marpa algorithm has what it takes and more. The first three requirements are described in my first post, Killing Yacc: 1, 2 & 3 :
-
Requirement 1: Handle Arbitrary Context-Free Grammars in O(n3)
-
Requirement 2: Handle "Average" Grammars in Linear Time
-
Requirement 3: A Sound Theoretical Basis
Error reporting has long been overlooked, and it is something at which yacc was astonishingly bad. I looked at the issue of development-time error reporting in Why the Bovicidal Rage?:
-
Requirement 4: Easy Diagnosis of Grammar Problems
And in Bovicide 5: Parse-time Error Reporting, I looked at the other half of error reporting.
-
Requirement 5: Good Reporting of Parse-time Errors
Both hand-written recursive descent and Marpa meet, to various degrees, all five of the previous requirements. The last requirement, stated in Bovicide 6: The Final Requirement, was the tie-breaker. Hand-written recursive descent has many strengths, but it will never be available as a library.
-
Requirement 6: Available as a Library
The "Perl and parsing" series
My "Perl and parsing" series used Perl to illustrate parsing theory and parsing theory to investigate Perl. I am not one of those who believes that mathematics takes place in vacuum -- it follows trends and even what are more accurately called fashions. Therefore this series has a lot of discussion of how current parsing theory came to be.
-
Parsing with Ruby Slippers: This is the post where I debuted the "Ruby Slippers" parsing technique, using both HTML and an example from Perl to illustrate it.
-
Parsing Perl 2: Down the Garden Path: Larry Wall did an amazing job getting LALR to work for the Perl grammar, but occasionally the limits show through. The post contains an example.
-
Parsing Perl 3: Perl and Minimalism: I had intended a couple of posts about Perl & parsing. At this point I realized the topic was expanding on me. This post is about the influence of minimalism on computer languages and parsing theory. Larry Wall is perhaps the only language design who has managed to escape the influence of minimalism and that is the reason Perl is so different from other languages.
-
Minimalism: Philosophy, Tradition, or Lunacy? An explanation of why the programmers of my generation and prior -- the ones who shaped the field -- are, almost without exception, stanch minimalists. Not numbered as part of the series, but it fits into the discussion of minimalism.
-
Perl and Parsing 4: The Golden Age of Right Parsing: Having nibbled at philosophy, I now charge into a history of parsing theory. My approach is unusual. I treat parsing theory as what it is: a human endeavor. This post describes the rise of LR, or right parsing.
-
Perl and Parsing 5: Rewind: If you require evidence that parsing theory can be as much about trends, ideology and fashion as it is about mathematics and techology, here it is. The dominant parsing algorithm for production compilers today is hand-written recursive descent, exactly the one that was displaced in 1975, when the technology was almost unimaginably less powerful.
-
Perl and Parsing 6: Error Handling: A discussion of Perl error detection. Despite the limits of its underlying LALR parse engine, the Perl interpreter can seem psychic at times. I show how the conjuring is done, and give some other examples where the limits imposed by LALR parsing show up very clearly.
-
Perl and Parsing 7: Do List Operators have Left/Right Precedence?: The documents describe Perl's list operators in terms of left- and right-precedence. Despite the fancy names, these are not terms from parsing theory and are not well-defined. In this post I show that, whatever their meaning, it is wrong, and I propose an alternative.
-
Perl and Parsing 8: The Where and Why of Rejection: Sometimes Perl is just not that into your syntax. An illustration of the limits of LALR error detection, using Perl.
-
Perl and Parsing 9: "Use" and the Ruby Slippers: While Marpa is the first parser to allow it in anything close to its full power and generality, Ruby Slippers parsing is not unprecedented. Perl parses its use statement by dummying up tokens based on the grammar. But in order to use the Ruby Slippers, so unhelpful is LALR, Perl has to simulate the parse inside the lexer.
-
Perl and Parsing 10: "Use" the Easier Way: In this post I use Marpa to parse Perl's use statement. Marpa is a lot better at the Ruby Slippers technique, but here the problem can be solved even more simply with another of Marpa's extended capabilities: ambiguous tokens. Ambiguous tokens are tokens whose type is left uncertain. It is left up to the parser to decide which token to use, something which Marpa can do easily.
-
Perl and Parsing 11: Are all Perl programs parseable?. The answer is no. Because I was the first to formally prove it, I initially got the blame, and now get the credit as the discoverer that Perl parsing is undecidable. While a very interesting result, its significance can be overstated. In particular, while undecidable parsing has definite benefits for the Perl user, no downside has ever been demonstrated.
-
Perl and Parsing 12: Beating up on the "use" statement. I had spotted a gotcha in the parsing of the "use" statement, which I saved up from from the previous two posts. In the meantime chromatic noticed another. Piling on is really not fair but, hey, can you really blame me for unlucky timing?
Natural Language Processing
-
The syntax of English is undecidable. Many of those most interested in Marpa are from the natural language processing (NLP) community. In taking on natural language, as in taking on any other computing task, a reasonable first question is, "Is it decidable?". This post claims to answer that question.
-
Parsing Ada Lovelace. I wrote this post when I started exploring parsing ambiguity. Following this line of inquiry eventually resulted in a new way of handling ambiguity in programming languages -- allowing potential ambiguities in the grammar, but not allowing actual ambiguities. (An actual ambiguity, when it occurs, is handled in much the same way as a syntax error. And it can be fixed, for example, by adding parentheses, or whatever other punctuation eliminates the ambiguity.) My primary concern is programming language issues, not natural languages issues, but I find going back and forth between the two very fruitful.
-
What parser do birds use? Research shows the Japanese Great Tit, a bird whose brain weighs less than 1/20 of an ounce, uses a language that is beyond the capabilities of parsers in standard use today. I think future generations of programmers will be astonished at the way in which the current generation limits itself.
Theory
Behind Marpa is a lot of mathematics, a lot of it old, but also some new. The posts below are for those curious about the Theory side of Marpa. (The definitive source is my paper on the theory behind Marpa, but that assumes the reader is comfortable with the mathematics of Parsing Theory.)
-
Jay Earley's Idea: A user-friendly description of Earley's algorithm. Marpa's algorithm is much more complicated, but at heart, Marpa is an Earley parser.
-
Is Earley parsing fast enough?: In 1968, Earley's algortihm was considered too slow to be practical for production. Now Earley's has improved and computers are over a billion times faster. Maybe, after 45 years, the old rule of thumb on Earley parsing speed needs revisiting.
-
What is the Marpa algorithm? lays out the essentials of the Marpa algorithm. Does not require the reader to be mathematically inclined, but even that reader might want to look here before digging into the actual theory paper.
-
The useful, the playful, the easy, the hard and the beautiful starts out describing a nice interface to Marpa written by Peter Stuifzand, and then gets into a discussion of how number series emerge from grammars. I especially enjoyed this post, as apparently did a number of readers.
-
Parsing left recursions. "A lot has been written about parsing left recursion. Unfortunately, much of it simply adds to the mystery. In this post, I hope to frame the subject clearly and briefly."
-
"Parsing with Pictures" This compares my work on Marpa with Might-Darais "derivatives" parsing. It uses a theoretical framework discovered by Pingali and Bilardi, and described in their 2012 paper, "Parsing with Pictures". Despite its title, the Pingali-Bilardi paper is heavy going, as is this blog post.
Visions
A warning to the reader: The posts in this section are my most misunderstood, and therefore, so far, can be called my least successful. These are posts in which I outline future directions, as plainly and simply as I know how to, uncluttered by details. The problem seems to be that, without technical details and code examples, many readers take my words as pleasant speculations, and not as I intend them, as descriptions of things that can be done now, today.
-
Parsing on your new hyper-quantum computer Marpa is a non-deterministic parser, and those used to coping with recursive decsent are used to thinking in terms of a single thread. With non-determinism, you become free of all that backtracking and lookahead, and your parser becomes simpler and more modular. This requires a re-visioning of parsing, but it is very similar to the re-visioning you'd need to do if you were suddenly delivered new non-deterministic hardware and asked to program efficiently for it.
-
Making the parsing game safe: What Marpa means for Domain-Specific Languages (DSLs) and Language-Oriented Programming (LOP).
-
Marpa and OO: Object-oriented programming and how it relates to Marpa.
Older tutorials
These tutorial were written before the SLIF interface was created. Today, you'd want to use the SLIF in most of these cases. But most of their contents are still relevant, and most of the actual code would simply be carried over.
-
Domain-Specific Languages made simpler. This is the predecessor of "Making DSL's even simpler". It's an example of how to write a DSL without using Marpa's own DSL-writing DSL's, instead boot-strapping your own. It has two full DSL's in less than 600 lines of code.
-
Precedence parsing made simpler. Defining grammars in terms of precedence is an attractive technique. General parsing makes it a lot simpler. Here's a precedence parsing DSL where you don't have to define your operators, much less describe them as infix, postfix, circumfix, etc. Which means efficient precedence parsing can be conveniently extended to n-ary operations, and can even allow implied operators.
-
A Marpa DSL tutorial: Error reporting made easy. Your quickly written domain-specific language can, as of its first draft, have error reporting whose helpfulness and precision exceeds that of carefully hand-crafted production compilers. This tutorial shows how.
-
A Marpa tutorial: pattern searches. We use regular expressions for pattern searching these days. But what if your search target is not a regular expression? This tutorial shows how to use Marpa to search text files for arbitrary context-free expressions.
-
A Marpa tutorial: iterative parser development. Unanchored search and error reporting are examples of the same general problem: finding a desired pattern/language, when it is embedded in other, aribitrary, text. This approach can also be used as a test-driven method for parser development.
You start with a small subset, and to test how far you've succeeded, you "search" for it in the text you are trying to parse. As you add syntax, you "find" more and larger "targets" in your texts. When all "searches" of valid texts finds a single "target" which spans the entire text, the language is finished.
This tutorial gives an example of one iteration in the development cycle of a parser. It takes the language of the previous post (a small subset of Perl expressions) and extends a bit.
These posts, while not full-fledged tutorials, have a "how-to" focus.
-
Developing parsers incrementally with Marpa: Hints on how to get started with Marpa::XS.
-
What? No Lexer!: A generation of programmers have grown up identifying lexing with parsing. Marpa is not a lexer, and Marpa::XS itself does not contain a lexer. The tutorials above contain very nice lexers, written in a few short lines of Perl.