Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website

Google+

Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Wed, 06 Jan 2016


What are the reasonable computer languages?

"You see things; and you say 'Why?' But I dream things that never were; and I say 'Why not?'" -- George Bernard Shaw

In the 1960's and 1970's computer languages were evolving rapidly. It was not clear which way they were headed. Would most programming be done with general-purpose languages? Or would programmers create a language for every task domain? Or even for every project? And, if lots of languages were going to be created, what kinds of languages would be needed?

It was in that context that Čulik and Cohen, in a 1973 paper, outlined what they thought programmers would want and should have. In keeping with the spirit of the time, it was quite a lot:

Today, we think we know that Čulik and Cohen's vision was naive, because we think we know that parsing technology cannot support it. We think we know that parsing is much harder than they thought.

The eyeball grammars

As a thought problem, consider the "eyeball" class of grammars. The "eyeball" class of grammars contains all the grammars that a human can parse at a glance. If a grammar is in the eyeball class, but a computer cannot parse it, it presents an interesting choice. Either,

There are some people out there (I am one of them) who don't believe that everything the mind can do reduces to a machine computation. But even those people will tend to go for the choice in this case: There must be some practical computer parsing algorithm which can do at least as well at parsing as a human can do by "eyeball". In other words, the class of "reasonable grammars" should contain the eyeball class.

Čulik and Cohen's candidate for the class of "reasonable grammars" were the grammars that a deterministic parse engine could parse if it had a lookahead that was infinite, but restricted to distinguishing between regular expressions. They called these the LR-regular, or LRR, grammars. And the LRR grammars do in fact seem to be a good first approximation to the eyeball class. They do not allow lookahead that contains things that you have to count, like palindromes. And, while I'd be hard put to eyeball every possible string for every possible regular expression, intuitively the concept of scanning for a regular expression does seem close to capturing the idea of glancing through a text looking for a telltale pattern.

So what happened?

Alas, the algorithm in the Čulik and Cohen paper turned out to be impractical. But in 1991, Joop Leo discovered a way to adopt Earley's algorithm to parse the LRR grammars in linear time, without doing the lookahead. And Leo's algorithm does have a practical implementation: Marpa.

References, comments, etc.

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.


posted at: 10:24 | direct link to this entry

§         §         §

Sun, 20 Dec 2015


Top-down parsing is guessing

Top-down parsing is guessing. Literally. Bottom-up parsing is looking.

The way you'll often hear that phrased is that top-down parsing is looking, starting at the top, and bottom-up parsing is looking, starting at the bottom. But that is misleading, because the input is at the bottom -- at the top there is nothing to look at. A usable top-down parser must have a bottom-up component, even if that component is just lookahead.

A more generous, but still accurate, way to describe the top-down component of parsers is "prediction". And prediction is, indeed, a very useful component of a parser, when used in combination with other techniques.

Of course, if a parser does nothing but predict, it can predict only one input. Top-down parsing must always be combined with a bottom-up component. This bottom-up component may be as modest as lookahead, but it must be there or else top-down parsing is really not parsing at all.

So why is top-down parsing used so much?

Top-down parsing may be unusable in its pure form, but from one point of view that is irrelevant. Top-down parsing's biggest advantage is that it is highly flexible -- there's no reason to stick to its "pure" form.

A top-down parser can be written as a series of subroutine calls -- a technique called recursive descent. Recursive descent allows you to hook in custom-written bottom-up logic at every top-down choice point, and it is a technique which is completely understandable to programmers with little or no training in parsing theory. When dealing with recursive descent parsers, it is more useful to be a seasoned, far-thinking programmer than it is to be a mathematician. This makes recursive descent very appealing to seasoned, far-thinking programmers, and they are the audience that counts.

Switching techniques

You can even use the flexibility of top-down to switch away from top-down parsing. For example, you could claim that a top-down parser could do anything my own parser (Marpa) could do, because a recursive descent parser can call a Marpa parser.

A less dramatic switchoff, and one that still leaves the parser with a good claim to be basically top-down, is very common. Arithmetic expressions are essential for a computer language. But they are also among the many things top-down parsing cannot handle, even with ordinary lookahead. Even so, most computer languages these days are parsed top-down -- by recursive descent. These recursive descent parsers deal with expressions by temporarily handing control over to an bottom-up operator precedence parser. Neither of these parsers is extremely smart about the hand-over and hand-back -- it is up to the programmer to make sure the two play together nicely. But used with caution, this approach works.

Top-down parsing and language-oriented programming

But what about taking top-down methods into the future of language-oriented programming, extensible languages, and grammars which write grammars? Here we are forced to confront the reality -- that the effectiveness of top-down parsing comes entirely from the foreign elements that are added to it. Starting from a basis of top-down parsing is literally starting with nothing. As I have shown in more detail elsewhere, top-down techniques simply do not have enough horsepower to deal with grammar-driven programming.

Perl 6 grammars are top-down -- PEG with lots of extensions. These extensions include backtracking, backtracking control, a new style of tie-breaking and lots of opportunity for the programmer to intervene and customize everything. But behind it all is a top-down parse engine.

One aspect of Perl 6 grammars might be seen as breaking out of the top-down trap. That trick of switching over to a bottom-up operator precedence parser for expressions, which I mentioned above, is built into Perl 6 and semi-automated. (I say semi-automated because making sure the two parsers "play nice" with each other is not automated -- that's still up to the programmer.)

As far as I know, this semi-automation of expression handling is new with Perl 6 grammars, and it may prove handy for duplicating what is done in recursive descent parsers. But it adds no new technique to those already in use. And features like

all of which are available and in current use in Marpa, are impossible for the technology behind Perl 6 grammars.

I am a fan of the Perl 6 effort. Obviously, I have doubts about one specific set of hopes for Perl 6 grammars. But these hopes have not been central to the Perl 6 effort, and I will be an eager student of the Perl 6 team's work over the coming months.

Comments

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.


posted at: 15:21 | direct link to this entry

§         §         §

Sun, 13 Dec 2015


Grammar reuse

Every year the Perl 6 community creates an "Advent" series of posts. I always follow these, but one in particular caught my attention this year. It presents a vision of a future where programming is language-driven. A vision that I share. The post went on to encourage its readers to follow up on this vision, and suggested an approach. But I do not think the particular approach suggested would be fruitful. In this post I'll explain why.

Reuse

The focus of the Advent post was language-driven programming, and that is the aspect that excites me most. But the points that I wish to make are more easily understood if I root them in a narrower, but more familiar issue -- grammar reuse.

Most programmers will be very familiar with grammar reuse from regular expressions. In the regular expression ("RE") world, programming by cutting and pasting is very practical and often practiced.

For this post I will consider grammar reusability to be the ability to join two grammars and create a third. This is also sometimes called grammar composition. For this purpose, I will widen the term "grammar" to include RE's and PEG parser specifications. Ideally, when you compose two grammars, what you get is

Not all language representations are reusable. RE's are, and BNF is. PEG looks like a combination of BNF and RE's, but PEG, in fact, is its own very special form of parser specification. And PEG parser specifications are one of the least reusable language representations ever invented.

Reuse and regular expressions

RE's are as well-behaved under reuse as a language representation can get. The combination of two RE's is always another RE, and you can reasonably determine what language the combined RE recognizes by examining it. Further, every RE is parseable in linear time.

The one downside, often mentioned by critics, is that RE's do not scale in terms of readability. Here, however, the problem is not really one of reusability. The problem is that RE's are quite limited in their capabilities, and programmers often exploit the excellent behavior of RE's under reuse to push them into applications for which RE's just do not have the power.

Reuse and PEG

When programmers first look at PEG syntax, they often think they've encountered paradise. They see both BNF and RE's, and imagine they'll have the best of each. But the convenient behavior of RE's depends on their unambiguity. You simply cannot write an unambiguous RE -- it's impossible.

More powerful and more flexible, BNF allows you to describe many more grammars -- including ambiguous ones. How does PEG resolve this? With a Gordian knot approach. Whenever it encounters an ambiguity, it throws all but one of the choices away. The author of the PEG specification gets some control over what is thrown away -- he specifies an order of preference for the choices. But degree of control is less than it seems, and in practice PEG is the nitroglycerin of parsing -- marvelous when it works, but tricky and dangerous.

Consider these 3 PEG specifications:

	("a"|"aa")"a"
	("aa"|"a")"a"
	A = "a"A"a"/"aa"

All three clearly accept only strings which are repetitions of the letter "a". But which strings? For the answers, suggestions for dealing with PEG if you are committed to it, and more, look at my previous post on PEG.

When getting an RE or a BNF grammar to work, you can go back to the grammar and ask yourself "Does my grammar look like my intended language?". With PEG, this is not really possible. With practice, you might get used to figuring out single line PEG specs like the first two above. (If you can get the last one, you're amazing.) But tracing these through multiple rule layers required by useful grammars is, in practice, not really possible.

In real life, PEG specifications are written by hacking them until the test suite works. And, once you get a PEG specification to pass the test suite for a practical-sized grammar, you are very happy to leave it alone. Trying to compose two PEG specifications is rolling the dice with the odds against you.

Reuse and the native Perl 6 parser

The native Perl 6 parser is an extended PEG parser. The extensions are very interesting from the PEG point of view. The PEG "tie breaking" has been changed, and backtracking can be used. These features mean the Perl 6 parser can be extended to languages well beyond what ordinary PEG parsers can handle. But, if you use the extra features, reuse will be even trickier than if you stuck with vanilla PEG.

Reuse and general BNF parsing

As mentioned, general BNF is reusable, and so general BNF parsers like Marpa are as reusable as regular expressions, with two caveats. First, if the two grammars are not doing their own lexing, their lexers will have to be compatible.

Second, with regular expressions you had the advantage that every regular expression parses in linear time, so that speed was guaranteed to be acceptable. Marpa users reuse grammars and pieces of grammars all the time. The result is always the language specified by the merged BNF, and I've never heard anyone complain that performance deterioriated.

But, while it may not happen often, it is possible to combine two Marpa grammars that run in linear time and end up with one that does not. You can guarantee your merged Marpa grammar will stay linear if you follow 2 rules:

Unmarked middle recursions are not things you're likely to need a lot: they are those palindromes where you have to count to find the middle: grammars like "A ::= a | a A a". If you use a middle recursion at all, it is almost certainly going to be marked, like "A ::= b | a A a", which generates strings like "aabaa". With Marpa, as with RE's, reuse is easy and practical. And, as I hope to show in a future post, unlike RE's, Marpa opens the road to language-driven programming.

Perl 6

I'm a fan of the Perl 6 effort. I certainly should be a supporter, after the many favors they've done for me and the Marpa community over the years. The considerations of this post will disappoint some of the hopes for applications of the native Perl 6 parser. But these applications have not been central to the Perl 6 effort, of which I will be an eager student over the coming months.

Comments

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.


posted at: 10:13 | direct link to this entry

§         §         §

Sat, 29 Aug 2015


Fast handy languages

Back around 1980, I had access to UNIX and a language I wanted to parse. I knew that UNIX had all the latest CS tools. So I expected to type in my BNF and "Presto, Language!".

Not so easy, I was told. Languages were difficult things created with complex tools written by experts who understood the issues. I recall thinking that, while English had a syntax that is as hard as they come, toddlers manage to parse it just fine. But experts are experts, and more so at second-hand.

I was steered to an LALR-based parser called yacc. (Readers may be more familiar with bison, a yacc successor.) LALR had extended the class of quickly parseable grammars a bit beyond recursive descent. But recursive descent was simple in principle, and its limits were easy to discover and work around. LALR, on the hand, was OK when it worked, but figuring out why it failed when it failed was more like decryption than debugging, and this was the case both with parser development and run-time errors. I soon gave up on yacc and found another way to solve my problem.

Few people complained about yacc on the Internet. If you noise it about that you are unable to figure out how to use what everybody says is the state-of-the-art tool, the conclusions drawn may not be the ones you want. But my experience seems to have been more than common.

LALR's claim to fame was that it was the basis of the industry-standard C compiler. Over three decades, its maintainers suffered amid the general silence. But by 2006, they'd had enough. GCC (the new industry standard) ripped its LALR engine out. By then the trend back to recursive descent was well underway.

A surprise discovery

Back in the 1970's, there had been more powerful alternatives to LALR and recursive descent. But they were reputed to be slow.

For some applications slow is OK. In 2007 I decided that a parsing tool that parsed all context-free languages at state-of-the-art speeds, slow or fast as the case might be, would be a useful addition to programmer toolkits. And I ran into a surprise.

Hidden in the literature was an amazing discovery -- an 1991 article by Joop Leo that described how to modify Earley's algorithm to be fast for every language class in practical use. (When I say "fast" in this article, I will mean "linear".) Leo's article had been almost completely ignored -- my project (Marpa) would become its first practical implementation.

Second-order languages

The implications of Leo's discovery go well beyond speed. If you can rely on the BNF that you write always producing a practical parser, you can auto-generate your language. In fact, you can write languages which write languages.

Which languages are fast?

The Leo/Earley algorithm is not fast for every BNF-expressible language. BNF is powerful, and you can write exponentially ambiguous languages in it. But programmers these days mostly care about unambiguous languages -- we are accustomed to tools and techniques that parse only a subset of these.

As I've said, Marpa is fast for every language class in practical use today. Marpa is almost certainly fast for any language that a modern programmer has in mind. Unless you peek ahead at the hints I am about to give you, in fact, it is actually hard to write an unambiguous grammar that goes non-linear on Marpa. Simply mixing up lots of left, right and middle recursions will not be enough to make an unambiguous grammar go non-linear. You will also need to violate a rule in the set that I am about to give you.

To guarantee that Marpa is fast for your BNF language, follow three rules:

Rule 3 turns out to be very easy to obey. I discuss it in detail in the next section, which will be about how to break these rules and get away with it.

Before we do that, let's look at what an "unmarked" middle recursion is. Here's an example of a "marked" middle recursion:

       M ::= 'b'
       M ::= 'a' M 'a'
    

Here the "b" symbol is the marker. This marked middle recursion generates sequences like

       b
       a b a
       a a b a a
    

Now for an "unmarked" middle recursion:

       M ::= 'a' 'a'
       M ::= 'a' M 'a'
    

This unmarked middle recursion generates sequences like

       a a
       a a a a
       a a a a a a
    

In this middle recursion there is no marker. To know where the middle is, you have to scan all the way to the end, and then count back.

A rule of thumb is that if you can "eyeball" the middle of a long sequence, the recursion is marked. If you can't, it is unmarked. Unfortunately, we can't characterize exactly what a marker must look like -- a marker can encode the moves of a Turing machine, so marker detection is undecidable.

How to get away with breaking the rules

The rules about ambiguity and recursions are "soft". If you only use limited ambiguity and keep your rule-breaking recursions short, your parser will stay fast.

Above, I promised to explain rule 3, which insisted that a right recursive symbol be "dedicated". A right recursive symbol is "dedicated" if it appears only as the recursive symbol in a right recursion. If your grammar is unambiguous, but you've used an "undedicated" right-recursive symbol, that is easy to fix. Just rewrite the grammar, replacing the "undedicated" symbol with two different symbols. Dedicate one of the two to the right recursion, and use the other symbol everywhere else.

When NOT to use Marpa

The languages I have described as "fast" for Marpa include all those in practical use and many more. But do you really want to use Marpa for all of them? Here are four cases for which Marpa is probably not your best alternative.

The first case: a language that parses easily with a regular expression. The regular expression will be much faster. Don't walk away from a good thing.

The second case: a language that is easily parsed using a single loop and some state that fits into constant space. This parser might be very easy to write and maintain. If you are using a much slower higher level language, Marpa's optimized C language may be a win on CPU speed. But, as before, if you have a good thing, don't walk away from it.

The third case: a variation on the second. Here your single loop might be getting out of hand, making you yearn for the syntax-driven convenience of Marpa, but your state still fits into constant space. In its current implementation, Marpa keeps all of its parse tables forever, so Marpa does not parse in constant space. Keeping the tables allows Marpa to deal with the full structure of its input, in a way that a SAX-ish approaches cannot. But if space requirements are an issue, and your application allows a simplified constant-space approach, Marpa's power and convenience may not be enough to make up for that.

The fourth case: a language that

It is rare that all of these are the case, but when that happens, recursive descent is often preferable to Marpa. Lua and JSON are two languages which meet the above criteria. In Lua's case, it targets platforms with very restricted memories, which is an additional reason to prefer recursive descent -- Marpa has a relatively large footprint.

It was not good luck that made both Lua and JSON good targets for recursive descent -- they were designed around its limits. JSON is a favorite test target of Marpa for just these reasons. There are carefully hand-optimized C language parsers for us to benchmark against.

We get closer and closer, but Marpa will never beat small hand-optimized JSON parsers in software. However, while recursive descent is a technique for hand-writing parsers, Marpa is a mathematical algorithm. Someday, instructions for manipulating Earley items could be implemented directly in silicon. When and if that day comes, Earley's algorithm will beat recursive descent even at parsing the grammars that were designed for it.

Comments

Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net. To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.


posted at: 16:36 | direct link to this entry

§         §         §

Tue, 10 Mar 2015


Linear? Yeah right.

Linear?

I have claimed that my new parser, Marpa, is linear for vast classes of grammars, going well beyond what the traditional parsers can do. But skepticism is justified. When it comes to parsing algorithms, there have been a lot of time complexity claims that are hand-wavy, misleading or just plain false. This post describes how someone, who is exercising the appropriate degree of skepticism, might conclude that believing Marpa's claims is a reasonable and prudent thing to do.

Marpa's linearity claims seem to be, in comparison with the other parsers in practical use today, bold. Marpa claims linearity, not just for every class of grammar for which yacc/bison, PEG and recursive descent currently claim linearity, but for considerably more. (The mathematical details of these claims are in a section at the end.) It seems too good to be true.

Why should I believe you?

The most important thing to realize, in assessing the believability of Marpa's time complexity claims, is that they are not new. They were already proved in a long-accepted paper in the refereed literature. They are the time complexity claims proved by Joop Leo for his algorithm in 1991, over two decades ago. Marpa is derived from Leo's algorithm, and its time complexity claims are those proved for Leo's algorithm.

Above I said that Marpa's time complexity claims "seem" bold. On any objective assessment, they are in fact a bit of a yawn. The claims seem surprising only because a lot of people are unaware of Leo's results. That is, they are surprising in the same sense that someone who had avoided hearing about radio waves would be surprised to learn that he can communicate instantly with someone on the other side of the world.

So, if there's so little to prove, why does the Marpa paper have proofs? In Marpa, I made many implementation decisions about, and some changes to, the Leo/Earley algorithm. None of my changes produced better time complexity results -- my only claim is that I did not change the Leo/Earley algorithm in a way that slowed it down. To convince myself of this claim, I reworked the original proofs of Leo and Earley, changing them to reflect my changes, and demonstrated that the results that Leo had obtained still held.

Proofs of this kind, which introduce no new mathematical techniques, but simply take a previous result and march from here to there by well-know means, are called "tedious". In journals, where there's a need to conserve space, they are usually omitted, especially if, as is the case with Marpa's time complexity proofs, the results are intuitively quite plausible.

Getting from plausible to near-certain

So let's say you are not going to work through every line of Marpa's admittedly tedious proofs. We've seen that the results are intuitively plausible, as long as you don't reject the previous literature. But can we do better than merely "plausible"?

As an aside, many people misunderstand the phrase "mathematically proven", especially as it applies to branches of math like parsing theory. The fact is that proofs in papers often contain errors. Usually these are minor, and don't affect the result. On the other hand, Jay Earley's paper, while one of the best Computer Science papers ever published, also contained a very serious error. And this error slipped past his Ph.D. committee and his referees. Mathematical arguments and proofs do not allow us to achieve absolute certainty. They can only improve the degree of certainty.

There's a second way to dramatically increase your degree of conviction in Marpa's linearity claims, and it is quite simple. Create examples of problematic grammars, run them and time them. This is not as satisfying as a mathematical proof, because no set of test grammars can be exhaustive. But if you can't find a counter-example to Marpa's linearity claims among the grammars of most interest to you, that should help lift your level of certainty to "certain for all practical purposes".

Much of this increase in certainty can be achieved without bothering to run your own tests. Marpa is in wide use at this point. If Marpa was going quadratic on grammars for which it claimed to be linear, and these were grammars of practical interest, that would very likely have been noticed by now.

I'm still not convinced

Let's suppose all this has not brought you to the level of certainty you need to use Marpa. That means the reasonable thing is to continue to struggle to work with the restrictions of the traditional algorithms, right? No, absolutely not.

OK, so you don't believe that Marpa preserves the advances in power and speed made by Leo. Does that mean that parsers have to stay underpowered? No, it simply means that there should be a more direct implementation of Leo's 1991, bypassing Marpa.

But if you are looking for an implementation of Leo's 1991 algorithm, I think you may end up coming back to Marpa as the most reasonable choice. Marpa's additional features include the ability to use custom, procedural logic, as you can with recursive descent. And Marpa has worked out a lot of the implementation details for you.

Comments

Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net. To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.

Appendix: Some technical details

Above I talked about algorithms, classes of grammars and their linearity claims. I didn't give details because most folks aren't interested. For those who are, they are in this section.

yacc is linear for a grammar class called LALR, which is a subset of another grammar class called LR(1). If you are willing to hassle with GLR, bison claims linearity for all of LR(1). Recursive descent is a technique, not an algorithm, but it is top-down with look-ahead, and therefore can be seen as some form of LL(k), where k depends on how it is implemented. In practice, I suspect k is never much bigger than 3, and usually pretty close to 1. With packratting, PEG can be made linear for everything it parses but there is a catch -- only in limited cases do you know what language your PEG grammar actually parses. In current practice, that means your PEG grammar must be LL(1). Some of the PEG literature looks at techniques for extending this as far as LL-regular, but there are no implementations, and it remains to be seen if the algorithms described are practical.

The Marpa paper contains a proof, based on a proof of the same claim by Joop Leo, that Marpa is linear for LR-regular grammars. The LR-regular grammars include the LR(k) grammars for every k. So Marpa is linear for LR(1), LR(2), LR(8675309), etc. LR-regular also includes LL-regular. So every class of grammar under discussion in the PEG literature is already parsed in linear time by Marpa. From this, it is also safe to conclude that, if a grammar can be parsed by anything reasonably described as recursive descent, it can be parsed in linear time by Marpa.


posted at: 22:01 | direct link to this entry

§         §         §