Ocean of Awareness

Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest

Jeffrey's personal website


Marpa resources

The Marpa website

The Ocean of Awareness blog: home page, chronological index, and annotated index.

Mon, 28 May 2018

Is a language just a set of strings?

But to my mind, though I am native here
And to the manner born, it is a custom
More honor’d in the breach than the observance.[1]

Chomsky's "Three Models" paper

Important papers produce important mistakes. A paper can contain a great many errors, and they will have no effect if the paper is ignored. On the other hand, even the good methods of a great paper can go badly wrong when its methods outlive the reasons for using them.

Chomsky's "Three Models" paper[2] is about as influential as a paper can get. Just 12 pages, it's the paper in which the most-cited scholar of our time first outlined his ideas. Even at the time, linguists described its effect on their field as "Copernician".[3] Bringing new rigor into what had been seen as a "soft" science, it turned lots of heads outside linguistics. It belongs on anyone's list of the most important scientific papers ever.[4]

Given its significance, it is almost incidental that "Three models" is also the foundation paper of computer Parsing Theory, the subject of these blog posts. Chomsky does not consider himself a computer scientist and, after founding our field, has paid little attention to it. But in fact, the Chomskyan model has been even more dominant in computer parsing than in Chomsky's own field of linguistics.

"Three Models" places Chomksy among the great mathematicians of all time. True, the elegance and rigor of Chomsky's proofs better befit a slumming linguist than they would a professional mathematician. But at its heart, mathematics is not a technical field, or even about problem-solving -- at its most fundamental, it is about framing problems so that they can be solved. And Chomsky's skill at framing problems is astonishing.

A brilliant simplification

In 1956, Chomsky had a new approach to linguistics, and wanted to prove that his approach to language did things that the previous approach, based on finite-state models, could not. ("Finite-state" models, also known as Markov chains, are the predecessors of the regular expressions of today.) Brilliantly, Chomsky sets out to do this with extremely minimal definition of what a language is.

By a language then, we shall mean a set (finite or infinite) of sentences, each of finite length, all constructed from a finite alphabet of sysbols. If A is an alphabet, we shall say that anything formed by concatenating the symbols of A is a string in A. By a grammar of the language L we mean a device of some sort that produces all of the strings that are sentences of L and only these.[5]

Yes, you read that right -- Chomsky uses a definition of language which has nothing to do with language actually meaning anything. A language, for the purposes of the math in "Three Models", is nothing but a list of strings. Similarly, a grammar is just something that enumerates those strings. The grammar does not have to provide any clue as to what the strings might mean.

For example, Chomsky would require of a French grammar that one of the strings that it lists be

(42) Ceci n'est pas une phrase vraie.

But for the purposes of his demonstration, Chomsky does not require of his "grammar" that it give us any guidance as to what sentence (42) might mean.

Chomsky shows that there are English sentences that his "grammar" would list, and which a finite-state "grammar" would not list. Clearly if the finite-state grammar cannot even produce a sentence as one of a list, it is not adequate as a model of that language, at least as far as that sentence goes.

Chomsky shows that there is, in fact, a large, significant class of sentences that his "grammars" can list, but which the finite-state grammars cannot list. Chomsky presents this as very strong evidence that his grammars will make better models of language than finite-state grammars can.

Other considerations

In addition to simplifying the math, Chomsky has two other good reasons to avoid dealing with meaning. A second reason is that semantics is a treacherously dangerous field of study. If you can make your point, and don't have to drag in semantics, you are crazy to do otherwise. Sentence (42), above, is just one example of the pitfalls that await those tackle who semantic issues. It echoes a famous Magritte and translates to "This is not a true sentence".

A third reason is that most linguists of Chomsky's time were Bloomfieldians. Bloomfield defined language as follows:

The totality of utterances that can be made in a speech community is the language of that speech-community.[6]

Bloomfield says "totality" instead of "set" and "utterances" instead of "strings", but for our purposes in this post the idea is the same -- the definition is without regard to the meaning of the members of the set.[7]

Bloomfield's omission of semantics is not accidental. Bloomfield wanted to establish linguistics as a science, and for Bloomfield claiming to know the meaning of a sentence was dangerously close to claiming to be able to read minds. You cannot base your work on mind-reading and expect people to believe that you are doing science. Bloomfield therefore suggested avoiding, totally if possible, any discussion of semantics. Most readers of Chomsky's paper in 1956 were Bloomfieldians -- Chomsky has studied under a Bloomfieldian, and originally was seen as one.[8] By excluding semantics from his own model of language, Chomsky was making his paper maximally acceptable to his readership.

Semantics sneaks back in

But you did not have to read Chomsky's mind, or predict the future, to see that Chomsky was a lot more interested in semantics than Bloomfield was. Already in "Three Models", he is suggesting that his model is superior to its predecessors, because his model, when an utterance is ambiguous, produces multiple derivations to reflect that. Even better, these multiple derivations "look" like natural representations of the difference between meanings.[9] These insights, which dropped effortlessly out of Chomsky's grammars, were well beyond what the finite-state models were providing.

By itself, Chomsky's argument, that his grammars were better because they could list more sentences, might have carried the day. With the demonstration that his grammars could do more than list sentences, but also could proivde insight into the structure and semantics of sentences, Chomsky's case was compelling. Young linguists wanted theoretical tools with this kind of power and those few older linguists not convinced struggled to find reasons why the young linguists could not have what they wanted.

In later years, Chomsky made it quite clear what his position was:

[...] it would be absurd to develop a general syntactic theory without assigning an absolutely crucial role to semantic considerations, since obviously the necessity to support semantic interpretation is one of the primary requirements that the structures generated by the syntactic component of a grammar must meet.[10]
Compare this to Bloomfield:
The statement of meanings is therefore the weak point in language-study, and will remain so until human knowledge advances very far beyond its present state. In practice, we define the meaning of a linguistic form, wherever we can, in terms of some other science.[11]
It is easy to see why linguists found Chomsky's expansion of their horizons irresistable.

The tradition

Given the immense prestige of "Three models", it is unsurprising that it was closely studied by the pioneers of parsing theory. Unfortunately, what they picked up was not Chomsky's emphasis on the overriding importance of semantics, but the narrow definition of language that Chomsky had adopted from Bloomfield for tactical purposes. In the classic Aho and Ullman 1972 textbook, we have

A language over an alphabet Σ is a set of strings over an alphabet Σ. This definition encompasses almost everyone's notion of a language.[12]
If this "encompasses" my notion of a language, it does so only in the sense that an avalanche encompasses a skier.

From 1988, thirty years after Chomsky, here is another authoritative textbook of Parsing Theory defining "language":

A set V is an alphabet (or a vocabulary) if it is finite and nonempty. The elements of an alphabet V are called the symbols (or letters or characters) of V. A language L over V is any subset of the free monoid V*. The elements of a language L are called sentences of L.[13]

The language is now that of abstract algebra, but the idea is the same -- pure Bloomfield.[14]

The problem

Interesting, you might be saying, that some textbook definitions are not everything they could be, but is there any effect on the daily practice of programming?

The languages human beings use with each other are powerful, varied, flexible and endlessly retargetable. The parsers we use to communicate with computers are restrictive, repetitive in form, difficult to reprogram, and prohibitively hard to retarget. Is this because humans have a preternatural language ability?

Or is there something wrong with the way we go about talking to computers? How the Theory of Parsing literature defines the term "language" may seem of only pedantic interest. But I will argue that it is a mistake which has everything to do with the limits of modern computer languages.

What is the problem with defining a language as a set of strings? Here is one example of how the textbook definition affects daily practice. Call one grammar SENSE:

      SENSE ::= E
      E ::= E + T
      E ::= T
      T ::= T * P
      T ::= P
      P ::= number

And call another grammar STRING:

      STRING ::= E
      E  ::= P OP E
      OP ::= '*'
      OP ::= '+'
      P  ::= number

If you define a language as a set of strings, both SENSE and STRING recognize the same language. But it's a very different story if you take the intended meaning as that of traditional arithmetic expressions, and consider the meaning of the two grammars.

SENSE recognizes the associativity and precedence of the two operators -- the parse tree it produces could be used directly to evaluate an arithmetic expression and the answer would always be correct. The parse tree that STRING produces, if evaluated directly, will very often produce a wrong answer -- it does not capture the structure of an arithmetic expression. In order to produce correct results, the output of STRING could be put through a second phase, but that is the point -- STRING left crucial parts of the job of parsing undone, and either some other logic does the job STRING did not do, or a wrong answer results.

It is much easier to write a parser for STRING than it is for SENSE. Encouraged by a theory that minimizes the difference, many implementations attempt to make do with STRING.

But that is not the worst of it. The idea that a language is a set of strings has guided research, and steered it away from the most promising lines. How, I hope to explain in the future.[15]

Comments, etc.

The background material for this post is in my Parsing: a timeline 3.0, and this post may be considered a supplement to "Timelime". To learn about Marpa, my Earley/Leo-based parsing project, there is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.


1. Hamlet, Act I, scene iv.

2. Chomsky, Noam. "Three models for the description of language." IRE Transactions on information theory, vol. 2, issue 3, September 1956, pp. 113-124.

3. Harris, Randy Allen, The Linguistics Wars, Oxford University Press, 1993, pp 33.

4. In this post I am treating the "Three models" paper as the "first" work of Chomskyan linguistics. Other choices can be justified. The next year, 1957, Chomsky published a book covering the same material: Syntactic Structures. Syntactic Structures was much more accessible, and attracted much more attention -- the Chomskyan revolution did not really begin before it came out. On the other hand, both of these draw their material from Chomsky's 1000-page Logical Structure of Linguistic Theory, which was completed in June 1955. But Logical Structure of Linguistic Theory was not published until 1975 and then only in part. (See Radick, Gregory, "The Unmaking of a Modern Synthesis: Noam Chomsky, Charles Hockett, and the Politics of Behaviorism, 1955–1965" .)

5. Chomsky 1956, p. 114.

6. Bloomfield, Leonard, "A set of Postulates for the Science of Language", Language, Vol. 2, No. 3 (Sep., 1926), pp. 153-164. The quote is definition 4 on p. 154.

7. In case there is any doubt as to the link between the Chomsky and Bloomfield definitions, Chomsky also calls his strings, "utterances". See Chomsky, Noam, Syntactic Structures, 2nd ed., Mouton de Gruyter, 2002, p. 49

8. Harris 1993, pp 31-34, p. 37.

9. Chomsky 1956, p. 118, p. 123.

10. Chomsky, Noam. Topics in the Theory of Generative Grammar. De Gruyter, 1978, p. 20. (The quote occurs in footnote 7 starting on p. 19.)

11. Bloomfield, Leonard. Language. Holt, Rinehart and Winston, 1933, p. 140.

12. Aho, Alfred V., and Jeffrey D. Ullman. The theory of parsing, translation, and compiling. Vol. 1. Prentice-Hall, 1972, p. 16.

13. Sippu, Seppo and Soisalon-Soininen, Eljas. Parsing Theory, Volume I, Springer-Verlag, 1988, p. 11.

14. A welcome errancy from tradition, however, arrives with Grune, D. and Jacobs, C. J. H., Parsing Techniques: A Practical Guide, 2nd edition, Springer, 2008. On pp. 5-7, they attribute the traditional "set of strings" definition to "formal linguistics". They point out that the computer scientist requires a grammar to not only list a set of strings, but provide a "structure" for each of them.

As an aside, Grune and Jacobs often depart from the "just stick to the math" approach taken by other textbooks parsing theory. They often give the history and motivation behind the math. My own work owes much to them.

15. Readers who want to peek ahead can look at my Parsing: a timeline 3.0. The tale is told there is from a somewhat different point of view, but no reader of "Timeline" will be much surprised by where I take this line of thought.

posted at: 19:59 | direct link to this entry

§         §         §