Sherlock Holmes and the Case of the Missing Parsing Solution

Sun, 31 Mar 2019

Sherlock Holmes and the Case of the Missing Parsing Solution

Always approach a case with an absolutely blank mind. It is always an advantage. Form no theories, just simply observe and draw inferences from your observations. — Sherlock Holmes, quoted in "The Adventure of the Cardboard Box".

It is a capital mistake to theorize before one has data. — Holmes, in "A Scandal in Bohemia".

I make a point of never having any prejudices, and of following docilely wherever fact may lead me. — Holmes, in "The Reigate Puzzle".

When you have eliminated the impossible, whatever remains, no matter how improbable, must be the truth. — Holmes, in "The Sign of Four".

In imagination there exists the perfect mystery story. Such a story presents the essential clues, and compels us to form our own theory of the case. If we follow the plot carefully, we arrive at the complete solution for ourselves just before the author's disclosure at the end of the book. The solution itself, contrary to those of inferior mysteries, does not disappoint us; moreover, it appears at the very moment we expect it. Can we liken the reader of such a book to the scientists, who throughout successive generations continue to seek solutions of the mysteries in the book of nature? The comparison is false and will have to be abandoned later, but it has a modicum of justification which may be extended and modified to make it more appropriate to the endeavour of science to solve the mystery of the universe. — Albert Einstein and Leopold Infeld. [1]

The Sherlock Holmes approach

My timeline history of parsing theory is my most popular writing, but it is not without its critics. Many of them accuse the timeline of lack of objectivity or of bias.

Einstein assumed his reader's idea of methods of proper investigation, in science as elsewhere, would be similar to those Conan Doyle's Sherlock Holmes. I will follow Einstein's lead in starting there.

The deductions recorded in the Holmes' canon often involve a lot of theorizing. To make it a matter of significance what the dogs in "Silver Blaze" did in the night, Holmes needs a theory of canine behavior, and Holmes' theory sometimes outpaces its pack of facts by a considerable distance. Is it really true that only dangerous people own dangerous dogs?[2]

Holmes's methods, at least as stated in the Conan Doyle stories, are incapable of solving anything but the fictional problems he encounters. In real life, a "blank mind" can observe nothing. There is no "data" without theory, just white noise. Every "fact" gathered relies on many prejudgements about what is relevant and what is not. And you certainly cannot characterize anything as "impossible", unless you have, in advance, a theory about what is possible.

The Einstein approach

Einstein, in his popular account of the evolution of physics, finds the Doyle stories "admirable"[3]. But to solve real-life mysteries, more is needed. Einstein begins his description of his methods at the start of his Chapter II:

The following pages contain a dull report of some very simple experiments. The account will be boring not only because the description of experiments is uninteresting in comparison with their actual performance, but also because the meaning of the experiments does not become apparent until theory makes it so. Our purpose is to furnish a striking example of the role of theory in physics. [4]

Einstein follows with a series of the kind of experiments that are performed in high school physics classes. One might imagine these experiments allowing an observer to deduce the basics of electromagnetism using materials and techniques available for centuries.

But, and this is Einstein's point, this is not how it happened. The theory came first, and the experiments were devised afterwards.

In the first pages of our book we compared the role of an investigator to that of a detective who, after gathering the requisite facts, finds the right solution by pure thinking. In one essential this comparison must be regarded as highly superficial. Both in life and in detective novels the crime is given. The detective must look for letters, fingerprints, bullets, guns, but at least he knows that a murder has been committed. This is not so for a scientist. It should not be difficult to imagine someone who knows absolutely nothing about electricity, since all the ancients lived happily enough without any knowledge of it. Let this man be given metal, gold foil, bottles, hard-rubber rod, flannel, in short, all the material required for performing our three experiments. He may be a very cultured person, but he will probably put wine into the bottles, use the flannel for cleaning, and never once entertain the idea of doing the things we have described. For the detective the crime is given, the problem formulated: who killed Cock Robin? The scientist must, at least in part, commit his own crime, as well as carry out the investigation. Moreover, his task is not to explain just one case, but all phenomena which have happened or may still happen. — Einstein and Infeld [5]

Commiting our own crime

If then, we must commit the crime of theorizing before the facts, where does out theory come from?

Science is not just a collection of laws, a catalogue of unrelated facts. It is a creation of the human mind, with its freely invented ideas and concepts. Physical theories try to form a picture of reality and to establish its connection with the wide world of sense impressions. Thus the only justification for our mental structures is whether and in what way our theories form such a link. — Einstein and Infeld [6]

In the case of planets moving around the sun it is found that the system of mechanics works splendidly. Nevertheless we can well imagine that another system, based on different assumptions, might work just as well.
Physical concepts are free creations of the human mind, and are not, however it may seem, uniquely determined by the external world. In our endeavor to understand reality we are somewhat like a man trying to understand the mechanism of a closed watch. He sees the face and the moving hands, even hears its ticking, but he has no way of opening the case. If he is ingenious he may form some picture of a mechanism which could be responsible for all the things he observes, but he may never be quite sure his picture is the only one which could explain his observations. He will never be able to compare his picture with the real mechanism and he cannot even imagine the possibility or the meaning of such a comparison. But he certainly believes that, as his knowledge increases, his picture of reality will become simpler and simpler and will explain a wider and wider range of his sensuous impressions. He may also be believe in the existence of the ideal limit of knowledge and that it is approached by the human mind. He may call this ideal limit the objective truth. -- Einstein and Infeld [7]

It may sound as if Einstein believed that the soundness of our theories is a matter of faith. In fact, Einstein was quite comfortable with putting it exactly that way:

However, it must be admitted that our knowledge of these laws is only imperfect and fragmentary, so that, actually the belief in the existence of basic all-embracing laws in Nature also rests on a sort of faith. All the same this faith has been largely justified so far by the success of scientific research. — Einstein [8]

I believe that every true theorist is a kind of tamed metaphysicist, no matter how pure a "positivist" he may fancy himself. The metaphysicist believes that the logically simple is also the real. The tamed metaphysicist believes that not all that is logically simple is embodied in experienced reality, but that the totality of all sensory experience can be "comprehended" on the basis of a conceptual system built on premises of great simplicity. The skeptic will say this is a "miracle creed." Admittedly so, but it is a miracle creed which has been borne out to an amazing extent by the development of science. — Einstein [9]

The liberty of choice, however, is of a special kind; it is not in any way similar to the liberty of a writer of fiction. Rather, it is similar to that of a man engaged in solving a well-designed puzzle. He may, it is true, propose any word as the solution; but, there is only one word which really solves the puzzle in all its parts. It is a matter of faith that nature — as she is perceptible to our five senses — takes the character of such a well-formulated puzzle. The successes reaped up to now by science do, it is true, give a certain encouragement for this faith. -- Einstein [10]

The puzzle metaphor of the last quote is revealing. Einstein believes there is a single truth, but that we will never know what it is — even its existence can only be taken as a matter of faith. Existence is a crossword puzzle whose answer we will never know. Even the existence of an answer must be taken as a matter of faith.

The very fact that the totality of our sense experience is such that by means of thinking (operations with concepts, and the creation and use of definite functional relations between them, and the coordination of sense experiences to these concepts) it can be put in order, this fact is one which leaves us in awe, but which we shall never understand. One may say that "the eternal mystery of the world is its comprehensibility". — Einstein [11]

In my opinion, nothing can be said a priori concerning the manner in which the concepts are to be formed and connected, and how we are to coordinate them to sense experiences. In guiding us in the creation of such an order of sense experiences, success alone is the determining factor. All that is necessary is to fix a set of rules, since without such rules the acquisition of knowledge in the desired sense would be impossible. One may compare these rules with the rules of a game in which, while the rules themselves are arbitrary, it is their rigidity alone which makes the game possible. However, the fixation will never be final. It will have validity only for a special field of application. — Einstein [12]

There are no eternal theories in science. It always happens that some of the facts predicted by a theory are disproved by experiment. Every theory has its period of gradual development and triumph, after which it may experience a rapid decline. — Einstein and Infeld [13]

In our great mystery story there are no problems wholly solved and settled for all time. — Einstein and Infeld [14]

This great mystery story is still unsolved. We cannot even be sure that it has a final solution. — Einstein and Infeld [15]

Choosing a "highway"

In most of the above, Einstein is focusing on his work in a "hard" science: physics. Are his methods relevant to "softer" fields of study? Einstein thinks so:

The whole of science is nothing more than a refinement of everyday thinking. It is for this reason that the critical thinking of the physicist cannot possibly be restricted to the examination of the concepts of his own specific field. He cannot proceed without considering critically a much more difficult problem, the problem of analyzing the nature of everyday thinking. — Einstein [16]

Einstein's collaboration with Infeld was, like the "Timeline", a description of the evolution of ideas, and in the Einstein–Infeld book they describe their approach:

Through the maze of facts and concepts we had to choose some highway which seemed to us most characteristic and significant. Facts and theories not reached by this road had to be omitted. We were forced, by our general aim, to make a definite choice of facts and ideas. The importance of a problem should not be judged by the number of pages devoted to it. Some essential lines of thought have been left out, not because they seemed to us unimportant, but because they do not lie along the road we have chosen. — Einstein and Infeld [17]

Truth and success

Einstein says that objective truth, while it exists, is not to be attained in the hard sciences, so it is not likely he thought that a historical account could outdo physics in this respect. For Einstein, as quoted above, "success alone is the determining factor".

Success, of course, varies with what the audience for a theory wants. In a very real sense, I consider a theory that can predict the stock market more successful than one which can predict perturbations of planetary orbits invisible to the naked eye. But this is not a reasonable expectation when applied to the theory of general relativity.

Among the expectations reasonable for a timeline of parsing might be these:

It helps choose the right parsing algoithm for practical applications.
It helps a reader to understand articles in the literature of parsing.
It helps guide future research.
It predicts the outcome of future research.

When I wrote the first version of Timeline, its goal was none of these. Instead I intended it to explain the sources behind my own research in the Earley/Leo lineage.

With such a criteria of "success", I wondered if Timeline would have an audience much larger than one, and was quite surprised when it started getting thousands of web hits a day. The large audience Timeline 1.0 drew was a sign that there is an large appetite out there for accounts of parsing theory, an appetite so strong that anything resembling a coherent account was quickly devoured.

In response to the unexpectedly large audience, later versions of the Timeline widened their focus. Timeline 3.1 was broadened to give good coverage of mainstream parsing practice including a lot of new material and original analysis. This brought in lot of material on topics which had little or no influence on my Earley/Leo work. The parsing of arithmetic expressions, for example, is trivial in the Earley/Leo context, and before my research for Timeline 3.0 I had devoted little attention to approaches that I felt amounted to needlessly doing things the hard way. But arithmetic expressions are at the borderline of power for traditional approaches and parsing arithmetic expressions was a central motivation for the authors of the algorithms that have so far been most influential on mainstream parsing. So in Timeline 3.1 arithmetic expresssions became a recurring theme, being brought back for detailed examination time and time again.

Is the "Timeline" false?

Is the "Timeline" false? The answer is yes, in three increasingly practical senses.

As Einstein makes clear, every theory that is about reality, will eventually proved be false. The best a theory can hope for is the fate of Newton's physics — to be shown to be a subcase of a larger theory.

In a more specific sense, the truth of any theory of parsing history depends on its degree of success in explaining the facts. This means that the truth of the "Timeline" depends on which facts you require it to explain. If arbitrary choices of facts to be explained are allowed, the "Timeline" will certainly be seen to be false.

But can the "Timeline" be shown to be false for criteria of success which are non-arbitrary? In the next section, I will describe four non-arbitrary criteria of success, all of which are of practical interest, and for all of which the "Timeline" is false.

The Forever Five

"Success" depends a lot on judgement, but my studies have led me to conclude that all but five algorithms are "unsuccessful" in the sense that, for everything that they do, at least one other algorithm does it better in practice. But this means there are five algorithms which do solve some practical problems better than any other algorithm, including each of the other four. I call these the "forever five" because, if I am correct, these algorithms will be of permanent interest.

My "Forever Five" are regular expressions, recursive descent, PEG, Earley/Leo and Sakai's algorithm.[18] Earley/Leo is the focus of my Timeline, so that an effective critique of my "Timeline" could be a parsing historiography centering on any other of the other four.

For example, of the five, regular expressions are the most limited in parsing power. On the other hand, most of the parsing problems you encounter in practice are handled quite nicely by regular expressions.[19] Good implementations of regular expressions are widely available. And, for speed, they are literally unbeatable -- if a parsing problem is a regular expression, no other algorithm will beat a dedicated regular expression engine for parsing it.

Could a Timeline competitor be written which centered on regular expressions? Certainly. And if immediate usefulness to the average programmer is the criterion (and it is a very good criterion), then the Regular Expressions Timeline would certainly give my timeline a run for the money.

What about a PEG Timeline?

The immediate impetus for this article was a very collegial inquiry from Nicolas Laurent, a researcher whose main interest is PEG. Could a PEG Timeline challenge mine? Again, very certainly.

Because there are at least some problems for which PEG is superior to everything else, my own Earley/Leo approach included. As one example, PEG could be an more powerful alternative to regular expressions.

That does not mean that I might not come back with a counter-critique. Among the questions that I might ask:

Is the PEG algorithm being proposed a future, or does it have an implementation?
What claims of speed and time complexity are made? Is there a way of determining in advance of runtime how fast your algorithm will run? Or is the expectation of practical speed on an "implement and pray" basis?
Does the proposed PEG algorithm match human parsing capabilities? If not, it is a claim for human exceptionalism, of a kind not usually accepted in modern computer science. How is exceptionalism justified in this case?

The search for truth is more precious than its possession. -- Einstein, quoting Lessing[20]

Comments, etc.

The background material for this post is in my Parsing: a timeline 3.0, and this post may be considered a supplement to "Timelime". To learn about Marpa, my Earley/Leo-based parsing project, there is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Footnotes

1. Einstein, Albert and Infeld, Leopold, The Evolution of Physics, Simon and Schuster, 2007, p. 3 ↩

2. "A dog reflects the family life. Whoever saw a frisky dog in a gloomy family, or a sad dog in a happy one? Snarling people have snarling dogs, dangerous people have dangerous ones." From "The Adventure of the Creeping Man". ↩

3. Einstein and Infeld, p. 4. ↩

4. Einstein and Infeld, p. 71. ↩

5. Einstein and Infeld, p 78. ↩

6. Einstein and Infeld, p. 294. ↩

7. Einstein and Infeld, p. 31. See also Einstein, "On the Method of Theoretical Physics", Ideas and Opinions, Wings Books, New York, no publication date, p. 272. ↩

8. Dukas and Hoffman, Albert Einstein: The Human Side, Princeton University Press, 2013, pp 32-33. ↩

9. "On the Generalized Theory of Gravitation", in Ideas and Opinions, p 342. ↩

10. "Physics and Reality", in Ideas and Opinions, pp. 294-295. ↩

11. "Physics and Reality", in Ideas and Opinions, p. 292. ↩

12. "Physics and Reality", in Ideas and Opinions, p. 292. ↩

13. Einstein and Infeld, p. 75. ↩

14. Einstein and Infeld, p. 35. ↩

15. Einstein and Infeld, pp. 7-8 ↩

16. "Physics and Reality", Ideas and Opinions, p 290. ↩

17. Einstein and Infeld, p. 78. ↩

18. Three quibbles: Regular expressions do not find structure, so pedantically they are recognizers, not parsers. Recursive descent is technique for creating a family of algorithms, not an algorithm. And the algorithm first described by Sakai is more commonly called CYK, from the initials of three other researchers who re-discovered it over the years. ↩

19. A lot of this is because programmers learn to formulate problems in ways which avoid complex parsing so that, in practice, the alternatives are using regular expressions or rationalizing away the need for parsing. ↩

20. "The Fundaments of Theoretical Physics", in Ideas and Opinions, p. 335. ↩

posted at: 21:31 | direct link to this entry

§ § §

Ocean of Awareness

Marpa resources

Sun, 31 Mar 2019