Jeffrey Kegler's blog about Marpa, his new parsing algorithm, and other topics of interest
The Ocean of Awareness blog: home page, chronological index, and annotated index.
Github's linguist is seen as the most trustworthy tool for estimating language popularity[1], in large part because it reports its result as the proportion of code in a very large dataset, instead of web hits or searches.[2] It is ironic, in this context, that linguist avoids looking at the code, preferring to use metadata -- file name and the vim and shebang lines. Scanning the actual code is linguist's last resort.[3]
How accurate is this? For files that are mostly in a single programming language, currently the majority of them, linguist's method are probably very accurate.
But literate programming often requires mixing languages. It is perhaps an extreme example, but much of the code used in this blog post comes from a Markdown file, which contains both C and Lua. This code is "untangled" from the Lua by ad-hoc scripts[4]. In my codebase, linguist indentifies this code simply as Markdown.[5] linguist then ignores it, as it does all documentation files.[6].
Currently, this kind of homegrown literate programming may be so rare that it is not worth taking into account. But if literate programming becomes more popular, that trend might well slip under linguist's radar. And even those with a lot of faith in linguist's numbers should be happy to know they could be confirmed by more careful methods.
linguist avoids reporting results based on looking at the code, because careful line counting for multiple languages cannot be done with traditional parsing methods.[7] To do careful line counting, a parser must be able to handle ambiguity in several forms -- ambiguous parses, ambiguous tokens, and overlapping variable-length tokens.
The ability to deal with "overlapping variable-length tokens" may sound like a bizarre requirement, but it is not. Line-by-line languages (BASIC, FORTRAN, JSON, .ini files, Markdown) and token-by-token languages (C, Java, Javascript, HTML) are both common, and even today commonly occur in the same file (POD and Perl, Haskell's Bird notation, Knuth's CWeb).
Deterministic parsing can switch back and forth, though at the cost of some very hack-ish code. But for careful line counting, you need to parse line-by-line and token-by-token simultaneously. Consider this example:
int fn () { /* for later \begin{code} */ int fn2(); int a = fn2(); int b = 42; return a + b; /* for later \end{code} */ }
A reader can imagine that this code is part of a test case using code pulled from a LaTeX file. The programmer wanted to indicate the copied portion of code, and did so by commenting out its original LaTeX delimiters. GCC compiles this code without warnings.
It is not really the case that LaTeX is a line-by-line language. But in literate programming systems[8], it is usually required that the \begin{code} and \end{code} delimiters begin at column 0, and that the code block between them be a set of whole lines so, for our purposes in this post, we can treat LaTeX as line-by-line. For LaTeX, our parser finds
L1c1-L1c29 LaTeX line: " int fn () { /* for later" L2c1-L2c13 \begin{code} L3c1-L5c31 [A CODE BLOCK] L6c1-L6c10 \end{code} L7c1-L7c5 LaTeX line: "*/ }"[9]
Note that in the LaTeX parse, line alignment is respected perfectly: The first and last are ordinary LaTeX lines, the 2nd and 6th are commands bounding the code, and lines 3 through 5 are a code block.
The C tokenization, on the other hand, shows no respect for lines. Most tokens are a small part of their line, and the two comments start in the middle of a line and end in the middle of one. For example, the first comment starts at column 17 of line 1 and ends at column 5 of line 3.[10]
What language is our example in? Our example is long enough to justify classification, and it compiles as C code. So it seems best to classify this example as C code[11]. Our parses give us enough data for a heuristic to make a decision capturing this intuition.[12]
In a series of previous posts[13], I have been developing a parsing method that integrates Earley/Leo parsing and combinator parsing. Everything in my previous posts is available in Marpa::R2, which was Debian stable as of jessie.
The final piece, added in this post, is the ability to use variable length subparsing[14], which I have just added to Marpa::R3, Marpa::R2's successor. Releases of Marpa::R3 pass a full test suite, and the documentation is kept up to date, but R3 is alpha, and the usual cautions[15] apply.
Earley/Leo parsing is linear for a superset of the LR-regular grammars, which includes all other grammar classes in practical use, and Earley/Leo allows the equivalent of infinite lookahead.[16] When the power of Earley/Leo gives out, Marpa allows combinators (subparsers) to be invoked. The subparsers can be anything, including other Earley/Leo parsers, and they can be called recursively[17]. Rare will be the grammar of practical interest that cannot be parsed with this combination of methods.
The code that ran this example is available on Github. In previous posts, we gave larger examples[18], and our tools and techniques have scaled. We expect that the variable-length subparsing feature will also scale -- while it was not available in Marpa::R2, it is not in itself new. Variable-length tokens have been available in other Marpa interfaces for years and they were described in Marpa's theory paper.[19].
The grammars used in the example of this post are minimal. Only enough LaTex is implemented to recognize code blocks; and only enough C syntax is implemented to recognize comments.
To learn more about Marpa, a good first stop is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.
1. This github repo for linguist is https://github.com/github/linguist/. ↩
2.
Their methodology is often left vague,
but it seems safe to say the careful line-by-line counting
discussed in this post
goes well beyond the techniques used in
the widely-publicized lists of "most popular programming
languages".
In fact, it seems likely these measures do not use line
counts at all,
but instead report the sum of blob sizes.
Github's linguist does give a line count but
Github does not vouch for its accuracy:
"if you really need to know the lines of code of an entire repo, there are much better tools for this than Linguist."
(Quoted from
the resolution of
Github linguist issue #1331.)
The Github API's list-languages command reports language sizes
in bytes.
The API documentation
is vague, but it seems the counts are the
sum of blob sizes,
with each blob classed as one and only one language.
Some tallies seem even more coarsely grained than this --
they are not even blob-by-blob,
but assign entire repos to the "primary language".
For more, see
Jon Evan's
Techcrunch
article;
and Ben Frederickson's project.
↩
3. linguist's methodology is described in its README.md (permalink as of 30 September 2018). ↩
4. This custom literate programming system is not documented or packaged, but those who cannot resist taking a look can find the Markdown file it processes here, and its own code here (permalinks accessed 2 October 2018). ↩
5. For those who care about getting linguist as accurate as possible. there is a workaround: the linguist-language git attribute. This still requires that each blob be reported as containing lines of only one language. ↩
6. For the treatment of Markdown, see linguist README.md (permalink accessed as of 30 September 2018). ↩
7. Another possibility is a multi-scan approach -- one pass per language. But that is likely to be expensive. At last count there were 381 langauges in linguist's database. Worse, it won't solve the problem: "liberal" recognition even of a single language requires more power than available from traditional parsers. ↩
8. For example, these line-alignment requirements match those in Section 10.4 of the 2010 Haskell Language Report. ↩
9. Adapted from test code in Github repo, permalink accessed 2 October 2018. ↩
10. See the test file on Gihub. ↩
11. Some might think the two LaTex lines should be counted as LaTex and, using subparsing of comments, that heuristic can be implemented. ↩
12. To be sure, a useful tool would want to include considerably more of C's syntax. It is perhaps not necessary to be sure that a file compiles before concluding it is C. And we might want to class a file as C in spite of a fleeting failure to compile. But we do want to lower the probably of a false positive. ↩
13. Marpa and procedural parsing; Marpa and combinator parsing; and Marpa and combinator parsing 2 ↩
14. There is documentation of the interface, but it is not a good starting point for a reader who has just started to look at the Marpa::R3 project. Once a user is familiar with Marpa::R3 standard DSL-based interface, they can start to learn about its alternatives here. ↩
15. Specifically, since Marpa::R3 is alpha, its features are subject to change without notice, even between micro releases, and changes are made without concern for backward compatibility. This makes R3 unsuitable for a production application. Add to this that, while R3 is tested, it has seen much less usage and testing than R2, which has been very stable for some time. ↩
16. Technically, a grammar is LR-regular if it can be parsed deterministically using a regular set as its lookahead. A "regular set" is a set of regular expressions. The regular set itself must be finite, but the regular expressions it contains can match lookaheads of arbitrary length. ↩
17. See Marpa and combinator parsing 2 ↩
18. The largest example is in Marpa and combinator parsing 2 ↩
19. Kegler, Jeffrey. Marpa, A Practical General Parser: The Recognizer. Online version accessed of 24 April 2018. The link is to the 19 June 2013 revision of the 2012 original. ↩
posted at: 20:16 | direct link to this entry