Parsing theory preliminaries (Libmarpa 11.0.10)

5.2 Parsing theory preliminaries

This document assumes the reader is familiar with parsing theory. The following exposition is not intended as either an introduction or a reference. Instead, it is intended to serve as a guide to the definitions of parsing terms as used in this document.

The definitions given are intended to be applicable within Libmarpa, rather than to reflect general usage. For example, Marpa sometimes uses a standard term with a definition that is different from the standard definition. “Ambiguous grammar” is one example: See Ambiguity. The term “terminal” is another. See Terminals. When a standard term is defined in a non-standard way. this is explicitly pointed out.

Readers who want a textbook or tutorial in parsing theory can look at Mark Jason Dominus’s excellent chapter on parsing in the Perl context. See Dominus 2005. It is available on-line. Wikipedia is also an excellent place to start. See Wikipedia BNF article.

A grammar is a set of rules, associated with a set of symbols, one of which is distinguished as the start symbol. A symbol string, or simply string where the meaning is clear, is an ordered series of symbols. The length of a string is the number of symbols in it. A symbol string is also called a sentential form.

Some of the symbols are terminals. For the the moment, we will say that a terminal is a symbol which may occur in an input to a parse of a grammar. A careful definition appropriate specifically to Marpa is below (see Terminals). In a parse, an input is either accepted or rejected. A potential input string, that is, a sentential form which is made up entirely of terminal symbols, is called a sentence. The set of sentences that a grammar accepts is the language of the grammar.

It is important to note that the term language, as it is used in parsing theory, means something very different from what it means in ordinary use. The meaning of the strings is an essential part of the ordinary idea of what a language is. In parsing terminology, meaning (or semantics as it is called) is a separate issue. For parsing theory a language is exactly a set of strings – that and nothing more.

A recognizer is a program that determines whether its input is in the language of a grammar. A parser is a program which finds the structure of that input.

Parsers often use a lexical analyzer to convert raw input, usually input text, into a token stream, which is a series of tokens. Each token represents one or more symbols of the grammar and has a value. (Libmarpa tokens may be ambiguous. See Ambiguous input.) A token is also sometimes called a lexeme. A lexical analyzer is often called a lexer or a scanner, and lexical analysis is often called lexing or scanning.

The series of symbols represented by the series of tokens becomes the symbol string input seen by the recognizer. The symbol string input is more often called the input sentence, or simply the input.