Gazelle is a system for parsing formal languages. A formal language is any language with formally-specified, non-ambiguous syntax. This includes but is not limited to:

  • programming languages (C, C++, Java, etc.)

  • data languages (XML, JSON, YAML, etc.)

  • text-based protocols (HTTP, SMTP, IMAP, etc.)

  • configuration files

  • logfiles

Gazelle defines a grammar language that is intended to give you enough tools to parse real-world languages and all of their idiosynchracies, while still being efficient enough to use in time-sensitive and/or memory-constrained situations. Gazelle takes a multi-paradigm approach. While Gazelle is based around context-free grammars, it also supports:

  • repetition operators like ? (optional), * (zero or more), and + (one or more).

  • ambiguity-resolution operators like / (prioritized choice, as in Parsing Expression Grammars) and greedy/non-greedy repetition.

  • (planned) operator-precedence parsing, so that you can define infix expressions in terms of their operators, precedences, and associativity.

  • (planned) an imperative language (Lua) to use as an "escape hatch" when none of the other formalisms address a language's idiosynchracy.

Gazelle also puts a primary focus on making its grammars reusable. Once a grammar is written for a language, it should be possible to use it unmodified, from any host language that Gazelle supports, for anything from a compiler to the syntax highlighting component of a text editor. That is Gazelle's single most fundamental design criterion.

An Introductory Tour

This section offers a quick tour of Gazelle and its capabilities. It assumes you have already compiled and instaled Gazelle successfully; the instructions for doing so are in the README.

Hello, World

For our "Hello, World" example, we'll write a grammar for a language that is simply an integer surrounded by an arbitrary number of balanced parentheses. To begin, use a text editor to write this grammar to hello.gzl.

hello -> "(" hello ")" | .digits=/[0-9]+/;

This text is a complete Gazelle grammar. It defines a single rule called "hello", which expands to either another (recursive) hello surrounded by parentheses, or a string of one or more numeric digits (and this terminal is named "digits").

This language will match any of the following inputs:

(5)
(((((((((((100)))))))))))
0

But none of these inputs will match:

(()
)(
0)

The first step is to compile the text version of the grammar to bytecode. The Gazelle compiler analyzes the input grammar and builds a series of finite automata (state machines) that can do the real parsing. It writes this compiled version of the grammar to bytecode.

$ gzlc --help
gzlc -- Gazelle grammar compiler.
Gazelle v0.3  http://www.reverberate.org/gazelle/

Usage: gzlc [options] input-file

  -h, --help         you're looking at it.

  -d,                dump detailed output about the grammar to
                     html/index.html.

  -k <depth>         Maximum LL(k) to consider (by default, uses a
                     heuristic that attempts to determine if the
                     grammar is LL(k) for *any* k).

  --no-minimize-rtns
                     skip the minimization step for RTNs.  This results
                     in larger RTNs, but may be necessary if minimization
                     is taking too long (this should only occur in
                     artificially-complicated grammars).

  -o <file>          output filename.  Default is input filename
                     with extension replaced with .gzc

  -v, --verbose      dump information about compilation process and
                     output statistics.

  --version          dump Gazelle version
$ gzlc hello.gzl
$

You won't see any output — that means that the grammar was compiled successfully. Just like gcc, gzlc is silent when things succeed. The default output filename is the input filename with gzl replaced with gzc.

Tallis:gazelle joshua$ ls -l hello*
-rw-r--r--  1 joshua  joshua  212 Sep 30 21:11 hello.gzc
-rw-r--r--  1 joshua  joshua   43 Sep 30 21:11 hello.gzl

If you want to see a bit more verbose output, run gzlc with -v.

$ gzlc -v hello.gzl
Gazelle v0.3
Opening input file 'hello.gzl'...
Parsing grammar...
Assigning RTN transition priorities...
Convering RTN NFAs to DFAs...
Minimizing RTN DFAs...
Doing LL(*) lookahead calculations...
Generating lexer DFAs...
Writing to output file 'hello.gzc'...
Writing 4 strings...
Writing 1 IntFAs...
  4 states, 4 transitions
Writing 0 GLAs...
Writing 1 RTNs...

Now, the magic moment where we do our first actual parsing. We'll write some valid input text for this language, then use gzlparse to actually parse it.

$ echo -n '((((500))))' > hello_text
$ gzlparse --help
gzlparse -- A command-line tool for parsing input text.
Gazelle 0.3  http://www.reverberate.org/gazelle/.

Usage: gzlparse [OPTIONS] GRAMMAR.gzc INFILE
Input file can be '-' for stdin.

  --dump-json    Dump a parse tree in JSON as text is parsed.
  --dump-total   When parsing finishes, print the number of bytes parsed.
  --help         You're looking at it.

$ gzlparse hello.gzc hello_text
$

Again, by default Gazelle doesn't print any output when things were successful. If there had been a parse error, Gazelle would have printed an error and exited. If you want to see more output, the flags —dump-json and —dump-total will print a parse tree in JSON format and a total byte count, respectively.

$ gzlparse --dump-json --dump-total hello.gzc hello_text
{"parse_tree":
  {"rule":"hello", "start": 0, "children": [
    {"terminal": "(", "slotname": "(", "slotnum": 0, "offset": 0, "len": 1},
    {"rule":"hello", "start": 1, "slotname":"hello", "slotnum":1, "children": [
      {"terminal": "(", "slotname": "(", "slotnum": 0, "offset": 1, "len": 1},
      {"rule":"hello", "start": 2, "slotname":"hello", "slotnum":1, "children": [
        {"terminal": "(", "slotname": "(", "slotnum": 0, "offset": 2, "len": 1},
        {"rule":"hello", "start": 3, "slotname":"hello", "slotnum":1, "children": [
          {"terminal": "(", "slotname": "(", "slotnum": 0, "offset": 3, "len": 1},
          {"rule":"hello", "start": 4, "slotname":"hello", "slotnum":1, "children": [
            {"terminal": "digits", "slotname": "digits", "slotnum": 3, "offset": 4, "len": 3}
          ], "len": 3},
          {"terminal": ")", "slotname": ")", "slotnum": 2, "offset": 4, "len": 4}
        ], "len": 5},
        {"terminal": ")", "slotname": ")", "slotnum": 2, "offset": 8, "len": 1}
      ], "len": 7},
      {"terminal": ")", "slotname": ")", "slotnum": 2, "offset": 9, "len": 1}
    ], "len": 9},
    {"terminal": ")", "slotname": ")", "slotnum": 2, "offset": 10, "len": 1}
  ], "len": 11}
}
11 bytes parsed.
$

The next thing you will want to do is use gzlc to produce an HTML dump of the grammar as the compiler sees it. This is an invaluable way to check and be sure that the compiler is seeing things the way you meant it to. If not, you might have misspecified the grammar, or you may have found a bug in the compiler (and these certainly exist).

Doing this HTML dump requires that Graphviz is installed. If ImageMagick is installed, this will also be used to create thumbnails.

$ gzlc -d hello.gzl

Now open html/index.html in a web browser to view the HTML dump. You will see an illustration of the "hello" rule, and you should see that it corresponds to your intuitive understanding of this rule. There are two paths through the rule — either "(", hello, ")" or digits.

You will also see the state machine that does the lexing: it recognizes three different tokens: a left parenthesis, a right parenthesis, and one called "digits" that is a string of repeated numbers. Notice that Gazelle built a lexer for you based on the grammar you entered. You don't have to give Gazelle a separate lexer specification.

Creating a lexer was trivial in the above example because none of the different token types conflict. Because left parenthesis, right parenthesis, and "digits" are all distinct and do not conflict, Gazelle simply combined them all into a single DFA that breaks the input into tokens. But consider a more complicated example: a grammar to describe files of comma-separated values, where the values can either be strings or numbers. For example:

1,5,"X","42","Hello, world!\n",31415926

What makes this example more complicated is that outside of quotes, commas and digits mean one thing, but inside quotes they mean another. Here is a grammar to describe this format:

// A CSV line is any number of comma-separated values separated by commas,
// ending with a newline.
csv_line -> csv_value *(,);

// A comma-separated value is either a number or a quoted string.
csv_value -> .number=/[0-9]+/ | '"' string_fragment* '"';

// Inside a string, there can be either regular characters or a
// backslash-escaped character.
string_fragment -> .chars=/[^\\"]+/ | .escaped_char=/\\./;

If you compile this grammar with gzlc -d and view the HTML output, you will see that Gazelle has generated two different IntFAs (DFAs used by the lexer) — one that is used inside strings that one that is used the rest of the time. Inside a string the text "42" will be lexed as the token "chars", but outside a string it will be lexed as the token "number."

Where to go from here

This concludes the tour. Next you can read more about Gazelle's input language and experiment with your own grammars.

Gazelle's major limitation at the moment is that it cannot support removing whitespace or comments from the input. A feature to address this is coming, but for the moment please bear with me!

I hope you enjoy Gazelle!

Writing Grammars for Gazelle

This chapter focuses on the syntax of Gazelle grammars, and the considerations of making your grammar easily parseable.

Rules

Rules are at the heart of every grammar; they describe the patterns of all the constructs in your language. Here is a simple example that describes an assignment statement.

assign -> ident "=" expr;

This declares that an assignment statement is an identifier followed by the string “=” followed by an expression. We say that assign is the rule's name; ident, "=", and expr are its components. We assume here that “ident” and “expr” are defined elsewhere in the grammar.

For every rule you write, Gazelle builds a graph that describes the possible paths that the input can take through this rule. Since this rule is very simple and doesn't have any kind of repetition, the graph is just a single path.

simplerule1.png
graph 1: assign -> ident "=" expr;

Alternation

Often you will have a rule that can be expanded in more than one way. For example, you might have a boolean expression that can be either “true” or “false.” You can specify alternation to Gazelle with the vertical bar (|).

boolexpr -> "true" | "false";

As you would expect, this produces a rule graph with two edges — one for each option.

altrule1.png
graph for: boolexpr -> "true" | "false";

You can use alternation freely throughout your rules. Alternation has the lowest precedence, but you can override that with parentheses. Here is a more complicated rule that demonstrates more extensive use of alternation.

name -> (fname (mname | minitial) | nickname) surname;

The graph for this rule is naturally a bit more complicated, but looking at it should convince you that the possible paths through this rule are what you intended.

altrule2.png
graph for: name -> (fname (mname | minitial) | nickname) surname

Repetition

Repetition is our other main tool for rule-writing. Repetition isn't strictly necessary for writing context-free grammars (it can always be expressed as recursion), but using it can make grammars easier to write, understand, and ultimately parse.

Gazelle offers these repetition modifiers; place them after the component you want to modify.

?

The ? modifier specifies 0 or 1 occurrences of the previous component. It corresponds to square brackets ([]) in EBNF.

*

The * modifier specifies 0 or more occurrences of the previous component. It corresponds to curly brackets ({}) in EBNF.

+

The + modifier specifies 1 or more occurrences of the previous component.

*(sep)

The *(sep) modifier specifies 0 or more occurrences of the previous component, where each occurrence is separated by sep. It is a more straightforward way of writing (component (sep component)*)?. sep can be any valid component (or in unusual cases, expression of components) that can appear on a right-hand-side of a rule.

+(sep)

The +(sep) modifier specifies 1 or more occurrences of the previous component, where each occurrence is separated by sep. It is a more straightforward way of writing component (sep component)*. sep can be any valid component (or in unusual cases, expression of components) that can appear on a right-hand-side of a rule.

Here is an example that uses repetition; it is the definition of an object in JSON:

object   -> "{" (string ":" value) *(",") "}";
reprule1.png
Graph for the above JSON object expression.
Important One important limitation of the repetition operators is that they cannot be nested within a single rule. For example, you cannot write expr -> (mult_term +("*")) +("+"); This isn't a weakness — it is a design decision that makes the process of pulling a parse tree apart much more sane. If you are tempted to nest repetitions, make the innermost repetition into its own rule.

Components

So far we have neglected to describe in detail what can appear as a component on the right-hand-side of a rule. In the previous examples we have seen strings and nonterminals — in this section we flesh out exactly what can appear there.

There are four fundamental kinds of components that can appear on a right-hand-side.

nonterminals

Nonterminals are symbols that are defined by rules, as described in the Rules section. Each nonterminal that appears on the right-hand-side of a rule must at some point be defined by appearing in the left-hand-side of a rule. The definition need not precede the use.

strings

A literal string that we expect to see at this point in the input.

regular expressions

A pattern of characters we expect to see at this point in the input.

named regular expressions

You can name a regex, and later refer to that regex by name.

Strings and regular expressions are referred to collectively as “tokens” or “nonterminals” because they represent actual strings of the input text. Gazelle can take the tokens defined in the grammar and use them to create a lexer-like component of the parser automatically.

Component Names

All rule components must have a name that is unique within that rule. Names can be specified explicitly with the syntax:

rule -> .name=other_rule;
rule -> .name="string";
rule -> .name=/regex/;
rule -> .name = /regex/;  // the whitespace is insignificant.

The component names are important when it comes to pulling apart the resulting parse tree, which is why every component must have one. Strings, rules, and named regular expressions have default names:

rule -> other_rule;        // name defaults to "other_rule"
rule -> "string";          // name defaults to "string"
rule -> .name=/regex/;     // no default!  name must be specified!

The last case is important: regular expressions must have explicit names. This is for two reasons: one is that we don't have a good way to generate a reasonable default name for a regular expression. The second is that if regular expressions could be specified without an explicit name, Gazelle's grammar would be ambiguous:

s -> a / b;
a -> c / d;

Are these two rules, or one rule that has a long, multi-line regex in it? To avoid this ambiguity, every regular expression inside a rule must have an explicit name.

Strings

When a string appears on a right-hand-side, it specifies exactly what string we expect to see at this point in the input. Strings can be either single or double quoted. Backslashes can be used to quote the next character, but no special backslash sequences are currently recognized (this clearly needs to be fixed in future versions of Gazelle).

Regular Expressions

Regular expressions (regexes for short) describe patterns of text. Users of languages like Perl, Python, or Ruby, or of tools like grep, will find the regular expressions in Gazelle very familiar. The main difference that you will notice is that whitespace in Gazelle regular expressions is insignificant (as you would get in Perl with /x).

Regular expressions in Gazelle are delimited with / (forward slash):

/<body of regex>/

In a regular expression, characters match themselves with the following exceptions:

.

A single dot matches any character (including a newline — as if doing a multiline match in Perl, Python, or Ruby).

\

A backslash always escapes the next character, causing it to lose any special meaning it would otherwise have.

spaces and newlines

To increase readability, whitespace inside regular expressions is insignificant . You are encouraged to use whitespace to make your regular expressions more readable.

[]

A left square bracket begins a character class, and a right square bracket ends it.

{}

Curly brackets allow you to specify repetition with limits. The high and low limits are separated by a comma, and either limit can be omitted to mean 0 and infinite, respectively.

?

A question mark specifies 0 or 1 repetition.

*

An asterisk specifies 0 or more repetition.

+

A plus sign specifies 1 or more repetition.

Named Regular Expressions

A regular expression can be named and referred to later by using the named regular expression syntax.

regex: /<some regex>/;

This appears to be almost identical to declaring a rule with a single regex on the right-hand-side, like so:

rule -> .regex=/<some regex>/;

The primary difference is that the named regular expression syntax avoids creating a trivial and useless nonterminal graph. While Gazelle could detect the case where a nonterminal graph is trivial, it is important that Gazelle preserves the structure of the grammar as specified, so that client programs can hook into this structure at parse time.

Special Commands

Definitions of nonterminals and tokens will be the bulk of most syntax files; however there are several commands that give Gazelle important information about the grammar that isn't expressed in rules.

@start

The @start command tells Gazelle what the top-level nonterminal for a grammar is. If @start is not specified, Gazelle assumes that the first nonterminal in the grammar is the start symbol. The syntax of this command is simply:

@start nonterm;

@start is not required, and may not appear more than once per grammar.

@allow

The @allow command lets you tell Gazelle about syntax constructs that can appear in many different places througout your grammar. Its primary use case is whitespace — it lets you say that whitespace can appear between any two components of certain rules in your grammar. Its syntax is:

@allow nonterm_to_allow in start_nonterm...end_nonterm[, other_end_nonterm]*;

For example:

@allow whitespace in program...string, number;

start_nonterm is very likely to be the top-level rule in your grammar — the rule you specified to @start. end_nonterm specifies what rule or rules do not allow whitespace (their child rules are automatically excluded also). All rules that are sub-rules start_nonterm (either directly or indirectly), but are not sub-rules of end_nonterm, will allow nonterm_to_allow in between any two components of the rule.

Ambiguity Resolution

Some grammars are ambiguous, meaning that some input strings could be parsed in more than one way. The most common example of this is the classic if/then/else case:

stmt -> "if" expr "then" stmt ("else" stmt)? | expr;
expr -> "true" | "false";

If you try to compile this grammar, Gazelle will give you this error message:

Gazelle cannot handle this grammar.
It is not Strong-LL or full-LL (and may be ambiguous, but we don't know).

Unfortunately, Gazelle cannot tell you with certainty that the problem is an ambiguous grammar — detecting whether a grammar is ambiguous is undecidable (impossible to compute in all cases). In the future a more detailed error message will make clear the root of the problem, which is that Gazelle cannot decide whether to match an "else" or not with input text like:

if true then
  if true then
    true
else false

As the grammar is specified above, the "else" in this example could go with either if. This ambiguity exists in many language's grammars, but is explicitly resolved by saying "an else goes with the most recent if". You can express this ambiguity resolution directly to Gazelle by using its ambiguity resolution operators.

/

A slash indicates prioritized choice, as in Parsing Expression Grammars (PEGs). Use it in place of | (non-prioritized choice) when you want to indicate that the first alternative should be used in cases where both alternatives match.

+

Greedy repetition: add to another repetition operator (+, *, or ?) to make the repetition prefer to keep matching.

-

Non-greedy repetition: add to another repetition operator (+, *, or ?) to make the repetition prefer to stop matching.

Here are examples of these operators in use:

// Match "else" to its most recent "if"
stmt -> "if" expr "then" stmt ("else" stmt)?+ | expr;

// "X" "Y" matches the first alternative, but with more than one "X"
// match the second alternative instead.
example -> "X" "Y" / "X"+ "Y";

Gazelle can only support some resolutions of ambiguity. For example, we can't use Gazelle to resolve the if/then/else problem the other way:

// WON'T WORK.  Throws the error below:
stmt -> "if" expr "then" stmt ("else" stmt)?- | expr;
expr -> "true" | "false";

Gazelle cannot support this resolution of the ambiguity in rule stmt.

Because of Gazelle's overriding goal to ensure that the resulting parser is both efficient and as "on-line" as possible, Gazelle can't support all grammars or ambiguity resolutions. If you wanted to resolve the "else" problem the opposite way, you would need to write custom code that makes the decision of whether or not to match the "else", because the logic to do so is nontrivial.

Unfortunately, the same is true if you try to resolve if/then/else with prioritized choice:

// WON'T WORK.  Throws the error below:
stmt -> "if" expr "then" stmt "else" stmt / "if" expr "then" stmt | expr;
expr -> "true" | "false";

Language is probably not LL(k) or LL(*): when calculating lookahead for
a state in stmt, one lookahead language was nonregular, others were not
all fixed

The problem in this case is that you're trying to make Gazelle figure out before it's parsed an "if" whether or not it will have an "else", but that requires unbounded lookahead that has to count (it has to match the "ifs" that it sees with "elses").

The moral of the story is this: in most cases you should avoid ambiguities in your grammar. In some cases you can't avoid them and these ambiguity resolution operators will be a satisfactory solution (if/then/else is the most likely case for this). In other cases ambiguity resolution won't suffice and you'll have to get down and dirty with some imperative Lua code.

Regular Expressions vs. Repetition Within Rules

You will notice that regular expressions and rules both give you the ?, *, and + operators. Given this, there are are many situations where you will find that you can use either regular expressions or rules. For example, consider this regular expression that describes a number:

number:  /(-)?  [0-9]*/

You could express the same pattern using a rule instead:

number -> "-"? ("0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9")*;

In this case, the rule version is much longer and more verbose, because the regular expression offers a useful tool that strings in rules do not offer: the character class. In addition, the second form could make the grammar as a whole more expensive to parse by requiring more tokens of lookahead. Since a number is no longer a single token, it cannot be used as a single unit of lookahead.

Does this mean that we should prefer regular expressions whenever we can? Not always. Consider this example of a quoted string that contain backslash escapes:

"Hello, world!\n\tThis string will be parsed by Gazelle!"

We could write a Gazelle grammar to parse this using a single regular expression.

string -> .string=/" ([^\\"]|\\.)* "/;

However, if we write it this way, we lose the information about where each backslash escape was in the string, because regular expressions in Gazelle do not have capture groups. In Gazelle, you find boundaries for subexpressions by making those subexpressions their own components. In this case, writing the pattern using rules is superior, because it preserves all the important information about where each sequence was seen.

// Written this way, each str_frag is either a backslash sequence
// (like \n) or a string of characters.  You don't have to re-parse
// the str_frag's later to find the backslash sequences.
string   -> '"' str_frag* '"';
str_frag -> /[^\\"]+/ | /\\./;

So what general rule can we extrapolate from this? In general, you should only use regular expressions to match strings that don't have any logical sub-parts. If you find that your regular expression has sub-parts that you might want to pull apart later, make each sub-part into its own component.

Dealing with Errors in your Grammars

Gazelle uses an algorithm known as LL(*) for parsing. The benefit of this algorithm is that it is very fast and uses very little memory. The drawback is that it cannot parse all grammars. Using all the tools that Gazelle gives you, it should be able to parse all real-world languages, it just might require you to rearrange your grammar a bit.

For anyone with unpleasant memories of trying to debug LALR(1) parsers like yacc or Bison, it bears mentioning that debugging Gazelle grammars should eventually be significantly easier than debugging Bison grammars. The LL(*) grammar analysis that Gazelle performs is far easier for humans to understand than LALR tables. However currently some of Gazelle's error messages do not give you the amount of information you really need to debug the problem. This will improve in future versions of Gazelle.

What follows is a list of errors your grammar can have, what they mean, and how to fix them.

Left Recursion

As a top-down parser, Gazelle cannot handle left-recursion. Left-recursion is simply a rule whose first component is a recursive invocation of the same rule. Examples are:

// Right hand side begins with "list".
list -> list "," "item" | "item";

// still left-recursive: even though "list" isn't the first component
// on the right-hand side, it still can occur first in the rule's pattern.
list -> "item" | list "," "item";

// still left-recursive: even though "prefix" can come before list, it
// can also be omitted, leaving "list" as the first component in the rule.
list -> "item" | "prefix"? list "," "item";

// Two or more rules can be left-recursive together.
rule_a -> rule_b "foo";
rule_b -> rule_a | "bar";

Trying to compile any of these grammars will cause Gazelle to throw the error:

Grammar is not LL(*): it is left-recursive!

In the future this message will give more information about which rule(s) were left-recursive. For the moment you have to figure this out yourself.

Eliminating left-recursion is not difficult. The most common reason for writing left-recursion is to express some kind of list:

list -> list "," "item" | "item";

The way to fix this is to express the list using repetition operators instead of left-recursion.

list -> "item" *(,);

Ambiguity

The earlier section on ambiguity resolution discussed ambiguity some. A grammar is ambiguous if there is input text that could be interpreted more than one way according to the grammar. Examples are:

s -> "X" | "X";    // Does a single X match the first or the second component?
s -> "X"? "X"?;    // Does a single X match the first or the second component?

// When X is matched, have we gone a -> b -> c, or just a -> c?
a -> b | c;
b -> c;
c -> "X";

In these cases, Gazelle knows that the grammar is ambiguous and can tell you so in the error message:

Ambiguous grammar for state starting in rule s, paths={{"term", "X"}} AND {{"term", "X"}}

This is telling you simply that in matching a single terminal "X" Gazelle was able to take two different paths through the grammar and end up in the same place (the definition of an ambiguity). Here is the error message from the last example above of three rules:

Ambiguous grammar for state starting in rule a,
paths={{"enter", "b"}, {"enter", "c"}, {"term", "X"}} AND {{"enter", "c"}, {"term", "X"}}

As you can see, Gazelle saw that it could either enter b and then c or just enter c.

To fix problems of ambiguity, you need to rewrite the grammar such that the given input text cannot be interpreted in more than one way. Another option is to use the ambiguity resolution operators.

Note that while Gazelle can detect some cases where the grammar is ambiguous (such as the above examples), in other cases it may only detect that the grammar is not LL(*) but not realize that this is because of an ambiguity. This is the case in the if/then/else example.

Rule has no non-recursive alternative

Every rule must have at least one non-recursive alternative. For example this rule is invalid, because it can never stop recursing:

s -> "X" s;

Gazelle complains with this error message:

Invalid grammar: rule s had no non-recursive alternative

The solution is simple: every rule must have one alternative that stops recursing. For example, you can fix the above rule like so:

s -> "X" s?;

Like left-recursion, two or more rules can work together to create this error:

a -> "X" b;
b -> "Y" a;

Grammar is not LL(*)

These can be the trickiest errors to solve — more information about solving them will be in future versions of the manual.

The C Runtime

The C Runtime is documentated in the header files in runtime/include/gazelle. You can browse them online at GitHub: the header files for the latest version of Gazelle. For an example of how to use the runtime, see utilities/gzlparse.c, which you can also view online at GitHub.

The Gazelle Algorithm

This section describes Gazelle's parsing algorithm. Unlike LALR(1) tables with their reputation for being mysterious and hard to understand, the top-down algorithm that Gazelle uses should be easy for the average programmer to understand. Gazelle users are encouraged to read this section to understand the way your grammars are analyzed and parsed.

Gazelle's algorithm takes a great deal of inspiration from ANTLR, the LL(*) parser generator written by Terence Parr. While Gazelle seeks to augment the theory that Terence developed, the foundations of Gazelle are strongly based in ANTLR's achievements. Like ANTLR, Gazelle is a top-down parser that models its lookahead as possibly-cyclic state machines. Gazelle's algorithm for generating this lookahead is based on ANTLR's algorithm.

Compared to ANTLR, Gazelle adds the following capabilities:

Compared to ANTLR, Gazelle does not yet support syntactic or semantic predicates, but these (or capabilities like them) are planned for the future.

High Level Overview

Gazelle's parsing algorithm boils down to nothing more than a stack of state machines. Each byte of the input transtions these state machines one or more times, and the entire state of the parse can be represented by the stack of state machines that are currently active and which state each one is in.

These are the four different kinds of state machines. Each is explained in detail in the following sections:

Rule Graph or Recursive Transition Network (RTN)

a DFA is built representing each grammar rule, and this DFA transitions once for each token of input. A runtime stack maintains the list of recursed nonterminals.

Grammar Lookahead Automaton (GLA)

The LL(k) lookahead information for any nontrivial RTN state is encoded in a DFA known as a GLA. The final states of GLAs indicate what RTN transition should be taken. We choose this approach rather than using a table because lookup information is typically extremely sparse, not even remotely approaching its theoretical O(2^k) upper limit.

Character-level Integer DFA (IntFA)

Serving the role usually played by separate lexers, IntFAs transition on individual characters of the input stream and hit a final state when a full token is recognized. The appropriate IntFA is chosen based on the current state of the current GLA or RTN For example, if the RTN state indicates we are expecting a variable name, we can select an IntFA that will not recognize keywords, thereby accommodating languages that allow variables named after keywords.

Character decoding DFA

For any multi-byte encoding, the lowest-level DFA is one that examines the input byte-by-byte and assembles the bytes into characters. (TODO: is this the right mechanism for handling things like C trigraphs also?)

In addition, operator-precedence parsing is supported by sandwiching the shunting yard algorithm between layers of the RTN stack.

Language rules that cannot be represented using pure context-free grammars (and all but the simplest languages have such complications) are accommodated with a variety of strategies that let the grammar implementor "hook in" to the above process with imperative code.

Rules and Rule Graphs

Gazelle's algorithm is based around making each rule in a grammar into a graph that resembles a syntax diagram. Syntax diagrams are a common way to describe syntax in a way that is easy for humans to understand visually. Here are some examples of such diagrams:

Example: JSON object from json.org
JSON Object
Example: UPDATE statement from the SQLite documentation
SQLite UPDATE

Syntax diagrams like the ones above visually specify what syntax patterns are allowed by the language. You read them by following the the lines from one grammar element to the next. If there is a fork in the road you can take either path, since both options are valid according to the language. Syntax diagrams are also known as "railroad diagrams" because they are like a network of railroad tracks that have switches at all the decision points.

Gazelle turns each rule into a similar kind of diagram, but Gazelle's diagrams are represented a little bit differently. Here is Gazelle's rendering of the same JSON rule as the first example:

json.png
Gazelle's graph of the JSON rule above

Gazelle calls these "rule graphs" or RTNs (for "Recursive Transition Network"). Rule graphs are state machines, and are one of the several kinds of state machines Gazelle uses to track its parsing state as a whole. You can view the rule graphs for any Gazelle grammar by compiling the grammar with gzlc -d and viewing the HTML output.

In rule graphs the grammar elements (like "string", "value," etc.) are on the edges instead of being the nodes. Also, in rule graphs the lines (the graph edges) don't have decisions or "forks in the road" — the decisions come from deciding what edge to take out of a state. Finally, in a rule graph you reach the "end" of the rule by transitioning to a final state, which is drawn as a state with a double circle. When the parser reaches this state, the rule it was parsing can be considered finished (but need not be — final states can also have transitions out of them).

While the rule graph may look more chaotic than the nicely-laid-out railroad diagram above, Gazelle's rule graph has one important advantage. In Gazelle's rule graph, the states represent real states that the parser can be in. The states in a railroad diagram are not as easy to identify. At any time you can inspect Gazelle's parsing state and see what rule graph state it is in. This is also the basis of how Gazelle can save and restore its parsing state.

The edges of rule graphs can be both terminals and nonterminals. Terminals are physical tokens of input like { and }, while nonterminals are references to other rules like string and value above.

When the parser takes a nonterminal transition, it temporarily stops processing the current rule and moves to a sub-rule corresponding to that nonterminal. To remember where it was, the parser pushes the destination state of the nonterminal transition onto a runtime stack. This whole process is easy to understand when viewed in the context of an example, which will be included in future versions of the manual.

Grammar Lookahead Automaton

You may have noticed that we glossed over one important detail in the previous section. How do we decide what transition in the rule graph to take? Rule graphs can be nondeterministic: there can be more than one transition out of a state for the same terminal. There can also be final states that have transitions out of them. This means that while Gazelle is parsing, there can be more that one option for what transition to take. How does Gazelle pick the right one?

There are two situations where knowing what transition to take is difficult. First, if any of the transitions that lead out of our RTN state are nonterminals (references to other rules), we need a way to know what strings in the input imply what transitions. For example, consider the rule:

hardrule1.png
graph 3: word -> numbers | letters;

If we are in the start state for word and we see a "5" in the input, what transition should we take? You might guess that we should take numbers, but that's entirely dependent on the definition of numbers and letters. We need a way to "see into" those definitions, so we know what transition to take based on what we see in the input.

The second difficulty is similar, but slightly different. What if we are in a final state that has transitions out of it? Consider this rule:

hardrule2.png
graph 3: sentence -> word*;

If we are in the start state for sentence, should we return from the current rule since we are in a final state, or should we stay and match more words? To answer this question we need not only the ability to "see into" word, but also the ability to "see up from" whatever set of rules we are currently recursed into.

This set of problems turns out to be somewhat difficult to solve — in fact, it is exactly the problem that LL parsing aims to solve. Part of what makes it tricky is that depending on how the grammar is written, we might not have enough information at this point in the text to possibly know what transition to take. We always have to look at the next token of input to decide, but in some cases we might have to look ahead 2, 3, or even more tokens before we have enough information to make a decision. It is even possible to write grammars where the token that tells us the difference is arbitrarily far away. These classes of grammars are known as LL(k) if we have to look k tokens ahead to make a decision, or LL(*) if that decision can be arbitrarily far away.

Note This description of LL(k) only tells part of the story. LL algorithms are further divided into "full" LL, Strong-LL, and Simple-LL. Nearly all practical "LL" algorithms are actually Strong-LL, and this is also the case for Gazelle. Though Strong-LL parsers can only parse a subset of the grammars of a full LL parser, they are significantly simpler and more efficient, and in practice provide enough power. For more information on the subject, I heartily recommend the book Parsing Techniques: A Practical Guide by Grune and Jacobs.

So how do we decide what transition to take in the rule graph? The answer is that Gazelle analyzes the grammar at compile time, and figures out for each rule graph state what tokens we expect to see here, and what transition each series of tokens implies. This information is encoded into a DFA known as a GLA — a Grammar Lookahead Automaton. If after analyzing the grammar Gazelle determines that it is possible to build a GLA for every rule graph state, it builds them and encodes it into the byte-code. If it is not possible to build the GLA (which implies that that the grammar is not LL(k)), Gazelle interrupts the compilation with an error.

When Gazelle is actually parsing, it stops at every rule graph state and tries to determine what transition to take next. To make this decision, Gazelle finds the GLA for this rule graph state and runs it for the tokens of the input until the it hits a final state. That final state will tell the rule graph what transition to take.

TODO: flesh out this section with more examples

Character-level Integer DFA

In the previous section, we assumed that we had a stream of tokens to feed into the GLA. But where do these tokens come from? The answer is that another layer of DFA's processes the characters of the input text and breaks them into tokens. These DFAs are known as IntFAs, since the characters of the input text are represented as integers (this is not a particularly good name, and may change). IntFAs in Gazelle fulfill the role traditionally performed by separate lexers.

Most programmers have some familiarity with regular expressions, because they are pervasive in modern programming. With Gazelle you describe tokens using strings and regular expressions, and Gazelle compiles these into DFAs using well-known algorithms (unlike what most scripting languages do today — they use an algorithm with much worse worst-case performance, as Russ Cox argues in his article Regular Expression Matching Can Be Simple And Fast).

Lexing using Gazelle has one key advantage over independent lexers: the Gazelle lexing component knows what sorts of language constructs it is expecting to see at this point in the input. A traditional lexer just breaks the input into words without having any idea what sort of construct it's in. For example, suppose you are a traditional C lexer and you see the input:

x = while;

A traditional C lexer in this case will return the tokens:

It treats "while" as a keyword, even though it is in a context where "while" doesn't make sense. Of course, this is required by the C standard, since keywords cannot be variable names, but suppose we were dealing with a language that did not have this restriction. What do we do? Our traditional lexer design is too weak to return the token WHILE in contexts where it makes sense and an IDENTIFIER named "while" in contexts where an identifier is expected. Another strategy will be required, adding complexity to the parser.

Gazelle, on the other hand, can handle this situation just fine, because parsing and lexing are combined. The key is that every GLA state knows what tokens are valid at this point in the input, so it can choose an IntFA that only recognizes the valid tokens. In the above example, when a Gazelle parser/lexer reached while in the input, the current GLA state would select an IntFA that recognizes IDENTIFIER tokens (because they make sense here) but not a WHILE keyword token (because it does not make sense here). Therefore languages that allow keywords as variables names are easily and naturally supported.

Of course, Gazelle also needs to support languages where keywords are not allowed to be variable names. This can be easily achieved by making the regular expression for identifiers explicitly exclude all keywords. This more explicitly describes the rules of the language anyway — when a language specifications includes text like "variable names may not be any of these reserved words," you can express this directly to Gazelle by explicitly listing the patterns an identifier is not allowed to have.

Character decoding DFA

There is one final detail that is not handled by any of the previous layers: where do the characters come from that are fed into the IntFAs? In the case of ASCII or other single-byte encodings, the answer is simple: the bytes are the characters. However, more and more programming languages are moving towards allowing their source to be in encodings other than ASCII, most commonly Unicode. We need a character decoding step to handle the intracacies of multi-byte encodings.

Many applications handle Unicode encodings such as UTF-8 without representing their character decoding stage as a DFA, so you may wonder why Gazelle takes this route. The benefit of using a DFA is that it provides a way to sanely process partial characters. If we are parsing text that is coming off the network, and the final byte of the buffer ends mid-character, what should we do? We could drop it and re-process it later, but that requires either Gazelle or the application to store it for later re-processing. Worse, it throws away the work we have already done making sense of this byte or bytes. By using a DFA, we can advance the state of the collective machine for every single byte of input, whether or not it is a full character. When we are fed more bytes, we pick up exactly where we left off.

It is also possible that introducing custom processing at this stage could provide an easy and efficient way to process very difficult language corner cases like C's trigraphs.