hello -> "(" hello ")" | .digits=/[0-9]+/;
Gazelle is a system for parsing context-free grammars. It takes inspiration from parser generator tools like Yacc/Bison and ANTLR, but makes several design decisions that set it apart — namely, the focus on making grammars reusable and making them language-agnostic.
Once a grammar is written for a language, it should be possible to use it for anything from a compiler to the syntax highlighting component of a text editor. That is Gazelle's single most fundamental design criterion.
|
This manual is very much a work in progress, just as Gazelle itself it. There is a lot of information in here, and it's a great way to get acquainted with Gazelle, especially if you are already familiar with parsing tools in general. However, there isn't a lot of friendly introductory material yet, and some of the information may be out of sync with the implementation. As Gazelle stabilizes, so will the relationship between the manual and reality. |
This section offers a quick tour of Gazelle and its capabilities. It assumes you have already compiled Gazelle successfully.
For our "Hello, World" example, we'll write a grammar for a language that is simply an integer surrounded by an arbitrary number of balanced parentheses. To begin, use a text editor to write this grammar to hello.gzl.
hello -> "(" hello ")" | .digits=/[0-9]+/;
This text is a complete Gazelle grammar. It defines a single rule called "hello", which expands to either another (recursive) hello surrounded by parentheses, or a string of one or more numeric digits (and this term is named "digits").
This language will match any of the following inputs:
(5) (((((((((((100))))))))))) 0
But none of these inputs will match:
(() )( 0)
The first step is to compile the text version of the grammar to bytecode. The Gazelle compiler analyzes the input grammar and builds a series of finite automata (state machines) that can do the real parsing. It writes this compiled version of the grammar to bytecode.
$ . lua_path
$ ./compiler/gzlc --help
gzlc -- Gazelle grammar compiler.
Gazelle v0.2 http://www.reverberate.org/gazelle/
Usage: gzlc [options] input-file
-h, --help you're looking at it.
-d, dump detailed output about the grammar to
html/index.html.
-o <file> output filename. Default is input filename
with extension replaced with .gzc
-v, --verbose dump information about compilation process and
output statistics.
--version dump Gazelle version
$ ./compiler/gzlc hello.gzl
$
You won't see any output — that means that the grammar was compiled successfully. Just like gcc, gzlc is silent when things succeed. The default output filename is the input filename with gzl replaced with gzc.
Tallis:gazelle joshua$ ls -l hello* -rw-r--r-- 1 joshua joshua 212 Jun 28 18:57 hello.gzc -rw-r--r-- 1 joshua joshua 43 Jun 28 18:43 hello.gzl
If you want to see a bit more verbose output, run gzlc with -v.
$ ./compiler/gzlc -v hello.gzl Gazelle v0.2 Opening input file 'hello.gzl'... Parsing grammar... Convering RTN NFAs to DFAs... Minimizing RTN DFAs... Doing LL(k) lookahead calculations... Generating lexer DFAs... Writing to output file 'hello.gzc'... Writing 4 strings... Writing 1 IntFAs... 4 states, 4 transitions Writing 0 GLAs... Writing 1 RTNs...
Now, the magic moment where we do our first actual parsing. We'll write some valid input text for this language, then use gzlparse to actually parse it.
$ echo '((((500))))' > hello_text $ ./utilities/gzlparse --help gzlparse -- A command-line tool for parsing input text. Gazelle 0.2 http://www.reverberate.org/gazelle/. Usage: gzlparse [OPTIONS] GRAMMAR.gzc INFILE Input file can be '-' for stdin. --dump-json Dump a parse tree in JSON as text is parsed. --dump-total When parsing finishes, print the number of bytes parsed. $ ./utilities/gzlparse hello.gzc hello_text $
Again, by default Gazelle doesn't print any output when things were successful. If there had been a parse error, Gazelle would have printed an error and exited. If you want to see more output, the flags —dump-json and —dump-total will print a parse tree in JSON format and a total byte count, respectively.
$ ./utilities/gzlparse --dump-json --dump-total hello.gzc hello_text
{"parse_tree":
{"rule":"hello", "start": 0, "children": [
{"terminal": "(", "slotname": "(", "slotnum": 0, "offset": 0, "len": 1},
{"rule":"hello", "start": 1, "slotname":"hello", "slotnum":1, "children": [
{"terminal": "(", "slotname": "(", "slotnum": 0, "offset": 1, "len": 1},
{"rule":"hello", "start": 2, "slotname":"hello", "slotnum":1, "children": [
{"terminal": "(", "slotname": "(", "slotnum": 0, "offset": 2, "len": 1},
{"rule":"hello", "start": 3, "slotname":"hello", "slotnum":1, "children": [
{"terminal": "(", "slotname": "(", "slotnum": 0, "offset": 3, "len": 1},
{"rule":"hello", "start": 4, "slotname":"hello", "slotnum":1, "children": [
{"terminal": "digits", "slotname": "digits", "slotnum": 3, "offset": 4, "len": 3}
], "len": 3},
{"terminal": ")", "slotname": ")", "slotnum": 2, "offset": 4, "len": 4}
], "len": 5},
{"terminal": ")", "slotname": ")", "slotnum": 2, "offset": 8, "len": 1}
], "len": 7},
{"terminal": ")", "slotname": ")", "slotnum": 2, "offset": 9, "len": 1}
], "len": 9},
{"terminal": ")", "slotname": ")", "slotnum": 2, "offset": 10, "len": 1}
], "len": 11}
}
11 bytes parsed.
$
The next thing you will want to do is use gzlc to produce an HTML dump of the grammar as the compiler sees it. This is an invaluable way to check and be sure that the compiler is seeing things the way you meant it to. If not, you might have misspecified the grammar, or you may have found a bug in the compiler (and these certainly exist).
Doing this HTML dump requires that Graphviz is installed. If ImageMagick is installed, this will also be used to create thumbnails.
$ ./compiler/gzlc -d hello.gzl
Now open html/index.html in a web browser to view the HTML dump. You will see an illustration of the "hello" rule, and you should see that it corresponds to your intuitive understanding of this rule. There are two paths through the rule — either "(", hello, ")" or digits. You will also see the state machine that does the lexing: it recognizes either parentheses or a series of one or more digits.
Gazelle is still quite rough around the edges. We'll look at a few examples of things Gazelle can't handle yet and how the failures manifest themselves, so you can know what to expect if you run into these problems yourself.
$ echo 'bad -> bad "X" | "Y";' > bad1.gzl $ ./compiler/gzlc bad1.gzl <hangs forever>
This grammar is left-recursive, and there is no way for Gazelle (a top-down parser) to successfully analyze it. However, Gazelle doesn't detect this case yet, and will happy spend until eternity trying. So if you see this behavior from gzlc, you've likely written a grammar that cannot be parsed with an LL(k) parser. This isn't always as obvious as left-recursion — here is another example of a grammar that will hang if you feed it to gzlc:
bad -> a | b; a -> "X"* "Y"; b -> "X"* "Z";
There are also some grammars that will be compiled incorrectly. For example:
bad -> "X" "Y" | "X";
Gazelle should (and in the future will) notice that "X" followed by EOF should trigger the second alternative, but at the moment EOF does not get this special handling. So Gazelle will always wait to see a "Y" and not be satisfied with just an "X". You can look at the gzlc -d input to get a better picture of why this fails.
$ echo 'bad -> "X" "Y" | "X";' > bad2.gzl $ echo -n 'X' > bad2_text $ ./compiler/gzlc bad2.gzl $ ./utilities/gzlparse bad2.gzc bad2_text Premature end-of-file.
This is an incorrect error from Gazelle — "X" alone should be valid input.
This concludes the tour. Next you can read more about Gazelle's input language and experiment with your own grammars.
Gazelle's major limitation at the moment is that it cannot support removing whitespace or comments from the input. A feature to address this is coming, but for the moment please bear with me!
I hope you enjoy Gazelle!
Gazelle has been designed to make writing grammars as easy as possible. Whether you are working from a published language specification or designing your own language, the Gazelle grammar format is designed to let you write grammars in the way you think about them.
This chapter focuses on the syntax of Gazelle grammars, and the considerations of making your grammar easily parseable.
Rules are at the heart of every grammar; they describe the patterns of all the constructs in your language. Here is a simple example that describes an assignment statement.
assign -> ident "=" expr;
This declares that an assignment statement is an identifier followed by the string “=” followed by an expression. We assume here that “ident” and “expr” are defined elsewhere in the grammar.
For every rule you write, Gazelle builds a graph that describes the possible paths that the input can take through this rule. Since this rule is very simple and doesn't have any kind of repetition, the graph is just a single path.
Often you will have a rule that can be expanded in more than one way. For example, you might have a boolean expression that can be either “true” or “false.” You can specify alternation to Gazelle with the vertical bar (|).
boolexpr -> "true" | "false";
As you would expect, this produces a rule graph with two edges — one for each option.
You can use alternation freely throughout your rules. Alternation has the lowest precedence, but you can override that with parentheses. Here is a more complicated rule that demonstrates more extensive use of alternation.
name -> (fname (mname | minitial) | nickname) surname;
The graph for this rule is naturally a bit more complicated, but looking at it should convince you that the possible paths through this rule are what you intended.
Repetition is our other main tool for rule-writing. Repetition isn't strictly necessary for writing context-free grammars (it can always be expressed as recursion), but using it can make grammars easier to write, understand, and ultimately parse.
Gazelle offers these repetition modifiers; place them after the term you want to modify.
The ? modifier specifies 0 or 1 occurrences of the previous term. It corresponds to square brackets ([]) in EBNF.
The * modifier specifies 0 or more occurrences of the previous term. It corresponds to curly brackets ({}) in EBNF.
The + modifier specifies 1 or more occurrences of the previous term.
The *(sep) modifier specifies 0 or more occurrences of the previous term, where each occurrence is separated by sep. It is a more straightforward way of writing (term (sep term)*)?. sep can be any valid term (or in unusual cases, expression of terms) that can appear on a right-hand-side of a rule.
The +(sep) modifier specifies 1 or more occurrences of the previous term, where each occurrence is separated by sep. It is a more straightforward way of writing term (sep term)*. sep can be any valid term (or in unusual cases, expression of terms) that can appear on a right-hand-side of a rule.
Here is an example that uses repetition; it is the definition of an object in JSON:
object -> "{" (string ":" value) *(",") "}";
|
One important limitation of the repetition operators is that they cannot be nested within a single rule. For example, you cannot write expr -> (mult_term +("*")) +("+"); This isn't a weakness — it is a design decision that makes the process of pulling a parse tree apart much more sane. If you are tempted to nest repetitions, make the innermost repetition into its own rule. |
So far we have neglected to describe in detail what can appear on the right-hand-side of a rule. In the previous examples we have seen strings and nonterminals — in this section we flesh out exactly what can appear there.
There are four fundamental kinds of terms that can appear on a right-hand-side.
Nonterminals are symbols that are defined by rules, as described in the Rules section. Each nonterminal that appears on the right-hand-side of a rule must at some point be defined by appearing in the left-hand-side of a rule. The definition need not precede the use.
A literal string that we expect to see at this point in the input.
A pattern of characters we expect to see at this point in the input.
You can name a string or regex, and later refer to the string or regex by name.
Strings and regular expressions are referred to collectively as “terminals,” because they represent actual strings of the input text. Gazelle can take the terminals defined in the grammar and use them to create a lexer-like component of the parser automatically. Unlike a separated lexer however, the lexer component of a Gazelle parser knows exactly what terminals it is expecting. Using this information, a Gazelle parser could, for example, parse a language that allowed keywords as variable names, as long as the context was enough to disambiguate the two.
A more practical application of this mechanism is the ability to nest languages easily. For example, many web templating systems nest a language like Perl, PHP, or Ruby inside of regular HTML, by using special tags to delimit the two languages. In this situation, Gazelle automatically knows when to switch to the lexer that parses the nested language's tokens. In many cases the same could be accomplished with a traditional lexer by using lexer states, but doing so would require far more manual effort; Gazelle can do this automatically.
When a string appears on a right-hand-side, it specifies exactly what string we expect to see at this point in the input. Strings can be either single or double quoted. They interpret some backslash sequences (TODO: there are some decisions left to make here — do both interpret the same sequences?)
Regular expressions (regexes for short) describe patterns of text. Users of languages like Perl, Python, or Ruby, or of tools like grep, will find the regular expressions in Gazelle very familiar. The main difference that you will notice is that whitespace in Gazelle regular expressions is insignificant (as you would get in Perl with /x).
Regular expressions in Gazelle are delimited with / (forward slash):
name: /<body of regex>/
In a regular expression, characters match themselves with the following exceptions:
A single dot matches any character (including a newline — as if doing a multiline match in Perl, Python, or Ruby).
A backslash always escapes the next character, causing it to lose any special meaning it would otherwise have.
To increase readability, whitespace inside regular expressions is insignificant . You are encouraged to use whitespace to make your regular expressions more readable.
A left square bracket begins a character class, and a right square bracket ends it.
Curly brackets allow you to specify repetition with limits. The high and low limits are separated by a comma, and either limit can be omitted to mean 0 and infinite, respectively.
A question mark specifies 0 or 1 repetition.
An asterisk specifies 0 or more repetition.
A plus sign specifies 1 or more repetition.
A string or regular expression can be named and referred to later by using the named terminal syntax.
term: /<some regex>/;
This appears to be almost identical to declaring a rule with a single regex on the right-hand-side, like so:
nonterm -> /<some regex>/;
The primary difference is that the named terminal syntax avoids creating a trivial and useless nonterminal graph.
Definitions of nonterminals and terminals will be the bulk of most syntax files; however there are several commands that give Gazelle important information about the grammar that isn't expressed in rules.
The @start command tells Gazelle what the top-level nonterminal for a grammar is. This frees you from having to follow a rule like “the top level nonterminal must be listed first.” The syntax of this command is simply:
@start nonterm;
@start must appear once and exactly once in every grammar.
You will notice that regular expressions and rules both give you the ?, *, and + operators. Given this, there are are many situations where you will find that you can use either regular expressions or rules. For example, consider this regular expression that describes a number:
number: /(-)? [0-9]*/
You could express the same pattern using a rule instead:
number -> "-"? ("0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9")*;
In this case, the rule version is much longer and more verbose, because the regular expression offers a useful tool that strings in rules do not offer: the character class. In addition, the second form could make the grammar as a whole more expensive to parse by requiring more tokens of lookahead. Since a number is no longer a single terminal, it cannot be used as a single unit of lookahead.
Does this mean that we should prefer regular expressions whenever we can? Not always. Consider this example of a quoted string that contain backslash escapes:
string -> '"' str_frag* '"'; str_frag -> /[^\\"]+/ | /\\./
We could rewrite this using a single regular expression:
string: /" ([^\\"]|\\.)* "/
However, if we write it this way, we lose the information about where each backslash escape was in the string, because capture groups in a regular expression cannot keep track of multiple occurrences. We would have to re-parse the string after the fact to find the escape sequences. In this case, writing the pattern using rules is superior, because it preserves all the important information about where each sequence was seen.
So what general rule can we extrapolate from this? In general, you should prefer using a regular expression unless it will match a string that can contain an arbitrary number of points of interest.
In the number example, there were two fixed points of interest: the optional negative sign and the string of numbers. In the string example, the string could contain an arbitrary number of escape sequences inside it. If you stick to this guideline, you can't go far wrong.
So far we have covered all the components of describing the syntax of your language to Gazelle. At this point there is good news and bad news.
The bad news is that in some cases, even if your grammar is perfectly valid Gazelle syntax, Gazelle will not be able to accept your grammar. The good news is that dealing with these problems is not a big deal, and Gazelle will work to give you as much information as it possibly can about how to fix them.
|
At this stage in Gazelle development, Gazelle may not be particularly helpful in dealing with these bad cases; it may fail to detect the problem, lie to you about it, complain when there is no problem, or even insult your mother. The tone of this section reflects where I want Gazelle to be — not where it is right now. |
There are two main reasons you might have to adjust your grammar. One possible problem is that your grammar could be ambiguous. It is very easy to write a grammar that is perfectly valid Gazelle syntax, but could yield more than one parse tree for the same input. Here is a very simple example:
# With this grammar, 1+2+3 can be parsed as either (1+2)+3 or 1+(2+3) expr -> expr "+" expr | /[0-9]+/;
Gazelle can't deal with this ambiguity, and will give you a message explaining the source of the problem.
The other reason is this: while there are a few algorithms that can parse any nonambiguous grammar, they are much slower than algorithms than algorithms that restrict the language somewhat. Since speed is another focus of Gazelle, and because the grammar restrictions are not that onerous, Gazelle focuses on grammars that can be parsed using the faster algorithms.
This section will document the API for the parsing runtime, and explained how it can be used for event-based parsing, syntax trees, ASTs, and whitespace-preserving transformations. It will also discuss best practices for wrapping the C runtime in other languages. It will not offer documentation about each language's wrappers — I think that belongs in separate manuals, but I may change my mind about that.
This section describes Gazelle's parsing algorithm. It will primarily be of interest to parsing aficionados, but may also provide insight to people who are trying to do tricky or advanced things with Gazelle.
Gazelle's parsing algorithm stands on the shoulders of giants. The field of Computer Science has seen decades of high-quality research in parsing, and the author has spent many months (and many trips to the library) working to understand the existing literature. There are also many open source tools that have provided key insights into the practical applications of parsing theory.
That said, the Gazelle algorithm combines some of these concepts in ways that have not been previously done, to the author's knowledge. Gazelle offers these key innovations over existing top-down parsing tools.
Gazelle combines parsing and lexing into a single unit: users of Gazelle implement both lexing and parsing in a single, unified grammar file. This provides significant benefits: grammars are easier to read and write, it is easy to embed one language's grammar into another (for example, embedding an imperative language into HTML, as the many templating engines do), and it makes it possible to support context-sensitive lexing decisions, like languages that allow keywords to be variable names. This technique was debuted in Eelco Visser's 1997 paper Scannerless Generalized-LR Parsing, but has not, to the author's knowledge, been attempted in a top-down parsing tool.
Gazelle supports operator-precedence expression parsing, which makes it much easier to describe infix operator expressions, especially if there are many levels of precedence. It is already known (if not widely advertised) that it is possible to "sandwich" an operator precedence parsing algorithm such as the shunting yard algorithm between two top-down parsers — indeed, this strategy is used in the current version of GCC's C/C++ parser — but this feature has not been offered (to the author's knowledge) in a top-down parsing tool.
What follows is a high-level overview of Gazelle's algorithm that speaks using the language of the field. A gentler explanation follows in the next section.
Gazelle is a top-down parser that supports Strong-LL(k) lookahead for fixed k. In the future this may be extended to LL(*) lookahead using ANTLR's algorithm (devised by Terence Parr) — this would expand the set of grammars Gazelle can handle, but the parsing algorithm itself would remain the same.
Gazelle parses using a stack of discrete finite automata (DFAs) that serve distinct purposes. Four kinds of automata work together to do both lexing and parsing. Starting from the top level and working down we have:
a DFA is built representing each grammar rule, and this DFA transitions once for each token of input. A runtime stack maintains the list of recursed nonterminals.
The LL(k) lookahead information for any nontrivial RTN state is encoded in a DFA known as a GLA. The final states of GLAs indicate what RTN transition should be taken. We choose this approach rather than using a table because lookup information is typically extremely sparse, not even remotely approaching its theoretical O(2^k) upper limit.
Serving the role usually played by separate lexers, IntFAs transition on individual characters of the input stream and hit a final state when a full token is recognized. The appropriate IntFA is chosen based on the current state of the current GLA or RTN For example, if the RTN state indicates we are expecting a variable name, we can select an IntFA that will not recognize keywords, thereby accommodating languages that allow variables named after keywords.
For any multi-byte encoding, the lowest-level DFA is one that examines the input byte-by-byte and assembles the bytes into characters. (TODO: is this the right mechanism for handling things like C trigraphs also?)
In addition, operator-precedence parsing is supported by sandwiching the shunting yard algorithm between layers of the RTN stack.
Language rules that cannot be represented using pure context-free grammars (and all but the simplest languages have such complications) are accommodated with a variety of strategies that let the grammar implementor "hook in" to the above process with imperative code.
This section provides a gentle tour of Gazelle's parsing stack. It is intended that any programmer can follow it, even if you have no parsing experience.
Gazelle parses using four layers of state machines, and each layer serves a distinct purpose. While this may sound complicated at first, seeing each of the layers in action should convince you that this is a simple, natural, and easy-to-understand model.
We saw in the Rules section of the manual that every time you write a rule, Gazelle creates a graph that describes the paths the parser can take through that rule. Here is an example, describing a part of an arithmetic expression grammar using the rule prim -> /0-9+/ | "("expr ")";.
You should find that you can look at the graph for a rule and intuitively understand what it does. In this case, you should be able to easily see that a prim starts with either a number (the regular expression /0-9+/) or a left parenthesis. If it was a number, the rule is done; if it was a left parenthesis, we expect an expr followed by a right parenthesis. You should be able to understand the graph for any rule you write in this way.
Notice that there are two kinds of transitions in this graph: one that matches a string or pattern of strings (/0-9+/, "(" and ")" above) and one that refers to another rule (expr). For the grammar to be complete, let us also define expr and a companion rule factor, to make a small but complete set of rules:
expr -> factor +("+");
factor -> prim +("*");
prim -> /0-9+/ | "(" expr ")";
Recall that the + repetition operator can include a separator as seen here, so expr is essentially equivalent to expr -> factor ("+" factor) *. You wouldn't actually write the grammar this way in Gazelle because expression parsing provides a more natural way of writing the same thing, but this example will serve well to illustrate the concepts.
Because expr and factor have repetition in them, the graphs that Gazelle builds for them have cycles in them, as you can see:
In the field of parsing, this set of graphs (one for each rule in the grammar) is known as a "Recursive Transition Network." We overload this terminology slightly by also referring to a single graph in this set as an RTN. It's recursive because every time we take a transition that refers to another rule, we push our state within the current rule onto a stack. When the sub-rule "returns," we pop our saved state off the stack. This concept will become more clear in the illustration below.
Once we have the recursive transition network, we can use it to parse input text by starting in the "Start" state for the top-level rule and taking whatever transitions are indicated by the input text. Let's walk through an example.
TODO: invest the time to make a friendly graphical walkthrough.
You may have noticed that we glossed over one important detail in the previous section. How do we decide what transition in the RTN to take? In some cases it seems obvious: if all the outgoing transitions in our RTN state are terminals (strings or regular expressions), we can tell by comparing each transition with the next characters of the input text. In some cases, however, the decision can be a lot more complicated to make.
There are two situations where knowing what transition to take is difficult. First, if any of the transitions that lead out of our RTN state are nonterminals (references to other rules), we need a way to know what strings in the input imply what transitions. For example, consider the rule:
If we are in the start state for word and we see a "5" in the input, what transition should we take? You might guess that we should take numbers, but that's entirely dependent on the definition of numbers and letters. We need a way to "see into" those definitions, so we know what transition to take based on what we see in the input.
The second difficulty is similar, but slightly different. What if we are in a final state that has transitions out of it? Consider this rule:
If we are in the start state for sentence, should we return from the current rule since we are in a final state, or should we stay and match more words? To answer this question we need not only the ability to "see into" word, but also the ability to "see up from" whatever set of rules we are currently recursed into.
This set of problems turns out to be somewhat difficult to solve — in fact, it is exactly the problem that LL parsing aims to solve. Part of what makes it tricky is that depending on how the grammar is written, we might not have enough information at this point in the text to possibly know what transition to take. We always have to look at the next token of input to decide, but in some cases we might have to look ahead 2, 3, or even more tokens before we have enough information to make a decision. It is even possible to write grammars where the token that tells us the difference is arbitrarily far away. These classes of grammars are known as LL(k) if we have to look k tokens ahead to make a decision, or LL(*) if that decision can be arbitrarily far away.
|
This description of LL(k) only tells part of the story. LL algorithms are further divided into "full" LL, Strong-LL, and Simple-LL. Nearly all practical "LL" algorithms are actually Strong-LL, and this is also the case for Gazelle. Though Strong-LL parsers can only parse a subset of the grammars of a full LL parser, they are significantly simpler and more efficient, and in practice provide enough power. For more information on the subject, I heartily recommend the book Parsing Techniques: A Practical Guide by Grune and Jacobs. |
So how do we decide what transition to take in the RTN? The answer is that Gazelle analyzes the grammar at compile time, and figures out for each RTN state what tokens we expect to see here, and what RTN transition each series of tokens implies. This information is encoded into a DFA known as a GLA — a Grammar Lookahead Automaton. If after analyzing the grammar Gazelle determines that it is possible to build a GLA for every RTN state, it builds them and encodes it into the byte-code. If it is not possible to build the GLA (which implies that that the grammar is not LL(k)), Gazelle interrupts the compilation with an error.
When Gazelle is actually parsing, it stops at every RTN state and tries to determine what transition to take next. To make this decision, Gazelle finds the GLA for this RTN state and runs it for the tokens of the input until the it hits a final state. That final state will tell the RTN what transition to take.
TODO: flesh out this section with more examples
In the previous section, we assumed that we had a stream of tokens to feed into the GLA. But where do these tokens come from? The answer is that another layer of DFA's processes the characters of the input text and breaks them into tokens. These DFAs are known as IntFAs, since the characters of the input text are represented as integers (this is not a particularly good name, and may change). IntFAs in Gazelle fulfill the role traditionally performed by separate lexers.
Most programmers have some familiarity with regular expressions, because they are pervasive in modern programming. With Gazelle you describe tokens using strings and regular expressions, and Gazelle compiles these into DFAs using well-known algorithms (unlike what most scripting languages do today — they use an algorithm with much worse worst-case performance, as Russ Cox argues in his article Regular Expression Matching Can Be Simple And Fast).
Lexing using Gazelle has one key advantage over independent lexers: the Gazelle lexing component knows what sorts of language constructs it is expecting to see at this point in the input. A traditional lexer just breaks the input into words without having any idea what sort of construct it's in. For example, suppose you are a traditional C lexer and you see the input:
x = while;
A traditional C lexer in this case will return the tokens:
IDENTIFIER (x)
EQUALS
WHILE
SEMICOLON
It treats "while" as a keyword, even though it is in a context where "while" doesn't make sense. Of course, this is required by the C standard, since keywords cannot be variable names, but suppose we were dealing with a language that did not have this restriction. What do we do? Our traditional lexer design is too weak to return the token WHILE in contexts where it makes sense and an IDENTIFIER named "while" in contexts where an identifier is expected. Another strategy will be required, adding complexity to the parser.
Gazelle, on the other hand, can handle this situation just fine, because parsing and lexing are combined. The key is that every GLA state knows what tokens are valid at this point in the input, so it can choose an IntFA that only recognizes the valid tokens. In the above example, when a Gazelle parser/lexer reached while in the input, the current GLA state would select an IntFA that recognizes IDENTIFIER tokens (because they make sense here) but not a WHILE keyword token (because it does not make sense here). Therefore languages that allow keywords as variables names are easily and naturally supported.
Of course, Gazelle also needs to support languages where keywords are not allowed to be variable names. This can be easily achieved by making the regular expression for identifiers explicitly exclude all keywords. This more explicitly describes the rules of the language anyway — when a language specifications includes text like "variable names may not be any of these reserved words," you can express this directly to Gazelle by explicitly listing the patterns an identifier is not allowed to have.
There is one final detail that is not handled by any of the previous layers: where do the characters come from that are fed into the IntFAs? In the case of ASCII or other single-byte encodings, the answer is simple: the bytes are the characters. However, more and more programming languages are moving towards allowing their source to be in encodings other than ASCII, most commonly Unicode. We need a character decoding step to handle the intracacies of multi-byte encodings.
Many applications handle Unicode encodings such as UTF-8 without representing their character decoding stage as a DFA, so you may wonder why Gazelle takes this route. The benefit of using a DFA is that it provides a way to sanely process partial characters. If we are parsing text that is coming off the network, and the final byte of the buffer ends mid-character, what should we do? We could drop it and re-process it later, but that requires either Gazelle or the application to store it for later re-processing. Worse, it throws away the work we have already done making sense of this byte or bytes. By using a DFA, we can advance the state of the collective machine for every single byte of input, whether or not it is a full character. When we are fed more bytes, we pick up exactly where we left off.
It is also possible that introducing custom processing at this stage could provide an easy and efficient way to process very difficult language corner cases like C's trigraphs.