That said, ANTLR differs fundamentally from most other research projects--my single voice has dictated the direction, coding, and philosophy for its ten year history. Too many promising projects wither and die after a student or professor leaves the project. The opposite can happen as well. Some projects suffer because 20 students have come through and added a bunch of random junk.
The focus of my efforts has been helping the common programmer solve real recognition and translation problems. This means that ANTLR must be lean, fast, powerful, and most importantly must employ mechanisms that average programmers understand. All additions and changes to ANTLR are evaluated in light of this mission.
In this document, I list the major innovations and contributions of ANTLR. I am pleased to see a shift in the industry towards LL(k) parsing and to see that other tools are using some of ANTLR's technology and ideas.
In theory, computing lookahead sets for LL(k) parsers for k>1 is easy. It sure is if you simply write out the definitions or algorithms in a book using FIRST/FOLLOW notation. At the point I started thinking about k>1 lookahead, I had only seen real algorithms for k=1. I had a heck of a time computing lookahead sets even by hand for k>1 for simple grammars.
In August 1990, right before I started my Ph.D. at Purdue University, I drew out a funny looking syntax diagram / pushdown automata from a grammar. Once built, I realized that the data structure was just what I needed and after playing around I realized that computing lookahead sets was no more complicated than a bounded NFA-to-DFA conversion! Lookahead sets were, in fact, DFAs that generated finite lookahead languages. At any decision point, you can compute the lookahead sets by walking the funny data structure and collecting the symbols you see up to depth k.
So I had a convenient data structure and slick algorithm. Because I was blessed with ignorance, I erroneously defined lookahead such that all symbols at depth n (1<=n<=k) were combined into a single set. So, lookahead for k=4 was an array of 4 sets. While this is "wrong", I had stumbled upon my Ph.D. thesis.
No one had tried building k>1 parsers before because the problem was known to be exponentially complex. It takes O(|T|^k) space just to store a single lookahead set where |T| is the size of the symbol vocabulary. My new definition of lookahead, linear approximate lookahead as I would call it later, reduced the complexity to O(|T|xk); in other words, the exponent became a multiplier!
Linear approximate lookahead, while drastically smaller and faster to compute, renders your parser much weaker in theory. Fortunately, in practice I showed that full lookahead vs linear approximate lookahead almost never bought you anything; perhaps 5% more strength. To grab that last 5%, I found a way to use the linear approximate lookahead to attenuate the cost of doing full lookahead. In late 1991, I released PCCTS 1.00, the first practical LL(k) parser generator for k>1.
At the time of this writing, Josef Grosch has decoded my thesis and implemented k>1 lookahead in his CoCo LR-based parser generator. Also, Etienne Gagnon is looking into k>1 lookahead for SableCC. JavaCC computes full k>1 lookahead without using linear approximate lookahead, but avoids the exponentiality by letting the programmer specify the lookahead for any decision.
ANTLR 2 uses purely linear approximate lookahead. It has bitten me only a few times, but I plan to implement full LL(k) lookahead in the future.
LL(k)-based parsers are often weaker than LR(k)-based parsers because LR parsers make decisions after having seen more of the input. On the other hand, any deterministic parsing strategy can fail to handle some really nasty problems such as those found when parsing C++. Furthermore, sometimes syntactic information alone is insufficient to recognize certain language structures, namely context-sensitive structures. While the underlying concepts were not new, (in 1993) I finalized some nice extensions to semantic predicates and invented the idea of a syntactic predicate (an infinite lookahead language operator).
Semantic predicates are run-time tests that can resolve finite lookahead conflicts and syntactic ambiguities with semantic information. To my knowledge, parser generators that allowed semantic predicates go way back to the late 1970s (though the concepts go back further). PCCTS's (ANTLR 1.xx) approach enhanced semantic predicates as follows:
Syntactic predicates are grammar fragments that describe a syntactic context that must be satisfied before application of an associated production is authorized. Backtracking is an extremely common technique even in the parsing world, however, backtracking parsers were not very useful prior to PCCTS. Parser generators took the decision to backtrack out of the hands of the programmer--most of the parser generators either backtracked all the time or none of the time.
PCCTS formalized the idea of selective backtracking as an operator, a predicate, that let you resolve finite lookahead conflicts with a possibly infinite, possibly nonregular, lookahead language. Importantly, PCCTS generates hybrid finite/backtracking parsing decisions that avoid the backtrack when finite lookahead provides an obvious path at parse-time. This hybrid approach significantly reduces the amount of backtracking done by the parser.
At a boring parallel computing conference in 1991, I came up with a really terse way to construct trees. Using two operators, ^ and !, I could tell ANTLR how to generate trees for a surprisingly-large portion of my grammar. For example, by suffixing the operator tokens in an expression grammar with a ^ operator, the parser would build the usual abstract syntax trees.
As a Ph.D. student and later as a postdoctoral fellow in 1993 at the Army High Performance Computing Research center at the U of MN, I built lots of Fortran 77 translators. I had a parser construct a tree and then walked it with a bunch of recursive routines. I did not use a generic depth-first tree walker function because I needed lots of context information; e.g., was this ID node on the left hand side of an assignment?
After having built a number of these tree walking translation phases, I realized that they were so mechanical that I could build them automatically! Tree walking was nothing more than tree parsing and, hence, I should describe my translation phases with a tree grammar. SORCERER, the first tree-parser generator was born (though you can claim that LISP sort of does the same thing). The grammar inheritance of ANTLR 2 makes building multiple-phase translation grammars even easier.
When I was consulting to NeXT computer (all hail Steve Jobs and NeXT) in 1994 on their C++ parser, I had the (now obvious) idea that error handling should mirror the functionality of exception handling. You could instruct the parser to handle errors where it made the most sense as opposed to where the error was generated. For example, it is better to say "bad while-expression" than "error in expression".
ANTLR has always been pretty good at automatic error recovery (though there is much better research on the subject than I have implemented), but parser exceptions gave a huge amount of control to the programmer.
The YACC++ LR-based parser generator is the first tool I saw that unified the specification of lexer and parser. You could reference characters and tokens in the same grammar. That was many years ago.
ANTLR takes this a bit further to generalize the notion of recognizing a stream of input whether that input be characters, tokens, or tree nodes. Lexers, parsers, and tree parsers are specified using the same syntax and the same code generator is used for all three!
While I remember discussing this idea as early as 1993/1994 at one of "Dr. T's Traveling Parsing Revival and Beer Tasting Festivals", it was not until 1997 that I implemented the idea of generating top-down lexers instead of DFAs. Now, ANTLR-generated parsers and lexers look something like what you would generate by hand.
At the 1997 Dr. T's Traveling Parsing Revival and Beer Tasting Festival (held at JavaSoft), Sriram Sankar gave a talk on JavaCC. He had a nice trick (special tokens) to handle certain situations where you wanted to ignore but not discard certain tokens in the input stream such as comments. After Sriram left, the group generalized the idea to the notion of channels or streams of tokens where a lexer might broadcast tokens on any number of channels.
In 1999, after working with a number of colleagues, I finally got around to introducing the notion of token streams, which are reminiscent of Java IO streams where chains of stream filters provide more and more highly processed data streams. The JavaCC trick is implemented via a simple object that sits between the parser and the lexer called TokenStreamHiddenTokenFilter.
Streams can be filtered to produce many more interesting results such as splitting off such things as comments, declarations, functions, whatever into multiple streams. A lexer is a stream producer, a parser is a stream consumer, and a stream filter is a stream producer and consumer.