1.33-to-2.20 Conversion Tips

I've been converting 1.33 grammars to 2.0 and thought I'd pass along the
following tips that may help folks avoid some of the problems I've had doing
it. I hope these tips help -- happy parsing!

Tom Nurkkala, PhD
tom.nurkkala@powercerv.com

Note that several of the EBNF notations have changed. In particular, the
optional clause "{...}", has become "(...)?". This new notation for
optional clauses conflicts with the old way to express syntactic predicates,
which have become "(...)=>". Because you'll probably have more optional
clauses than syntactic predicates, convert the optional clauses first, then
go back to your old grammar, find the syntactic predicates and change them
appropriately in the new grammar.
Semantic actions are now delimited with "{...}" rather than the old-style
"<<...>>" notation. This is an easy replacement to make, as there are
probably few "<<" or ">>" shift operators in your old C++ code, so you can
do a simple search-and-replace. Note that you should change optional
clauses from "{...}" to "(...)?" _before_ changing semantic action
delimiters, when the old optional clauses are still easy to distinguish from
the new semantic action delimiters.
Probably the most challenging part of the conversion will be moving from
the DLG-based scanner to the LL(k) scanner. Most of the conversions are
quite mechanical, but some are not. In particular, you now have to address
left factoring in those productions of the scanner that will return tokens
to the parser.
ANTLR is happiest when you use quoted strings directly in the grammar for
keywords. Under 1.33, I had defined all my keywords as lexical tokens
(using something like "#token K_WORD "keyword"). Although doing this avoids
misspelling problems (e.g., using "while" in one place and "whiel" another),
ANTLR 2.x is best-suited to using literals directly in the grammar because
of the way it generates the token hash table, etc. in the resulting code.
Watch carefully for misspellings.
There is no #tokenclass in ANTLR 2.x. The best way to handle such cases
appears to be to create a new production in the _parser_ that mimics the
old-style token class (e.g., changing "#tokenclass SQLVerbs { K_SELECT,
K_DELETE, ...}" to something like "sqlVerbs : "select" | "delete" | ...").
Handling numeric literals is more problematic in 2.x. In particular, if
you have a language that has "similar" literals (e.g., integers, reals,
dates, times, etc. as are present in a database-focused language), you'll
have more work to do in the LL(k) scanner environment. It appears easiest
to collect these literals into a single scanner production and either
left-factor or make use of syntactic predicates. You can set the token type
in each alternative using a specific semantic action in each disjunct of the
production (e.g., "{ _ttype = NUM_FLOAT; }"). (Note that if you use
the -diagnostic switch on antlr.Tool, the scanner's ".txt" file includes
what seem like spurious complaints about setting _ttype in this manner. The
warnings can apparently be safely ignored.) See the sample Java grammars
(particularly Scott's new one) for examples of how to do this type of thing.
Use the "protected" flag on lexer rules that are only being used as
"helpers" (e.g., on a "DIGIT" production that's used in other lexer
productions for integers, floats, etc.). Not only does this make the
resulting method in the output protected, it is also used by ANTLR to modify
its test for ambiguous rules in the scanner, eliminating some
"non-deterministic" warnings. See examples of this in Scott's new Java
parser.
When generating ASTs, it's often helpful to create "dummy" nodes that
have a token type that's used only to make AST traversal unambiguious (i.e.,
"flag" various subtrees so that the tree parser doesn't have to fool with
resolving ambiguous tree structures). Under 1.3x, such dummy token types
could be created using #token with no pattern (e.g., "#token D_DUMMY").
Under 2.x, you can create dummy token types with a production that simply
has the dummy values as disjuncts (for example, "dummyTokens : D_RED |
D_GREEN | D_BLUE;"). Such a production will cause the tokens to be created,
added to the TokenTypes output and so on. You can then refer to the dummy
types in semantic actions used to build ASTs. Be sure NOT to refer to the
"dummyTokens" production elsewhere in your grammar!
Make use of the "-diagnostic" flag on antlr.Tool. The ".txt" output for
your parser(s) and scanner(s) are very helpful in diagnosing conflicts and
ambiguities. Using the txt files in conjunction with the ANTLR output
itself is the easiest way to figure out which alternatives are conflicting
with which when there are ambiguities. Note that when the ANTLR output
refers to "line 0", it's really talking about the "nextToken" function, the
alternatives for which will appear first in the scanner txt file.