A whitepaper that advocates hand-written code for walking an AST rather than using ANTLR Tree Grammars. |
To a large degree, how you want to design some AST transformation code is a matter of taste. If you're the type of person who likes to look at some example code first, you're in luck! We have two sets of example code, each of which simply walks through an AST and prints it out as nicely formated Java code. We have the "Tree Grammar" approach in ANTLR and Java syntax by Terence Parr. And we have the the "plain AST walker" approach by Andy Tripp.
If you don't want to go into too much detail and skip all the mumbo-jumbo, you may want to just compare those two and decide for yourself which one you prefer.
...Now, on to the mumbo-jumbo...
When someone is building a language translator, having an AST and wondering what to do with it is not a reasonable starting point. Rather, a reasonable starting point is having a set of requirements; a fairly detailed description of needs to be done in order to perform the translation,
Start with requirements for translation. |
What difference does the starting point make? Well, these are more than two different mindsets, they are also two different approaches. On the one hand, the approach is "Let's do as much as we can with the AST". The other approach is "We must do all of the things in our requirements, and we'll leverage the AST as best we can." And so the two approaches will likely lead us to two different solutions. The goal of language translation is not "What to do with this AST?", it's "What's the best way to do all those things in that huge list above?" Starting with this second goal, you may never even want to have an AST, as I argue in another article. But for the purposes of this paper, I'll assume that you have chosen to use ASTs as the basis for your language translation tool.
The point here is that we should not approach the problem with an AST in hand, wondering how to best traverse it, but rather that we have a set of given requirements that we need to implement and an AST in hand, and where should we go from here? One solution may be to simply walk the AST, but more likely, when we look in detail at our requirements, we'll find that we'll need to do many walks of the AST, we may need to do some very complex searching and manipulation of the AST, we may need multiple ASTs as well as other data structures, and so on.
So it may be more complex than simply walking the AST once, so what? Good question. Just to give one example, if we have to walk the tree many times, doing transformations in each pass, the whole idea of what it means to "validate the tree structure" becomes very complex. So if one strategy has the advantage that it "validates the tree structure", we'll now have to wonder if that's even something that's possible or desirable. More generally, we need to approach the problem with a firm grasp on the real-world complexities that we need to solve. That way, we won't come up with solution that seem very simple on the surface (e.g. "be sure to initialize variables at their declaration...") but turn out to be devilishly complex in reality ("...but only if control flow analysis shows that we need to.").
Anyway, we've got a set of requirements for our language translator. We've decided to ignore my advice and architect it around the idea of AST manipulation. We've used a lexer and parser to build our AST. Now what?
The "Manually walk the AST" approach may or may not involve the Vistor pattern. |
First, let me just outline what it is we're talking about here. In this approach, we simply write some generic code in, say, Java, that traverses a tree. It may involve use of the Visitor pattern, but doesn't have to. Here is some code that does not use it:
void walk(AST ast) { switch(ast.getType()) { case Types.PACKAGE_DEF: print("package "); walk(ast.getFirstChild()); print(";\n\n"); break; case Types.IMPORT: print("import "); walk(ast.getFirstChild()); print(";"); break; case Types.CLASS_DEF: walk(getChild(ast, MODIFIERS)); print("class "); walk(getChild(ast, IDENT)); print(" "); walk(getChild(ast, EXTENDS_CLAUSE)); walk(getChild(ast, IMPLEMENTS_CLAUSE)); startBlock(); walk(getChild(ast, OBJBLOCK)); endBlock(); break; // ...handle every other possible node type in the AST... } }Here, for example, when we encounter a "PACKAGE_DEF" node in our AST, we print out "package ", then print the first child of the AST (via a recursive call to walk()), and then print ";" and a couple of newline characters.
The article lists three problems with visitors:
On the other hand, if we look at the CLASS_DEF processing in the code above, we see that it apparently will not check the order of the children nodes (e.g. that any MODIFIERS child comes first). And with a little effort, we could tell that this code:
walk(getChild(ast, IDENT));Is not actually requiring that the CLASS_DEF node have a IDENT child.
So this code is really doing only some tree structure validation...just about as much as it needs to in order to do its job. Is that really a disadvantage?
Let's take a step back and look at the normal chain of events for a typical compiler:
preprocessor --> lexer ---> parser --> code generator --> assembler --> linker
Our translator probably will be very similar, something like:
preprocessor --> lexer ---> parser --> AST transformer --> prettyprinter
Looking at the compiler steps and all the existing compilers out there, how much input validation does each step do, and what is the result? The answer is that in general, all steps do some sort of validation of their input as part of their normal course of processing. But none really "goes out of it's way" to attempt to completely verify the integrity of its input. The parser assumes that the stream of tokens it got from the lexer is valid, the code generator assumes that the AST produced by the parser is valid, and so on.
So looking at the chain for our translator, do we really need or even want our "AST Transformer" step to be doing any extra work to validate that the parser produced a valid AST? The answer, to me, is a clear "no".
We don't need each step to completely validate its input. |
Aside from making your actions dependent on the physical structure of the tree, manually looking upwards is extremely inconvenient and very inefficient.
As for depending on the physical structure of the AST, there's simply no getting around that. If you're going to do one thing when below a WHILE node, and something else when below an ASSIGN node, you're by definition depending on the physical structure of the tree. You may take the approach of building up some data structure and passing it around, but that's still depending on the structure of the tree. There's nothing wrong with depending on the physical structure of the AST. In fact, that's what our "AST Transformer" - the heart of our translator - does for a living. It massages the tree based on its structure.So I don't see how TreeWalker approach somehow gives you context information "for free". If we look at the java.tree.g example which uses the TreeWalker approach to print a Java AST, we see many calls to getFirstChild() and getNextSibling() every time we need context information - information about children of the current AST node. We also see methods such as indent(), undent(), and nl(), which are written in Java, and preserve context information (how much we are currently indented and whether the last output was a newline or not). I can find nothing in that java.tree.g that "looks up the tree" or uses information that was passed down during the walk.
So rather than this example showing how the TreeWalker approach provides context information for free, it only shows that this example is simple enough not to need it.
As for looking upwards being inconvenient and inefficient, this would only be true if you've done a poor job of designing your AST data structure.
Walking up the tree is fast and easy. |
class MyAST { boolean searchUp(int type) { if (this.type == type) return true; if (getParent() == null) return false; return getParent().searchUp(type); } }
Node actions may compute data needed by future actions. Because each action is isolated..., you cannot use the usual programming constructs like local variables, parameters, and return values to pass information
Let's split "future actions" into two groups and deal with each group. One case is where you save data during a tree walk for some "future action" that's under the current node in the tree. There are several solutions here, none of which is inherently "bad". You could simply not save any data, and look up the tree when you need to, as we showed in the previous section. A second approach is to build up some stack-type data structure pushing when you step down the tree and popping when you step up. And a third approach would be to have some sort of non-local variable that stores this information. For example, you might maintain a boolean "isUnderWhile" variable. Yes, these second two options are bit messy, and the "search up the tree as needed" is almost certainly better. But the point is that there's no glaring problem here: the AST has all the information that we need, and any use of parameters and return values is probably not needed.
The other group of "future actions" is the case where some part of the AST affects some other part that's not under it. For example, suppose we're analysing some AST node that represents a function call and we want to search the AST for the function definition. Here again, we simply need simple routines that would search the AST. No local variables, parameters, and return values is going to help us with this situation no matter how we structure things.
The AST *is* the context. |
The article says "You must use instance variables which are clumsy and violate data hiding principles.". A I said above, the response is "yes, you use the AST, which is an instance variable", but that (assuming done reasonably) is not clumsy nor violate data hiding principles.
The article then goes on to explain that there's nothing inherently wrong with the visitor pattern and the author is not biased against them, which I agree with completely. As a side note, it's interesting to note that the original Design Patterns ("Gang Of Four") book actually uses the walking of ASTs as it's example to illustrate the need for the Visitor Pattern.
class PlusNode extends ExprNode { ExprNode left; ExprNode right; public PlusNode(ExprNode left, ExprNode right) {...} public void walk() { left.walk(); right.walk(); ... plus node action ... }It is indeed a bit scary to think anyone might consider doing this by hand. The only time you tend to see this kind of code these days is in a compiler book that is showing you how you might do a parser by hand. This is in the chapter just before the one that says "and now we will use a generator to generate all that, because no one would want to write that by hand."
So I agree completely with the list of drawbacks given in the article:
And the article continues on to talk about the drawbacks of writing a treewalker "by hand".
Generating lots of classes to do the walking is indeed a bad idea. |
In other words, the three drawbacks listed above do apply to a treewalker that's written "by hand" and has a class for each node type. But it does not apply to a treewalker that's written "by hand" that looks like the JavaEmitter code that I showed earlier - the one big switch statement inside a walk() method.
I realized that tree parsing is identical to text parsing, albeit in two dimensions instead of one.
Go ahead and take an few minutes, a few hours, or a full PhD program to really absorb that. I'll wait right here....
...done? OK. Good. You see how parsing a linear stream of tokens is just like parsing a tree data structure, but with one less dimension? If so, you're a better man than I am. While I can see how they're a little analogous, I can't clearly see that they're really "the same thing" in that I'd want to then use the same language and approach to do these two things.
A great parser-generator language does not necessarily make a great tree-transformation language. |
In the other case, we're taking an AST as input, and perhaps a different AST as output. Here, the situation is not nearly as clear. The mapping from one AST structure to another is quite complex (see XSLT, TXL, or TreeDL for example), and a single BNF-like grammar doesn't capture it.
In any case, let's get right to the supposed advantages of tree grammars over writing your own code (visitor pattern and all that) that walks an AST. From the article:
As for the first item, we've covered this issue already. It's a formal specification of the input AST structure, not any output. We don't want a complete specification of the input tree structure in our AST-transformation phase. It's enough for the phase to assume that the previous phase (the parser) is doing its job correctly, just as all phases do. Having a complete specification here is actually harmful, because now we have that same "specification" in two places: one is the heart of the parser, and the second is here at the start of the transformer. Yes, certainly we'll have to deal with error cases in our AST-transformation phase if it gets an AST that it can't handle. But I really don't want to have anything like a complete specification of the AST structure - that's the parser's job.
As for the second item: "actions having explicit context by virtue of the location in the grammar"...obviously this makes no sense to me since I don't even want the grammar here. We also covered this earlier under "Context Information".
This idea of "having explicit context" goes to what we said earlier about instance variables. In my view, the AST is the context. We are in the midst of walking an AST, and if we want to find out our context, we simply look up the tree. There's no need for passing variables and return values. And there's no need to have the code structure match the AST structure.
And the third issue "data can be passed around easily between actions", is really the same issue. I don't see a need to be passing around data, whether implicit ("context") or explicit ("parameters"). Just use the AST itself.
I have always hated it when people mix two languages together, whether it's HTML and
Mixing de langues es muy feo* |
If it doesn't, imagine an analogous situation. Imagine trying to read about 10 pages of text, of which about 50% is in English, 25% is in Spanish, and 25% is in a hybrid of English and Spanish.
Finally, and most important, is the issue of the structure of our AST transformation code. Comparing the "Tree Grammar" approach and the "by-hand walker" approach, the crucial issue is that the structure of the Tree Grammar approach matches the input AST structure. It essentially says "Here is what the AST looks like, and actions to print it out are embedded within". Whereas the "by-hand" approach says "Here is how to print out each of the node types". Which is the better way?
Well, at the risk of over-analyzing things... the question comes down to whether we really care what the overall structure of the input AST is. Specifically, do we want our AST-transformation phase to be tied tightly (to "look like", really) the AST structure of the parser? In my opinion, the answer is "no". By not having the AST-transformer tightly bound to the AST structure, it's easier to maintain, enhance, and modify to work with other languages and AST structures.
Structuring the code to match the input doesn't help anything. |
Sentence : [Noun {Action}] Verb {VerbAction} [DirectObject {DirectObjectAction}] *(Conjunction {ConjunctionAction} Sentence) EndPunctuation {EndPunctuationAction};...or do we want it to look like this...
switch (nodeType) { case Noun: nounAction; break; case Verb: VerbAction; break; case DirectObject: DirectObjectAction; break; case Conjunction: ConjunctionAction; break; case EndPunctuation: EndPunctuationAction; break; }The first case is more terse and reflects the structure of the input, but is less readable and mixes two languages (although we only see one language here, presumably we'll have to embed some Java code as things get complicated).
I think the approach that you'll prefer depends on your background. If you're already intimately familiar with the ":", "[". "]", "*" , "{", "}", and ";" constructs in the first approach, you will prefer the terse approach. On the other hand, if you don't want to be bothered with all that "extra syntax", you'll prefer the more verbose "plain Java" approach.
Not sure which one you are? Well, I'll give you a hint. People who actually like dealing with
*"Mixing of languages is very ugly" |
By: Andy Tripp CTO, Founder, Jazillian, Inc. February 22, 2006 |