Building Development Tools

With Greg Voss from JavaSoft.

Tue Feb 11, 1997

gvoss:	Hello everyone. Welcome to this Weeks JDC Discussion Forum, "Building Java Development Tools." We'll be talking today with Terence Parr, President and "Lead Mage" (cofounder) of the MageLang Institute, which specializes in Java training. Hi Terence, By way of introduction I'll give some background on what led us to decide to do a series of articles on parser technology and programmer development tools. Parser technology is a subject near and dear to my heart. While working at Borland, I was surprised to see how much of their work in UI design was made possible by integrating parser technology into the development environment. For instance Borland received many accolades for it's syntax highlighting (keywords, method declarations, comments appearing in alternate colors). It took some time for popular programming editors such as Brief to follow suit. I realized that third parties were at a distinct disadvatage, especially with languages like C++, because they didn't have access to a parser. After seeing environments like Smalltalk, and NextStep with their wonderful browsers and source code editors, I realized you could build all kinds of amazing tools if you only had a parser. In fact many of the shell scripts I was writing in grep, sed, awk and perl, were much better suited to small programs based on having a separate parser. Can you talk a little bit about the kinds of tools you can build with parsers especially for folks who don't think parsers having anything to do with the kinds of problems they solve on a day to day basis? If anyone has any qestions, now would be a good time to submit them. I'll continute with some of my own questions in the mean time.
parrty:	As it turns out, the latest C/C++/Objective-C browser under NeXTStep uses a PCCTS-generated parser. Anyway, when you're a language guy, all the World is a language recognition or translation problem. For example, when I did a stint at an Army supercomputing center, even the physicists had to write parsers (in Fortran...ack) to read in initial conditions for their computational problems. Any program that reads any kind of data or properties can often take advantage of a "real" parser. Tools such as grep, sed, awk, and perl are very nice and I use them a lot. However, they are only useful on a line-by-line basis generally. A parser can naturally span any lexical boundaries you want. From a software engineering point of view, anytime I can describe the structure of my program in meta-language (such as a grammar), I will do so. I view grammars as a shorthand for what I could write manually. Note: I take all this grammar stuff pretty far--I use grammars to describe tree data structures so I don't have to build complicated tree walks by hand.
gvoss:	But is this really relevant to me, Mr. Enterprise programmer, or an ISV in a small shop. Isn't there a lot of overhead involved in building a parser, or in learning how to use one?
parrty:	Greg, "parsers aren't just for breakfast anymore". ;) People often have a phobia concerning parsers and translators because they are a pain to build by hand. Further, the common LALR-based parsing tools such as YACC require an MS degree to really understand. The approach I have taken is to help you build what you could manually. The parsers generated from PCCTS are recursive-descent C++, or Java programs that you can debug and step through with your favorite debugger. Believe me, there are even biologists in UK right now doing DNA sequence pattern matching with PCCTS. If they can do it without a big background in CS, CS folks should easily manage it. The production of parsers and translators is aided by cutting-n-pasting rules or whole grammars from public-domain sources, or from grammars you've written before. We are even adding "grammar inheritance" to say that a new grammar is just like another except for the following rules... As for applications of parsing in noncompiler development shops, consider our CEO, Tom Burns, who starts to build a large piece of software by first developing a language or library to allow his group to speak in the problem domain. (For example, a guy doing lots of set manipulations built a language called SETL--set language--to allow him to more efficiently describe his problems). Consider a shop that has lots of HTML documents to process and store, etc... Think of all those stale links. Sure, there are tools to help you manage this, but if you want to make true "symbolic links" to documents, you must parse the HTML to grab these links. Also, think of building your own indexer for a search engine. Walking all the links in all documents requires an HTML parser. There are boatloads of examples...
gvoss:	I was hoping you would bring up some of the programming tools, like the source code migration tool we're building, to help identify (and suggest or make changes) when moving JDK 1.0.2 source code to run under JDK 1.1. There are many other examples (code mungers to help protect your classes from being decompiled, class hierarchy browsers, source code analysis tools and databases). How does having a parser help you build a better source code migration tool than, say, an awk script or a perl program?
parrty:	The answer lies in two key points: flexibility and recognition strength (although ease of development also plays a big role). A "real" parser is much more flexible than something built from a line-by-line tool such as awk or perl. For example, you can have the parser build a intermediate representation such as a parse tree or syntax tree that can be walked a second time. Note that because Java allows you to reference a class or member before seeing its definition, a two-pass parser is required. Doing this in perl would require two different perl scripts with the second script reading some output generated by the first plus, the original input. Kinda gross. Further, if you want a dialog box to pop up when the migration tool finds a potential trouble spot, you need a real language...you can't have perl fork off a GUI very easily. The second key issue here relates to recognition strength. There are some language constructs that simply require a great deal of parser lookahead or context information to correctly parse. There are even constructs that require symbol table information to parse in a useful manner. A real parser has the infrastructure to solve all of these issues, whereas perl was written as a report generator. It is surely a force-fit to parse complicated languages with perl. In summary, you might say that perl is great at picking off a few definitions here and there as long as you don't need context information. It is not good at two-pass parsing and for building tree structures.
gvoss:	I want to get back to symbol tables, and how to manipulate them, later. For now, let's talk about basic technology issues first. Let's talk a little bit about the specifics or parser technology. (Readers, keep in mind that you don't have to really understand how parsers work or how to build them in order to use them in your own custom development tools. However it does help to know what's happening behind the scenes.) Can you briefly explain the difference between a "parser" and a "parser generator."
parrty:	Okay, sure. A parser is a program that attempts to apply a grammatical structure to an input stream of vocabulary symbols, such as programming language keywords like "int" or "while". Note I consider tree-walkers parsers also--they simply apply a two-dimensional grammatical structure. ;) Fundamentally, a parser answers "yes" or "no" as to whether the input stream conformed to a grammatical structure. Now a parser-GENERATOR, is a tool that generates a parser from a high-level description, such as a grammar. For example, the overall structure of simple English language sentences might look like: sentence : subject predicate object ; A parser generator would read this description and generate a program to check an input stream for conformance. You might get a method out like this: void sentence() { subject(); // go match a sentence subject such as "I" predicate(); // match a predicate such as "ran" object(); // match an object such as "to the store" }
gvoss:	You mentioned that building languages applies to nonprogramming languages--languages for controlling machine tools, or robots for example. Have you had experience with special purpose langauges such as those used in machine automation, or graphics generation?
parrty:	Actually, there are lots of fun things you can do with languages as they apply to graphics. I built a really cool application that takes in trees in LISP like notation such as "(+ 3 4)" and generates postscript output to draw the tree visually (that tree has "+" at the root and two children "3" and "4"). All the trees appearing in the PCCTS book were generated automatically using this tool. Another groovy language you might be interested in is called IVL: Inventor Language (VRML is basically the ASCII output of the SGI Inventor library). IVL has English commands to build up complicated VRML scenes. For example: floor = "http://some URL of the floor VRML object" table = ... ... draw floor draw table above floor draw workstation above table draw refrigerator above floor and 3 to the left of table See http://www.ocnus.com/ivl.html for more info. Oh, IVL is pronounced "evil" ;)
gvoss:	Let's talk about ANTLR, your parser generator. First of all, what does ANTLR stand for?
parrty:	ANTLR stands for ANother Tool for Language Recognition. A "funny" term as there was YACC, then BISON, then ANTLR. ;) ANTLR is the parser generator packaged in PCCTS (Purdue Compiler-Construction Tool Set)--the practical application of my PhD dissertation at Purdue.
gvoss:	ANTLR takes a Java grammar (essentially compatible with JDK 1.1 Java code--except for inner classes--and generates C++ source code for a parser. I guess we can get back to symbol tables from here. The output of ANTLR can either be a symbol table (a database full of identifiers such as method and variables, paired with their definitions), or it could be just a bunch of print statements, for example a list of filenames and line numbers followed by lines with the matched indentifiers. When would you want symbol table output versus printf-style text output?
parrty:	Just to clarify a bit...ANTLR doesn't generate a symbol table, it generates a parser that can construct a symbol table for later use by your program. Another thing to consider is that the parser can also generate an intermediate representation in the form of a tree data structure that can be easily walked to obtain information or manipulated for eventual translation back to text. Anyway, the task at hand really defines what you want to output. Can you be more specific about printf versus symbol table stuff?
gvoss:	What I'm really after here is an understanding of how much a programmer would need to customize the generated parser source code to get output that is going to be useful for the custom tool she or he needs to build.
parrty:	Ah. Well, you never touch the output of the parser; rather, you embed Java or C++ actions in the grammar, which are then executed at that point during the parse. For example, here is the rule for class definitions in our Java grammar: classDefinition : "class" id:IDENT extends implements << currentClassOrInterface = id-&gtgetText(); fprintf(out,"class %s %d\n", id-&gtgetText(), id-&gtgetLine()); >> classBlock ; The printf will dump out the class def right before the classBlock is matched. This type of output is easy to obtain, but is often not what you want. For difficult problems such as translations that require order of output differences you cannot just use a bunch of printfs. You must build a symbol table and generate a tree (perhaps ordering the list of declarations such that no member is referenced before it is used). The tree is then manipulated and eventually printed back to text.
gvoss:	What does a programmer gain by having a parser generator that generates C/C++ code as opposed to a parser generator that generates Java code? Is it a disadvantage, for example, that ANTLR generates C/C++ parser source code rather than Java source code?
parrty:	There are two main advantages to using C++ as the implementation vehicle for a parser. First, C++ is currently faster than Java and, second, most development tools right now are written in C++ allowing these parsers to be easily folded into their environments. Note, however, that Java can still run these parsers using Runtime.exec(). We have demonstrated this notion by connecting a Java-based source code browser and connecting to a Java parser written in C++ (future article at this site). At the moment were are using the current 1.33 version of PCCTS that generates parsers implemented in C and C++ only. ANTLR 2.00, under development, is written in Java and generates Java. A prerelease version will be available shortly. Join our ANTLR 2.00 discussion group if you want to influence its design (send a body of "subscribe" to antlr-interest-request@java.magelang.com, then post by emailing to antlr-interest@java.magelang.com).
gvoss:	Where can programmers get ANTLR, PCCTS, and your new rewrite of ANTLR that generates a Java parser?
parrty:	Programmers should take a look at http://java.magelang.com/antlr/entry.html to get started. Watch comp.compilers.tools.pccts or the antlr-interest list for an announcement concerning the latest PCCTS.
gvoss:	Also, readers should know about the series of articles written for the JDC Web site. I think the third installment is going out this week. There's an interesting example there of a source code browser-- at least as a proof of concept example. Can you explain terms about grammars like LALR, LL(k), LL(1)? I'm sure these terms must really seem intimidating if you haven't studied parser technology in a formal setting.
parrty:	You're right. These terms are confusing and often difficult to explain succinctly and correctly. People often say, "okay, I don't care what LALR or LL parsers are, but YACC is LALR and PCCTS is LL. What is the difference between the two in practical terms?" There are three primary differences: recognition strength, semantic flexbility (you can embed actions anywhere you want in the grammar), and complexity (LALR is a lot harder to understand because it requires actual parsing theory knowledge). Recognition strength is considered by many to be the important issue. After thinking about parsers for a long time, I've settled on the idea that strength is all about context and lookahead. In other words, the more information available to the parser to make decisions, the more complicated the language it can recognize.
gvoss:	Can you say a bit more about LALR and LL(k)--what are the advantages of one over the other, how do they work? FYI, Terence is composing a (lengthy answer) Give him a beer and ask him to talk about LALR, and you're set for the evening. This will be the last answer posted today.
parrty:	Well, I'll start by bludgeoning you with LALR-based parsers so that you get a nice feeling when you think about LL ;) LALR parsers are a simplified version of LR parsers (L==parse input from left to right, R==make decisions at the right edge). LALR's recognition strength relies upon large amounts of context. The context of an LR or LALR parser consists of all grammar alternatives consistent with the previously seen input. This context often includes several "pending" alternatives. Intuitively, an LR parser attempts to match multiple productions at the same time and postpones making a decision until sufficient input has been seen. Consider a simple grammar in order for me to illustrate what that means. rule : A B C \| A B D ; The capital letters are vocabulary symbols (token references). An LALR parser consumes input symbols until it finds a (viable) complete production; here, we can assume all productions are viable. The parser reads input symbols until it has "A B C", for example, at which point it decides that it has matched the first alternative of the rule. Similarly, for input "A B D", the parser would not decide between alternatives one and two until it had seen the "D". The fact that the parser delays making decisions until right edges , provide the source of the R in LR and LALR. In contrast, the context for an LL parser is restricted to the sequence of previously matched productions, and the position within the current grammar production being matched. An LL parser must make decisions about which production to match without having seen any portion of the pending productions--it has access to less context information. You might say that an LALR parser asks "What alternative does the input match?" and an LL parser says "I'm looking to match rule foo. Is the input consistent with that rule?" For our simple grammar, you would say that an LL parser needs to decide between both alternatives without having matched any of the alternative. The parser needs to see three symbols ahead (past the "A B" to the "D" or "C") in order to make a correct decision--the second L in LL comes from the fact that decisions are made at the left-edge of alternatives. Due to the fact that LL parsers have access to less context information, LL parsers rely heavily on lookahead. Note that our grammar above is LALR(0) (look ahead 0 tokens) and LL(3) (look ahead 3 tokens) as a case in point. Parser lookahead is like going into a dark maze that has words written on the floor at regular intervals--given the right word sequence, an adventurer can successfully traverse the maze. At each fork in the maze, the adventurer must compare the input (the visible words on the floor) with the grammar (key word sequence allowing escape). What happens if at a fork both paths start with the same word on the floor? You have a nondeterminism. More lookhead (or backtracking) is required to make a correct decision. Naturally, the answer is to get a bigger flashlight. ;) Tough to get a correct and readable answer on the fly, but...
gvoss:	Wow! That's some answer Terence. I like the analogy that you use at the end about going through a maze with words written on the floor. Makes a token stream more graphic. Maybe we could write a game program to get kids messing with parsers like they do by combining blasteroids with spelling programs :) We've gone far beyond our allotted time. Hope no one minded too much. :) Thanks, Terence for giving us some insight into parsing and its use in building custom programming tools. I'm looking forward to seeing the Java version of ANTLR. We'll post it to the JDC web site when it's ready. We'll also be posting some pretty cool tools that use the parser to do their magic. Anything you want to add, Terence?
parrty:	Nope. Glad to get a chance to talk about language tools and such. I look forward to hearing from anybody with further questions on development tools (parrt@MageLang.com). -- parrt signing off
gvoss:	Thanks everyone for attending. See you next week.
echarne:	Sorry I'm way late, but is there a publically available parser available for Java?
gvoss:	Yes, there is. Keep following the JDC article series. There's one available for download now (ANTLR) through JDC or through the URL Terence mentioned earlier. There is both a parser and and parser generator available. The current version of the parser is in C++ so it is hard to use with an applet. But a regular Java program can call it as a stand-alone executable. Shortly, we'll post a new version of ANTLR that genarates a parser written in Java (that also recognizes Java source). This would work in applets as well as programs. It also has the advantage of being easier to integrate with your application--be it a Java source code editor, class browser, or whatever. Hope that helps. So long, everyone, time for lunch here on the West Coast. Last moderator (me) signing off. The forum is now unmoderated