logo.gif (4249 bytes)

 

 

Building Development Tools

With Greg Voss from JavaSoft.

Tue Feb 11, 1997

gvoss: Hello everyone.

Welcome to this Weeks JDC Discussion Forum, "Building Java Development
Tools." We'll be talking today with Terence Parr, President and "Lead
Mage" (cofounder) of the MageLang Institute, which specializes in Java training.

Hi Terence,

By way of introduction I'll give some background on what led us to decide
to do a series of articles on parser technology and programmer development
tools.

Parser technology is a subject near and dear to my heart. While
working at Borland, I was surprised to see how much of their work in UI
design was made possible by integrating parser technology into the
development environment. For instance Borland received many accolades
for it's syntax highlighting (keywords, method declarations, comments
appearing in alternate colors). It took some time for popular
programming editors such as Brief to follow suit. I realized that
third parties were at a distinct disadvatage, especially with languages
like C++, because they didn't have access to a parser.

After seeing environments like Smalltalk, and NextStep with their wonderful
browsers and source code editors, I realized you could build all kinds of
amazing tools if you only had a parser. In fact many of the shell scripts
I was writing in grep, sed, awk and perl, were much better suited to small
programs based on having a separate parser.

Can you talk a little bit about the kinds of tools you can build with parsers
especially for folks who don't think parsers having anything to do with
the kinds of problems they solve on a day to day basis?

If anyone has any qestions, now would be a good time to submit them.
I'll continute with some of my own questions in the mean time.
parrty: As it turns out, the latest C/C++/Objective-C
browser under NeXTStep uses a PCCTS-generated
parser. Anyway, when you're a language guy,
all the World is a language recognition
or translation problem. For example, when I
did a stint at an Army supercomputing center,
even the physicists had to write parsers (in
Fortran...ack) to read in initial conditions
for their computational problems. Any program
that reads any kind of data or properties
can often take advantage of a "real" parser.

Tools such as grep, sed, awk, and perl are
very nice and I use them a lot. However, they
are only useful on a line-by-line basis
generally. A parser can naturally span any
lexical boundaries you want.

From a software engineering point of view,
anytime I can describe the structure of
my program in meta-language (such as a grammar),
I will do so. I view grammars as a
shorthand for what I could write manually.

Note: I take all this grammar stuff pretty
far--I use grammars to describe tree data
structures so I don't have to build complicated
tree walks by hand.
gvoss: But is this really relevant to me, Mr. Enterprise programmer, or an
ISV in a small shop. Isn't there a lot of overhead involved in building
a parser, or in learning how to use one?
parrty: Greg, "parsers aren't just for breakfast
anymore". ;)

People often have a phobia concerning parsers
and translators because they are a pain to
build by hand. Further, the common LALR-based
parsing tools such as YACC require an MS degree
to really understand. The approach I have
taken is to help you build what you could
manually. The parsers generated from PCCTS
are recursive-descent C++, or Java programs
that you can debug and step through with your
favorite debugger. Believe me, there are
even biologists in UK right now doing DNA
sequence pattern matching with PCCTS. If
they can do it without a big background in
CS, CS folks should easily manage it.

The production of parsers and translators
is aided by cutting-n-pasting rules or
whole grammars from public-domain sources,
or from grammars you've written before.
We are even adding "grammar inheritance" to
say that a new grammar is just like another
except for the following rules...

As for applications of parsing in noncompiler development
shops, consider our CEO, Tom Burns, who starts to build
a large piece of software by first developing a language or
library to allow his group to speak in the problem domain.
(For example, a guy doing lots of set manipulations built a language
called SETL--set language--to allow him to more efficiently
describe his problems).

Consider a shop that has lots of HTML documents to process and store, etc...
Think of all those stale links. Sure, there are tools to help you
manage this, but if you want to make true "symbolic links" to
documents, you must parse the HTML to grab these links. Also, think
of building your own indexer for a search engine. Walking all the
links in all documents requires an HTML parser. There are boatloads
of examples...

gvoss: I was hoping you would bring up some of the programming tools, like
the source code migration tool we're building, to help identify (and
suggest or make changes) when moving JDK 1.0.2 source code to run
under JDK 1.1. There are many other examples (code mungers to help
protect your classes from being decompiled, class hierarchy browsers,
source code analysis tools and databases). How does having a parser
help you build a better source code migration tool than, say, an awk
script or a perl program?
parrty: The answer lies in two key points: flexibility and recognition
strength (although ease of development also plays a big role).

A "real" parser is much more flexible than something built from
a line-by-line tool such as awk or perl. For example, you can
have the parser build a intermediate representation such as
a parse tree or syntax tree that can be walked a second time.
Note that because Java allows you to reference a class or member
before seeing its definition, a two-pass parser is required. Doing
this in perl would require two different perl scripts with the second
script reading some output generated by the first plus, the original
input. Kinda gross. Further, if you want a dialog box to pop up
when the migration tool finds a potential trouble spot, you need
a real language...you can't have perl fork off a GUI very easily.

The second key issue here relates to recognition strength. There
are some language constructs that simply require a great deal of
parser lookahead or context information to correctly parse. There
are even constructs that require symbol table information to parse
in a useful manner. A real parser has the infrastructure to solve
all of these issues, whereas perl was written as a report generator.
It is surely a force-fit to parse complicated languages with perl.

In summary, you might say that perl is great at picking off a few
definitions here and there as long as you don't need context information.
It is not good at two-pass parsing and for building tree structures.
gvoss: I want to get back to symbol tables, and how to manipulate them,
later. For now, let's talk about basic technology issues first.

Let's talk a little bit about the specifics or parser technology.
(Readers, keep in mind that you don't have to really understand how
parsers work or how to build them in order to use them in your own
custom development tools. However it does help to know what's
happening behind the scenes.)

Can you briefly explain the difference between a "parser" and a "parser generator."
parrty: Okay, sure. A parser is a program that attempts to apply a grammatical
structure to an input stream of vocabulary symbols, such as programming
language keywords like "int" or "while". Note I consider tree-walkers
parsers also--they simply apply a two-dimensional grammatical structure. ;)
Fundamentally, a parser answers "yes" or "no" as to whether the input
stream conformed to a grammatical structure.

Now a parser-GENERATOR, is a tool that generates a parser from a high-level
description, such as a grammar. For example, the overall structure of
simple English language sentences might look like:

sentence : subject predicate object ;

A parser generator would read this description and generate a program to
check an input stream for conformance. You might get a method out like
this:

void sentence() {
subject(); // go match a sentence subject such as "I"
predicate(); // match a predicate such as "ran"
object(); // match an object such as "to the store"
}

gvoss: You mentioned that building languages applies to nonprogramming
languages--languages for controlling machine tools, or robots for
example. Have you had experience with special purpose langauges
such as those used in machine automation, or graphics generation?
parrty: Actually, there are lots of fun things you can do with languages
as they apply to graphics. I built a really cool application that takes in
trees in LISP like notation such as "(+ 3 4)" and generates postscript
output to draw the tree visually (that tree has "+" at the root and two
children "3" and "4"). All the trees appearing in the PCCTS book were
generated automatically using this tool.

Another groovy language you might be interested in is called IVL: Inventor
Language (VRML is basically the ASCII output of the SGI Inventor library).
IVL has English commands to build up complicated VRML scenes. For example:

floor = "http://some URL of the floor VRML object"
table = ...
...
draw floor
draw table above floor
draw workstation above table
draw refrigerator above floor and 3 to the left of table

See http://www.ocnus.com/ivl.html for more info. Oh, IVL is pronounced "evil" ;)

gvoss: Let's talk about ANTLR, your parser generator. First of all, what
does ANTLR stand for?
parrty: ANTLR stands for ANother Tool for Language Recognition. A "funny" term as there was YACC, then BISON, then ANTLR. ;)
ANTLR is the parser generator packaged in PCCTS (Purdue Compiler-Construction Tool Set)--the practical application of
my PhD dissertation at Purdue.
gvoss: ANTLR takes a Java grammar (essentially compatible with JDK 1.1 Java
code--except for inner classes--and generates C++ source code for a
parser.


I guess we can get back to symbol tables from here. The output of
ANTLR can either be a symbol table (a database full of identifiers
such as method and variables, paired with their definitions), or it
could be just a bunch of print statements, for example a list of
filenames and line numbers followed by lines with the matched
indentifiers.

When would you want symbol table output versus printf-style
text output?
parrty: Just to clarify a bit...ANTLR doesn't generate a symbol table, it generates a parser that can construct a symbol table for later use by your program. Another thing to consider is that the parser can also generate an intermediate representation in the form of a tree data structure that can be easily walked to obtain information or manipulated for eventual translation back to text.

Anyway, the task at hand really defines what you want to output. Can you be more specific about printf versus symbol table stuff?
gvoss: What I'm really after here is an understanding of how much a
programmer would need to customize the generated parser source code to
get output that is going to be useful for the custom tool she or he
needs to build.
parrty: Ah. Well, you never touch the output of the parser; rather, you embed Java or C++ actions in the grammar, which are then executed at that point during the parse. For example, here is the rule for class definitions in our Java grammar:

classDefinition
: "class" id:IDENT extends implements
<<
currentClassOrInterface = id-&gtgetText();
fprintf(out,"class %s %d\n", id-&gtgetText(), id-&gtgetLine());
>>
classBlock
;

The printf will dump out the class def right before the classBlock is matched.

This type of output is easy to obtain, but is often not what you want. For difficult problems such as translations that require order of output differences you cannot just use a bunch of printfs. You must build a symbol table and generate a tree (perhaps ordering the list of declarations such that no member is referenced before it is used). The tree is then manipulated and eventually printed back to text.
gvoss:
What does a programmer gain by having a parser generator that generates
C/C++ code as opposed to a parser generator that generates Java code?

Is it a disadvantage, for example, that ANTLR generates C/C++ parser
source code rather than Java source code?
parrty: There are two main advantages to using C++ as the implementation vehicle for a parser. First, C++ is currently faster than Java and, second, most development tools right now are written in C++ allowing these parsers to be easily folded into their environments. Note, however, that Java can still run these parsers using Runtime.exec(). We have demonstrated this notion by connecting a Java-based source code browser and connecting to a Java parser written in C++ (future article at this site).

At the moment were are using the current 1.33 version of PCCTS that generates parsers implemented in C and C++ only. ANTLR 2.00, under development, is written in Java and generates Java. A prerelease version will be available shortly. Join our ANTLR 2.00 discussion group if you want to influence its design (send a body of "subscribe" to antlr-interest-request@java.magelang.com, then post by emailing to antlr-interest@java.magelang.com).

gvoss: Where can programmers get ANTLR, PCCTS, and your new rewrite of ANTLR that generates a Java parser?
parrty: Programmers should take a look at http://java.magelang.com/antlr/entry.html
to get started.

Watch comp.compilers.tools.pccts or the antlr-interest list for an announcement
concerning the latest PCCTS.
gvoss: Also, readers should know about the series of articles written for
the JDC Web site. I think the third installment is going out this
week. There's an interesting example there of a source code browser--
at least as a proof of concept example.
Can you explain terms about grammars like LALR, LL(k), LL(1)? I'm
sure these terms must really seem intimidating if you haven't studied
parser technology in a formal setting.
parrty: You're right. These terms are confusing and often difficult to explain succinctly and correctly.
People often say, "okay, I don't care what LALR or LL parsers are, but YACC is LALR and PCCTS
is LL. What is the *difference* between the two in practical terms?" There are three primary differences: recognition strength, semantic flexbility (you can embed actions anywhere you want in the grammar), and complexity (LALR is a lot harder to understand because it requires actual parsing theory knowledge).

Recognition strength is considered by many to be the important issue.
After thinking about parsers for a long time, I've settled on the idea that strength is all about context and lookahead. In other words, the more information available to the parser to make decisions, the more complicated the language it can recognize.
gvoss: Can you say a bit more about LALR and LL(k)--what are the advantages
of one over the other, how do they work? FYI, Terence is composing a (lengthy answer) Give him a beer and ask
him to talk about LALR, and you're set for the evening. This will be
the last answer posted today.
parrty: Well, I'll start by bludgeoning you with LALR-based parsers so that you
get a nice feeling when you think about LL ;)

LALR parsers are a simplified version of LR parsers (L==parse input from left to right, R==make decisions at the right edge). LALR's recognition strength relies upon large amounts of context. The context of an LR or LALR parser consists of all grammar alternatives consistent with the previously seen input. This context often includes several "pending" alternatives. Intuitively, an LR parser attempts to match multiple productions at the same time and postpones making a decision until sufficient input has been seen.

Consider a simple grammar in order for me to illustrate what that means.

rule : A B C
| A B D
;

The capital letters are vocabulary symbols (token references). An LALR parser consumes input symbols until it finds a (viable) complete production; here, we can assume all productions are viable. The parser reads input symbols until it has "A B C", for example, at which point it decides that it has matched the first alternative of the rule. Similarly, for input "A B D", the parser would not decide between alternatives one and two until it had seen the "D". The fact that the parser delays making decisions until right edges , provide the source of the R in LR and LALR.

In contrast, the context for an LL parser is restricted to the sequence of previously matched productions, and the position within the current grammar production being matched. An LL parser must make decisions about which production to match without having seen any portion of the pending productions--it has access to less context information. You might say that an LALR parser asks "What alternative does the input match?" and an LL parser says "I'm looking to match rule foo. Is the input consistent with that rule?" For our simple grammar, you would say that an LL parser needs to decide between both alternatives without having matched any of the alternative. The parser needs to see three symbols ahead (past the "A B" to the "D" or "C") in order to make a correct decision--the second L in LL comes from the fact that decisions are made at the left-edge of alternatives.

Due to the fact that LL parsers have access to less context information, LL parsers rely heavily on lookahead. Note that our grammar above is LALR(0) (look ahead 0 tokens) and LL(3) (look ahead 3 tokens) as a case in point. Parser lookahead is like going into a dark maze that has words written on the floor at regular intervals--given the right word sequence, an adventurer can successfully traverse the maze. At each fork in the maze, the adventurer must compare the input (the visible words on the floor) with the grammar (key word sequence allowing escape). What happens if at a fork both paths start with the same word on the floor? You have a nondeterminism. More lookhead (or backtracking) is required to make a correct decision. Naturally, the answer is to get a bigger flashlight. ;)

Tough to get a correct and readable answer on the fly, but...
gvoss: Wow! That's some answer Terence. I like the analogy that you
use at the end about going through a maze with words written on
the floor. Makes a token stream more graphic. Maybe we could write
a game program to get kids messing with parsers like they do by
combining blasteroids with spelling programs :)
We've gone far beyond our allotted time. Hope no one minded too
much. :) Thanks, Terence for giving us some insight into parsing and
its use in building custom programming tools. I'm looking forward to
seeing the Java version of ANTLR. We'll post it to the JDC web site
when it's ready. We'll also be posting some pretty cool tools that
use the parser to do their magic.

Anything you want to add, Terence?
parrty: Nope. Glad to get a chance to talk about language tools and such. I look forward to hearing from
anybody with further questions on development tools (parrt@MageLang.com).

-- parrt signing off
gvoss: Thanks everyone for attending. See you next week.
echarne: Sorry I'm way late, but is there a publically available parser available for
Java?
gvoss: Yes, there is. Keep following the JDC article series. There's one available
for download now (ANTLR) through JDC or through the URL Terence
mentioned earlier. There is both a parser and and parser generator
available. The current version of the parser is in C++ so it is hard to
use with an applet. But a regular Java program can call it as a stand-alone
executable.

Shortly, we'll post a new version of ANTLR that genarates a parser written
in Java (that also recognizes Java source). This would work in applets
as well as programs. It also has the advantage of being easier to integrate
with your application--be it a Java source code editor, class browser,
or whatever. Hope that helps. So long, everyone, time for lunch here on the West Coast. Last moderator (me) signing off. The forum is now unmoderated