|
|
Building Development Tools
With Greg Voss from JavaSoft.
Tue Feb 11, 1997
gvoss: |
Hello everyone.
Welcome to this Weeks JDC Discussion Forum,
"Building Java Development
Tools." We'll be talking today with Terence Parr,
President and "Lead
Mage" (cofounder) of the MageLang Institute, which
specializes in Java training.
Hi Terence,
By way of introduction I'll give some background on what
led us to decide
to do a series of articles on parser technology and
programmer development
tools.
Parser technology is a subject near and dear to my heart.
While
working at Borland, I was surprised to see how much of
their work in UI
design was made possible by integrating parser technology
into the
development environment. For instance Borland received
many accolades
for it's syntax highlighting (keywords, method
declarations, comments
appearing in alternate colors). It took some time for
popular
programming editors such as Brief to follow suit. I
realized that
third parties were at a distinct disadvatage, especially
with languages
like C++, because they didn't have access to a parser.
After seeing environments like Smalltalk, and NextStep
with their wonderful
browsers and source code editors, I realized you could
build all kinds of
amazing tools if you only had a parser. In fact many of
the shell scripts
I was writing in grep, sed, awk and perl, were much
better suited to small
programs based on having a separate parser.
Can you talk a little bit about the kinds of tools you
can build with parsers
especially for folks who don't think parsers having
anything to do with
the kinds of problems they solve on a day to day basis?
If anyone has any qestions, now would be a good time to
submit them.
I'll continute with some of my own questions in the mean
time. |
parrty: |
As it turns out, the latest
C/C++/Objective-C
browser under NeXTStep uses a PCCTS-generated
parser. Anyway, when you're a language guy,
all the World is a language recognition
or translation problem. For example, when I
did a stint at an Army supercomputing center,
even the physicists had to write parsers (in
Fortran...ack) to read in initial conditions
for their computational problems. Any program
that reads any kind of data or properties
can often take advantage of a "real" parser.
Tools such as grep, sed, awk, and perl are
very nice and I use them a lot. However, they
are only useful on a line-by-line basis
generally. A parser can naturally span any
lexical boundaries you want.
From a software engineering point of view,
anytime I can describe the structure of
my program in meta-language (such as a grammar),
I will do so. I view grammars as a
shorthand for what I could write manually.
Note: I take all this grammar stuff pretty
far--I use grammars to describe tree data
structures so I don't have to build complicated
tree walks by hand. |
gvoss: |
But is this really relevant to me, Mr.
Enterprise programmer, or an
ISV in a small shop. Isn't there a lot of overhead
involved in building
a parser, or in learning how to use one? |
parrty: |
Greg, "parsers aren't just for
breakfast
anymore". ;)
People often have a phobia concerning parsers
and translators because they are a pain to
build by hand. Further, the common LALR-based
parsing tools such as YACC require an MS degree
to really understand. The approach I have
taken is to help you build what you could
manually. The parsers generated from PCCTS
are recursive-descent C++, or Java programs
that you can debug and step through with your
favorite debugger. Believe me, there are
even biologists in UK right now doing DNA
sequence pattern matching with PCCTS. If
they can do it without a big background in
CS, CS folks should easily manage it.
The production of parsers and translators
is aided by cutting-n-pasting rules or
whole grammars from public-domain sources,
or from grammars you've written before.
We are even adding "grammar inheritance" to
say that a new grammar is just like another
except for the following rules...
As for applications of parsing in noncompiler development
shops, consider our CEO, Tom Burns, who starts to build
a large piece of software by first developing a language
or
library to allow his group to speak in the problem
domain.
(For example, a guy doing lots of set manipulations built
a language
called SETL--set language--to allow him to more
efficiently
describe his problems).
Consider a shop that has lots of HTML documents to
process and store, etc...
Think of all those stale links. Sure, there are tools to
help you
manage this, but if you want to make true "symbolic
links" to
documents, you must parse the HTML to grab these links.
Also, think
of building your own indexer for a search engine. Walking
all the
links in all documents requires an HTML parser. There are
boatloads
of examples...
|
gvoss: |
I was hoping you would bring up some of
the programming tools, like
the source code migration tool we're building, to help
identify (and
suggest or make changes) when moving JDK 1.0.2 source
code to run
under JDK 1.1. There are many other examples (code
mungers to help
protect your classes from being decompiled, class
hierarchy browsers,
source code analysis tools and databases). How does
having a parser
help you build a better source code migration tool than,
say, an awk
script or a perl program?
|
parrty: |
The answer lies in two key points:
flexibility and recognition
strength (although ease of development also plays a big
role).
A "real" parser is much more flexible than
something built from
a line-by-line tool such as awk or perl. For example, you
can
have the parser build a intermediate representation such
as
a parse tree or syntax tree that can be walked a second
time.
Note that because Java allows you to reference a class or
member
before seeing its definition, a two-pass parser is
required. Doing
this in perl would require two different perl scripts
with the second
script reading some output generated by the first plus,
the original
input. Kinda gross. Further, if you want a dialog box to
pop up
when the migration tool finds a potential trouble spot,
you need
a real language...you can't have perl fork off a GUI very
easily.
The second key issue here relates to recognition
strength. There
are some language constructs that simply require a great
deal of
parser lookahead or context information to correctly
parse. There
are even constructs that require symbol table information
to parse
in a useful manner. A real parser has the infrastructure
to solve
all of these issues, whereas perl was written as a report
generator.
It is surely a force-fit to parse complicated languages
with perl.
In summary, you might say that perl is great at picking
off a few
definitions here and there as long as you don't need
context information.
It is not good at two-pass parsing and for building tree
structures.
|
gvoss: |
I want to get back to symbol tables, and
how to manipulate them,
later. For now, let's talk about basic technology issues
first.
Let's talk a little bit about the specifics or parser
technology.
(Readers, keep in mind that you don't have to really
understand how
parsers work or how to build them in order to use them in
your own
custom development tools. However it does help to know
what's
happening behind the scenes.)
Can you briefly explain the difference between a
"parser" and a "parser generator."
|
parrty: |
Okay, sure. A parser is a program that
attempts to apply a grammatical
structure to an input stream of vocabulary symbols, such
as programming
language keywords like "int" or
"while". Note I consider tree-walkers
parsers also--they simply apply a two-dimensional
grammatical structure. ;)
Fundamentally, a parser answers "yes" or
"no" as to whether the input
stream conformed to a grammatical structure.
Now a parser-GENERATOR, is a tool that generates a parser
from a high-level
description, such as a grammar. For example, the overall
structure of
simple English language sentences might look like:
sentence : subject predicate object ;
A parser generator would read this description and
generate a program to
check an input stream for conformance. You might get a
method out like
this:
void sentence() {
subject(); // go match a sentence subject such as
"I"
predicate(); // match a predicate such as "ran"
object(); // match an object such as "to the
store"
}
|
gvoss: |
You mentioned that building languages
applies to nonprogramming
languages--languages for controlling machine tools, or
robots for
example. Have you had experience with special purpose
langauges
such as those used in machine automation, or graphics
generation? |
parrty: |
Actually, there are lots of fun things
you can do with languages
as they apply to graphics. I built a really cool
application that takes in
trees in LISP like notation such as "(+ 3 4)"
and generates postscript
output to draw the tree visually (that tree has
"+" at the root and two
children "3" and "4"). All the trees
appearing in the PCCTS book were
generated automatically using this tool.
Another groovy language you might be interested in is
called IVL: Inventor
Language (VRML is basically the ASCII output of the SGI
Inventor library).
IVL has English commands to build up complicated VRML
scenes. For example:
floor = "http://some URL of the floor VRML
object"
table = ...
...
draw floor
draw table above floor
draw workstation above table
draw refrigerator above floor and 3 to the left of table
See http://www.ocnus.com/ivl.html for more info. Oh, IVL
is pronounced "evil" ;)
|
gvoss: |
Let's talk about ANTLR, your parser
generator. First of all, what
does ANTLR stand for?
|
parrty: |
ANTLR stands for ANother Tool for
Language Recognition. A "funny" term as there
was YACC, then BISON, then ANTLR. ;)
ANTLR is the parser generator packaged in PCCTS (Purdue
Compiler-Construction Tool Set)--the practical
application of
my PhD dissertation at Purdue.
|
gvoss: |
ANTLR takes a Java grammar (essentially
compatible with JDK 1.1 Java
code--except for inner classes--and generates C++ source
code for a
parser.
I guess we can get back to symbol tables from here. The
output of
ANTLR can either be a symbol table (a database full of
identifiers
such as method and variables, paired with their
definitions), or it
could be just a bunch of print statements, for example a
list of
filenames and line numbers followed by lines with the
matched
indentifiers.
When would you want symbol table output versus
printf-style
text output? |
parrty: |
Just to clarify a bit...ANTLR doesn't
generate a symbol table, it generates a parser that can
construct a symbol table for later use by your program.
Another thing to consider is that the parser can also
generate an intermediate representation in the form of a
tree data structure that can be easily walked to obtain
information or manipulated for eventual translation back
to text.
Anyway, the task at hand really defines what you want to
output. Can you be more specific about printf versus
symbol table stuff? |
gvoss: |
What I'm really after here is an
understanding of how much a
programmer would need to customize the generated parser
source code to
get output that is going to be useful for the custom tool
she or he
needs to build.
|
parrty: |
Ah. Well, you never touch the output of
the parser; rather, you embed Java or C++ actions in the
grammar, which are then executed at that point during the
parse. For example, here is the rule for class
definitions in our Java grammar:
classDefinition
: "class" id:IDENT extends implements
<< currentClassOrInterface = id->getText();
fprintf(out,"class %s %d\n",
id->getText(), id->getLine());
>>
classBlock
;
The printf will dump out the class def right before the
classBlock is matched.
This type of output is easy to obtain, but is often not
what you want. For difficult problems such as
translations that require order of output differences you
cannot just use a bunch of printfs. You must build a
symbol table and generate a tree (perhaps ordering the
list of declarations such that no member is referenced
before it is used). The tree is then manipulated and
eventually printed back to text.
|
gvoss: |
What does a programmer gain by having a parser generator
that generates
C/C++ code as opposed to a parser generator that
generates Java code?
Is it a disadvantage, for example, that ANTLR generates
C/C++ parser
source code rather than Java source code?
|
parrty: |
There are two main advantages to using
C++ as the implementation vehicle for a parser. First,
C++ is currently faster than Java and, second, most
development tools right now are written in C++ allowing
these parsers to be easily folded into their
environments. Note, however, that Java can still run
these parsers using Runtime.exec(). We have demonstrated
this notion by connecting a Java-based source code
browser and connecting to a Java parser written in C++
(future article at this site).
At the moment were are using the current 1.33 version of
PCCTS that generates parsers implemented in C and C++
only. ANTLR 2.00, under development, is written in Java
and generates Java. A prerelease version will be
available shortly. Join our ANTLR 2.00 discussion group
if you want to influence its design (send a body of
"subscribe" to
antlr-interest-request@java.magelang.com, then post by
emailing to antlr-interest@java.magelang.com).
|
gvoss: |
Where can programmers get ANTLR, PCCTS,
and your new rewrite of ANTLR that generates a Java
parser?
|
parrty: |
Programmers should take a look at
http://java.magelang.com/antlr/entry.html
to get started.
Watch comp.compilers.tools.pccts or the antlr-interest
list for an announcement
concerning the latest PCCTS.
|
gvoss: |
Also, readers should know about the
series of articles written for
the JDC Web site. I think the third installment is going
out this
week. There's an interesting example there of a source
code browser--
at least as a proof of concept example.
Can you explain terms about grammars like LALR, LL(k),
LL(1)? I'm
sure these terms must really seem intimidating if you
haven't studied
parser technology in a formal setting.
|
parrty: |
You're right. These terms are confusing
and often difficult to explain succinctly and correctly.
People often say, "okay, I don't care what LALR or
LL parsers are, but YACC is LALR and PCCTS
is LL. What is the *difference* between the two in
practical terms?" There are three primary
differences: recognition strength, semantic flexbility
(you can embed actions anywhere you want in the grammar),
and complexity (LALR is a lot harder to understand
because it requires actual parsing theory knowledge).
Recognition strength is considered by many to be the
important issue.
After thinking about parsers for a long time, I've
settled on the idea that strength is all about context
and lookahead. In other words, the more information
available to the parser to make decisions, the more
complicated the language it can recognize.
|
gvoss: |
Can you say a bit more about LALR and
LL(k)--what are the advantages
of one over the other, how do they work? FYI, Terence is
composing a (lengthy answer) Give him a beer and ask
him to talk about LALR, and you're set for the evening.
This will be
the last answer posted today. |
parrty: |
Well, I'll start by bludgeoning you with
LALR-based parsers so that you
get a nice feeling when you think about LL ;)
LALR parsers are a simplified version of LR parsers
(L==parse input from left to right, R==make decisions at
the right edge). LALR's recognition strength relies upon
large amounts of context. The context of an LR or LALR
parser consists of all grammar alternatives consistent
with the previously seen input. This context often
includes several "pending" alternatives.
Intuitively, an LR parser attempts to match multiple
productions at the same time and postpones making a
decision until sufficient input has been seen.
Consider a simple grammar in order for me to illustrate
what that means.
rule : A B C
| A B D
;
The capital letters are vocabulary symbols (token
references). An LALR parser consumes input symbols until
it finds a (viable) complete production; here, we can
assume all productions are viable. The parser reads input
symbols until it has "A B C", for example, at
which point it decides that it has matched the first
alternative of the rule. Similarly, for input "A B
D", the parser would not decide between alternatives
one and two until it had seen the "D". The fact
that the parser delays making decisions until right edges
, provide the source of the R in LR and LALR.
In contrast, the context for an LL parser is restricted
to the sequence of previously matched productions, and
the position within the current grammar production being
matched. An LL parser must make decisions about which
production to match without having seen any portion of
the pending productions--it has access to less context
information. You might say that an LALR parser asks
"What alternative does the input match?" and an
LL parser says "I'm looking to match rule foo. Is
the input consistent with that rule?" For our simple
grammar, you would say that an LL parser needs to decide
between both alternatives without having matched any of
the alternative. The parser needs to see three symbols
ahead (past the "A B" to the "D" or
"C") in order to make a correct decision--the
second L in LL comes from the fact that decisions are
made at the left-edge of alternatives.
Due to the fact that LL parsers have access to less
context information, LL parsers rely heavily on
lookahead. Note that our grammar above is LALR(0) (look
ahead 0 tokens) and LL(3) (look ahead 3 tokens) as a case
in point. Parser lookahead is like going into a dark maze
that has words written on the floor at regular
intervals--given the right word sequence, an adventurer
can successfully traverse the maze. At each fork in the
maze, the adventurer must compare the input (the visible
words on the floor) with the grammar (key word sequence
allowing escape). What happens if at a fork both paths
start with the same word on the floor? You have a
nondeterminism. More lookhead (or backtracking) is
required to make a correct decision. Naturally, the
answer is to get a bigger flashlight. ;)
Tough to get a correct and readable answer on the fly,
but...
|
gvoss: |
Wow! That's some answer Terence. I like
the analogy that you
use at the end about going through a maze with words
written on
the floor. Makes a token stream more graphic. Maybe we
could write
a game program to get kids messing with parsers like they
do by
combining blasteroids with spelling programs :)
We've gone far beyond our allotted time. Hope no one
minded too
much. :) Thanks, Terence for giving us some insight into
parsing and
its use in building custom programming tools. I'm looking
forward to
seeing the Java version of ANTLR. We'll post it to the
JDC web site
when it's ready. We'll also be posting some pretty cool
tools that
use the parser to do their magic.
Anything you want to add, Terence?
|
parrty: |
Nope. Glad to get a chance to talk about
language tools and such. I look forward to hearing from
anybody with further questions on development tools
(parrt@MageLang.com).
-- parrt signing off
|
gvoss: |
Thanks everyone for attending. See you
next week.
|
echarne: |
Sorry I'm way late, but is there a
publically available parser available for
Java?
|
gvoss: |
Yes, there is. Keep following the JDC
article series. There's one available
for download now (ANTLR) through JDC or through the URL
Terence
mentioned earlier. There is both a parser and and parser
generator
available. The current version of the parser is in C++ so
it is hard to
use with an applet. But a regular Java program can call
it as a stand-alone
executable.
Shortly, we'll post a new version of ANTLR that genarates
a parser written
in Java (that also recognizes Java source). This would
work in applets
as well as programs. It also has the advantage of being
easier to integrate
with your application--be it a Java source code editor,
class browser,
or whatever. Hope that helps. So long, everyone, time for
lunch here on the West Coast. Last moderator (me) signing
off. The forum is now unmoderated |
|