XML
BCEL
XML Spy
Compiler
ANTLR
Java
VM
Tools
Apache
Eclipse


|
|

By Madhu Siddalingaiah
|
Reflecting on Binary Files
|
Reading formatted binary files can be a real chore. A typical approach is to translate the file format into a sequence of
read() calls for each primitive data type and build up an object that represents the file in a logical manner. This works, but it's very tedious and error prone.
This article demonstrates a simple, effective parser for reading Java class files. The parser uses a grammar defined by Java classes.
Many years ago I developed a framework for reading and writing Java class files, similar to
Bytecode Engineering Library (BCEL). A Java class file is a binary file, the format is described in the
Java Virtual Machine Specification.
To get an idea of the structure of a class file, here's the high level
description of the contents:
ClassFile {
u4 magic;
u2 minor_version;
u2 major_version;
u2 constant_pool_count;
cp_info constant_pool[constant_pool_count-1];
u2 access_flags;
u2 this_class;
u2 super_class;
u2 interfaces_count;
u2 interfaces[interfaces_count];
u2 fields_count;
field_info fields[fields_count];
u2 methods_count;
method_info methods[methods_count];
u2 attributes_count;
attribute_info attributes[attributes_count];
}
The types u1, u2, and u4 represent unsigned 1, 2, and 4 byte quantities. These, along with 8 byte quantities for long and double types, represent the primitives within a class file. In addition to primitives,
class files also support homogeneous aggregate types (arrays) and heterogeneous aggregate types (structures). This is very similar to the type model supported by the Java language, in fact, a Java class could almost exactly represent the binary class file structure.
My initial approach started as a straightforward class model, one class for each structure. For example, there were classes for ClassFile, CPInfo, FieldInfo and so on. Each class had a
read() method, which read all of the fields for that class. This worked, but
it was necessary to code every read operation for every field in every class. It doesn't look like much, but it turned out to be really tedious and prone
to errors. More important, reading another format meant I'd have to start over from scratch.
If you have experience with compiler-compiler tools such as Yacc, Bison, or ANTLR, it might
occur to you that it's much simpler to specify the structure of a class file as a grammar. Any of these tools can read a grammar and generate source code and tables needed to parse input that matches the grammar. In fact, Andreas Rueckert has developed an ANTLR grammar for reading Java class files.
ANTLR is designed to support a large class of grammars, known as LL(k) grammars. Most programming languages can be described and parsed using LL(k) grammars. Java class files, like most binary files, are not nearly as complicated as typical programming languages, so a high-powered tool
such as ANTLR is probably overkill. It seemed a simplified parser could do the job
and, more important, would avoid having to generate code. Generating code is not
a big deal, but it's more work and one more step in the build process that I would prefer to avoid.
Next page: Extracting information
from class files
1
2 3
4 5 Next>>
|
|
|