Tuesday, 9 August 2011

Java Compilation Process

Java is "semi-interpreted" language and it differs from C/C++ and the process described above. What do we mean by "semi-interpreted" language? Java programs execute in the Java Virtual Machine (or JVM), which makes it an interpreted language. On the other hand Java unlike pure interpreted languages passes through an intermediate compilation step. Java code does not compile to native code that the operating system executes on the CPU, rather the result of Java program compilation is intermediate bytecode. This bytecode runs in the virtual machine. Let us take a look at the process through which the source code is turned into executable code and the execution of it.


Figure : The Java Compile/Execute Path
The Java Compile/Execute Path

Java requires each class to be placed in its own source file, named with the same name as the class name and added suffix .java. This basicaly forces any medium sized program to be split in several source files. When compiling source code, each class is placed in its own .class file that contains the bytecode. The java compiler differs from gcc/g++ in the fact that if the class you are compiling is dependent on a class that is not compiled or is modified since it was last compiled, it will compile those additional classes for you. It acts similarly to make, but is nowhere close to it. After compiling all source files, the result will be at least as much class files as the sources, which will combine to form your Java program. This is where the class loader comes into picture along with the bytecode verifier - two unique steps that distinguish Java from languages like C/C++.

The class loader is responsible for loading each class' bytecode. Java provides developers with the opportunity to write their own class loader, which gives developers great flexibility. One can write a loader that fetches the class from everywhere, even IRC DCC connection. Now let us look at the steps a loader takes to load a class.

When a class is needed by the JVM the loadClass(String name, boolean resolve); method is called passing the class name to be loaded. Once it finds the file that contains the bytecode for the class, it is read into memory and passed to the defineClass. If the class is not found by the loader, it can delegate the loading to a parent class loader or try to use findSystemClass to load the class from local filesystem. The Java Virtual Machine Specification is vague on the subject of when and how the ByteCode verifier is invoked, but by a simple test we can infer that the defineClass performs the bytecode verification. (FIXME maybe show the test). The verifier does four passes over the bytecode to make sure it is safe. After the class is successfully verified, its loading is completed and it is available for use by the runtime.

The nature of the Java bytecode allows people to easily decompile class files to source. In the case where default compilation is performed, even variable and method names are recovered. There are bunch of decompilers out there, but a free one that works well is Jad. FIXME: add a link. Because of this we will not discuss how to reverese engineer software written in Java.