JLang: Developer Guide

This document is up to date as of March 2020.

Overview
Related Documentation
Building and Workflow
LLVM API
Other LLVM Version Support
Desugaring Passes
Barrier Pass
Translation Pass
Object Layouts
Method Calls
instanceof
Arrays
Strings
Native Runtime Code
Class Loading
Control Flow Translation
Unneeded AST Nodes
Concurrency and Synchronization
Debugging Tips

Overview

JLang is built as an extension to the Polyglot compiler. Since JLang is a backend only, it does not extend the parser, nor the type system built into polyglot. JLang simply adds compiler passes for desugaring and translating Java ASTs into LLVM IR.

The project also contains native code for supporting Java semantics at runtime, and support for compiling OpenJDK 7. Compiling the JDK is particularly difficult because it requires a large amount of JVM functionality (e.g., reflection), which we must implement ourselves.

The Polyglot tutorial should at least be skimmed to get an idea of how Polyglot works. For the purposes of JLang, the most important things to know about Polyglot are its type system (especially how it handles generics), its scheduler framework, its AST visitor framework, and its AST node extension framework (NodeFactory, ExtFactory, etc.).
See the LLVM language reference manual to learn how to read and write LLVM IR.
See the LLVM documentation homepage for links to documentation on exception handling, debug information, FAQ, optimization passes, garbage collection (if we ever want to move to a non-conservative GC), coroutines, and much more.
The JLS should be used to implement Java language semantics closely.
The Java Native Interface (JNI) is specified here.
The LLVM C API, should be referenced whenever writing translations to LLVM IR. The Instruction Builder module is particularly useful, since that’s used to create LLVM instructions. Please be sure to reference the correct LLVM API version, since there have been significant changes in portions of this API between 5.0, 7.0 and current mainstream releases.

Building and Workflow

Makefiles

There is a top-level makefile which uses ant to build the compiler, and then delegates recursively into other makefiles for the JDK and runtime.

The makefile in the runtime directory compiles native C++ code and a few supporting Java classes into a shared library called libjvm. The name of this library is important, because native code in OpenJDK assumes that this library exists, and that it contains the methods defined in runtime/native/jvm.cpp.

The jdk-lite directory can be used to build a minimal “bare-bones” JDK. The Java sources in jdk-lite are compiled down to LLVM IR, then linked together into a shared library arbitrarily called libjdk. This can be built with the command JDK=jdk-lite make.

By default, the full OpenJDK is compiled instead. The makefile in the jdk directory will unzip OpenJDK 7 source files, apply a small number of temporary patches that help work around unimplemented features in JLang, and then compile everything into libjdk as before. Here it will also put your local JDK 7 installation on the dynamically loaded search path of libjdk, so that JDK code has access to the native code that is part of OpenJDK 7. Note: This linking doesn’t work on all systems and the final binary compilation of executables must also link the OpenJDK 7 native code libraries.

Note: Not every single source file in the JDK is compiled, only those required to initialize the java.lang.System class and run a HelloWorld like Java program. This comprises approximately 1500 source files, which suffices for all of our unit tests and provided example programs. It is ongoing work to compile the remainder of the JDK source and add that functionality to the libjdk build.

The makefile in tests/isolated will compile each unit test and create an executable by linking with libjvm from the runtime and libjdk from the JDK. By default it will also run each test case and store the output in a .output file.

The Makefiles themselves are the best source of documentation for how to compile Java files with JLang, create shared libraries, and link against the native code in your local system JDK.

Scripts

The bin/jlangc script is the primary script used to launch JLang. It was originally auto-generated by Polyglot. It automatically adds classes from the runtime to the JLang classpath, which is necessary because some JLang desugar transformations refer directly to runtime classes. The result to of jlangc is LLVM IR in the form of a .ll file for each compiled Java file.

The bin/plc script is intended to automate the linking part of building an executable, though it is currently out of date. Refer to the makefiles above for how to link things together.

Testing

The unit tests in tests/isolated are thorough, and should be your primary resource for checking correctness after making changes to the compiler or runtime. These tests can be run from the top-level Makefile via the make tests command. There is also a file called expected_fails which tracks currently failing tests and the makefile uses this to detect regressions or newly passing tests in its success/failure report when running make tests.

The makefile in tests/isolated also makes it easy to run individual tests manually from the command line. You can run commands like make Add.ll to compile just Add.java down to LLVM IR, or make Add.sol to generate the expected output using javac, or make Add.output to compile, link and run. This is currently slightly broken and needs some makefile hacking love

LLVM API

The LLVM C API is used through a JavaCPP JNI bridge. JavaCPP is a program that essentially parses C/C++ header files and creates ready-to-use Java stubs and jar files automatically. Normally this requires some careful configuration, but someone has already done most of that work as part of javacpp-presets, a repository hosting JNI bridges for popular C++ libraries.

The LLVM C API (v5.x) is limited in that it does not have a stable API for debug information. Other languages (Go, Rust, etc.) get around this by manually creating their own C bindings. Our solution: start with the LLVM Go bindings, and create custom additional bindings as needed. This process is automated through a fork of javacpp-presets, which is tracked as a git submodule. Cloning with --depth 1 is recommended. To build, cd into the llvm subdirectory and run mvn install. This will produce the needed .jar files in the llvm/target directory. For convenience we provide up-to-date .jar files in the JLang repository directly, for OS X and Linux.

Other LLVM Version Support

The LLVM C API has changed significantly between version 5.0, 7.0 and mainline llvm (currently 10). There is currently a branch called llvm7 dedicated to making JLang LLVM 7.0 compatible.

Due to the number of api behavioral changes this requies new javacpp-preset jars and re-writing portions of the JLang source code to use the new APIs. This is ongoing work and ♥needs some love♥.

Desugaring Passes

There are currently several desugaring passes that run prior to translation, executed as part of the JLangDesugared scheduler goal. For example,

The DesugarEnums pass converts Java enums into normal classes.
The DesugarInstanceInitializers searches for class initializer blocks and inline field initializers, and prepends these to object constructors.
The DesugarLocalClasses rewrites local classes so that captured local variables are stored as instance fields.

There are more, and they are listed in JLangDesugared. Each pass has Javadoc documentation.

There is also the special DesugarLocally pass run at the end, which gives each Java AST node a chance to desugar itself locally into something simpler. For example, try-with-resource statements are desugared within JLangTryWithResourcesExt down to normal try-catch blocks as specified by the JLS.

Barrier Pass

There is a single barrier pass which forces all Desugar transformations to complete before any translation is executed. This is critical since Desguar transformations can add new fields and methods, which can generate an inconsistent state between various Job’s representations of Class objects. With the barrier, the max-runs option for Polyglot must be set fairly high; this is expected since there will be # of Desugar Passes * # of Compilation Units outstanding runs simulatneously. The current jdk/Makefile already takes care of this and the jdk-lite/Makefile does not since the number of compiled classes is still fairly small.

Translation Pass

The translation pass is implemented as a polyglot visitor. The main data structure the translator uses is a map from Java Node objects to LLVMValueRef objects. When translating a Java node, the translation for sub-nodes is retrieved using the getTranslation method.

The translations themselves are implemented in each of the JLangExt subclasses (roughly one per Java AST node). Most of these translations are documented with Javadoc or inline comments. For example:

JLangBinaryExt translations all Java binary expressions.
JLangIfExt translates if-statements.
JLangClassDeclExt emits runtime type information for the current class (to be used by the runtime).
JLangTryExt translates try-catch blocks, implementing Itanium ABI zero-cost exception handling.

In addition to traversing the AST, the translator keeps track of various states needed for translation, such as the current function, the current enclosing try-catch block, the current LLVM module, etc.

The translator also exposes various utility classes to aid translation. For example:

LLVMUtils contains helper methods for common LLVM IR constructs such as structs, calls, and LLVM types.
DebugInfo contains methods to construct LLVM debug information.
ObjectStruct specifies the layout of Java objects, and provides methods for accessing fields with LLVM IR.
DispatchVector specifies the layout of Java dispatch vectors, and provides methods for indexing into the dispatch vector based on the method desired.
JLangMangler mangles symbol names.

Object Layouts

See ObjectStruct_c for the definitive layout of Java objects used by JLang. This layout must be kept in sync with the layouts in runtime/rep.h, which is used by native code to work with Java objects.

A Java object currently looks like this:

Dispatch vector pointer
Pointer to synchornization variables (e.g., mutex, condition variable)
Field 1
Field 2
…

See DispatchVector_c for the definitive layout of dispatch vectors generated by JLang. These currently look like this:

Pointer to the java.lang.Class object for this class.
Point to the interface method dispatch hash table.
Pointer to a contiguous array of super types, used for relatively fast instanceof checks.
Inline array of function pointers, one for each instance method in this class.

Method Calls

Static methods are invoked directly, using an appropriately mangled symbol name.

Instance methods are invoked by indexing in the dispatch vector of the receiver using a constant index generated at compile time.

Interface methods are invoked by delegating to native runtime code in runtime/native/interface.cpp, which finds the appropriate method to call with the help of a hash generated at compile time.

instanceof

The following native code (from runtime/native/reflect.cpp) is used to execute an instanceof check at runtime.

extern "C" {

bool InstanceOf(jobject obj, void* type_id) {
    if (obj == nullptr)
        return false;
    type_info* type_info = Unwrap(obj)->Cdv()->SuperTypes();
    for (int32_t i = 0, end = type_info->size; i < end; ++i)
        if (type_info->super_type_ids[i] == type_id)
            return true;
    return false;
}

} // extern "C"

The function accesses the dispatch vector of obj to retrieve a table containing all super-classes and super-interfaces, and looks for a match with compare_type_id. These “type id” pointers are just the addresses of global variables generated for each compiled class. Each type id is unique because the linker ensures that different global symbols receive different addresses.

Arrays

A Java array (e.g., int[3]) is implemented as a contiguous region of memory, with one word at the beginning to point to a dispatch vector, and the next word to hold the array length. Arrays must behave as standard Java objects with respect to type information, so for simplicity arrays are implemented as a Java class (see Array.java in the runtime directory). The catch is that JLang allocates extra memory for Array instances in order to store data elements.

Arrays are packed, so that an array of chars (for example) uses only two bytes per element. The one exception is that boolean arrays use one byte per element as opposed to one bit. Packed arrays are implemented by casting the array data pointer (in LLVM IR) to the appropriate type before offsetting with an index.

Strings

Strings do not require significant special handling from the compiler; they simply rely on a backing char array. The exception is that string literals are translated into global constants. The linkage for string literals is such that there will only be one copy of a given string among files that are linked together; so, "hello" == "hello" will evaluate to true.

Native Runtime Code

We use native C++ code in many parts of the runtime, including

Converting command-line arguments to Java strings
Calling the Java entry point
Implementing reflection-like features such as InstanceOf
Interface method calls
etc.

Wherever possible, native C code should be preferred over handwritten or compiler-generated LLVM IR. Native code currently resides in the runtime/native directory.

For an example, consider the native code used to implement instanceof. When translating a reference to instanceof in Java source code, JLang emits a call to this native code with the correct arguments. The runtime build system is responsible for compiling runtime code into a shared library which should be linked with user programs.

The runtime is also responsible for keeping track of runtime type information, and implementing much of the functionality that the JVM would normally implement.

Class Loading

Java classes are normally loaded by the JVM just before they are used. This is also when the static initializers for the class are run. In order to implement this behavior, we emit class loading checks before every static field access, static method call, or new instance creation. If a class has not been loaded yet, then we call it’s “class loading function”: a special function emitted by JLang for each class that will allocate a new java.lang.Class object, ensure the super class has been loaded, run all static initializers, and register runtime type information with native runtime code.

Control Flow Translation

The LLVM C API requires that code be emitted as a collection of basic blocks. The key invariant while translating control flow is as follows:

After traversing an AST subtree, all paths through the corresponding CFG end at a common block, and the instruction builder is positioned at the end of this block.

For example, an if-statement will (1) build the conditional branch, (2) position the builder at the true block, (3) recurse into the consequent child, (4) position the builder at the false block, and (5) recurse into the alternative child. After each recursion it adds a branch to the end block (unless there is already a terminating instruction). Finally, it positions the builder at the end block. See JLangIfExt to see this in action.

Unneeded AST Nodes

Some AST extensions are unneeded, either because they do not require a translation, or because they can reuse the translation of another extension. Examples are listed below.

ArrayAccessAssign (uses AssignExt)
LocalAssign (uses AssignExt)
FieldAssign (uses AssignExt)
MethodDecl (uses ProcedureDeclExt)
Eval (the child translation suffices)

Concurrency and Synchronization

Every Java thread is backed by a native thread (pthread) after it starts. Unlike HotSpot JVM, there is no JVM thread or runtime thread in our implementation. The Java main Thread is run by the native main thread. In order to know which Java Thread is currently executing, the current Java Thread object is stored as a thread_local variable in the runtime.

Synchronization is also implemented by pthread primitives. Every object stores a pointer to synchronization variables which contain pthread mutex and condition variable primitives. These variables are used to implement synchronized, notify, wait, etc. In addition, Java synchronized code blocks are translated into try-finally blocks to make sure the acquired monitor is always released.

To have the garbage collector work correctly in multi-threaded code, we define a macro variable GC_THREADS before including gc.h but after pthread.h, as its documentation specifies. Note that gc.h must be included after pthread.h even if functions in gc.h are not used in the current source file.

Debugging Tips

If you have a JLang-compiled executable that crashes at runtime, the first thing to do is use lldb or gdb. With lldb you can find exactly where the program crashes, and see the source code that corresponds to each stack frame.

It is also possible to debug the program in vscode. Install the Native Debug plugin and config it to use lldb or gdb. A sample gdb config is provided in .vscode/launch.json.

Once you find where the program is crashing, it’s usually helpful to find the corresponding LLVM IR (within the .ll files corresponding to the Java class of interest).

Also take a look at JLang’s -dump-desugared flag, which will print Java ASTs after the desugar passes have run. This desugared output will explicitly show many of Java’s implicit language semantics (such as implicit type conversions).

If you are debugging compiler crashes or type errors while compiling OpenJDK 7, it can be helpful to try compiling with vanilla Polyglot first, as a control. There are some small differences between Polyglot and javac (especially relating to type inference), and sometimes a patch will be needed to work around quirky JDK code.

JLang

Developer Guide

Contents