This document is up to date as of March 2020.
Contents
- Overview
- Related Documentation
- Building and Workflow
- LLVM API
- Other LLVM Version Support
- Desugaring Passes
- Barrier Pass
- Translation Pass
- Object Layouts
- Method Calls
- instanceof
- Arrays
- Strings
- Native Runtime Code
- Class Loading
- Control Flow Translation
- Unneeded AST Nodes
- Concurrency and Synchronization
- Debugging Tips
Overview
JLang is built as an extension to the Polyglot compiler. Since JLang is a backend only, it does not extend the parser, nor the type system built into polyglot. JLang simply adds compiler passes for desugaring and translating Java ASTs into LLVM IR.
The project also contains native code for supporting Java semantics at runtime, and support for compiling OpenJDK 7. Compiling the JDK is particularly difficult because it requires a large amount of JVM functionality (e.g., reflection), which we must implement ourselves.
Related Documentation
-
The Polyglot tutorial should at least be skimmed to get an idea of how Polyglot works. For the purposes of JLang, the most important things to know about Polyglot are its type system (especially how it handles generics), its scheduler framework, its AST visitor framework, and its AST node extension framework (
NodeFactory
,ExtFactory
, etc.). -
See the LLVM language reference manual to learn how to read and write LLVM IR.
-
See the LLVM documentation homepage for links to documentation on exception handling, debug information, FAQ, optimization passes, garbage collection (if we ever want to move to a non-conservative GC), coroutines, and much more.
-
The JLS should be used to implement Java language semantics closely.
-
The Java Native Interface (JNI) is specified here.
-
The LLVM C API, should be referenced whenever writing translations to LLVM IR. The Instruction Builder module is particularly useful, since that’s used to create LLVM instructions. Please be sure to reference the correct LLVM API version, since there have been significant changes in portions of this API between 5.0, 7.0 and current mainstream releases.
Building and Workflow
Makefiles
There is a top-level makefile which uses ant
to build the compiler, and then
delegates recursively into other makefiles for the JDK and runtime.
The makefile in the runtime
directory compiles native C++ code and a few
supporting Java classes into a shared library called libjvm
. The name
of this library is important, because native code in OpenJDK assumes that
this library exists, and that it contains the methods defined in
runtime/native/jvm.cpp
.
The jdk-lite
directory can be used to build a minimal “bare-bones” JDK. The
Java sources in jdk-lite
are compiled down to LLVM IR, then linked together
into a shared library arbitrarily called libjdk
. This can be built with the command
JDK=jdk-lite make
.
By default, the full OpenJDK is compiled instead.
The makefile in the jdk
directory will unzip OpenJDK 7 source files,
apply a small number of temporary patches that
help work around unimplemented features in JLang, and then compile everything
into libjdk
as before. Here it will also put your local JDK 7 installation
on the dynamically loaded search path of libjdk
, so that JDK code has access
to the native code that is part of OpenJDK 7. Note: This linking doesn’t work
on all systems and the final binary compilation of executables must also link
the OpenJDK 7 native code libraries.
Note: Not every single source
file in the JDK is compiled, only those required to initialize the java.lang.System
class and run a HelloWorld like Java program. This comprises approximately 1500 source
files, which suffices for all of our unit tests and provided example programs.
It is ongoing work to compile the remainder of the JDK source and add that functionality
to the libjdk
build.
The makefile in tests/isolated
will compile each unit test and create an
executable by linking with libjvm
from the runtime and libjdk
from the
JDK. By default it will also run each test case and store the output in a
.output
file.
The Makefiles themselves are the best source of documentation for how to compile Java files with JLang, create shared libraries, and link against the native code in your local system JDK.
Scripts
The bin/jlangc
script is the primary script used to launch JLang.
It was originally auto-generated by Polyglot. It automatically adds classes
from the runtime to the JLang classpath, which is necessary because
some JLang desugar transformations refer directly to runtime classes.
The result to of jlangc
is LLVM IR in the form of a .ll
file for
each compiled Java file.
The bin/plc
script is intended to automate the linking part of building
an executable, though it is currently out of date. Refer to the makefiles above
for how to link things together.
Testing
The unit tests in tests/isolated
are thorough, and should be your primary
resource for checking correctness after making changes to the compiler or
runtime. These tests can be run from the top-level Makefile via the make tests
command.
There is also a file called expected_fails
which tracks currently failing
tests and the makefile uses this to detect regressions or newly passing tests
in its success/failure report when running make tests
.
The makefile in tests/isolated
also makes it easy to run individual tests
manually from the command line. You can run commands like make Add.ll
to
compile just Add.java
down to LLVM IR, or make Add.sol
to generate
the expected output using javac
, or make Add.output
to compile, link
and run. This is currently slightly broken and needs some makefile hacking
love
LLVM API
The LLVM C API is used through a JavaCPP JNI bridge. JavaCPP is a program
that essentially parses C/C++ header files and creates ready-to-use Java stubs and jar files
automatically. Normally this requires some careful configuration, but someone
has already done most of that work as part of javacpp-presets
, a repository
hosting JNI bridges for popular C++ libraries.
The LLVM C API (v5.x) is limited in that it does not have a stable API for debug
information. Other languages (Go, Rust, etc.) get around this by manually
creating their own C bindings. Our solution: start with the LLVM Go bindings,
and create custom additional bindings as needed. This process is automated
through a fork of javacpp-presets,
which is tracked as a git submodule. Cloning with --depth 1
is recommended. To
build, cd
into the llvm
subdirectory and run mvn install
. This will
produce the needed .jar
files in the llvm/target
directory. For convenience
we provide up-to-date .jar
files in the JLang repository
directly,
for OS X and Linux.
Other LLVM Version Support
The LLVM C API has changed significantly between version 5.0, 7.0 and mainline llvm (currently 10). There is currently a branch called llvm7 dedicated to making JLang LLVM 7.0 compatible.
Due to the number of api behavioral changes this requies new javacpp-preset jars and re-writing portions of the JLang source code to use the new APIs. This is ongoing work and ♥needs some love♥.
Desugaring Passes
There are currently several desugaring passes that run prior to translation,
executed as part of the JLangDesugared
scheduler goal. For example,
- The
DesugarEnums
pass converts Java enums into normal classes. - The
DesugarInstanceInitializers
searches for class initializer blocks and inline field initializers, and prepends these to object constructors. - The
DesugarLocalClasses
rewrites local classes so that captured local variables are stored as instance fields.
There are more, and they are listed in JLangDesugared
. Each pass has Javadoc documentation.
There is also the special DesugarLocally
pass run at the end, which gives each Java AST node a chance to desugar itself locally into something simpler. For example, try-with-resource statements are desugared within JLangTryWithResourcesExt
down to normal try-catch blocks as specified by the JLS.
Barrier Pass
There is a single barrier pass which forces all Desugar transformations to complete
before any translation is executed. This is critical since Desguar transformations
can add new fields and methods, which can generate an inconsistent state between various Job’s
representations of Class objects. With the barrier, the max-runs
option for Polyglot must be set fairly high;
this is expected since there will be # of Desugar Passes
* # of Compilation Units
outstanding runs simulatneously.
The current jdk/Makefile
already takes care of this and the jdk-lite/Makefile
does not since the number of compiled classes is still fairly small.
Translation Pass
The translation pass is implemented as a polyglot visitor. The main
data structure the translator uses is a map from Java Node
objects to
LLVMValueRef
objects. When translating a Java node, the translation for
sub-nodes is retrieved using the getTranslation
method.
The translations themselves are implemented in each of the JLangExt
subclasses
(roughly one per Java AST node). Most of these translations are documented with Javadoc or inline comments. For example:
JLangBinaryExt
translations all Java binary expressions.JLangIfExt
translates if-statements.JLangClassDeclExt
emits runtime type information for the current class (to be used by the runtime).JLangTryExt
translates try-catch blocks, implementing Itanium ABI zero-cost exception handling.
In addition to traversing the AST, the translator keeps track of various states needed for translation, such as the current function, the current enclosing try-catch block, the current LLVM module, etc.
The translator also exposes various utility classes to aid translation. For example:
LLVMUtils
contains helper methods for common LLVM IR constructs such as structs, calls, and LLVM types.DebugInfo
contains methods to construct LLVM debug information.ObjectStruct
specifies the layout of Java objects, and provides methods for accessing fields with LLVM IR.DispatchVector
specifies the layout of Java dispatch vectors, and provides methods for indexing into the dispatch vector based on the method desired.JLangMangler
mangles symbol names.
Object Layouts
See ObjectStruct_c
for the definitive layout of Java objects used by JLang.
This layout must be kept in sync with the layouts in runtime/rep.h
, which is
used by native code to work with Java objects.
A Java object currently looks like this:
- Dispatch vector pointer
- Pointer to synchornization variables (e.g., mutex, condition variable)
- Field 1
- Field 2
- …
See DispatchVector_c
for the definitive layout of dispatch vectors generated
by JLang. These currently look like this:
- Pointer to the
java.lang.Class
object for this class. - Point to the interface method dispatch hash table.
- Pointer to a contiguous array of super types, used for relatively fast
instanceof
checks. - Inline array of function pointers, one for each instance method in this class.
Method Calls
Static methods are invoked directly, using an appropriately mangled symbol name.
Instance methods are invoked by indexing in the dispatch vector of the receiver using a constant index generated at compile time.
Interface methods are invoked by delegating to native runtime code in
runtime/native/interface.cpp
, which finds the appropriate method to call with
the help of a hash generated at compile time.
instanceof
The following native code (from runtime/native/reflect.cpp
) is used to execute an instanceof
check at runtime.
extern "C" {
bool InstanceOf(jobject obj, void* type_id) {
if (obj == nullptr)
return false;
type_info* type_info = Unwrap(obj)->Cdv()->SuperTypes();
for (int32_t i = 0, end = type_info->size; i < end; ++i)
if (type_info->super_type_ids[i] == type_id)
return true;
return false;
}
} // extern "C"
The function accesses the dispatch vector of obj
to retrieve a table containing all super-classes and super-interfaces, and looks for a match with compare_type_id
. These “type id” pointers are just the addresses of global variables generated for each compiled class. Each type id is unique because the linker ensures that different global symbols receive different addresses.
Arrays
A Java array (e.g., int[3]
) is implemented as a contiguous region of memory, with one word at the beginning to point to a dispatch vector, and the next word to hold the array length. Arrays must behave as standard Java objects with respect to type information, so for simplicity arrays are implemented as a Java class (see Array.java
in the runtime
directory). The catch is that JLang allocates extra memory for Array
instances in order to store data elements.
Arrays are packed, so that an array of chars (for example) uses only two bytes per element. The one exception is that boolean arrays use one byte per element as opposed to one bit. Packed arrays are implemented by casting the array data pointer (in LLVM IR) to the appropriate type before offsetting with an index.
Strings
Strings do not require significant special handling from the compiler; they simply rely on a backing char array. The exception is that string literals are translated into global constants. The linkage for string literals is such that there will only be one copy of a given string among files that are linked together; so, "hello" == "hello"
will evaluate to true.
Native Runtime Code
We use native C++ code in many parts of the runtime, including
- Converting command-line arguments to Java strings
- Calling the Java entry point
- Implementing reflection-like features such as InstanceOf
- Interface method calls
- etc.
Wherever possible, native C code should be preferred over handwritten or compiler-generated LLVM IR. Native code currently resides in the runtime/native
directory.
For an example, consider the native code used to implement instanceof
. When translating a reference to instanceof
in Java source code, JLang emits a call to this native code with the correct arguments. The runtime build system is responsible for compiling runtime code into a shared library which should be linked with user programs.
The runtime is also responsible for keeping track of runtime type information, and implementing much of the functionality that the JVM would normally implement.
Class Loading
Java classes are normally loaded by the JVM just before they are used.
This is also when the static initializers for the class are run.
In order to implement this behavior, we emit class loading checks before
every static field access, static method call, or new instance creation. If
a class has not been loaded yet, then we call it’s “class loading function”:
a special function emitted by JLang for each class that will allocate a new
java.lang.Class
object, ensure the super class has been loaded, run all
static initializers, and register runtime type information with native runtime
code.
Control Flow Translation
The LLVM C API requires that code be emitted as a collection of basic blocks. The key invariant while translating control flow is as follows:
After traversing an AST subtree, all paths through the corresponding CFG end at a common block, and the instruction builder is positioned at the end of this block.
For example, an if-statement will (1) build the conditional branch, (2) position the builder at the true
block, (3) recurse into the consequent
child, (4) position the builder at the false
block, and (5) recurse into the alternative
child. After each recursion it adds a branch to the end block (unless there is already a terminating instruction). Finally, it positions the builder at the end block. See JLangIfExt
to see this in action.
Unneeded AST Nodes
Some AST extensions are unneeded, either because they do not require a translation, or because they can reuse the translation of another extension. Examples are listed below.
- ArrayAccessAssign (uses AssignExt)
- LocalAssign (uses AssignExt)
- FieldAssign (uses AssignExt)
- MethodDecl (uses ProcedureDeclExt)
- Eval (the child translation suffices)
Concurrency and Synchronization
Every Java thread is backed by a native thread (pthread
) after it starts. Unlike HotSpot JVM, there is no JVM thread or runtime thread in our implementation. The Java main Thread is run by the native main thread. In order to know which Java Thread is currently executing, the current Java Thread object is stored as a thread_local
variable in the runtime.
Synchronization is also implemented by pthread
primitives. Every object stores a pointer to synchronization variables which contain pthread
mutex and condition variable primitives. These variables are used to implement synchronized
, notify
, wait
, etc. In addition, Java synchronized
code blocks are translated into try-finally blocks to make sure the acquired monitor is always released.
To have the garbage collector work correctly in multi-threaded code, we define a macro variable GC_THREADS
before including gc.h
but after pthread.h
, as its documentation specifies. Note that gc.h
must be included after pthread.h
even if functions in gc.h
are not used in the current source file.
Debugging Tips
If you have a JLang-compiled executable that crashes at runtime, the first
thing to do is use lldb
or gdb
. With lldb
you can find exactly where the program
crashes, and see the source code that corresponds to each stack frame.
It is also possible to debug the program in vscode. Install the Native Debug plugin and config it to use lldb
or gdb
. A sample gdb
config is provided in .vscode/launch.json
.
Once you find where the program is crashing, it’s usually helpful to find
the corresponding LLVM IR (within the .ll
files corresponding to the
Java class of interest).
Also take a look at JLang’s -dump-desugared
flag, which will print
Java ASTs after the desugar passes have run. This desugared output will
explicitly show many of Java’s implicit language semantics (such as
implicit type conversions).
If you are debugging compiler crashes or type errors while compiling OpenJDK 7, it can be helpful to try compiling with vanilla Polyglot first, as a control. There are some small differences between Polyglot and javac (especially relating to type inference), and sometimes a patch will be needed to work around quirky JDK code.