Hail Native Code Generation
Richard Cownie, Hail Team, 2018-03-22
Introduction
This document sketches initial steps towards generating native-x86_64
code for performance-critical operations in Hail.
With a small team and a substantial existing Scala codebase, we need
to evolve the codebase piece-by-piece, so for a considerable time we
expect to have a mixture of Scala, dynamically-generated JVM bytecode,
compiled C++, and dynamically-generated x86_64 code. The TStruct/TArray/
RegionValue infrastructure should provide a good basis for all flavors of
code to interoperate.
C++ source code vs low-level interface (LLVM IR ?)
One major design decision is whether to generate native (x86_64) code
through some kind of compiler-IR representation (e.g. LLVM IR), or to
simply generate C++ source code and compile it. Experience with both
approaches suggest that dynamically-generated C++ gives good productivity,
an easy way to combine static-compiled and template libraries with
dynamically-generated code, and excellent runtime performance, but with
some risk of noticeable compilation delays. I propose that we should
use this approach, but try to reduce or eliminate the impact of compilation
time in several ways:
a) Cacheing of compiled object-code files with ccache, which can maintain
a persistent cache so that if you do similar operations across several
Hail sessions, the code may only be compiled once.
b) Where it isn’t critical to performance, constant values can be hoisted
out of the compiled code into dynamically-passed parameters, which
then allows the compiled code to be cached and re-used with different
parameter values.
c) Codegen/compilation can be started eagerly in the background, thus
overlapping with time taken for the data processing of earlier
operators/stages. [Aside: in the big picture, I think we probably
want to keep the Scala/JVM code single-threaded, but allow multiple
threads in the C++ in situations where it offers a compelling performance
advantage without too much complexity of data-sharing].
And of course the big win of using C++ is that we don’t have to design
a new language/AST.
This should also help with interfacing to existing linear algebra/GPGPU
code.
To avoid confusion, we would probably want to standardize on the same C++
compiler and version for MacOSX and (Ubuntu) Linux. Past experience is that
g++ gets very slow for heavily-templated code, so llvm/clang is preferred.
The very latest is llvm-6.0, but I’m going to start out with llvm-5.0 and
see how it goes.
Interfacing between Scala and C++
It is supposedly easy to call from Scala into compiled C or C++, e.g.
Scala:
object LongDistance {
@native def callSquareNative(x: Int): Int
}
// need to do System.loadLibrary("squarenative") somewhere
C++:
#include <jni.h>
JNIEXPORT jint JNICALL Java_LongDistance_callSquareNative(JNIEnv* env, jobject obj, jint x) {
return(x * x);
}
So far so good. The JNI also allows the C++ code to access members and call methods
on a Java object or array. But if you do that it starts get very expensive, in much
the same way as interactions across the py4j boundary.
As long as we stick to passing around Int/Long/Boolean base types, and never making
callbacks from C++ into JVM, it should go fast.
Aside: exceptions don’t propagate from C++ back into JVM in any obvious way, so
the C++ code should catch any exceptions, clean up, and pass back any information
in an efficient way - viz in a base-type return value, or by modification of a
RegionValue visible to both sides. The current plan is that our C++ code should
be exception-safe but should be designed not to use exceptions.
This all seems plausible as a way to handle data in RegionValues. However, it seems
likely that the C++ side will also sometimes need to access the Type/TStruct/TArray
metadata to make runtime decisions, and if that only existed as JVM objects it would
be expensive to access.
I think we can finesse this by having Scala objects which are mirrored in C++ -
i.e. for each instance of a mirrored object we create a corresponding object on
the C++ heap, and the Scala object holds a reference to the C++ object (the
JNIEnv has some magic for managing references, which I haven’t understood yet).
Of course in such a scheme we have to worry about coherency between the two
copies of the same information. But for the Type/TStruct/TArray metadata that
may be fairly simple.
Do we ever modify a metadata object ?
Do we need to be able to modify metadata objects from the C++ side ?
An alternative would be to flatten metadata into a Region, and then reconstruct
it on the C++ side. That is probably simpler, but I’m not sure whether it
would hurt performance in some cases. Probably not, since metadata is tiny.
Where to handle dynamic generation of C++ ?
Right now all the IR infrastructure is being built as Scala objects, and the
IR is evolving rapidly. I don’t think we want to force the IR to be mirrored
as C++ data structures; and Scala is a good concise language for handling
the AST and doing code-generation. So the most obvious short-term approach
is to write Scala code which generates one or more C++ source files from
an IR tree. That code can have the same kind of structure as the
IR-to-JVM-bytecode translation.
From past experience, the code-generation phase tends to be fast compared to
C++ compilation, so I don’t anticipate that doing this in Scala would be
noticeably slow. And the AST is very small relative to data, so the impact
on JVM heap and GC should also be small. The big win is that that since
Scala is a good language for this problem, and we have a team with more Scala
experience than C+ experience, we should get it done with less effort.
However, writing new code in Scala does tend to defer the possible goal of
eliminating the JVM at some point.
Propagating dynamic code across the Spark cluster
I propose to try the easy way first: broadcast the contents of the
compiled+linked dynamic library to all nodes in the cluster, store it to
a file, then call System.loadLibrary to get it loaded.
We can measure the performance for various sizes of foo.so, and see whether
it’s likely to be a problem.
There may be some decisions about when/whether to unload the dynamic library
corresponding to code which is no longer being used.
Batching and parallelism
If we can give the C++ a batch of RegionValue’s at a time (maybe up to 64 ?)
then there are several potential advantages:
a) There may be enough work to exploit multiple threads in the C++ code.
A rule of thumb is that if you can split it into tasks of about 2-5msec
duration, you’ll get good efficiency. Below that it needs more care.
b) The costs of JNI JVM-to-C++ can be amortized over more actual work
c) Eventually there’s more potential to optimize data layout for SIMD
and/or GPGPU. But that’s fairly far in the future.
Within the Spark framework, we’re mostly building iterators which call
other iterators. So it should be easy to build in a modest amount of
eagerness and batching, i.e. an iterator which eagerly pulls up to 64
RegionValue’s from its child, passes that batch to C++, and gets back
some number of rows from C++. Though we need some care to keep the total
memory usage under control when there’s an operator which can multiply
the size of the data (e.g. “explode” can produce many output rows for
each input).
However, note that if the goal is to reduce the number of JNI calls,
then we still want to implement the iterator methods on the JVM side,
rather than having it call down to C++ for each row.
We should measure the costs of various kinds of JNI call/return before
trying to get too detailed in designing this.
Managing lifetimes of Region’s and RegionValue’s
Region’s will always be allocated on the C++ heap, so that the JVM doesn’t
get tempted to move them around. However, ownership of a Region passes
back and forth between C++ code and Scala code: it will be allocated in C++,
may be populated either in Scala or C++ (depending on the IR node), may be
passed down to C++ for the consumer, and then back up to a Scala iterator.
To manage the lifetime correctly and avoid memory leaks, I think this
means that we’ll need the Scala side to be able to tell the C++ side
when and if it owns an object. But we don’t necessarily need to
synchronously call down into C++ to destroy the object: instead the
Scala code could add it to a list of scheduled-for-deletion objects,
and the next call down to C++ could destroy it.
More detail needed here, just wanted to flag the issue early.
First steps
-
Experiment with integrating non-dynamic C++ code into the build
system and calling it through JNI. Measure overheads with various
numbers of arguments. -
Experiment with compilation/loading of dynamic-generated C++ code.
-
Experiment with Spark broadcast/loading of dynamic-generate C++ code.
-
Do a prototype implementation of something moderately interesting.
At each step, we should instrument and analyze performance to see
whether there are any gotchas.