Hail Native Code Generation (C++)

Hail Native Code Generation

Richard Cownie, Hail Team, 2018-03-22

Introduction

This document sketches initial steps towards generating native-x86_64
code for performance-critical operations in Hail.

With a small team and a substantial existing Scala codebase, we need
to evolve the codebase piece-by-piece, so for a considerable time we
expect to have a mixture of Scala, dynamically-generated JVM bytecode,
compiled C++, and dynamically-generated x86_64 code. The TStruct/TArray/
RegionValue infrastructure should provide a good basis for all flavors of
code to interoperate.

C++ source code vs low-level interface (LLVM IR ?)

One major design decision is whether to generate native (x86_64) code
through some kind of compiler-IR representation (e.g. LLVM IR), or to
simply generate C++ source code and compile it. Experience with both
approaches suggest that dynamically-generated C++ gives good productivity,
an easy way to combine static-compiled and template libraries with
dynamically-generated code, and excellent runtime performance, but with
some risk of noticeable compilation delays. I propose that we should
use this approach, but try to reduce or eliminate the impact of compilation
time in several ways:

a) Cacheing of compiled object-code files with ccache, which can maintain
a persistent cache so that if you do similar operations across several
Hail sessions, the code may only be compiled once.

b) Where it isn’t critical to performance, constant values can be hoisted
out of the compiled code into dynamically-passed parameters, which
then allows the compiled code to be cached and re-used with different
parameter values.

c) Codegen/compilation can be started eagerly in the background, thus
overlapping with time taken for the data processing of earlier
operators/stages. [Aside: in the big picture, I think we probably
want to keep the Scala/JVM code single-threaded, but allow multiple
threads in the C++ in situations where it offers a compelling performance
advantage without too much complexity of data-sharing].

And of course the big win of using C++ is that we don’t have to design
a new language/AST.

This should also help with interfacing to existing linear algebra/GPGPU
code.

To avoid confusion, we would probably want to standardize on the same C++
compiler and version for MacOSX and (Ubuntu) Linux. Past experience is that
g++ gets very slow for heavily-templated code, so llvm/clang is preferred.
The very latest is llvm-6.0, but I’m going to start out with llvm-5.0 and
see how it goes.

Interfacing between Scala and C++

It is supposedly easy to call from Scala into compiled C or C++, e.g.

Scala:

  object LongDistance {
    @native def callSquareNative(x: Int): Int
  }
  
  // need to do System.loadLibrary("squarenative") somewhere
  
C++:

  #include <jni.h>
  
  JNIEXPORT jint JNICALL Java_LongDistance_callSquareNative(JNIEnv* env, jobject obj, jint x) {
    return(x * x);
  }

So far so good. The JNI also allows the C++ code to access members and call methods
on a Java object or array. But if you do that it starts get very expensive, in much
the same way as interactions across the py4j boundary.

As long as we stick to passing around Int/Long/Boolean base types, and never making
callbacks from C++ into JVM, it should go fast.

Aside: exceptions don’t propagate from C++ back into JVM in any obvious way, so
the C++ code should catch any exceptions, clean up, and pass back any information
in an efficient way - viz in a base-type return value, or by modification of a
RegionValue visible to both sides. The current plan is that our C++ code should
be exception-safe but should be designed not to use exceptions.

This all seems plausible as a way to handle data in RegionValues. However, it seems
likely that the C++ side will also sometimes need to access the Type/TStruct/TArray
metadata to make runtime decisions, and if that only existed as JVM objects it would
be expensive to access.

I think we can finesse this by having Scala objects which are mirrored in C++ -
i.e. for each instance of a mirrored object we create a corresponding object on
the C++ heap, and the Scala object holds a reference to the C++ object (the
JNIEnv has some magic for managing references, which I haven’t understood yet).

Of course in such a scheme we have to worry about coherency between the two
copies of the same information. But for the Type/TStruct/TArray metadata that
may be fairly simple.

Do we ever modify a metadata object ?
Do we need to be able to modify metadata objects from the C++ side ?

An alternative would be to flatten metadata into a Region, and then reconstruct
it on the C++ side. That is probably simpler, but I’m not sure whether it
would hurt performance in some cases. Probably not, since metadata is tiny.

Where to handle dynamic generation of C++ ?

Right now all the IR infrastructure is being built as Scala objects, and the
IR is evolving rapidly. I don’t think we want to force the IR to be mirrored
as C++ data structures; and Scala is a good concise language for handling
the AST and doing code-generation. So the most obvious short-term approach
is to write Scala code which generates one or more C++ source files from
an IR tree. That code can have the same kind of structure as the
IR-to-JVM-bytecode translation.

From past experience, the code-generation phase tends to be fast compared to
C++ compilation, so I don’t anticipate that doing this in Scala would be
noticeably slow. And the AST is very small relative to data, so the impact
on JVM heap and GC should also be small. The big win is that that since
Scala is a good language for this problem, and we have a team with more Scala
experience than C+ experience, we should get it done with less effort.

However, writing new code in Scala does tend to defer the possible goal of
eliminating the JVM at some point.

Propagating dynamic code across the Spark cluster

I propose to try the easy way first: broadcast the contents of the
compiled+linked dynamic library to all nodes in the cluster, store it to
a file, then call System.loadLibrary to get it loaded.

We can measure the performance for various sizes of foo.so, and see whether
it’s likely to be a problem.

There may be some decisions about when/whether to unload the dynamic library
corresponding to code which is no longer being used.

Batching and parallelism

If we can give the C++ a batch of RegionValue’s at a time (maybe up to 64 ?)
then there are several potential advantages:

a) There may be enough work to exploit multiple threads in the C++ code.
A rule of thumb is that if you can split it into tasks of about 2-5msec
duration, you’ll get good efficiency. Below that it needs more care.

b) The costs of JNI JVM-to-C++ can be amortized over more actual work

c) Eventually there’s more potential to optimize data layout for SIMD
and/or GPGPU. But that’s fairly far in the future.

Within the Spark framework, we’re mostly building iterators which call
other iterators. So it should be easy to build in a modest amount of
eagerness and batching, i.e. an iterator which eagerly pulls up to 64
RegionValue’s from its child, passes that batch to C++, and gets back
some number of rows from C++. Though we need some care to keep the total
memory usage under control when there’s an operator which can multiply
the size of the data (e.g. “explode” can produce many output rows for
each input).

However, note that if the goal is to reduce the number of JNI calls,
then we still want to implement the iterator methods on the JVM side,
rather than having it call down to C++ for each row.

We should measure the costs of various kinds of JNI call/return before
trying to get too detailed in designing this.

Managing lifetimes of Region’s and RegionValue’s

Region’s will always be allocated on the C++ heap, so that the JVM doesn’t
get tempted to move them around. However, ownership of a Region passes
back and forth between C++ code and Scala code: it will be allocated in C++,
may be populated either in Scala or C++ (depending on the IR node), may be
passed down to C++ for the consumer, and then back up to a Scala iterator.

To manage the lifetime correctly and avoid memory leaks, I think this
means that we’ll need the Scala side to be able to tell the C++ side
when and if it owns an object. But we don’t necessarily need to
synchronously call down into C++ to destroy the object: instead the
Scala code could add it to a list of scheduled-for-deletion objects,
and the next call down to C++ could destroy it.

More detail needed here, just wanted to flag the issue early.

First steps

  1. Experiment with integrating non-dynamic C++ code into the build
    system and calling it through JNI. Measure overheads with various
    numbers of arguments.

  2. Experiment with compilation/loading of dynamic-generated C++ code.

  3. Experiment with Spark broadcast/loading of dynamic-generate C++ code.

  4. Do a prototype implementation of something moderately interesting.

At each step, we should instrument and analyze performance to see
whether there are any gotchas.

never making callbacks from C++ into JVM

I don’t think that’s going to be easy :-(.

Hadoop is 3 things:

  1. An implementation of a distributed file system,
  2. An API for doing IO against distributed file systems, and
  3. An implementation of MapReduce (which we don’t use)

We pervasively rely on (2), the distributed file system API, for all file IO: accessing Google storage, Amazon s3, the Hadoop filesystem (1) itself and local files.

It would be quite a bit of work to build the breadth of support that Hadoop has. So my plan loosely was to call back into the JVM to read/write blocks of files. The hope is the JVM overhead would be low because we’re doing IO on blocks compared to the cost of going to a distributed file system.

The read/write would move data to/from off-heap buffers so in either case we’re still only passing scalar values (but it does incur an extra copy).

There are a couple of ways to attack this:

  1. The file-reader can be turned inside out so that the current state is held in an
    explicit stack in an object on the heap, rather than an implicit stack of local values
    (and program counters) in multiple call frames on the stack. Once you decouple
    that state from the thread’s stack, it becomes possible to return from C++ to JVM
    at the point where you need another block, let the JVM get the block, then
    resume from the correct state.

  2. C+±to-JVM calls are probably not that bad in terms of performance (I need to
    measure that). So it won’t hurt too much if we do a few, under tightly-controlled
    conditions. But my intuition, backed up by what I’ve read about best practices for
    using JNI, is that if you have a thread going back and forth across the JNI
    boundary in deep and complicated ways, it devolves into poor performance and a
    swamp of hard-to-debug interactions.

On the issue of debugging, I need to explore whether we can debug C+±called-from-JVM by attaching gdb to the running process (or some other way).

More discussion copied from emails:

Email 1 - Strategy

In the short-to-medium term, my proposal is that all of HDFS and Spark stays in place,
but we focus our effort on moving the cpu-intensive code of transforming/operating on
RegionValues down into C++.

Then the JVM Spark/HDFS code would still be responsible for reading/writing blocks
and managing the higher levels of the computation. But inside an RDD operation, all the
expensive parts of a transformation from an input sequence of RegionValue’s to an
output set of RegionValue’s (and joins and sorting) would be handled down in C++.

It’s also true that calling out from C+±to-JVM is probably not that expensive, at least
relative to the cost of doing IO, though it involves some hairiness about accessing and
cacheing metadata about Java classes and methods in the JNIEnv. C+±to-JVM calls
don’t have to be completely verboten. But we shouldn’t be in a hurry to move code which isn’t
performance-critical (or heap-garbage-critical) from Scala into Java - for example, in reading
a MatrixTable from a file, I expect it would be near-optimal to have a mixture of Scala for parsing
the header/metadata, then based on the metadata generate C++ for unpacking/decoding
records from globalType, columnType, and rvRowType into RegionValue’s. And then run
that C+±optimized-for-this-MatrixType code for the bulk of the data.

We could contemplate moving more of the Spark infrastructure down into C++, but I think
that would require more work to get to a payoff. And if/when it proves to be a problem,
that can be reconsidered.

I’m basing this proposal on my guesses about balancing time and effort against the curve
of likely improvements in speed/memory-usage/reliability. But we should also consider any
not-so-quantifiable reasons for wanting to have less Scala and more C++.

Email 2 - turning MatrixRead inside out

In general terms (warning: stand back, I’m about to wave my arms …) you’ve got a top-down
recursive-descent parser for the file, which at the lowest level makes a readBlock call when
it needs some more data to carry on.

At each point where it calls readBlock, the current state of the parse is held in local variables
(and the program counter) of a number of stack frames.

But you can instead represent a current-parse-state as an explicit stack held in a
“parsing engine” state-machine object. The methods for that parsing engine might
look like this.

class ParserEngine {

  def curState: Int

  // advance the parse until exhausted, have an RV, or need a block
  def run()

  // curState 0 - need the next block, call either putNextBlock or putNoMoreBlocks
  def putNextBlock(block: Array[Byte])

  def putNoMoreBlocks()

  // curState1 - we have one or more complete output RV's
  
  def nextAndAdvance(): RegionValue

  // curState2 - no more RV's
}

So then we’ve turned it inside out and the actual reading of blocks is elsewhere.

Not super elegant - and maybe it’s neater to do the C+±to-JVM call for readBlock -
but you see it can be twisted around this way if necessary.

Email 3 - relation to Actor model

This gets a bit more elegant if we make it more Actor-like, with the parser-actor
sending a request message to the readblock-actor when it needs a block, and
getting a response message which is either oneBlockHere or noMoreBlocks.

And similarly on the RV side, with the request wantAnRV giving a response
oneRVHere or noMoreRVs (but sometimes with some messages to/from the
readblock-actor before the response).

But then there might be some C+±to-JVM calls creeping in. Either way, the
principle is that if your state is in call-frames on the stack at the point where you
need a block, then it’s going to need to do a C+±to-JVM upcall, but if the state
is all on the heap, you can stop and wait until something hands you a block.

[I’m generally intrigued by the Actor model because it works across networks while allowing loose synchronization - and making the states and protocols explicit gives you a shot at proving useful global properties of the system, e.g. can it deadlock ? is memory usage bounded ? etc. But that goes beyond the issue of the JVM/C++ split]

Email 4 - more detail on MatrixRead

If I’m reading this right, it looks as though the performance-critical code
for MatrixRead is all under PackDecoder.readRegionValue(), which has the TBaseStruct
type and interprets it to scan the packed binary data (sucked from a possibly-
compressed stream).

Once I’ve got some JNI/cpp infrastructure in place, I’ll prototype trying to do that
all with codegen’ed C++ and see if I get stuck.

JNI status and performance measurements

I got some JNI calls working. Once on the C++ side, I’m doing an indirect call through a function pointer, so that we don’t need to be creating new JVM-side objects and methods each time we compile a new C++ function. All calls are returning a Long. A call which passes one Long parameter takes about 0.011usecs ~ 11nsec. With 8 Long parameters the call-return takes about 19nsec. So it’s all quite snappy as long as you can avoid the more complicated features of JNI
There was a certain amount of gnarliness involved, since most examples are Java-to-C and the jni.h header actually declares various things differently for C++ (guarded by #ifdef __cplusplus).
This also did involve passing a Scala String and doing env->GetStringUTFChars(javaString, 0) down in a native method - used for passing the function/symbol name for dynamic lookup with dlsym(). So that does work, at least for that use which is not performance-critical (once per function, not once per record).

And I jumped forward to clang-6.0.0 with --std=c++14 just to be sure we can get a common version on OSX and Linux.
Have to build a little more infrastructure for managing dynamic-library files and loading and compiling before getting into real work in C++ (e.g. MatrixRead)

19ns

Awesome! This is plenty fast for block IO, and orders of magnitude faster than my fears. Are you calling a static function or a member on an object? Can you vary the object without doing additional setup with the JVM?

The 19ns figure is measured using a JNI’ed “public static native” method which takes 8 “long” arguments, plus a 9th “long” specifying the C++ function address. This goes
to a corresponding static-compiled C++ function conforming to the JNI standard, which
then calls indirect to the funcAddr.

This little zigzag seems like the easy thing to do to be able to dynamically generate and load C++ functions, without also needing to dynamically create corresponding
class/method JVM stuff (which would then also need to be torn down when/if we
unloaded a dynamic library).

Since the JVM is the JVM, it insists that everything is a method, not a function, so
even though the JNI’ed method is declared static, it still passes an object-reference which turns up in C++ as a “jobject”.

I haven’t yet measured anything to do with C+±to-JVM upcalls. My current thinking is that anything which needs to be visible to both C++ and JVM should be on the malloc heap rather than JVM heap, so that it can be accessed quickly from both sides (on JVM side, through Unsafe accesses at an offset from a “long”
address). But I should experiment with accessing Java object fields from C++ before committing to that choice.

This little zigzag

I quite like this. I’m all for minimizing the width of the C++/JVM interface.

I haven’t yet measured anything to do with C+±to-JVM upcalls.

Ah, my confusion, I thought this was C++ to JVM. Thanks for clarifying. Will be quite interested when you have some timing for this.

anything which needs to be visible to both C++ and JVM should be on the malloc heap

I agree, that’s exactly what I was thinking when I put together region values. The only exception I can think of is things which must be Java objects which we already discussed (e.g. objects for IO), although I’d also be happy if we allocate IDs for those and stick them in a table somewhere on the JVM side and pass scalar IDs or malloc heap objects back and forth for IO. Certainly the IO buffers should be in the malloc heap.

From C++ I can now do GetLongField/SetLongField on a Scala object
SetLongField average time is 26.5nsec
GetLongField average time is 1.6nsec

So probably not great to be modifying Scala objects from C++ if we can
avoid it. I read some stuff suggesting that nio.ByteBuffer is a good way to
manage stuff visible from both sides, so I’ll try that next. Will also try an upcall
from C++ to a JVM object method.

Call from C++ to a Scala object method - average time 157.4 nsec
So a good deal more expensive than the JVM -to - C++ call, but still fast relative
to any kind of IO.