NDArray loops are a severe performance pain point of lowered methods like linear_regression_rows_nd. John and I propose a new kind of NDArray emitter that should alleviate some of this performance pressure.
Current design
The current NDArrayEmitter intermediate looks like this:
abstract class NDArrayEmitter(val outputShape: IndexedSeq[Value[Long]], val elementType: SType) {
val nDims = outputShape.length
def outputElement(cb: EmitCodeBuilder, idxVars: IndexedSeq[Value[Long]]): PCode
While simple, this design forces implementations to do a lot of work in the inner loop where the code generated by output_element is placed.
Proposed design
Instead, we propose the following design:
abstract class NDArrayProducer {
def elementType: SType
val shape: IndexedSeq[Value[Long]]
// global initialization
val init: EmitCodeBuilder => Unit
// initialize or reset an axis.
val initAxis: IndexedSeq[(EmitCodeBuilder) => Unit]
// step an axis by some number of elements
val stepAxis: IndexedSeq[(EmitCodeBuilder, Value[Int]) => Unit]
// load current
def loadElementAtCurrentAddr(cb: EmitCodeBuilder): SCode
}
This producer can be “consumed” by using the interface above. As an example, to iterate over each element, a consumer can generate a number of loops equal to the number of axes, call initAxis(axisIndex)(cb)
before each loop, and increment the element pointer by calling stepAxis(axisIndex)(cb, const(1))
within the body of each loop. The consumer can consume the element by calling loadElementAtCurrentAddr
inside the innermost loop.
What this solution is not:
This solution is not an end-all solution to NDArray performance problems in the JVM.
What this solution is:
This design is an easy (< 1 week of work) way to drastically improve the generated bytecode, and hopefully get NDArray performance to a place where we can start ripping out Spark implementations.