NDArray emitter redesign

NDArray loops are a severe performance pain point of lowered methods like linear_regression_rows_nd. John and I propose a new kind of NDArray emitter that should alleviate some of this performance pressure.

Current design

The current NDArrayEmitter intermediate looks like this:

abstract class NDArrayEmitter(val outputShape: IndexedSeq[Value[Long]], val elementType: SType) {
  val nDims = outputShape.length

  def outputElement(cb: EmitCodeBuilder, idxVars: IndexedSeq[Value[Long]]): PCode

While simple, this design forces implementations to do a lot of work in the inner loop where the code generated by output_element is placed.

Proposed design

Instead, we propose the following design:

abstract class NDArrayProducer {

  def elementType: SType

  val shape: IndexedSeq[Value[Long]]
  
  // global initialization
  val init: EmitCodeBuilder => Unit

  // initialize or reset an axis.
  val initAxis: IndexedSeq[(EmitCodeBuilder) => Unit]

  // step an axis by some number of elements
  val stepAxis: IndexedSeq[(EmitCodeBuilder, Value[Int]) => Unit]

  // load current 
  def loadElementAtCurrentAddr(cb: EmitCodeBuilder): SCode
}

This producer can be “consumed” by using the interface above. As an example, to iterate over each element, a consumer can generate a number of loops equal to the number of axes, call initAxis(axisIndex)(cb) before each loop, and increment the element pointer by calling stepAxis(axisIndex)(cb, const(1)) within the body of each loop. The consumer can consume the element by calling loadElementAtCurrentAddr inside the innermost loop.

What this solution is not:

This solution is not an end-all solution to NDArray performance problems in the JVM.

What this solution is:

This design is an easy (< 1 week of work) way to drastically improve the generated bytecode, and hopefully get NDArray performance to a place where we can start ripping out Spark implementations.