NDArray Strides

Strides

I’m making a list here of all the ways Numpy uses strides to perform optimization tricks. If there are no compelling examples in this list, I’m going to take strides off the canonical PNDArray PType and just stick to column major. I was discussing with Tim whether maintaining strides is more trouble than it’s worth, as I don’t want to have to maintain strides on TNDArray.

  • Windows into matrices.
    • Imagine you have a row matrix and ask for only the first c columns. If you change the shape, but not the strides, this will work naturally without copying, as the strides still know the distance between two rows in the original matrix
    • Seems like people use it to implement sliding windows on an array as well, where they want to look at a series of small sections without ever copying
  • Transposition
    • A transposition can be performed in constant time by flipping the elements of the stride tuple.
  • Broadcasting
    • If you do something like add an ndarray to a constant in numpy, the constant will get expanded to a vector by wrapping the constant in a ndarray of the same shape as the original ndarray, but with all the strides zeroed out. See:
    x = np.arange(10)
    X, Y = np.broadcast_arrays(x, 3) # Read as "broadcast the second thing to match the shape of the first"
    print(X.strides)
    (8,)
    print(Y.strides)
    (0,)
    

So Y looks like this:

ndarray(data=[3], shape=(8,), strides=(0,)

It only stores one single element in the data array, but the 0 strides specifies that the way to get from one element to the next element is to step forward 0 bytes, meaning you’d stay in one spot the whole time.

Going point by point for Hail implementations:

  1. Windows into matrices: I think this is least relevant to us, since we are going to end up copying data around, not going to return a window in a matrix to a user. And if we are going to filter out entries internally, we are not going to want to copy around filtered out data. We also aren’t going to want to write out data to disk that has data filtered out of it.

  2. Transposition: We do currently benefit from the fast transpose trick if it’s the top level operation in an IR. If it’s an IR node in the middle of a deforested computation, we won’t ever manifest the transposed thing, which I think makes sense. Worth noting that BLAS doesn’t have a transpose operation.

  3. The NDArrayEmitter handles all striding issues right now. The current NDArrayEmitter is a thing that knows how to produce an entry at a particular coordinate. If you have an NDArrayEmitter that represents the addition of a matrix and a vector, what will happen is that the both the row and column coordinate will be used to look up an entry in the matrix, but we will throw out the column coordinate when looking up an entry in the vector, which has the effect of broadcasting. However, if we don’t keep track of strides, a user will not be able to pass us a broadcasted numpy literal unless we first “densify” it by manifesting the entire data array in python.

I suspect we are going to be relying on BLAS more and more for performance reasons, perhaps even for simple things like adding two matrices elementwise or scaling a matrix by a constant (both handled by DAXPY BLAS instruction). Some of these instructions support stride arguments, so it’s possible if we moved more into this realm there’d be benefits to using the stride arguments, but I don’t currently see those benefits.

The only other benefit I thought of is representing ndarrays of constants by using an array of one element with all strides as 0. Like:

ndarray(data=[1], shape=(50, 40), strides=(0, 0))

would make an ndarray with 50 rows and 40 columns of all 1s. The zero strides specifies that moving along a row or column of the matrix does not result in any movement in the data array (i.e. the strides specify that we should keep looking at the single element of the data array over and over). But I suspect this would actually be harmful for things like vectorized instructions, since there won’t be an actual array of the constants manifested. This also seems like something that could just be handled with some sort of PConstantNDArray PType in the future.

I’m curious if anyone else has thoughts.