Last night while feverishly thinking of the world of opportunities opened up by
DNDArray, I realized that 1:1 partitioners are far too restrictive! I can
achieve good key-change performance with a general class of partitioners that I
call Rectangular Partitioners (they’re similar to but more general than
BlockMatrix Grid Partitioners).
What DNDArrays need is cheap permutation and prefixation of keys. When is this
cheap?
Define a partitioner as a key type and a sequence of intervals. An interval’s
end-points are structs of type the partitioner’s key type.
Define “partitioner, p, restricted to field f” as a new partitioner whose
sequence of intervals is p’s intervals where each end-point has been projected
to just the field f. For example:
Partitioner(
Interval(hl.struct(x=0, y=0, z=0), hl.struct(x=0, y=3, z=5)),
Interval(hl.struct(x=0, y=0, z=6), hl.struct(x=0, y=3, z=10)),
Interval(hl.struct(x=0, y=3, z=0), hl.struct(x=0, y=10, z=5)),
Interval(hl.struct(x=0, y=3, z=6), hl.struct(x=0, y=10, z=10)),
If we restrict to “z” we have:
Partitioner(
Interval(hl.struct(z=0), hl.struct(z=5)),
Interval(hl.struct(z=6), hl.struct(z=10)),
Interval(hl.struct(z=0), hl.struct(z=5)),
Interval(hl.struct(z=6), hl.struct(z=10)),
If we restrict to “y” we have:
Partitioner(
Interval(hl.struct(y=0), hl.struct(y=3)),
Interval(hl.struct(y=0), hl.struct(y=3)),
Interval(hl.struct(y=3), hl.struct(y=10)),
Interval(hl.struct(y=3), hl.struct(y=10)),
A partitioner is “rectangular” if for every field, f, in its key type, the
partitioner restricted to f contains no partially overlapping intervals. Two
intervals partially overlap if they overlap but are not equal.
If a partitioner is rectangular, then permuting the key fields can be achieved
with only a local sort. The order of the partitions changes in the expected
way. For two-key rectangular partitioners, this corresponds to switching the
ordering between row-major and column-major.
This avoids a shuffle, which is already a huge win, but we can do even better. A
local sort can be avoided if our data structure has a cheap way to permute the
sort-by-indices. hl.nd.array
is one such data structure! If we open our mind
to partitions that are not just arrays of records but ndarrays of records,
then key permutations are just matrix transpositions (or whatever the tensor
version of that is called)!
Rectangular partitioners permit filtering the rows and columns of ndarrays
without repacking the blocks! Hail’s table join logic will find the
corresponding partitions. All that remains is to teach Hail how to work with
ndarrays as partitions instead of arrays.
This essay was inspired by working on DNDArray: