I’ve been thinking a bit about how we could do faster filtering. We’re all pretty disappointed with the user experience of the hack I wrote up for Caitlin. Here are some of my not fully developed thoughts.
As suggested by Cotton, I think a good interface would be to have RDD’s contain an iterator that can advance to the next record with a matching key. Unfortunately, the Java Iterator
class isn’t well suited to this because it does not have an advance
method. Using a flip book iterator we could add:
def advance(
rv: RegionValue,
t: OrderingView[RegionValue],
orderingRelevantFields: Array[String]
): Unit
which advances the iterator to the next value which is equal to rv
under t
, if one exists. If no such value exists, the next call to isValid
will return false. This could be leveraged by joins.
I see two options to do this efficiently when the ordering view only looks at a subset of the fields:
-
If the source can produce the subset of fields more efficiently than producing the full row, then we can clearly scan past irrelevant records without fully decoding them.
-
If the source has an index with a superset of the relevant fields, we can look up the location of the next matching record in the index and
seek
to that position.
Reading A Subset of Fields
We’d want to add a decoder that can read bytes of type T into a record of type U. Ideally, we would be able to retroactively add the unread fields when we realize the record matches. This would be easy with physical types because we can place the “key” fields first even if they are not first in the virtual type. Even without physical types, we could allocate space for the full record and populate the fields only when needed (i.e. the key fields first and the rest if the key matches), but this seems easier to get wrong.
But how does the source skip over an irrelevant record efficiently? I think we could write a method to skip the encoded bytes of a type more efficiently than we could read it because we don’t need to look at the bytes of a string or the bytes of an array of fixed-size things.
A Matrix Table Index
When a matrix table is written, it could optionally produce an index by tracking the offsets and keys of each record in a partition as the records are written. After writing all the records, the keys and the offsets are written in a separate file, sorted, one for each partition. This seems to work better if the dataset is already sorted by the key. If there’s enough records in a partition to make it worth it, we can create a b-tree for that partition. To produce a dataset-wide indexing data structure, we’d want the write to be a tree-reduce that built the b-tree’s internal nodes as it reduced. Assuming the data of the index is quite large, we’d want to write each internal node of the b-tree as a separate GS file. There’s a balance here of file size versus the amount of data we need to collect on a single machine. Having a b-tree distributed across a bunch of GS files seems complicated too.