Should RVD be keyed?

I’m having a bit of difficulty finding the right way to extend the ordered join implementation to Table. Joins on Table and MatrixTable should share almost all of their code. It currently seems to me like the most natural way to achieve that is to implement joins on RVD. The only problem is that RVD isn’t keyed, so joins don’t make sense.

One option would be to put keys in the parameters of RVD.join, but that doesn’t feel right. As far as I’m aware, all of our uses of RVD carry key information already, so it seems natural to add keys to RVD directly. Tim mentioned on slack that we might want to allow MatrixTable to be backed by general RVD, in which case it definitely seems like we’d want RVD to be keyed.

It seems reasonable to move the OrderedRVDType member from OrderedRVD to RVD (and rename OrderedRVDType to RVDType). It’s a little odd to have a partition key for something which isn’t partitioned by keys, but I think the point is that it could be partitioned by keys, and that we want to repartition in order to join with an OrderedRVD with compatible keys.

If we did this, my join strategy would be:

  • Add a method to RVD called conformToPartitioner (or something) which takes an OrderedRVDPartitioner compatible with the RVDs key schema. If the RVD is unordered, this will do a shuffle-sort; if it is already ordered it will do a narrow-dependency repartitioning. In both cases, any data falling outside the new partitioner’s partitions get dropped.

  • Add a join method to RVD. If both left and right RVDs are already ordered, choose a new partitioner (depending on whether the join is left, outer, etc.), repartition both using the previous method so they now have a common partitioner, then use zipPartitions and the existing iterator joins.

    If they aren’t both ordered, then we might choose the new partitioner differently (e.g. if left is ordered and right is not, just use lefts partitioner), and repartitioning the unordered one will require a shuffle, but otherwise the idea is the same.

As an aside, I think that the conformToPartitioner method should be put into the MatrixIR and TableIR. That way the optimizer could bubble repartitioning steps up as early as possible. In effect, this just means using knowledge of the later steps of the pipeline to make smart choices about partitioning in the beginning, to avoid the need to repartition as much as possible. E.g. the inputs to a join should be computed with the same partitioner in the first place, so that no shuffling is needed.

Before I start making fundamental changes, I’d like feedback on:

  • Should RVD carry a list of key fields?
  • Should RVD also carry a list of partition key fields? (In which case RVD and OrderedRVD have the same type data, which makes things easier).
  • Is there a better approach to making a common implementation of joins for Table and MatrixTable?

I think I’ve changed my mind on this.

When I asked this question on Slack before writing the previous post, Dan commented that he wasn’t sure what keys really are anymore. My first thought was that keys are some metadata kept with RVDs that tell you how joins (and other methods like groupByKey) should behave, which is why it made sense that all RVDs should have keys.

While I think that is the right answer for what keys mean on Table and MatrixTable, now I think I like a different answer better for the lower-level RVD: keys are metadata on an OrderedRVD that tells you what ordering the data is guaranteed to conform to. Given this view, I think all operations on RVD which require keys—such as joins—should take the key information as parameters, since these join-keys are semantically unrelated to the sort-keys. The implementation of these operations on OrderedRVD can use the knowledge of the order guarantee to choose a more efficient algorithm: if the join key is a prefix of the sort key, a merge join can be used, otherwise either the OrderedRVD needs to be resorted first or a hash join can be used.

If we’re considering dropping the requirement that the RVD of a MatrixTable is always ordered, I think it will be helpful to clarify the semantics of RVD. What I’m proposing here is that RVD (including OrderedRVD) is like an RDD and not a PairRDD. We could always add convenience classes KeyedRVD and KeyedOrderedRVD that are lightweight pairs of an RVD and an array of key fields, that are more like PairRDD and that collect those functions which depend on a choice of key.

I have some thoughts:

  1. I think having a non-keyed RVD is a useful abstract even if we aren’t directly using it right now, although I could be convinced otherwise. In particular, when things aren’t sensibly keyed, we set key=[] which is wrong because if we enforced such a partitioning, everything would go into the same partition (which would usually catastrophic). So I think things like import_table should have key=None by default.

  2. I don’t think join should be an operation on RVD. The presence of keys is only part of the problem (I don’t think I would want it on an abstract KeyedRVD, either): an RVD is not a joinable object because join needs additional information about the partitioning and ordering of the data to be implementable in a distributed setting. So RVD.join(right: RVD, …) doesn’t make sense because you can’t join two general RVDs, e.g., the ordered join strategies requires both sides be ordered.

  3. RDD was a user-facing structure that was built to be convenient to use. I think of RVD as a backend structure and the target of the query optimizer. The main principle I follow when doing IR backend design is that everything that you want to optimize must be explicit, not easy to use. Repartitioning datasets is the most expensive operation we run, so it should never be that it can be applied implicitly. Therefore, I don’t want join to automatically order things in the background. The only join operations we should write should be efficient ones and they should be correctly typed. Therefore, OrderedRVD.join(OrderedRVD, …) is OK, but RVD.join(RVD, …) is not.

  4. My sense is that everywhere it is possible and efficient to implement join, the specialized classes will carry information about partitioning and ordering in a way that will be sufficient to implement the natural join interface.

3 really helps with how I should think about design in the lower layers, thanks!

This is a semi-non-sequitur, but I think the notion of “preserves partitioning” is wrong as we use it/call it. In particular, OrdreredRVD depends on both the order and the partitioning, but we assume we don’t change the ordering in mapPartitionsPreservesPartitioning. Perhaps we want the operation to mean both for now. I don’t think we have something that changes the ordering but not the partitioning (e.g. permuting within partitions) but we do have operations that preserve the partitioning but not the ordering (coalesce/repartition) … although that is never going to be a packaged operation at the RVD level.

I think you mean operations that preserve the ordering but not the partitioning.

I suppose if we want to be completely precise we could rename it monotonicMapPartitionsPreservesPartitioning.