TSORTED in OrderedRVD needs to go


#1

This is no longer safe, for example, if you have a large number of rows with the same key, the memory requirements for the queue in OrderedRVD.localKeySort will blow out memory. The T-sorted path should be removed. The difficulty in doing this is, if the keys are K-sorted but the partition key does not break up along partition boundaries, it would be nice to “fix up” the partition edges to avoid a shuffle. rangesAndAdjustments does this, but there are cases where it must reduce the key ordering to TSORTED. I think if OrderedRVPartitionInfo carries the min/max for K instead of T, then it will be straightforward to fix this. There might be some additional work to do to figure out the correct bounds after fix up.

Note, T here is an old convention that refers to the partition key. We should really say PKSORTED.


#2

we see a lot of VCFs that aren’t sorted by alleles, though. We need to make sure we don’t trigger shuffles everywhere


#3

A possible solution to both problems is to count the number of keys with the same PK that would need to be sorted, and shuffle if that count exceeds some threshold.