General MatrixTable keys

I’m slightly worried* about the transition to generic keys in MatrixTable and I wanted to start a thread to talk through some of the options.

*I’m not worried we can get it done, but worried about finding an efficient strategy to get where we want to be.

I admit, I made a mistake in designing the region value row representation when converting from annotations to region values in the last step. I’m happy with the generality of OrderedRVD, and in retrospect I should have done the same thing with MatrixTable. I followed the (v, (va, gs)) representation to closely to make the conversion between rdd <-> rdd2 easy instead of avoiding a second round of conversion.

However, I knew there would need to be a second round of conversion because I think eventually manipulating region values explicitly in Scala needs to go away for IR that manipulate region values, and I was hoping the conversion to generic keys and building things with the IR could happen at the same time.

My proposed MatrixType interface for generic key is here: https://github.com/hail-is/hail/commit/d72c0e84bdea76228fd674dae9370af166c752df#diff-191ea9c8cfc065f421b851203ddcd2a5R44

  • Option A: So, that suggests, instead of pushing on generic keys, we jump on the compiler and help @dking get it to the point where AST.toIR is complete, and then make a plan for key conversion with that infrastructure in place.

  • Option 1: This is the direct option. We need to change the MatrixTable rvRowType from the legacy 4-tuple (pk, v, va, gs) to row ++ entries. There are two ways to do this:

    • Option 1A: have the rvRowType include the row indexed-fields as top-level fields, along with a single additional “hidden” entries field of type Array[entryType]. This is the design I envisioned when building OrderedRVD and the row keys map exactly onto OrderedRVD as written.

      This disadvantage of this approach is you have to generate a unique identifier for, and rename as necessary, the entries field. If you call it entries, what happens when a user inserts a new field named entries? This is all a bit tedious.

      An alternative is to change Field.name for structs from a String to and abstract Identifier class (and identifiers throughout in the IR). It would look something like:

       abstract class Identifier
       case class StringId(name: String) extends Identifier { def toString = name }
       case class InternalId(name: String) extends Identifier {
         def toString = "<" + name + ">"
       }
      

      and use InternalId(“entries”) as the entries field.

      or even

       case class Entries extends Identifier { def toString = "<entries>" }
      

      Major downside (which is universal): all the MatrixTable routines that rewrite the row will need to get adapted to this new rvRowType.

    • Option 1B: have the rvRowType go from (pk, v, va, gs) to (va, gs) by pushing the key into the va field. I think this is @tpoterba’s favored strategy.

      The attraction here is this is a less radical change to rvRowType and the eval context* in which expressions are evaluated, and therefore will require fewer code changes.

      *Option 1A does can expose the same interface at the level of the eval context by treating the entire row as the va field.

      The downside to this is that the row partition key and key are nested inside the va field of the row region value. Currently, OrderedRVD does not support nested keys:

       class OrderedRVType(
         val partitionKey: Array[String],
         val key: Array[String], // full key
         val rowType: TStruct) ...
      

      The above key fields would need to change to Array[List[String]]. WritableRegionValue.setSelect, OrderedRVType.selectUnsafeOrdering, and TStruct.select would need to be updated. OrderedRVPartitioner would also need some work.

      Then updating the MatrixTable routines resolves around: not building the first two fields, and updating the handling of keys where necessary.

    I am slightly hesitant to decide between these two options based on the ease of making the change, because these will determine the final form of the rvRowType going forward. If there were no other constraints, what would the right choice be?

  • Option 2: Leave the rvRowType as is. Put the partition key as a struct in pk and the remainder of the key as a struct in v. Then, for MatrixTable.rvd, we always have partitionKey = Array(“pk”) and key = Array(“pk”, “v”).

    When referring to a symbol, have AST.to_hql generate references to the right prefix. @tpoterba already has support for this in api2 with, e.g., construct_reference(…, prefix=‘v’). What other changes are necessary?

    pk needs to be added to the EvalContext for most commands. The existing implementation of most MatrixTable methods are unchanged. This is the main advantage of this approach!

    The final piece is that we want to be able to update the keys either by annotating new values or changing the keys with the equivalent of key_by.

    For annotateVariantsExpr, we could allow three prefixes: pk, v and va, and build three groups of inserters.

    annotateSamplesExpr could do the same thing, or it might be easier to fuse sampleIds and sampleAnnotations and provide a routine to extract (as an annotation) the sample ID from the sample annotations since the situation there is so much less complex.

    Finally, key_by would have to rearrange the fields in the row, but that should be straightforward.

    Then we have breathing room to get Option A in place and then change the rvRowType to 1A or 1B when we IR-ify MatrxTable.

What other options are we not considering? What’s Option 3?

General thought: not permitted nested fields to be keys seems annoying for the user.