File format changes

Here is the list of changes I want to pile up for the next file format change. Hopefully this will bring us through 0.2, but if we need to force another reimport before we go out of 0.2 beta, that’s life.

  • split row annotations and entries on disk so we can just load the row annotations if the entries are not used.
  • have “encoded types” and store the row encoded type in the row store metadata. To start, we could have a placeholder “default” encoded type which corresponds to the current encoding.
  • store information about compression so we can plug different compression schemes, no lz4, other compression schemes, pack, no encoding, etc. This is row store level metadata.
  • compound keys in matrix table. Matrix type should have 4 structs: global, row, col, entry, and row partition key and key, and col key. We should have a parser for matrix type and store that in the MatrixTable metadata.
  • clean up metadata: remove the split stuff. On the code side, there are a ton of redundant metadata classes now: VDSMetadata, VSMMetadata, VSMFileMetadata, etc.
  • partitioning should include upper and lower bounds for every partition (note, not intervals, because the upper bound should be inclusive). This is possible now that OrderedRDD is gone.
  • call should include phase information, or at least, placeholder in the format for phase information, even if it isn’t created or used yet. For more, see: Haploids and generic genotypes
  • store the sample annotations as a row store instead of json

Did I miss anything? Also see: Stabilizing the file format, but I think everything there is included here.

Once the api2 liftover calms down, I will start farming these out. @tpoterba is already working on compound keys in MatrixTable.

Couple more things:

  • the partitioner type should be stored in the metadata so we can add, say, a HashedRVD
  • col data should not be stored as part of the metadata. Right now we load that when we load the matrix metadata, but it is simply discarded if, say, we drop columns. ReadMatrix shouldn’t carry a MatrixLocalValue. (And perhaps that class should just be folded into MatrixValue). This is largely enabled by storing the column data as a row store instead of json.

Another one: use + instead of ! for requiredness.

We also need to change the vds and kt extensions. mt and table? hmt and ht? ds and df? 2ds and 1ds? Just hail with a single read command and store the type in some metadata, so read_matrix and read_table => read? Ohh, I like that. Don’t require extensions? Thoughts?

No file extension means tensors, etc. don’t get new file extensions, which seems nice.

You get the same advantage by using the same extension for all Hail files, like ‘.hail’ or ‘.hds’. I like that.