Here is the list of changes I want to pile up for the next file format change. Hopefully this will bring us through 0.2, but if we need to force another reimport before we go out of 0.2 beta, that’s life.
- split row annotations and entries on disk so we can just load the row annotations if the entries are not used.
- have “encoded types” and store the row encoded type in the row store metadata. To start, we could have a placeholder “default” encoded type which corresponds to the current encoding.
- store information about compression so we can plug different compression schemes, no lz4, other compression schemes, pack, no encoding, etc. This is row store level metadata.
- compound keys in matrix table. Matrix type should have 4 structs: global, row, col, entry, and row partition key and key, and col key. We should have a parser for matrix type and store that in the MatrixTable metadata.
- clean up metadata: remove the split stuff. On the code side, there are a ton of redundant metadata classes now: VDSMetadata, VSMMetadata, VSMFileMetadata, etc.
- partitioning should include upper and lower bounds for every partition (note, not intervals, because the upper bound should be inclusive). This is possible now that OrderedRDD is gone.
- call should include phase information, or at least, placeholder in the format for phase information, even if it isn’t created or used yet. For more, see: Haploids and generic genotypes
- store the sample annotations as a row store instead of json
Did I miss anything? Also see: Stabilizing the file format, but I think everything there is included here.
Once the api2 liftover calms down, I will start farming these out. @tpoterba is already working on compound keys in MatrixTable.