File format changes

cseed · January 19, 2018, 4:33pm

Here is the list of changes I want to pile up for the next file format change. Hopefully this will bring us through 0.2, but if we need to force another reimport before we go out of 0.2 beta, that’s life.

split row annotations and entries on disk so we can just load the row annotations if the entries are not used.
have “encoded types” and store the row encoded type in the row store metadata. To start, we could have a placeholder “default” encoded type which corresponds to the current encoding.
store information about compression so we can plug different compression schemes, no lz4, other compression schemes, pack, no encoding, etc. This is row store level metadata.
compound keys in matrix table. Matrix type should have 4 structs: global, row, col, entry, and row partition key and key, and col key. We should have a parser for matrix type and store that in the MatrixTable metadata.
clean up metadata: remove the split stuff. On the code side, there are a ton of redundant metadata classes now: VDSMetadata, VSMMetadata, VSMFileMetadata, etc.
partitioning should include upper and lower bounds for every partition (note, not intervals, because the upper bound should be inclusive). This is possible now that OrderedRDD is gone.
call should include phase information, or at least, placeholder in the format for phase information, even if it isn’t created or used yet. For more, see: Haploids and generic genotypes
store the sample annotations as a row store instead of json

Did I miss anything? Also see: Stabilizing the file format, but I think everything there is included here.

Once the api2 liftover calms down, I will start farming these out. @tpoterba is already working on compound keys in MatrixTable.

cseed · January 20, 2018, 7:49pm

Couple more things:

the partitioner type should be stored in the metadata so we can add, say, a HashedRVD
col data should not be stored as part of the metadata. Right now we load that when we load the matrix metadata, but it is simply discarded if, say, we drop columns. ReadMatrix shouldn’t carry a MatrixLocalValue. (And perhaps that class should just be folded into MatrixValue). This is largely enabled by storing the column data as a row store instead of json.

cseed · January 28, 2018, 4:20am

Another one: use + instead of ! for requiredness.

cseed · February 1, 2018, 3:16pm

We also need to change the vds and kt extensions. mt and table? hmt and ht? ds and df? 2ds and 1ds? Just hail with a single read command and store the type in some metadata, so read_matrix and read_table => read? Ohh, I like that. Don’t require extensions? Thoughts?

dking · February 1, 2018, 3:20pm

No file extension means tensors, etc. don’t get new file extensions, which seems nice.

pschultz · February 1, 2018, 3:21pm

You get the same advantage by using the same extension for all Hail files, like ‘.hail’ or ‘.hds’. I like that.

Topic		Replies	Views
MatrixTable file format reference	5	930	January 30, 2020
0.2 beta checklist	1	723	December 1, 2017
General MatrixTable keys	1	595	January 22, 2018
Stabilizing the file format	0	620	November 10, 2017
Matrix row key redesign - help wanted	3	621	February 5, 2018

File format changes

Related topics