Stabilizing the file format

The state of the file format is strong.

Here are some size measurements I made after I changed the compression scheme to compress across rows:

$ ls -lh  gnomad.genomes.r2.0.2.sites.chr21.vcf.bgz 
-rw-rw-r-- 1 cotton cotton 1.2G Oct  7 01:36 gnomad.genomes.r2.0.2.sites.chr21.vcf.bgz
$ du -sh sites.vds
1.2G	sites.vds

$ ls -lh genotypes.unfiltered.vcf.bgz 
-rw-r--r-- 1 cotton cotton 4.8G Oct  7 02:37 genotypes.unfiltered.vcf.bgz
$ du -sh geno.vds
4.7G	geno.vds

genotypes.unfiltered.vcf.bgz is a shard of the gnomAD exomes. In other words, we basically have parity in file format size for both site files and wide VCF files, even without the improvements from required fields or dependent sized arrays. (I will follow up with 0.1 file sizes. If memory serves, they were 10-20% smaller than the corresponding VCFs.)

There are two changes I’d like to see for early next week (so gnomAD reimort can take advantage of them):

  1. Don’t split variable length encoded integers across compress block boundaries. This will save a bound check in the varint decode loop. I will make a PR for this.

  2. Push support for required fields into the en/decoder. In particular, arrays with required elements shouldn’t emit any missing bits. This should make a measurable difference for VCF files with genotypes. @catoverdrive has already started on this.

  3. Treat all integers as signed in the variant int encoding. Currently we treat them as unsigned. This was a mistake on my part. I will PR this as a part of (1.)

If you have other suggestions, now is the time to make them for the short term.

Additional things I’d like to do for the 0.2 beta:

  1. We should store the variant annotations and the genotypes separately for matrices. Reading these into the same region value should basically have zero overhead with a struct concat physical type*. That will allow us to automatically avoid loading the genotypes when working just with the variant annotations. This should perform even better than the Parquet counterpart in 0.1.

    *Discussion of physical types coming in another post.

  2. We should store sample annotations in a rowstore instead of JSON. This will be one of the tasks for the region-value-ification of sample annotations.

  3. I’m not sure about this one, but we could use a dictionary encoding for string types. This could be added post-hoc if we have physical encoded types, see the discussion below.

I’m not sure what we can do to make the file more easily forward extensible (that is, that we’re able to continue to read files generated by older versions of Hail but support additional features in newer versions of Hail.)

On observation is that each type (currently, in my thinking, each virtual type) defines its own decoder. One option would be to have physical encoded types. That would allow us to add new encodings in a forward-compatible manner. As things current stand, we can add new encodings by adding new (virtual) types, for example, fixed-length arrays.

One example I’ve been thinking about in this direction is column-stored genotypes by taking the array-of-structs we currently have an store it as an struct-of-arrays (with an extra block of missing bits for the structs in the original formulation.) This could also be a physical type for storing the genotypes and would get us most of the standard column store benefits for genotypes.

Finally, there are some larger feature additions I think we should leave for post-0.2: an index so we can efficiently seek to a given row, and storing summary statistics of data values to perform fast sketch computations (something to think about @pschultz) and do cost estimation in the query optimizer.