MatrixTable file format reference

Hey have y’all documented the MatrixTable file format, or considered making a utility like parquet-tools?

We haven’t documented the file format (and probably don’t intend to), since our work on encoded types (post 1, post 2) means we’ll want to keep introducing new ways to encode data, which means external tools reading those files will need updates or break.

However, making a command-line utility like parquet-tools is something we’ve floated before – I think it would be great to have some way to do head or similar operations on our files.

have y’all documented the MatrixTable file format

No. In particular, we don’t want to fix the format so we can continue to extend it for performance and functionality. Things that get documented have a way of getting stuck, and we don’t want that yet. It also enable other good things like interoperability, so it’s a trade-off, but it isn’t in a place where we think we can stop innovating on it.

I don’t think that would the same thing it would mean for something like Parquet. Our format is designed to be infinitely extensible: it is basically a directory with a file called metadata.json(.gz) which is a JSON object with a “name” field, and the interpretation of the entire thing is driven by that name. So we can completely redesign the whole thing and it works backward compatibly as long as we keep the old decoders around.

An intermediate solution might be a reference implementation for reading and writing the format. Our backend would be presumably be that reference implementation, however, it can’t be called directly: it is a code generator that generates schema-specific codecs to/form a custom unsafe/off-heap value representation … which is itself a pluggable part of our code generator. If other people want to hack on that, awesome, I want to talk to them.

At our scale, if you’re not generating custom code that avoids type-based dispatch and primitive boxing, I think it is going to be too slow to be useful (and, increasingly, takes advantage of column store benefits and sparsity, which is what we’re working on now).

or considered making a utility like parquet-tools?

Basically any of this functionality is a trivial 1-2 lines of Python, but as Tim says, it would be a useful utility. It would be slightly more complicated given the multi-dimensional nature of MatrixTable. What in particular would you want to see?

I’d really like a sans-JVM way to run head. Alternatively, if we could do some gradle-style keep Hail/Spark loaded in a JVM in the background for these queries that could make response time reasonable.

I’d really like a sans-JVM way to run Hail!

Quoting @tpoterba from a separate thread:

how our 3-type system works

I’m going to go ahead and collect some links here to think out loud. I’ll come back to this issue when I have time to document my understanding of the file format.

3 levels of types

Virtual

  • Type definition
  • Python equivalents documented at Types?

Physical

Encoded

Example MTs

Writing an MT

Quoting @cseed:

If other people want to hack on that, awesome, I want to talk to them.

You are talking to a person that is hiring such people. As we discussed in person I’m quite familiar with off-heap data representations (cf. Offheaping the Read Path in Apache HBase), container file formats (cf. Announcing Parquet 1.0: Columnar Storage for Hadoop), and distributed data processing systems that make use of code generation (cf. Runtime Code Generation in Cloudera Impala).

I don’t know that we will for sure select Hail for our work at Related Sciences, but if we can understand and hack on the Hail implementation, that would certainly increase our chances of selecting it.

I appreciate y’all bearing with us as we poke around the code and try to figure out how things work. We definitely appreciate your rapid responses so far.