Encoded types design

ETypes Update

This document is intended to give a more complete overview of the state and further
development of encoded types or ETypes. I seek to describe the current API which
I believe to be mostly complete, and to sketch out proposed or in progress work
on new ETypes.

What is an EType?

ETypes are our way of interpreting bytes recieved from or written to some
stream. There probably needs to be a distiction between the EType subclasses
and our notion of an encoded type as our EType subclasses currently delegate
some elements of encoding to other part’s of hail’s reader/writer stack. For
example, LEB integer compression and uniform compression (like lz4) are not
currently EType classes.

EType.scala

From a programming perspective, ETypes are the new way for building encoders
and decoders. To create an EType, we use defaultFromPType or the parser from
some RVDSpec.

When creating encoders or decoders EType's buildEncoder and buildDecoder
object methods should be used (there are instance methods with identical names).
We cache the results of these functions so we don’t need to compile the same
method over and over again.

The EType objects contain the methods used to generate the code needed to
encode/decode the bytes involved. The abstract EType class contains common
functionality, while the subclasses contain internal methods that return Code.

One feature of the implementation is that every invocation of buildEncoder or
buildDecoder for an EType creates a new method. This gives us excellent
visibility as to where hail is spending it’s time in profiling.

Future ETypes

Our current ETypes map very closely onto our underlying memory representations
for our types. There is no EType notion of a string, we use binary instead.
There is no encoded sets; we use arrays. We would like to extend this to ETypes
that currently have no PType that represents them. For example:

  • Packed integer encoding like Stream VByte.
  • Transposition of arrays of structs to structs of arrays (local column store)
  • Occupancy list encoding for sparse data.