Encoded types design

Input needed.

The current stack of specs is as follows:

(above: off-heap data structures)
  CodecSpec
    BufferSpec
      BlockBufferSpec
        (below: bytes on disk)

As we introduce encoded types, it makes sense (thanks Patrick!) to think of this entire stack as the encoded type – an encoded type is the way we interpret bytes on disk.

There is a problem with the current design - the CodecSpec doesn’t have the information it needs to decode the bytes on disk. Instead, it takes the virtual/physical type of the encoded data as an argument.

The right solution is to put the EType in the CodecSpec. This will require a bit of an infrastructural lift, unfortunately.

I’m currently working on implementing some of this. Here’s the high level overview of what I’ve done:

  • Introduce EType, another parallel type hierarchy.
  • ETypes can do the following:
    1. Check if they can Encode/Decode a given PType
    2. Given a compatible PType, generate code to either encode values of
      the EType into an output buffer, or decode values of the EType from an
      input buffer into a PType.

That’s it. All the Encoder/Decoder codegen responsibility is moving to ETypes. This will make it
straightforward to add new ETypes as supporting them is as simple as defining how to encode and
decode them.

I’m in the process of replacing EmitPackEncoder/Decoder with EType generated Encoders/Decoders and the Encoder side passes all the tests (or at least it did). PR here:

After the above is done, I plan on extending the work by adding the following two ETypes.

  1. An Array of Structs to a Struct of Arrays encoding
  2. Packed integer arrays, possibly a staged, scalar version of Stream VByte.

ETypes Update

This document is intended to give a more complete overview of the state and further
development of encoded types or ETypes. I seek to describe the current API which
I believe to be mostly complete, and to sketch out proposed or in progress work
on new ETypes.

What is an EType?

ETypes are our way of interpreting bytes recieved from or written to some
stream. There probably needs to be a distiction between the EType subclasses
and our notion of an encoded type as our EType subclasses currently delegate
some elements of encoding to other part’s of hail’s reader/writer stack. For
example, LEB integer compression and uniform compression (like lz4) are not
currently EType classes.

EType.scala

From a programming perspective, ETypes are the new way for building encoders
and decoders. To create an EType, we use defaultFromPType or the parser from
some RVDSpec.

When creating encoders or decoders EType's buildEncoder and buildDecoder
object methods should be used (there are instance methods with identical names).
We cache the results of these functions so we don’t need to compile the same
method over and over again.

The EType objects contain the methods used to generate the code needed to
encode/decode the bytes involved. The abstract EType class contains common
functionality, while the subclasses contain internal methods that return Code.

One feature of the implementation is that every invocation of buildEncoder or
buildDecoder for an EType creates a new method. This gives us excellent
visibility as to where hail is spending it’s time in profiling.

Future ETypes

Our current ETypes map very closely onto our underlying memory representations
for our types. There is no EType notion of a string, we use binary instead.
There is no encoded sets; we use arrays. We would like to extend this to ETypes
that currently have no PType that represents them. For example:

  • Packed integer encoding like Stream VByte.
  • Transposition of arrays of structs to structs of arrays (local column store)
  • Occupancy list encoding for sparse data.