Encoded types design

Input needed.

The current stack of specs is as follows:

(above: off-heap data structures)
  CodecSpec
    BufferSpec
      BlockBufferSpec
        (below: bytes on disk)

As we introduce encoded types, it makes sense (thanks Patrick!) to think of this entire stack as the encoded type – an encoded type is the way we interpret bytes on disk.

There is a problem with the current design - the CodecSpec doesn’t have the information it needs to decode the bytes on disk. Instead, it takes the virtual/physical type of the encoded data as an argument.

The right solution is to put the EType in the CodecSpec. This will require a bit of an infrastructural lift, unfortunately.

I’m currently working on implementing some of this. Here’s the high level overview of what I’ve done:

  • Introduce EType, another parallel type hierarchy.
  • ETypes can do the following:
    1. Check if they can Encode/Decode a given PType
    2. Given a compatible PType, generate code to either encode values of
      the EType into an output buffer, or decode values of the EType from an
      input buffer into a PType.

That’s it. All the Encoder/Decoder codegen responsibility is moving to ETypes. This will make it
straightforward to add new ETypes as supporting them is as simple as defining how to encode and
decode them.

I’m in the process of replacing EmitPackEncoder/Decoder with EType generated Encoders/Decoders and the Encoder side passes all the tests (or at least it did). PR here:

After the above is done, I plan on extending the work by adding the following two ETypes.

  1. An Array of Structs to a Struct of Arrays encoding
  2. Packed integer arrays, possibly a staged, scalar version of Stream VByte.