Interesting Etypes/PTypes we want

Integers

  • Variable-bit-sized, e.g. int2, int4, int8, etc.
  • Variable sized with delta (for values with small dynamic range)
  • Jeff Dean / Daniel Lemire integer encoding (https://arxiv.org/abs/1709.08990)

Floats

  • Store as integer with a divisor (BGEN!)

Call

  • 2-bit diploid unphased call

String

  • Length + pointer (allows O(1) slicing)
  • Factor type (store as int, look up in array of possible values)

Struct/Tuple

  • Struct by reference
  • InsertFieldsPStruct
  • SelectFieldsPStruct

Array

  • Fixed-length

Locus

Dict

  • Sorted
  • Hash

Set

  • Sorted
  • Hash

misc

  • Array[Struct] => Struct{arrays…} !!!
  • union types to support if, makearray, etc
  • Lazily-decoded values (e.g. BGEN genotypes)
  • bit sets

Feel free to add!

These are great!

Some ideas:

  • An array range ptype.
  • Filtered array: store pointer to original array a + array of indices x to keep, so filter(a)[i] becomes a[x[i]] where x is the indices array.
  • Arrays that cannot be indexed: we can choose this when we see an array isn’t indexed and it doesn’t escape. This is a case where infer ptypes can do something clever with analysis. For example, sparse arrays (length + array of values + array of indices). Are there other cases where specializing existing types leads to simplifications? I’m not sure.
  • I want values that are serialized: values into these will be a pointer plus a base pointer, and pointers will be stored as offsets (with respect to the base pointer). If we’re going to serialize something, we should push down “serializability” into the ptypes.

Some comments:

Store as integer with a divisor (BGEN!)

A BGEN ptype! We should have something that works directly against the decoded BGEN representation. Same from PLINK, but that’s just int2.

Length + pointer (allows O(1) slicing)

This will be a nice performance experiment, but I think maybe we should use length + pointer for strings and arrays everywhere.

Struct by reference

This is a special case of parameterized “pointer to some other ptype”.