We should separate the interface that types define (e.g. access fields, get lengths of arrays, etc.), which I’m calling virtual types, from particular realizations of those types in Region
s , which I’m calling physical types.
We should add a PhysicalType
hierarchy to parallel virtual Type
. byteSize
, alignment
and the value accessors should move from Type
to PhysicalType
. Each virtual type should define an interface that must be implemented by a physical type realizing that virtual type. Each virtual type should have a canonical or default physical implementation (which is the only one we use now). Each physical type should know the corresponding virtual type it implements.
Why do this? I imagine a few applications:
-
It makes it easy to change the default implementation for a type. For example, if we wanted structs to pointed to instead of stored inline.
-
Operations that return new types can create custom representation for their results. Two examples: inserting into or appending onto a struct can just store a pointer to the original struct and the value being inserted/appended. No need to copy the struct.
Filtering an array of large values (for example, genotypes), instead of copying the elements that are kept, just make an array of indices of the values being kept and use an extra indirection on access. This would be great for filter_samples, for example.
-
We can choose optimized physical representations based on the structure of the data, e.g., sparse vs dense vectors. We could have done this before with a complex
Vector
type, but this gives infrastructure for easily making and building such representations. (See the comment aboutif
below.)
The reason this works is because are pipelines are small, static and don’t produce aggregate values in an recursive way. (In the cases they do, we’d have to canonicalize values to the default representation.)
But @cseed, what about if
?! Well, if the physical types returned by each arm are different, you can have a physical type that selects between them by storing a pointer to the original physical type as well a flag indicating which physical type the value actually is. Again, this works because, usually, the structure of the values mirrors the structure of the expressions.
Finally, if, instead of requiring each virtual type to correspond to physical type in a parallel manner, we allow physical types to implement multiple levels of virtual types, I think we get much of the infrastructure for in-memory column stores for free. It should be relatively easy to apply this to genotypes.
Consider the gs
from a VSM row:
gs: Array[Struct {
GT: Call,
...,
PL: Array[Int]
}]
I want to this array-of-struct idiom monolithically with a struct-of-arrays as:
gsRep: Struct {
_missing: Array[Boolean], // a packed bitset
GT: Array[Call],
...,
PL: Array[Array[Int]]
}
Where _missing
records if the i
th element of the original array was missing (and hence all of its fields). This means like-values are stored together and will likely compress better. It should be easier to use special Array[Int]
encodings or run length encoding vs. the array-of-struct. Even if we load the entire object into memory, queries that only touch one (or a few) fields will have better cache behavior and touch fewer fields.
Question is, how do we represent the struct value gs[i]
at runtime? This is an aggregate value, and is normally stored as a Long
. However, I’m reluctant to create a region value every time we create this object as an intermediate. I see two complementary options:
-
Rewrite virtual accesses in the
IR
to make use of the physical layout. For example, rewritegs[i].GT
asif (gsRep._missing[i]) NA: Int else gsRep.GT[i]
. This works fine if we don’t manipulate the intermediate struct type directly, namely,if (cond) gs[i] else gs[j]
. -
Relax the condition in the IR compiler that aggregate types must be represented as
Long
: let the runtime representation be determined by the physical type. In this case, and intermediate value could be represented by a pair of stack slots: the offset togsRep
and the indexi
.