We allow Hail values to be missing. In general, this means that if I have, say, a value of Hail type tint32, I can expect the value to take the form of any 32-bit integer, or
NA (missing). Whether or not a hail value is missing is a runtime concept—we can’t expect to know, at compile time, the actual value of a Hail expression.
What we can say, however, at compile time, is that an value is guaranteed to be non-missing, if the expression is, say, a literal, non-
NA expression or if the result of a computation is guaranteed to be non-missing regardless of whether or not any of its inputs are
hl.is_defined(expr). If we can make that guarantee at compile-time, we can treat the type as “required”, which basically just means we don’t have to track the extra bit of information about whether or not this value is
NA. This looks different in different contexts:
In the (JVM) code generator, we carry around an
EmitTripletwhich stores the generated code that computes the value, which basically consists of two staged primitives—a
Code[Boolean], which tells us whether the value is
NAor not, and a
Code[_], which is what the value of the evaluated IR would be if it’s not
NA. (As an aside, we call the second thing the
value, which I think is fine as shorthand but semantically misleading because the actual hail value as we think of it throughout the rest of the stack includes the missingness—this is probably more correctly thought of as "value, if not
NA") Knowing that the type of a computation is required lets us elide the need to track the
Code[Boolean], although in practice we don’t currently do this because requiredness in the IR has caused type unification problems.
When we store values inside containers, the container needs to be able to track these two pieces of information as well: whether or not the stored value is
NA, and what the value is, if not
NA. We currently do this by allocating a header to store the bit of information about missingness for each element. If a given element type is required, we can avoid having to store the missingness bit in the container.
Some concrete examples:
I’m generating code for a
PStruct(true, "a" -> PInt32(true), "b" -> PArray(false, PInt64(true)), "c" -> PArray(true, PFloat64(false)))
The struct is required. I don’t have to track whether or not it’s going to have an
NAvalue—I can avoid carrying the
Code[Boolean]around for this value in the code generation.
- Fields “a” and “c” have required types; “b” does not. The struct’s memory layout needs to store missingness bits for only field “b”, since all of the other fields don’t need it.
Field “c”'s element type is not required. The memory layout for the array in field “c” needs to allocate a header of missingness bits to track whether or not each element is
Field “b”'s element type is required. The memory layout for the array in field “b” does not need to allocate any space to track whether elements are
NA, since we know that this is not possible.