Requiredness, missingness, and physical type layout

We allow Hail values to be missing. In general, this means that if I have, say, a value of Hail type tint32, I can expect the value to take the form of any 32-bit integer, or NA (missing). Whether or not a hail value is missing is a runtime concept—we can’t expect to know, at compile time, the actual value of a Hail expression.

What we can say, however, at compile time, is that an value is guaranteed to be non-missing, if the expression is, say, a literal, non-NA expression or if the result of a computation is guaranteed to be non-missing regardless of whether or not any of its inputs are NA, like hl.is_defined(expr). If we can make that guarantee at compile-time, we can treat the type as “required”, which basically just means we don’t have to track the extra bit of information about whether or not this value is NA. This looks different in different contexts:

  • In the (JVM) code generator, we carry around an EmitTriplet which stores the generated code that computes the value, which basically consists of two staged primitives—a Code[Boolean], which tells us whether the value is NA or not, and a Code[_], which is what the value of the evaluated IR would be if it’s not NA. (As an aside, we call the second thing the value, which I think is fine as shorthand but semantically misleading because the actual hail value as we think of it throughout the rest of the stack includes the missingness—this is probably more correctly thought of as "value, if not NA") Knowing that the type of a computation is required lets us elide the need to track the Code[Boolean], although in practice we don’t currently do this because requiredness in the IR has caused type unification problems.

  • When we store values inside containers, the container needs to be able to track these two pieces of information as well: whether or not the stored value is NA, and what the value is, if not NA. We currently do this by allocating a header to store the bit of information about missingness for each element. If a given element type is required, we can avoid having to store the missingness bit in the container.

Some concrete examples:

I’m generating code for a

PStruct(true, 
       "a" -> PInt32(true), 
       "b" -> PArray(false,
                     PInt64(true)), 
       "c" -> PArray(true, 
                     PFloat64(false)))
  • The struct is required. I don’t have to track whether or not it’s going to have an NA value—I can avoid carrying the Code[Boolean] around for this value in the code generation.
  • Fields “a” and “c” have required types; “b” does not. The struct’s memory layout needs to store missingness bits for only field “b”, since all of the other fields don’t need it.
  • Field “c”'s element type is not required. The memory layout for the array in field “c” needs to allocate a header of missingness bits to track whether or not each element is NA.
  • Field “b”'s element type is required. The memory layout for the array in field “b” does not need to allocate any space to track whether elements are NA, since we know that this is not possible.