The future of types

I thought I would outline some ways I see builtin types evolving as we pull things out of Scala into the expression language.

  1. We need be able to specify that elements of container types (struct fields, array elements, etc.) are required (non-missing). @catoverdrive already has a PR for arrays: https://github.com/hail-is/hail/pull/2356.

    Non-missing arrays will find immediate application for AD and PL fields in genotypes. I think strictly speaking the VCF spec supports arrays with missing elements but this is not used in practice and isn’t supported by GATK/HTSJDK’s internal format, see, for example, htsjdk.variant.variantcontext.Genotype.getPL:

    public abstract int[] getPL();
    

    I’m imagining something like import_vcf(..., array_elements_optional=True).

    Non-missing field elements will find immediate application in representation of complex types like Variant and Locus whose fields are naturally required.

  2. We should have unsigned integral types and 8 and 16-bit types.

  3. We need an n-dimensional array type a: NDArray[T]. It should also support required elements. It should mirror numpy, and have a.ndim, a.shape: Array[Int]. Things like agreement for the number of dimensions in index expresions a[i, j, k] should be checked dynamically. It should support slicing, transpose and basic BLAS operations efficiently. I think this will make it sufficient to implement linreg (and maybe even lmmreg) fully in python.

    Once we have this, we can drop the existing broadcast and 1-level deep vectorization operations on numeric array types.

  4. We should static fixed-length arrays: Array[T, N]. I’m not sure about the literal syntax. I don’t think this will get much use but (1) it is the a building block for dependently-sized arrays and (2) we also want to support fixed-size variants, like type: Variant(b37, 2). This will allow us to get rid of wasSplit from VariantSampleMatrix.

  5. Dependently-sized arrays. Currently the length of the PL for each genotype for a variant is stored, but they are all the same, nGenotypes. To avoid this, the size of an array should be able to depend on a previous field (or simple expression of a previous field) in a structure type. For example:

     struct {
       v: Variant,
       va: ...,
       gs: Array[Struct {
         GT: Call,
         AD: Array[Int; v.nAlleles],
         ...,
         PL: Array[Int; v.nGenotypes]
       }]
     }
    

    The dependently-typed array type’s loadLength accessor will use the size expression to compute the length. I see two difficulties: How are the dependent expressions scoped? What’s the “root” of the struct? Second, getLength and accessors will be non-local and require a pointer not just to the array but the “root” value.

I’ve been thinking a bit about the dependently-sized arrays.

Without dependency, types and expressions behave rather differently. An expression can use variables which are only defined (bound) in a containing expression, so while parsing we have to keep track of the unbound variables and their types. I’ll call such a list of typed variables a context (aka a scope). On the other hand, variables can’t appear in types, so the meaning of a type expression never depends on the containing expression.

With dependency, types behave more like expressions. So a type like Array[Int; v.nAlleles] carries with it a context of free variables, in this case the context is [v: Variant]. It also carries with it a function recording how each “hole” is computed from the context. In this case, the only hole is the length of the array, so we record the function (v: Variant) => (v.nAlleles), where the return type is a singleton tuple since there is only one hole.

As another example, the type

Array[Struct {
  GT: Call,
  AD: Array[Int; v.nAlleles],
  ...,
  PL: Array[Int; v.nGenotypes]
}]

still has context v: Variant, but now there are two holes, which are filled by the function (v: Variant) => (v.nAlleles, v.nGenotypes).

Note that even if Array is the only directly dependent type, compound types like Struct and Array can now also be dependent, and hence carry non-empty contexts. A context says “this type will only be meaningful in an environment in which these variables have been given meaning”.

Now the parser can assemble smaller type expressions into larger type expressions by composing hole-filling functions. The context of a compound expression will usually be the union of the contexts of the components (and a parse error is raised if a variable occurs in two sub-contexts with different types), with one exception: the context of a Struct is the union of the sub-contexts minus its fields. For example, when embedding the above type into the larger type

Struct {
  v: Variant
  va: ...,
  gs: Array[Struct {
    GT: Call,
    AD: Array[Int; v.nAlleles],
    ...,
    PL: Array[Int; v.nGenotypes]
  }]
}

the variable v is removed from the sub-context, because it gets bound to the field v of the parent struct.

Under this proposal, the semantics of type expressions is thus that variables used in sub-expressions are bound to the most recent ancestor struct field of the same name (with an error if the types don’t match).

As for the values of these dependent array types, Cotton said one motivation (the only motivation?) was to avoid having to store redundant array-lengths. So given a pointer to the field AD in the above type, how does one compute its length? If the array has to store a pointer to some enclosing struct, we haven’t gained much. My feeling is that just as the type Array[Int; v.nAlleles] isn’t well-formed in isolation, a pointer to an AD field should be considered incomplete without a pointer to an enclosing struct which contains the context of the field (the outer-most struct in the example). Then the “hole-filling functions” stored in the types provide the way to compute the length of the AD array from the enclosing struct.

One way to encode this might be to only support reading from a region value through “concrete” types (types with empty contexts). Array[Int; v.nAlleles] isn’t concrete, so we can’t make use of a direct pointer to a value of that type. On the other hand, the outer-most struct is concrete, so we can follow a pointer to a value of that type, together with a sequence of fields and array indices, to access a value of one of the AD fields, and have all the needed information to compute its length.

I realized that the proposal in my last post is equivalent to something that fits nicely into standard dependent type theory. I thought that elaborating on this different—but equivalent—presentation, and filling in a bit of basic dependent type theory, might be useful for the discussion.

For illustration, I’ll present a simplified version of the expression language type system. We start out with a few atomic types, such as Int, Double, maybe an unsigned integer type UInt, Bool, Genotype, Variant, etc. Then there are a few constructors of new types from old (aka parametric types) like Set[T] and Dict[T, U].

Array is also a type constructor, but dependently sized arrays require a parameter value n: UInt in addition to a parameter type T. As I said before, this change means that types can now contain variables, and so require a context. For this post, I’ll be a bit more formal about contexts. Take the type Array[T; n + m]. This isn’t meaningful in isolation, but in a context (n: Int, m: Int) it is a well defined type. To keep the context paired with the type, I’ll use the standard notation
n: Int, m: Int ⊢ Array[T; n + m].
Of course, this type still makes sense in a larger context, with more variables declared in the environment than are being used in the type, so I’ll write
𝚪, n: Int, m: Int ⊢ Array[T; n + m],
where 𝚪 is an arbitrary list of typed variables (not containing n or m).

Now instead of adding a struct type constructor (which is different from the others like Dict[T, U] or Array[T, n] since it takes a variable number of type parameters), I’ll add one of the most fundamental type constructors from dependent type theory: the dependent sum type (also known variously as a 𝚺-type, an existential type, or a dependent pair type). There are a few common notations for dependent sums; I’ll use one that emphasizes the “dependent pair” perspective.

Suppose 𝚪 ⊢ T is a type, and 𝚪, x: T ⊢ U[x] is a type parameterized by values of T (note that T and U both might depend on other variables from the context 𝚪). Then I’ll write (x: T) × U[x] for the dependent pair type, which is now in the context 𝚪, the variable x having been bound. A value of this type is a pair of values t: T and u: U[t], where the type of the second component depends on the first component.

By repeated use of the dependent pair type constructor, types equivalent to the dependent structs from my last post can be built. To keep the notation readable, in the case where the second type U doesn’t actually contain the variable x, so doesn’t depend on T, we could emphasize that fact by writing ( _: T) × U, or even more concisely T × U. In that way, the usual tuple type with two components (a biple?) is a special case of the dependent pair type. I’ll also make × be right associative, so that (x: T) × (y: U[x]) × V[x, y] is well defined. (To write this in a left associative way, you would have to say z: ((x: T) × U[x]) × V[z._1, z._2].)

Then in this notation, the struct type from Cotton’s post could be written
(v: Variant) × VAType × Array[Call × Array[Int, v.nAlleles] × Array[Int; v.nGenotypes]].

Note that decomposing struct types into nested dependent pairs like this gives an equivalent semantics to the proposal in my last post: in particular, variables used in sub-expressions are bound to the most recent ancestor struct field of the same name. That’s because in the pair type (x: T) × U, all free occurrences of x in U are bound, and so are no longer available to be bound by enclosing constructors like in (x: V) × ((x: T) × U).

I also want to add (6.) sum types and pattern matching. There are a few places I’d like to use them already (anyone have other use cases?)

  1. For enums like CopyState, AltAlleleType, the Mendel errors code (which is just an Int now), and genotype type (HomRef, Het, HomVar).

  2. The pattern match in Mendel errors to determine the code.

  3. If we add a 2 (or 4) bit genomic sequence encoding and support for symblic alleles (needed), Allele would look something like:

enum Allele {
  Bases(sequence: BaseSequence),
  Star,
  Symbolic(id: String),
  Breakend(...) # I don't actually know what this is
}

Given the difficulty we’re having with binding forms in moving the expr language to Python, I think the hardest part would be figuring out the Python syntax for pattern matching.