I thought I would outline some ways I see builtin types evolving as we pull things out of Scala into the expression language.
-
We need be able to specify that elements of container types (struct fields, array elements, etc.) are required (non-missing). @catoverdrive already has a PR for arrays: https://github.com/hail-is/hail/pull/2356.
Non-missing arrays will find immediate application for AD and PL fields in genotypes. I think strictly speaking the VCF spec supports arrays with missing elements but this is not used in practice and isn’t supported by GATK/HTSJDK’s internal format, see, for example,
htsjdk.variant.variantcontext.Genotype.getPL
:public abstract int[] getPL();
I’m imagining something like
import_vcf(..., array_elements_optional=True)
.Non-missing field elements will find immediate application in representation of complex types like Variant and Locus whose fields are naturally required.
-
We should have unsigned integral types and 8 and 16-bit types.
-
We need an n-dimensional array type
a: NDArray[T]
. It should also support required elements. It should mirror numpy, and havea.ndim
,a.shape: Array[Int]
. Things like agreement for the number of dimensions in index expresionsa[i, j, k]
should be checked dynamically. It should support slicing, transpose and basic BLAS operations efficiently. I think this will make it sufficient to implementlinreg
(and maybe evenlmmreg
) fully in python.Once we have this, we can drop the existing broadcast and 1-level deep vectorization operations on numeric array types.
-
We should static fixed-length arrays:
Array[T, N]
. I’m not sure about the literal syntax. I don’t think this will get much use but (1) it is the a building block for dependently-sized arrays and (2) we also want to support fixed-size variants, like type:Variant(b37, 2)
. This will allow us to get rid ofwasSplit
fromVariantSampleMatrix
. -
Dependently-sized arrays. Currently the length of the
PL
for each genotype for a variant is stored, but they are all the same,nGenotypes
. To avoid this, the size of an array should be able to depend on a previous field (or simple expression of a previous field) in a structure type. For example:struct { v: Variant, va: ..., gs: Array[Struct { GT: Call, AD: Array[Int; v.nAlleles], ..., PL: Array[Int; v.nGenotypes] }] }
The dependently-typed array type’s
loadLength
accessor will use the size expression to compute the length. I see two difficulties: How are the dependent expressions scoped? What’s the “root” of the struct? Second,getLength
and accessors will be non-local and require a pointer not just to the array but the “root” value.