Expression compiler use cases

As we move towards fully generic matrix cell (genotype) schemas, I’m pushing various functions from Scala into the expression language. The performance of these use cases shouldn’t be worse than 0.1, so I’m recording them here so we can keep them in mind as we develop the expression compiler.

The per-allele operations (split_multi, filter_allles) are going to be fully generic. The user will give expressions to update the generic cell (genotype) schemas based on the alleles being filtered/split. These are currently hardcoded in Scala (with a few options, e.g., downcoding vs subsetting). The most complex on of these is updating the PL. From filter_alleles:

  def downcodePx(px: Array[Int]): Array[Int] = {
    coalesce(px)(triangle(newCount), (_, gt) => downcodeGt(gt), Int.MaxValue) { (oldNll, nll) =>
      min(oldNll, nll)
  def subsetPx(px: Array[Int]): Array[Int] = {
    val (newPx, minP) = px.zipWithIndex
        case (p, i) =>
          val gTPair = Genotype.gtPair(i)
          (gTPair.j == 0 || oldToNew(gTPair.j) != 0) && (gTPair.k == 0 || oldToNew(gTPair.k) != 0)
      .foldLeft((Array.fill(triangle(newCount))(0), Int.MaxValue))({
        case ((newPx, minP), (p, i)) =>
          newPx(downcodeGt(i)) = p
          (newPx, min(p, minP))
      }) - minP)

and from split_multi: { gplx =>
        val plx = gplx.iterator.zipWithIndex
          .map { case (p, k) => (splitGT(k, i), p) }
          .reduceByKeyToArray(3, Int.MaxValue)(_ min _)

There are a few things that will be necessary: we will have to expose some routines for working with genotypes vs alleles (e.g. splitGT), we will need a general groupBy operation at the level of collections, and we will need some kind of deforestation to avoid the overhead of allocating intermediate structures (esp. if we’re using region allocation), a feature the Scala code does not have.

Second use case is loading BGEN files. The UKB is encoding fixed-point using 1 byte per genotype, storing (nGenotype - 1) values (the last value being implicit because they sum to 255. We would like to import the BGEN natively with schema struct { GP0: Int8, GP1: Int8 } and then, in the expression language, compute the float-based GP and/or the dosage (expected genotype). Something like this:

  .annotate_genotypes_expr('g.GP = [g.GP0 / 255.0, g.GP1 / 255.0, (255 - g.GP0 - g.GP1) / 255.0], g.GT = ...')
  .linreg('sa.y', 'dosage(g.GP)', ...)

and in this case the construction of GP should never be realized and the dosage should be computed directly as (g.GP1 + 2*(255 - g.GP0 - g.GP1)) / 255.0 (or some simplification thereof). This should get the same performance as the shortcut I coded for the UKB GWAS.