Proposed python expr syntax

Current devel interface:

We have good IDE support and typing throughout, made possible by lambda functions. There’s really no way to
get rid of the second “scope” argument, though.

Example 1: Annotate number of heterozygotes

vds = vds.annotate_variants(
    nCaseHets = vds.gs.filter(lambda g, scope: g.isHet() & scope.sa.isCase).count())

Example 2: Have a key table with schema a: String, b: Int32, c: Int32, want to group by a and compute the sum of 1 / b where c is larger than 5.

grp = kt.group_by(kt.a)
grp.aggregate(sum = grp.b.filter(lambda b, scope: scope.c > 5.0).map(lambda b, _: 1 / b).sum())

Proposed interface 1

In this interface, we use an anonymous indicator “X” to hold everything in the current scope. You lose
IDE support / tab completion because X is untyped until runtime.

Example 1: Annotate number of heterozygotes

vds = vds.annotate_variants(
    nCaseHets = vds.gs.filter(lambda g: g.isHet() & X.sa.isCase).count())

Example 2: Have a key table with schema a: String, b: Int32, c: Int32, want to group by a and compute the sum of 1 / b where c is larger than 5.

grp = kt.group_by(kt.a)
grp.aggregate(sum = grp.b.filter(lambda b: X.c > 5.0).map(lambda b: 1 / b).sum())

Proposed interface 2

In this model, aggregables need to be explicitly constructed by the agg method (either apply or an instance method would work). This allows the typing and IDE support.

Example 1: Annotate number of heterozygotes

vds = vds.annotate_variants(
    nCaseHets = vds.g.agg().filter(vds.g.isHet() & vds.sa.isCase).count())

Example 2: Have a key table with schema a: String, b: Int32, c: Int32, want to group by a and compute the sum of 1 / b where c is larger than 5.

kt.group_by(kt.a)\
  .aggregate(sum = agg(1 / kt.b).filter(kt.c > 5.0).sum())

In proposed interface 1, why can’t we use X for the genotype as well?

vds = vds.annotate_variants(
  nCaseHets = vds.gs.filter(X.g.isHet & X.sa.isCase).count())

?

I think you could do either. But consider this example in 0.1:

vds.annotate_variants_expr(
    'va.foo = gs.map(g => g.nNonRefAlleles()).filter(n => n >= 2).count() > 10')

How do you do that with the X? The second lambda doesn’t have a variable name bound.

vds.annotate_variant_expr(
    foo = vds.g.map(X.g.nNonRefAlleles()).filter(X.??? >= 2).count() > 10 )

Proposed interface 3. I have a new idea:

  1. Get rid of aggregable columns. That means operations like filter and flat_map and aggregators can’t be used with . notation: they need to be functions. All columns are now value types.

  2. The only thing you can do with the return value of filter or flat_map is aggregate it. For example, this expression seems non-sensical:

    filter(vds.g.DP > 50, vds.g.GQ) + f(vds.g.PL)
    

In particular, the summands have a different number of elements.

  1. Columns will need to track indices and whether an aggregation has happened for error checking. A few well-typed expressions that fail on index/aggregation grounds:

    vds.annotate_variants(va.foo = vds.g.DP)
    
    vds.annotate_variants(va.foo = count(vds.g.DP + count(vds.g.PL.filter(lambda x: x == 0).length)
    

    while:

    vds.annotate_genotypes(af = vds.va.info.af)
    

    should be OK.

  2. I think this means that lambdas are only used for manipulating scalar values, and don’t appear in the VDS- and KT-level interface.

  3. filter should take either a boolean column or a lambda that is a predicate on the elements being aggregated over.
    Here’s a stupid example:

    vds.annotate_variants(
      goodDPMean = mean(filter(lambda x: x > 50.0, g.DP + rnorm(0.0, 1.0))))
    

    Note, rnorm is only evaluated once (per genotype).

  4. We need a hail package prefix to avoid naming collisions with Python and numpy functions. In hail2, the common stuff is imported at the top level and the convention is that functions are imported as a module named hf. The other option is to use hl for everything. (Or something equally short but easier to type.)

Would it be too presumptuous for us to use h as our prefix?