Proposed python expr syntax

tpoterba · November 1, 2017, 1:28pm

Current `devel` interface:

We have good IDE support and typing throughout, made possible by lambda functions. There’s really no way to
get rid of the second “scope” argument, though.

Example 1: Annotate number of heterozygotes

vds = vds.annotate_variants(
    nCaseHets = vds.gs.filter(lambda g, scope: g.isHet() & scope.sa.isCase).count())

Example 2: Have a key table with schema `a: String`, `b: Int32`, `c: Int32`, want to group by `a` and compute the sum of `1 / b` where `c` is larger than 5.

grp = kt.group_by(kt.a)
grp.aggregate(sum = grp.b.filter(lambda b, scope: scope.c > 5.0).map(lambda b, _: 1 / b).sum())

Proposed interface 1

In this interface, we use an anonymous indicator “X” to hold everything in the current scope. You lose
IDE support / tab completion because X is untyped until runtime.

Example 1: Annotate number of heterozygotes

vds = vds.annotate_variants(
    nCaseHets = vds.gs.filter(lambda g: g.isHet() & X.sa.isCase).count())

Example 2: Have a key table with schema `a: String`, `b: Int32`, `c: Int32`, want to group by `a` and compute the sum of `1 / b` where `c` is larger than 5.

grp = kt.group_by(kt.a)
grp.aggregate(sum = grp.b.filter(lambda b: X.c > 5.0).map(lambda b: 1 / b).sum())

Proposed interface 2

In this model, aggregables need to be explicitly constructed by the agg method (either apply or an instance method would work). This allows the typing and IDE support.

Example 1: Annotate number of heterozygotes

vds = vds.annotate_variants(
    nCaseHets = vds.g.agg().filter(vds.g.isHet() & vds.sa.isCase).count())

Example 2: Have a key table with schema `a: String`, `b: Int32`, `c: Int32`, want to group by `a` and compute the sum of `1 / b` where `c` is larger than 5.

kt.group_by(kt.a)\
  .aggregate(sum = agg(1 / kt.b).filter(kt.c > 5.0).sum())

cseed · November 1, 2017, 1:37pm

In proposed interface 1, why can’t we use X for the genotype as well?

vds = vds.annotate_variants(
  nCaseHets = vds.gs.filter(X.g.isHet & X.sa.isCase).count())

?

tpoterba · November 1, 2017, 1:45pm

I think you could do either. But consider this example in 0.1:

vds.annotate_variants_expr(
    'va.foo = gs.map(g => g.nNonRefAlleles()).filter(n => n >= 2).count() > 10')

How do you do that with the X? The second lambda doesn’t have a variable name bound.

vds.annotate_variant_expr(
    foo = vds.g.map(X.g.nNonRefAlleles()).filter(X.??? >= 2).count() > 10 )

cseed · November 1, 2017, 5:29pm

Proposed interface 3. I have a new idea:

Get rid of aggregable columns. That means operations like filter and flat_map and aggregators can’t be used with . notation: they need to be functions. All columns are now value types.
The only thing you can do with the return value of filter or flat_map is aggregate it. For example, this expression seems non-sensical:
```
filter(vds.g.DP > 50, vds.g.GQ) + f(vds.g.PL)
```

In particular, the summands have a different number of elements.

Columns will need to track indices and whether an aggregation has happened for error checking. A few well-typed expressions that fail on index/aggregation grounds:

vds.annotate_variants(va.foo = vds.g.DP)

vds.annotate_variants(va.foo = count(vds.g.DP + count(vds.g.PL.filter(lambda x: x == 0).length)

while:

vds.annotate_genotypes(af = vds.va.info.af)

should be OK.

I think this means that lambdas are only used for manipulating scalar values, and don’t appear in the VDS- and KT-level interface.
filter should take either a boolean column or a lambda that is a predicate on the elements being aggregated over.
Here’s a stupid example:
```
vds.annotate_variants(
  goodDPMean = mean(filter(lambda x: x > 50.0, g.DP + rnorm(0.0, 1.0))))
```
Note, rnorm is only evaluated once (per genotype).
We need a hail package prefix to avoid naming collisions with Python and numpy functions. In hail2, the common stuff is imported at the top level and the convention is that functions are imported as a module named hf. The other option is to use hl for everything. (Or something equally short but easier to type.)

dking · November 1, 2017, 9:29pm

Would it be too presumptuous for us to use h as our prefix?

Topic		Replies	Views
Hail2 Python interface discussion	21	1305	November 27, 2017
Hail expr functions ambiguity	0	617	December 11, 2017
Aggregator interface	2	745	October 10, 2018
Proposed changes to the python select/annotate/drop interface for keys	0	616	April 2, 2018
Optimization ideas	3	974	April 25, 2018

Proposed python expr syntax

Current devel interface:

Example 1: Annotate number of heterozygotes

Example 2: Have a key table with schema a: String, b: Int32, c: Int32, want to group by a and compute the sum of 1 / b where c is larger than 5.

Proposed interface 1

Example 1: Annotate number of heterozygotes

Example 2: Have a key table with schema a: String, b: Int32, c: Int32, want to group by a and compute the sum of 1 / b where c is larger than 5.

Proposed interface 2

Example 1: Annotate number of heterozygotes

Example 2: Have a key table with schema a: String, b: Int32, c: Int32, want to group by a and compute the sum of 1 / b where c is larger than 5.

Related topics

Current `devel` interface:

Example 2: Have a key table with schema `a: String`, `b: Int32`, `c: Int32`, want to group by `a` and compute the sum of `1 / b` where `c` is larger than 5.

Example 2: Have a key table with schema `a: String`, `b: Int32`, `c: Int32`, want to group by `a` and compute the sum of `1 / b` where `c` is larger than 5.

Example 2: Have a key table with schema `a: String`, `b: Int32`, `c: Int32`, want to group by `a` and compute the sum of `1 / b` where `c` is larger than 5.