Hail2 Python interface discussion

tpoterba · November 17, 2017, 3:07pm

See here: https://github.com/tpoterba/hail/blob/py-expr-2/python/hail2/dataset.py

Some thoughts

KeyTables

Keytables have the following manipulation operations:

def annotate(**named_exprs):

def filter(expr):

def select(expr):

def transmute(**named_exprs): # not yet implemented

def group_by(*exprs, **named_exprs):
  -> def aggregate(**named_exprs)

def aggregate(**named_exprs): # the new 'query'

VariantDataset / Matrix

Matrix have the following manipulation operations:

def annotate_{rows, cols, entries}(**named_exprs):

def filter_{rows, cols, entries}(expr):

def select_{rows, cols, entries}(expr):

def transmute_{rows, cols, entries}(**named_exprs): # not yet implemented

def group_{rows, cols}_by(*exprs, **named_exprs): # not yet implemented
  -> def aggregate(**named_exprs): # group_rows_by.aggregate produces a new Matrix

def aggregate_{rows, cols, entries}(**named_exprs): # the new 'query' methods. Could be removed.

In order to unify the KeyTable and Matrix interfaces, it seems like a good idea to purge references to “va”, “sa”, “g”, and “globals”: instead, all fields in any annotation category are available at the top level namespace:

vds = vds.annotate_rows(
    mean_nfe_gq = f.mean(f.filter(vds.GQ, vds.pop == 'NFE')))

How does this work? Each field / expression knows its index. KeyTables only have one index, row, but matrices have two: row and column. In the above expression, GQ is indexed by [row, col], and pop is indexed by col. We should be able to use this information in combination with runtime analysis of the expression DAG / AST to generate nice error messages when undefined operations are invoked.

dking · November 17, 2017, 5:26pm

This is great!

I also really want implicit joins, a la:

left.annotate_rows(new_row_annotation = left.row_indexed_thing + right.row_indexed_thing)
left.annotate_cols(new_col_annotation = left.col_indexed_thing + right.col_indexed_thing)
left.annotate_entries(...)

pulling the thread a bit more:

left.annotate_rows(foo = left.a + f.mean(right.entries))

cseed · November 17, 2017, 6:10pm

Very nice. I also want implicit joins with alternate keys:

  left.annotate_rows(x = other[left.y].z)

Let’s do it (although maybe not in this initial prototype). It seems like an Expression can carry a list of things to join (and where to join them, a UUID or something).

tpoterba · November 17, 2017, 6:12pm

For 0.2? I think we’re not going to have the optimizer to make this efficient.

tpoterba · November 17, 2017, 6:15pm

Hmm, maybe it’s possible to make this work. I’ll think about it.

cseed · November 17, 2017, 6:34pm

I don’t see how this will be less efficient than what we do now. Can you elaborate?

tpoterba · November 17, 2017, 9:51pm

to paste here from slack, here’s an antagonistic example of a reason why doing implicit joins may not be the right thing:

# correct: 
vds = vds.annotate_rows(x = 5)
vds = vds.annotate_rows(y=vds.x * 2)

# produces an implicit join!
vds = vds.annotate_rows(x = 5)\        
         .annotate_rows(y=vds.x * 2)

tpoterba · November 18, 2017, 12:12am

In [4]: kt.show()
+-------+--------+
| index | bar    |
+-------+--------+
| Int32 | String |
+-------+--------+
|     0 | foo10  |
|     1 | foo9   |
|     2 | foo8   |
|     3 | foo7   |
|     4 | foo6   |
|     5 | foo5   |
|     6 | foo4   |
|     7 | foo3   |
|     8 | foo2   |
|     9 | foo1   |
+-------+--------+

In [5]: kt2.show()
+-------+--------+
| index | foo    |
+-------+--------+
| Int32 | String |
+-------+--------+
|     0 | foo0   |
|     1 | foo1   |
|     2 | foo2   |
|     3 | foo3   |
|     4 | foo4   |
|     5 | foo5   |
|     6 | foo6   |
|     7 | foo7   |
|     8 | foo8   |
|     9 | foo9   |
+-------+--------+

In [6]: kt.annotate(kt2_index = kt2[kt.bar].index).show()
+-------+--------+-----------+
| index | bar    | kt2_index |
+-------+--------+-----------+
| Int32 | String |     Int32 |
+-------+--------+-----------+
|     8 | foo2   |         2 |
|     0 | foo10  |        NA |
|     4 | foo6   |         6 |
|     3 | foo7   |         7 |
|     1 | foo9   |         9 |
|     7 | foo3   |         3 |
|     2 | foo8   |         8 |
|     5 | foo5   |         5 |
|     6 | foo4   |         4 |
|     9 | foo1   |         1 |
+-------+--------+-----------+

cseed · November 18, 2017, 12:19am

I don’t even. Nice work, @tpoterba!

tpoterba · November 21, 2017, 2:04pm

Here are some thoughts about the explicit joins. To discuss, please mention the point by number!

Key tables

NB: We assume that global fields have been added to tables

1. Reference a row of another table with one key

table1 = table1.annotate(x = table2[table1.key1].x)

2. Using expressions to index is also acceptable:

table1 = table1.annotate(x = table2[f.str(table1.key1 % 100)].x)

3. What about global fields in another table?

table1 = table1.annotate(y = table1.field1 * table2[:].x)

4. What about tables with multiple keys?

table2 = table2.key_by(table2.f1, table2.f2)
table1 = table1.annotate(x = table2[table1.y, table1.z].x)

5. Should this be allowed? I think not…

table2 = table2.key_by(table2.f1, table2.f2)
table1 = table1.annotate(x = table2[table1.y, :].x)

(this would roughly translate to:

table2 = table2.key_by(table2.f1, table2.f2)
table_anon = table2.key_by(table2.f1)
table1 = table1.annotate(x = table_anon[table1.y].x)

Matrices

Things get interesting now that we have multiple keys to throw around!
NB: we assume that we’ve implemented methods to join globals and left join on entries

6. Do the equivalent of an annotate_variants_vds

m1 = m1.annotate_rows(x = m2[m1.v, :].x)

7. This follows for globals, cols, entries:

m1 = m1.annotate_globals(foo = m2[:, :].foo)
m1 = m1.annotate_cols(bar = m2[:, m1.s].bar)
m1 = m1.annotate_entries(gt2 = m2[m1.v, m1.s].GT)

8. What happens if the row key of `m2` includes two fields?

m1 = m1.annotate_rows(x = m2[(m1.gene, m1.consequence), :].x)

tpoterba · November 21, 2017, 2:44pm

There’s a somewhat subtler point about the indices of an expression derived from joins. Generally, the index is going to be the union of indices of expressions used. With this model, the following work:

vds2[:, :] has no indices
vds2[vds.v, :] has index {'row'}
vds2[:, vds.s] has index {'col'}
vds2[vds.v, vds.s] has indices {'row', 'col'}

But what if vds2 has row keys Variant, String and some other column key Int32?

vds2[(vds.v, vds.s), :] should expose the row annotations of vds2 as a per-entry value in vds. This is going to be tricky to get right.

tpoterba · November 21, 2017, 4:34pm

9. Multiple joins

This should be possible:

kt = kt.annotate(x = kt4[kt3[kt2[kt.foo].bar].baz].x)

tpoterba · November 21, 2017, 8:27pm

Re: 9, this actually just works!
(the AST is constructed inside out, so I just need to do the joins in reverse order!)

from hail2 import *
hc = HailContext()
import hail as h
kt = h.KeyTable.range(1).drop('index').to_hail2()
kt = kt.annotate(a='foo')

kt1 = h.KeyTable.range(1).drop('index').to_hail2()
kt1 = kt1.annotate(a= 'foo', b = 'bar').key_by('a')

kt2 = h.KeyTable.range(1).drop('index').to_hail2()
kt2 = kt2.annotate(b = 'bar', c = 'baz').key_by('b')

kt3 = h.KeyTable.range(1).drop('index').to_hail2()
kt3 = kt3.annotate(c = 'baz', d = 'qux').key_by('c')

kt4 = h.KeyTable.range(1).drop('index').to_hail2()
kt4 = kt4.annotate(d='qux', e='quam').key_by('d')

kt.annotate(test = kt4[kt3[kt2[kt1[kt.a].b].c].d].e).show()
+--------+--------+
| a      | e      |
+--------+--------+
| String | String |
+--------+--------+
| foo    | quam   |
+--------+--------+

konradjk · November 22, 2017, 1:44pm

Is number 5 intended to be a left join (on only one of the keys of table2)? If so, I agree with the requirement to re-key and join again (assuming it’s as fast).

Re: your subtler point about multiple row keys, I assume this will nearly never happen with a straight vds, but will probably happen plenty with matrices?

tpoterba · November 22, 2017, 1:46pm

5 would be a left join, yeah. I think we just shouldn’t support that – it would look too similar to the matrix stuff, too.

tpoterba · November 24, 2017, 3:40am

So this works great for expressing a left join: kt2[kt1.field1]. But how do we replicate the product functionality? I would want to treat that as an array sometimes, and I have absolutely no idea what that syntax should look like.

konradjk · November 25, 2017, 4:18pm

I imagine it will be a separate argument to annotate - to clarify, if product=False, would it simply select a random entry to annotate (last seen, I assume?). If so, then I might advocate for changing it to pick or something like that (and have the default be the Array[...]).

tpoterba · November 25, 2017, 4:49pm

I don’t want any other argument to annotate, because it takes arbitrary **kwargs. What if you wanted to defined a new field product which is the table.x * table.y?

I think the right place for this option is on the join syntax, but I’m not sure what it would look like.

cseed · November 26, 2017, 3:00am

What about kt1.index(kt2.key, product=True)? What’s the type of the result? An array of structs? Or a struct of arrays?

tpoterba · November 27, 2017, 1:19am

That could work. Do we keep the kt2[kt1.k] syntax at all? Is kt2[kt1.k] just syntactic sugar for kt2.index(kt1.k, product=False)?

The type of kt1.index(kt2.key, product=True) should be Array[Struct], I think.

Topic		Replies	Views
The curious case of `impute_sex`	14	864	February 6, 2018
Proposed changes to the python select/annotate/drop interface for keys	0	618	April 2, 2018
Things that illegally use MatrixValue right now	0	723	July 3, 2018
Some thoughts on Matrix joins	2	679	March 27, 2018
General MatrixTable keys	1	605	January 22, 2018