Hail2 Python interface discussion

Very nice. I also want implicit joins with alternate keys:

  left.annotate_rows(x = other[left.y].z)

Let’s do it (although maybe not in this initial prototype). It seems like an Expression can carry a list of things to join (and where to join them, a UUID or something).

For 0.2? I think we’re not going to have the optimizer to make this efficient.

Hmm, maybe it’s possible to make this work. I’ll think about it.

I don’t see how this will be less efficient than what we do now. Can you elaborate?

to paste here from slack, here’s an antagonistic example of a reason why doing implicit joins may not be the right thing:

# correct: 
vds = vds.annotate_rows(x = 5)
vds = vds.annotate_rows(y=vds.x * 2)

# produces an implicit join!
vds = vds.annotate_rows(x = 5)\        
         .annotate_rows(y=vds.x * 2)
In [4]: kt.show()
+-------+--------+
| index | bar    |
+-------+--------+
| Int32 | String |
+-------+--------+
|     0 | foo10  |
|     1 | foo9   |
|     2 | foo8   |
|     3 | foo7   |
|     4 | foo6   |
|     5 | foo5   |
|     6 | foo4   |
|     7 | foo3   |
|     8 | foo2   |
|     9 | foo1   |
+-------+--------+
In [5]: kt2.show()
+-------+--------+
| index | foo    |
+-------+--------+
| Int32 | String |
+-------+--------+
|     0 | foo0   |
|     1 | foo1   |
|     2 | foo2   |
|     3 | foo3   |
|     4 | foo4   |
|     5 | foo5   |
|     6 | foo6   |
|     7 | foo7   |
|     8 | foo8   |
|     9 | foo9   |
+-------+--------+
In [6]: kt.annotate(kt2_index = kt2[kt.bar].index).show()
+-------+--------+-----------+
| index | bar    | kt2_index |
+-------+--------+-----------+
| Int32 | String |     Int32 |
+-------+--------+-----------+
|     8 | foo2   |         2 |
|     0 | foo10  |        NA |
|     4 | foo6   |         6 |
|     3 | foo7   |         7 |
|     1 | foo9   |         9 |
|     7 | foo3   |         3 |
|     2 | foo8   |         8 |
|     5 | foo5   |         5 |
|     6 | foo4   |         4 |
|     9 | foo1   |         1 |
+-------+--------+-----------+
1 Like

I don’t even. Nice work, @tpoterba!

Here are some thoughts about the explicit joins. To discuss, please mention the point by number!

Key tables

NB: We assume that global fields have been added to tables

1. Reference a row of another table with one key

table1 = table1.annotate(x = table2[table1.key1].x)

2. Using expressions to index is also acceptable:

table1 = table1.annotate(x = table2[f.str(table1.key1 % 100)].x)

3. What about global fields in another table?

table1 = table1.annotate(y = table1.field1 * table2[:].x)

4. What about tables with multiple keys?

table2 = table2.key_by(table2.f1, table2.f2)
table1 = table1.annotate(x = table2[table1.y, table1.z].x)

5. Should this be allowed? I think not…

table2 = table2.key_by(table2.f1, table2.f2)
table1 = table1.annotate(x = table2[table1.y, :].x)

(this would roughly translate to:

table2 = table2.key_by(table2.f1, table2.f2)
table_anon = table2.key_by(table2.f1)
table1 = table1.annotate(x = table_anon[table1.y].x)

Matrices

Things get interesting now that we have multiple keys to throw around!
NB: we assume that we’ve implemented methods to join globals and left join on entries

6. Do the equivalent of an annotate_variants_vds

m1 = m1.annotate_rows(x = m2[m1.v, :].x)

7. This follows for globals, cols, entries:

m1 = m1.annotate_globals(foo = m2[:, :].foo)
m1 = m1.annotate_cols(bar = m2[:, m1.s].bar)
m1 = m1.annotate_entries(gt2 = m2[m1.v, m1.s].GT)

8. What happens if the row key of m2 includes two fields?

m1 = m1.annotate_rows(x = m2[(m1.gene, m1.consequence), :].x)

There’s a somewhat subtler point about the indices of an expression derived from joins. Generally, the index is going to be the union of indices of expressions used. With this model, the following work:

vds2[:, :] has no indices
vds2[vds.v, :] has index {'row'}
vds2[:, vds.s] has index {'col'}
vds2[vds.v, vds.s] has indices {'row', 'col'}

But what if vds2 has row keys Variant, String and some other column key Int32?

vds2[(vds.v, vds.s), :] should expose the row annotations of vds2 as a per-entry value in vds. This is going to be tricky to get right.

9. Multiple joins

This should be possible:

kt = kt.annotate(x = kt4[kt3[kt2[kt.foo].bar].baz].x)

Re: 9, this actually just works!
(the AST is constructed inside out, so I just need to do the joins in reverse order!)

from hail2 import *
hc = HailContext()
import hail as h
kt = h.KeyTable.range(1).drop('index').to_hail2()
kt = kt.annotate(a='foo')

kt1 = h.KeyTable.range(1).drop('index').to_hail2()
kt1 = kt1.annotate(a= 'foo', b = 'bar').key_by('a')

kt2 = h.KeyTable.range(1).drop('index').to_hail2()
kt2 = kt2.annotate(b = 'bar', c = 'baz').key_by('b')

kt3 = h.KeyTable.range(1).drop('index').to_hail2()
kt3 = kt3.annotate(c = 'baz', d = 'qux').key_by('c')

kt4 = h.KeyTable.range(1).drop('index').to_hail2()
kt4 = kt4.annotate(d='qux', e='quam').key_by('d')

kt.annotate(test = kt4[kt3[kt2[kt1[kt.a].b].c].d].e).show()
+--------+--------+
| a      | e      |
+--------+--------+
| String | String |
+--------+--------+
| foo    | quam   |
+--------+--------+

Is number 5 intended to be a left join (on only one of the keys of table2)? If so, I agree with the requirement to re-key and join again (assuming it’s as fast).

Re: your subtler point about multiple row keys, I assume this will nearly never happen with a straight vds, but will probably happen plenty with matrices?

5 would be a left join, yeah. I think we just shouldn’t support that – it would look too similar to the matrix stuff, too.

So this works great for expressing a left join: kt2[kt1.field1]. But how do we replicate the product functionality? I would want to treat that as an array sometimes, and I have absolutely no idea what that syntax should look like.

I imagine it will be a separate argument to annotate - to clarify, if product=False, would it simply select a random entry to annotate (last seen, I assume?). If so, then I might advocate for changing it to pick or something like that (and have the default be the Array[...]).

I don’t want any other argument to annotate, because it takes arbitrary **kwargs. What if you wanted to defined a new field product which is the table.x * table.y?

I think the right place for this option is on the join syntax, but I’m not sure what it would look like.

What about kt1.index(kt2.key, product=True)? What’s the type of the result? An array of structs? Or a struct of arrays?

That could work. Do we keep the kt2[kt1.k] syntax at all? Is kt2[kt1.k] just syntactic sugar for kt2.index(kt1.k, product=False)?

The type of kt1.index(kt2.key, product=True) should be Array[Struct], I think.

Extending this:

Table:
kt2[kt.k] === kt2.index(kt.k, product=False)
kt2[:] === kt2.index_globals()

Matrix:
m2[:, :] === m2.index_globals()
m2[m2.v, :] === m2.index_rows(m.v, product=False)
m2[:, m.s] === m2.index_cols(m.s, product=False)
m2[m.v, m.s] === m2.index_entries(m.v, m.s, product=False)

I kinda like this, actually. We can clearly document it as a shorthand, and full documentation can go on the index things.

We need to think about how locus / interval joins work in this model, too.