The original aggregator interface, which assumed one aggregable input per aggregator, has broken down with multiple input aggregators (linreg, correlation, group_by) and, I would argue, count (which has zero or one value, depending). The problem is the semantics of filter and explode. Right now correlation supports neither, and linreg supports them in the first argument (and it is implicitly applied to the second) and group_by … has unexpected semantics but I don’t remember the details. (@jigold?)
I propose the following. Filter and explode go outside the aggregation expression:
hl.agg.filter(hl.len(x.alleles) == 2, hl.agg.count())
explode now takes a lambda that includes the exploded value:
lambda gene: hl.agg.counter(hl.tuple([gene, x.GT])))
where the argument is individual values of the collection. In particular,
x.genes inside the counter would have the (constant) array value as
gene iterated over the values.
filter and explode should be able to be applied to the second argument of
group_by in the obvious way.
group_by, I tried a couple of things:
Just rewriting all of the SeqOps in ApplyAggOp and ApplyScanOp. For
count_where, this caused any groups with a count of 0 to not appear in the dictionary result because the aggregagable is written in terms of filter/count. The same problem applies to
mean which were all implemented with filters. I also had to compute the keySet of all keys for each aggregator and iterate over the possible set because each aggregation result could have potentially different keys.
Changing the semantics of filter to set the aggregable to NA rather than filter out the value and rewrite the Count aggregator to not count missing values. This was wrong because aggregators like
collect will fail even though
count_where is now correct. However, the benefit was all keys would appear in the dictionary for
group_by and I could get rid of computing the KeySet.
On the Scala side, I thought about pulling out the Keyed part of the SeqOp into a separate IR node or as extra arguments to ApplyAggOp, but I can’t remember why I didn’t think that would solve the problems I was encountering.
I added a
CountWhere aggregator and rewrote
mean in terms of the new
count_where which doesn’t have a filter as its implementation. This required checking if the count == 0 for
mean so I don’t get
NaN values. This works fine for a single filtered aggregable, as the it will not have the group in the result. But there is still a couple of problems:
Need to account for multiple aggregables being filtered like
hl.agg.group_by(t.group, hl.agg.count(hl.agg.filter(t.x < 2)) + hl.agg.count(t.x)), so still have to compute the keySet of all possible groups for the result which is linear in the number of ApplyAggOps. I think this might be okay as most people won’t have a super complicated aggregation expression.
Can’t filter the groups i.e.
hl.agg.group_by(hl.agg.filter(t.group), ...) as it’s completely dependent on the SeqOp. This may be able to be added – I haven’t thought about it at all.
Basically from my experience with
group_by, I don’t think we should implement any aggregators in terms of
filter as there is an implicit change to the aggregable going on that is not visible to the user.
Also, I like the proposal! Devil’s advocate question – do we still allow nested filters and explodes?