Group-by Interface

Problem:
The current Python interface for group_{rows,cols}_by().aggregate() only allows aggregation over entries. The corresponding row or column annotations are dropped in the new matrix table.

dplyr:
dplyr has a group_by function which is analogous to our group_by. It also has extra group_by_if, group_by_at, and group_by_all functions that give the user more control over the grouping. The corresponding function to our aggregate function is summarise – the interfaces are basically identical. Lastly, dplyr has an ungroup function which removes the groupings. Calling summarise does not convert the grouped data frame to a regular data frame.

Pandas:
Pandas is similar to Hail. The semantics are group_by().aggregate() and the result of aggregate is a regular data frame.

PySpark:
The PySpark interface is similar to the Scala Spark interface. There’s a groupBy function and aggregateByKey which takes a zero value, a seqop, and a combop. There’s also a reduceByKey function which is similar to aggregateByKey, but only has a seqop.

Proposal:
Follow dplyr’s semantics that require an additional function call to return a single matrix table or table. I’m not wedded to the name ungroup. Could also be result like in the SplitMulti and case builder interfaces. The aggregate_entries is the same as the current aggregate. aggregate_{rows,cols} is new and specifies how to aggregate the row and column annotations.

dataset.group_rows_by(dataset.gene)
       .aggregate_entries(n_non_ref = agg.count_where(dataset.GT.is_non_ref())))
       .aggregate_rows(transcripts = agg.collect_as_set(dataset.transcript))
       .ungroup()

dataset.group_cols_by(dataset.gene)
       .aggregate_entries(n_non_ref = agg.count_where(dataset.GT.is_non_ref())))
       .aggregate_cols(phenotypes = agg.collect_as_set(dataset.qPheno))
       .ungroup()
1 Like

Making sure I understand: after a group_rows_by, the entry and row fields will be exactly the fields defined by the aggregate_entries and aggregate_rows methods?

If I understand the semantics correctly, I find ungroup very confusing. I like result. And I like the proposed interface.

what about adding optional aggregate_rows and aggregate_cols on grouped matrix table? Then .aggregate is always the last call

That would be good for backwards compatibility, but I don’t love that aggregate_{rows,cols} has to be before aggregate and aggregate implies entry level aggregations. I think that will be confusing.

that is an obvious drawback, but it makes the common thing nice, I think. There aren’t as many cases when the row/col fields need to be aggregated as well.

I suppose this proposed interface is symmetric to the existing aggregate_rows aggregate_cols aggregate_entries, which is a plus

What about aggregate is an alias for .aggregate_entries().result()? This would mean these changes aren’t breaking.

+1 to aggregate as an alias. That makes me very happy.