Problem:
The current Python interface for group_{rows,cols}_by().aggregate()
only allows aggregation over entries. The corresponding row or column annotations are dropped in the new matrix table.
dplyr:
dplyr has a group_by
function which is analogous to our group_by
. It also has extra group_by_if
, group_by_at
, and group_by_all
functions that give the user more control over the grouping. The corresponding function to our aggregate
function is summarise
– the interfaces are basically identical. Lastly, dplyr has an ungroup
function which removes the groupings. Calling summarise
does not convert the grouped data frame to a regular data frame.
Pandas:
Pandas is similar to Hail. The semantics are group_by().aggregate()
and the result of aggregate
is a regular data frame.
PySpark:
The PySpark interface is similar to the Scala Spark interface. There’s a groupBy
function and aggregateByKey
which takes a zero value, a seqop, and a combop. There’s also a reduceByKey
function which is similar to aggregateByKey
, but only has a seqop.
Proposal:
Follow dplyr’s semantics that require an additional function call to return a single matrix table or table. I’m not wedded to the name ungroup
. Could also be result
like in the SplitMulti and case builder interfaces. The aggregate_entries
is the same as the current aggregate
. aggregate_{rows,cols}
is new and specifies how to aggregate the row and column annotations.
dataset.group_rows_by(dataset.gene)
.aggregate_entries(n_non_ref = agg.count_where(dataset.GT.is_non_ref())))
.aggregate_rows(transcripts = agg.collect_as_set(dataset.transcript))
.ungroup()
dataset.group_cols_by(dataset.gene)
.aggregate_entries(n_non_ref = agg.count_where(dataset.GT.is_non_ref())))
.aggregate_cols(phenotypes = agg.collect_as_set(dataset.qPheno))
.ungroup()