Group-by Interface

jigold · July 16, 2018, 1:12pm

Problem:
The current Python interface for group_{rows,cols}_by().aggregate() only allows aggregation over entries. The corresponding row or column annotations are dropped in the new matrix table.

dplyr:
dplyr has a group_by function which is analogous to our group_by. It also has extra group_by_if, group_by_at, and group_by_all functions that give the user more control over the grouping. The corresponding function to our aggregate function is summarise – the interfaces are basically identical. Lastly, dplyr has an ungroup function which removes the groupings. Calling summarise does not convert the grouped data frame to a regular data frame.

Pandas:
Pandas is similar to Hail. The semantics are group_by().aggregate() and the result of aggregate is a regular data frame.

PySpark:
The PySpark interface is similar to the Scala Spark interface. There’s a groupBy function and aggregateByKey which takes a zero value, a seqop, and a combop. There’s also a reduceByKey function which is similar to aggregateByKey, but only has a seqop.

Proposal:
Follow dplyr’s semantics that require an additional function call to return a single matrix table or table. I’m not wedded to the name ungroup. Could also be result like in the SplitMulti and case builder interfaces. The aggregate_entries is the same as the current aggregate. aggregate_{rows,cols} is new and specifies how to aggregate the row and column annotations.

dataset.group_rows_by(dataset.gene)
       .aggregate_entries(n_non_ref = agg.count_where(dataset.GT.is_non_ref())))
       .aggregate_rows(transcripts = agg.collect_as_set(dataset.transcript))
       .ungroup()

dataset.group_cols_by(dataset.gene)
       .aggregate_entries(n_non_ref = agg.count_where(dataset.GT.is_non_ref())))
       .aggregate_cols(phenotypes = agg.collect_as_set(dataset.qPheno))
       .ungroup()

pschultz · July 16, 2018, 1:41pm

Making sure I understand: after a group_rows_by, the entry and row fields will be exactly the fields defined by the aggregate_entries and aggregate_rows methods?

If I understand the semantics correctly, I find ungroup very confusing. I like result. And I like the proposed interface.

tpoterba · July 16, 2018, 1:41pm

what about adding optional aggregate_rows and aggregate_cols on grouped matrix table? Then .aggregate is always the last call

jigold · July 16, 2018, 2:01pm

That would be good for backwards compatibility, but I don’t love that aggregate_{rows,cols} has to be before aggregate and aggregate implies entry level aggregations. I think that will be confusing.

tpoterba · July 16, 2018, 2:04pm

that is an obvious drawback, but it makes the common thing nice, I think. There aren’t as many cases when the row/col fields need to be aggregated as well.

tpoterba · July 16, 2018, 2:12pm

I suppose this proposed interface is symmetric to the existing aggregate_rows aggregate_cols aggregate_entries, which is a plus

jigold · August 14, 2018, 4:52pm

What about aggregate is an alias for .aggregate_entries().result()? This would mean these changes aren’t breaking.

tpoterba · August 14, 2018, 5:05pm

+1 to aggregate as an alias. That makes me very happy.

Topic		Replies	Views
Aggregator interface	2	750	October 10, 2018
Hail2 Python interface discussion	21	1314	November 27, 2017
Hail expr functions ambiguity	0	618	December 11, 2017
Proposed python expr syntax	4	608	November 1, 2017
General MatrixTable keys	1	595	January 22, 2018

Group-by Interface

Related topics