Proposal: scans

(This is motivated by a use case from @konradjk. I will write it up in a separate thread on “use cases Hail doesn’t handle”.)

I propose to add a new scan facility. I will describe it for Tables. I’ll discuss generalization to MatrixTables at the end.

Scans are a aggregator-like generalization of Table.index. A scan applies an aggregator A: T => V (by this I mean an aggregator that aggregates an expression of type T into a value of type V) to an expression of type T to produce a new field of type V.

It works as follows: aggregate each partition with the seqOp to produce a per-partition aggregation intermediate. Collect these. Perform a scanLeft to produce a running aggregation for each partition. The first partition starts with the zero value. The aggregation of the last partition isn’t needed. Broadcast these. Scan through the partitions again, this time, for each row, insert the result into the row and update the aggregator state using the seqOp.

Then Table.index just becomes Table.scan(index = hl.agg.count()). (Maybe hl.scan.count?)

There should be a reverse scan.

There should be scan_rows and scan_columns for MatrixTable. We could even have scan_entries, but there’s a question of total order for the entries. There could be two versions, row major and column major? Not sure if this is useful.