(This is motivated by a use case from @konradjk. I will write it up in a separate thread on “use cases Hail doesn’t handle”.)
I propose to add a new scan facility. I will describe it for
Tables. I’ll discuss generalization to
MatrixTables at the end.
Scans are a aggregator-like generalization of
Table.index. A scan applies an aggregator A: T => V (by this I mean an aggregator that aggregates an expression of type T into a value of type V) to an expression of type T to produce a new field of type V.
It works as follows: aggregate each partition with the seqOp to produce a per-partition aggregation intermediate. Collect these. Perform a
scanLeft to produce a running aggregation for each partition. The first partition starts with the zero value. The aggregation of the last partition isn’t needed. Broadcast these. Scan through the partitions again, this time, for each row, insert the result into the row and update the aggregator state using the seqOp.
Table.index just becomes
Table.scan(index = hl.agg.count()). (Maybe
There should be a reverse scan.
There should be
MatrixTable. We could even have
scan_entries, but there’s a question of total order for the entries. There could be two versions, row major and column major? Not sure if this is useful.