Merging ref blocks in sparse MT

This is a quick note to record an algorithm that may be useful for working with sparse matrix tables. It can be used to fuse adjacent ref blocks, e.g. if we want to coarsen the GQ binning or get rid of it entirely. It can also be used to construct an index which, given a locus l, can look up the earlier locus you must start reading ref blocks from in order to find all blocks containing l, or just to compute stats about how efficient such an index would be (how many blocks not containing l do you have to skip over, on average).

Basic algorithm

We stream through ref blocks ordered by start locus. Say we’re currently processing blocks starting at l_cur. Split previously seen blocks into three groups. Blocks containing l_cur are “open”. Let l_trailing be the minimum start locus among all open blocks. Blocks which start before l_trailing are “closed”, and blocks which start after are “pending”. We need to keep open and pending blocks in memory, in separate data structures, but closed blocks are no longer needed.

For example, if we’re constructing an index, then we should add l_cur -> l_trailing to the index, as l_trailing is where we need to start reading to find all open blocks.

There can be at most one open block per sample, so we keep open blocks in a length num_samples array open_blocks. Each time we advance l_cur, read all blocks starting at l_cur and update open_blocks. If the average number of blocks starting at any locus is sufficiently high, relative to the number of samples, we can just scan over open_blocks and compute l_trailing from scratch. Otherwise, we can maintain a tournament tree (a simple kind of priority queue best for a fixed number of elements) over the open blocks. I suspect the ref blocks are dense enough that we just want to do the scan each time.

When we advance l_cur, we also want to kick out from open_blocks any block which ends before l_cur. (If we’re fusing blocks, we want to delay kicking out until we’ve seen if a new block can be fused with the current one.)

Blocks which are kicked out go to a priority queue pending_blocks, keyed by start locus. Each time we advance l_cur, and update l_trailing, some of the pending blocks may have become closed. We can simply pop blocks from pending_blocks until the start locus of the head block is >= l_trailing. This maintains the invariant that blocks popped from pending_blocks are in increasing order by start locus. (We can’t pop blocks which start after l_trailing, because there are still open blocks that start earlier which will need to be popped in the future.)

Special cases

If we’re fusing blocks, we can write blocks to a new file as they’re popped from pending_blocks: they are guaranteed to be popped in increasing order of start locus. Depending on the distribution of block sizes, pending_blocks may grow too big for memory. To handle that possibility, we could set a max size for pending_blocks. When the max size is exceeded, just pop blocks from pending_blocks until the size is small enough. Let l_buffered be the min start locus of all blocks in pending_blocks.

It is now possible that blocks kicked out of open_blocks may start earlier than l_buffered. We can’t add those to pending_blocks without violating the invariant that blocks popped from pending_blocks are in increasing order of start locus. There are two options: we can push such blocks to a second priority queue, which will write out to a second ordered file. The two files can then be joined in a second pass. This could be repeated any number of levels, but it’s not clear how to choose max sizes for each level. Alternatively (or in addition), when a block can’t be pushed to another queue, we can just write it to an unordered file. The unordered file could then be shuffled before joining. We should use knowledge of the distribution of block sizes to ensure that this file will be small enough to be shuffled.

I believe similar tricks could be used to cap the memory usage when building an index or computing stats.