# Merging ref blocks in sparse MT

This is a quick note to record an algorithm that may be useful for working with sparse matrix tables. It can be used to fuse adjacent ref blocks, e.g. if we want to coarsen the GQ binning or get rid of it entirely. It can also be used to construct an index which, given a locus `l`, can look up the earlier locus you must start reading ref blocks from in order to find all blocks containing `l`, or just to compute stats about how efficient such an index would be (how many blocks not containing `l` do you have to skip over, on average).

# Basic algorithm

We stream through ref blocks ordered by start locus. Say we’re currently processing blocks starting at `l_cur`. Split previously seen blocks into three groups. Blocks containing `l_cur` are “open”. Let `l_trailing` be the minimum start locus among all open blocks. Blocks which start before `l_trailing` are “closed”, and blocks which start after are “pending”. We need to keep open and pending blocks in memory, in separate data structures, but closed blocks are no longer needed.

For example, if we’re constructing an index, then we should add `l_cur -> l_trailing` to the index, as `l_trailing` is where we need to start reading to find all open blocks.

There can be at most one open block per sample, so we keep open blocks in a length `num_samples` array `open_blocks`. Each time we advance `l_cur`, read all blocks starting at `l_cur` and update `open_blocks`. If the average number of blocks starting at any locus is sufficiently high, relative to the number of samples, we can just scan over `open_blocks` and compute `l_trailing` from scratch. Otherwise, we can maintain a tournament tree (a simple kind of priority queue best for a fixed number of elements) over the open blocks. I suspect the ref blocks are dense enough that we just want to do the scan each time.

When we advance `l_cur`, we also want to kick out from `open_blocks` any block which ends before `l_cur`. (If we’re fusing blocks, we want to delay kicking out until we’ve seen if a new block can be fused with the current one.)

Blocks which are kicked out go to a priority queue `pending_blocks`, keyed by start locus. Each time we advance `l_cur`, and update `l_trailing`, some of the pending blocks may have become closed. We can simply pop blocks from `pending_blocks` until the start locus of the head block is `>= l_trailing`. This maintains the invariant that blocks popped from `pending_blocks` are in increasing order by start locus. (We can’t pop blocks which start after `l_trailing`, because there are still open blocks that start earlier which will need to be popped in the future.)

# Special cases

If we’re fusing blocks, we can write blocks to a new file as they’re popped from `pending_blocks`: they are guaranteed to be popped in increasing order of start locus. Depending on the distribution of block sizes, `pending_blocks` may grow too big for memory. To handle that possibility, we could set a max size for `pending_blocks`. When the max size is exceeded, just pop blocks from `pending_blocks` until the size is small enough. Let `l_buffered` be the min start locus of all blocks in `pending_blocks`.

It is now possible that blocks kicked out of `open_blocks` may start earlier than `l_buffered`. We can’t add those to `pending_blocks` without violating the invariant that blocks popped from `pending_blocks` are in increasing order of start locus. There are two options: we can push such blocks to a second priority queue, which will write out to a second ordered file. The two files can then be joined in a second pass. This could be repeated any number of levels, but it’s not clear how to choose max sizes for each level. Alternatively (or in addition), when a block can’t be pushed to another queue, we can just write it to an unordered file. The unordered file could then be shuffled before joining. We should use knowledge of the distribution of block sizes to ensure that this file will be small enough to be shuffled.

I believe similar tricks could be used to cap the memory usage when building an index or computing stats.