This is a quick note to record an algorithm that may be useful for working with sparse matrix tables. It can be used to fuse adjacent ref blocks, e.g. if we want to coarsen the GQ binning or get rid of it entirely. It can also be used to construct an index which, given a locus
l, can look up the earlier locus you must start reading ref blocks from in order to find all blocks containing
l, or just to compute stats about how efficient such an index would be (how many blocks not containing
l do you have to skip over, on average).
We stream through ref blocks ordered by start locus. Say we’re currently processing blocks starting at
l_cur. Split previously seen blocks into three groups. Blocks containing
l_cur are “open”. Let
l_trailing be the minimum start locus among all open blocks. Blocks which start before
l_trailing are “closed”, and blocks which start after are “pending”. We need to keep open and pending blocks in memory, in separate data structures, but closed blocks are no longer needed.
For example, if we’re constructing an index, then we should add
l_cur -> l_trailing to the index, as
l_trailing is where we need to start reading to find all open blocks.
There can be at most one open block per sample, so we keep open blocks in a length
open_blocks. Each time we advance
l_cur, read all blocks starting at
l_cur and update
open_blocks. If the average number of blocks starting at any locus is sufficiently high, relative to the number of samples, we can just scan over
open_blocks and compute
l_trailing from scratch. Otherwise, we can maintain a tournament tree (a simple kind of priority queue best for a fixed number of elements) over the open blocks. I suspect the ref blocks are dense enough that we just want to do the scan each time.
When we advance
l_cur, we also want to kick out from
open_blocks any block which ends before
l_cur. (If we’re fusing blocks, we want to delay kicking out until we’ve seen if a new block can be fused with the current one.)
Blocks which are kicked out go to a priority queue
pending_blocks, keyed by start locus. Each time we advance
l_cur, and update
l_trailing, some of the pending blocks may have become closed. We can simply pop blocks from
pending_blocks until the start locus of the head block is
>= l_trailing. This maintains the invariant that blocks popped from
pending_blocks are in increasing order by start locus. (We can’t pop blocks which start after
l_trailing, because there are still open blocks that start earlier which will need to be popped in the future.)
If we’re fusing blocks, we can write blocks to a new file as they’re popped from
pending_blocks: they are guaranteed to be popped in increasing order of start locus. Depending on the distribution of block sizes,
pending_blocks may grow too big for memory. To handle that possibility, we could set a max size for
pending_blocks. When the max size is exceeded, just pop blocks from
pending_blocks until the size is small enough. Let
l_buffered be the min start locus of all blocks in
It is now possible that blocks kicked out of
open_blocks may start earlier than
l_buffered. We can’t add those to
pending_blocks without violating the invariant that blocks popped from
pending_blocks are in increasing order of start locus. There are two options: we can push such blocks to a second priority queue, which will write out to a second ordered file. The two files can then be joined in a second pass. This could be repeated any number of levels, but it’s not clear how to choose max sizes for each level. Alternatively (or in addition), when a block can’t be pushed to another queue, we can just write it to an unordered file. The unordered file could then be shuffled before joining. We should use knowledge of the distribution of block sizes to ensure that this file will be small enough to be shuffled.
I believe similar tricks could be used to cap the memory usage when building an index or computing stats.