This is a quick note to record an algorithm that may be useful for working with sparse matrix tables. It can be used to fuse adjacent ref blocks, e.g. if we want to coarsen the GQ binning or get rid of it entirely. It can also be used to construct an index which, given a locus l
, can look up the earlier locus you must start reading ref blocks from in order to find all blocks containing l
, or just to compute stats about how efficient such an index would be (how many blocks not containing l
do you have to skip over, on average).
Basic algorithm
We stream through ref blocks ordered by start locus. Say we’re currently processing blocks starting at l_cur
. Split previously seen blocks into three groups. Blocks containing l_cur
are “open”. Let l_trailing
be the minimum start locus among all open blocks. Blocks which start before l_trailing
are “closed”, and blocks which start after are “pending”. We need to keep open and pending blocks in memory, in separate data structures, but closed blocks are no longer needed.
For example, if we’re constructing an index, then we should add l_cur -> l_trailing
to the index, as l_trailing
is where we need to start reading to find all open blocks.
There can be at most one open block per sample, so we keep open blocks in a length num_samples
array open_blocks
. Each time we advance l_cur
, read all blocks starting at l_cur
and update open_blocks
. If the average number of blocks starting at any locus is sufficiently high, relative to the number of samples, we can just scan over open_blocks
and compute l_trailing
from scratch. Otherwise, we can maintain a tournament tree (a simple kind of priority queue best for a fixed number of elements) over the open blocks. I suspect the ref blocks are dense enough that we just want to do the scan each time.
When we advance l_cur
, we also want to kick out from open_blocks
any block which ends before l_cur
. (If we’re fusing blocks, we want to delay kicking out until we’ve seen if a new block can be fused with the current one.)
Blocks which are kicked out go to a priority queue pending_blocks
, keyed by start locus. Each time we advance l_cur
, and update l_trailing
, some of the pending blocks may have become closed. We can simply pop blocks from pending_blocks
until the start locus of the head block is >= l_trailing
. This maintains the invariant that blocks popped from pending_blocks
are in increasing order by start locus. (We can’t pop blocks which start after l_trailing
, because there are still open blocks that start earlier which will need to be popped in the future.)
Special cases
If we’re fusing blocks, we can write blocks to a new file as they’re popped from pending_blocks
: they are guaranteed to be popped in increasing order of start locus. Depending on the distribution of block sizes, pending_blocks
may grow too big for memory. To handle that possibility, we could set a max size for pending_blocks
. When the max size is exceeded, just pop blocks from pending_blocks
until the size is small enough. Let l_buffered
be the min start locus of all blocks in pending_blocks
.
It is now possible that blocks kicked out of open_blocks
may start earlier than l_buffered
. We can’t add those to pending_blocks
without violating the invariant that blocks popped from pending_blocks
are in increasing order of start locus. There are two options: we can push such blocks to a second priority queue, which will write out to a second ordered file. The two files can then be joined in a second pass. This could be repeated any number of levels, but it’s not clear how to choose max sizes for each level. Alternatively (or in addition), when a block can’t be pushed to another queue, we can just write it to an unordered file. The unordered file could then be shuffled before joining. We should use knowledge of the distribution of block sizes to ensure that this file will be small enough to be shuffled.
I believe similar tricks could be used to cap the memory usage when building an index or computing stats.