Overview of Method
Given an r^2 threshold and a window (specifying distance between variants), the method prunes variants in linkage disequilibrium, ultimately returning a set of independent variants.
The method starts with a local prune that removes variants over the r^2 threshold on each partition (Jackie’s method). Next, the genotypes are mean-imputed and standardized, so that a correlation matrix can be computed via a matrix multiplication of block matrices.
Currently, this entire correlation matrix is computed, squared, and then filtered to just those entries that we care about (variants with same contig and positions close to each other, as specified by the window).
The entries table of this correlation matrix has the schema [variant i, variant j, r2]. We filter the entries table to just those entries over the r2 threshold, then pass the edges (i, j) to the maximal_independent_set method, which will return a list of variants to prune out. We filter out those variants, leaving a table of independent variants to return to the user.
Performance / Areas for Improvement
On a dataset of 25,956 variants (profile.vcf), Jackie’s method takes 200-300 seconds. My implementation takes around 800-900 seconds.
Around 85% of the time for my implementation is spent on the maximal independent set (MIS) method. This method uses a greedy algorithm with a binary heap for the edge set.
I have been thinking about ways to speed up MIS. The MIS method could possibly be sped up if I could split the graph into several connected components, put all the edges from each component onto the same partition, then run MIS in parallel. But splitting the edge set into the connected components might be slow, so I’m not sure if the time saved from running MIS in parallel would be worth the time spent getting the components.
A banded block matrix, as mentioned in the original proposal, would also offer some improvement, but doesn’t seem to be the main bottleneck right now.
I have experimented a bit with using persist() to speed up my code, which helps somewhat, so perhaps I could be placing my persist() calls in better places to speed things up. This wouldn’t really help with the time spent on MIS, though.