Project is related to MNVs, and involves collecting data per sample from nearby variants to look for cis alternate alleles.
Grouping the window
Proposal 1:
def collect_window(mt: MatrixTable, window_size: Int) -> MatrixTable
would add an extra row field prev_rows of type array<mt.row.dtype>
, and an extra entry field prev_entries of type array<mt.entry.dtype>
, where mt.prev_row[0]
is the previous row, and mt.prev_row[-1]
is the first row in the window.
Proposal 2:
def collect_window(mt: MatrixTable, bp_window_size: Int) -> MatrixTable
Same result semantics. We can limit to 3 * bp_window_size previous rows to consider (for partitioning considerations).
Proposal 2 is preferred.
Within the grouped window
Find pairs of phased (cis) heterozygotes or hom-var genotypes.
variant1, variant2, sample pairs of phased genotypes
Optimizations
We care about phased non-ref calls. Filtering entries first drastically reduces the memory requirements.