MNV project requirements

Project is related to MNVs, and involves collecting data per sample from nearby variants to look for cis alternate alleles.

Grouping the window

Proposal 1:

def collect_window(mt: MatrixTable, window_size: Int) -> MatrixTable

would add an extra row field prev_rows of type array<mt.row.dtype>, and an extra entry field prev_entries of type array<mt.entry.dtype>, where mt.prev_row[0] is the previous row, and mt.prev_row[-1] is the first row in the window.

Proposal 2:

def collect_window(mt: MatrixTable, bp_window_size: Int) -> MatrixTable

Same result semantics. We can limit to 3 * bp_window_size previous rows to consider (for partitioning considerations).

Proposal 2 is preferred.

Within the grouped window

Find pairs of phased (cis) heterozygotes or hom-var genotypes.

variant1, variant2, sample pairs of phased genotypes

Optimizations

We care about phased non-ref calls. Filtering entries first drastically reduces the memory requirements.

Pull request is open!

Should go in this week, I think.