impute_sex
probabilistically guesses if samples in a dataset are XX, XY, or if there’s insufficient evidence to support either of those two states. The algorithm is rather simple:
- filter to the non psuedo-autosomal X chromosome loci
- compute the alternate allele frequency for each row (or use a supplied allele frequency per row)
- remove rows outside a given allele frequency
- compute the inbreeding statistics per column
- guess XX, XY, or NA for each column based on Fstat
The above algorithm has three important inputs: locus
, call
, and population_frequency
. All three should be expressions and they don’t necessarily come from the same matrix table. The important bit is that call
should be row & col indexed while locus
and population_frequency
should have the same row index as call
.
I achieved this as follows:
gr = locus.dtype.reference_genome
lsource = locus._indices.source
csource = call._indices.source
analyze('methods.impute_sex/locus', locus, lsource._row_indices)
analyze('methods.impute_sex/call', call, csource._entry_indices)
# FIXME: assert that row indices of locus are same as row indices of call?
lds, _ = lsource._process_joins(locus)
cds, _ = csource._process_joins(call)
# put call in an entry field so we can access it later
ds = cds.annotate_entries(call = call)
# put locus in an row field so we can access it later
lds = lds.annotate_rows(locus = locus)
# join call and locus into one matrix (FIXME: what happens if lds is a table?!)
ds = ds.annotate_rows(locus = lds[ds.v,:].locus)
The population_frequency is even more complicated because I provide a default expression:
if (population_frequency is None):
ds = ds.annotate_rows(
aaf = (agg.sum(ds.call.num_alt_alleles()).to_float64() /
agg.count_where(f.is_defined(ds.call)) / 2))
else:
pf_source = population_frequency._indices.source
analyze('methods.impute_sex/locus', population_frequency,
pf_source._row_indices, {pf_source._col_axis})
pf_kt, _ = pf_source._process_joins(population_frequency)
pf_kt = pf_kt.annotate_rows(aaf = population_frequency)
# FIXME: what happens if pf_kt is a table, like a sites key table from gnomad?
ds = ds.annotate_rows(aaf = pf_kt[ds.v, :].aaf)