Thoughts on entry filtering

The entry filtering bugfix has been a bit of a shitshow. I remain steadfastly convinced that the current (fixed) model is correct, especially given examples like the below where the denominator of fraction should clearly be the non-filtered entries:

mt = mt.filter_entries(thing_is_bad) 
mt = mt.annotate_rows(fraction_gq_over_20 = hl.agg.fraction(mt.GQ > 20))

However, we still have some lingering issues:

  • Computing call rate is counterintuitive if people are using filter_entries for genotype qc. I still haven’t fixed sample_qc to divide the the number of variants yet: https://github.com/hail-is/hail/issues/5561. It’s also extremely difficult to compute grouped call rate, say, grouped by population – as far as I can tell the only way to do this is to compute global denominators first with aggregate_cols, make it a literal, then divide explicitly.
  • People are getting errors from BlockMatrix.from_entry_expr calls (which error if an entry is missing) inside things like pc_relate. There need to be unfilter_entries calls inside these methods, and that needs to be documented.
  • It’s very hard to ask about the number of filtered entries. Right now all you can do is subtract the number of entries present (hl.agg.count()) from whatever total number is relevant (mt.count_rows()). This feels insufficient, but I have no idea how to support something like hl.agg.n_filtered().

Can you share the explanation for why the current model is correct?

My thoughts on call rate –

If:
a) filter_entries is going to be used to filter genotypes, and
b) we’re going to distinguish between missing entries and filtered entries, then
I’d like to again express my opinion that call rate ought to be (# called) / (total), where total = # called + # missing + # filtered.

My typical workflow involves filtering out questionable genotypes based on things like GQ, and then filtering out variants with a low call rate (the ones that were full of questionable genotypes). In this sequence of events, it’s imperative that the call rate take the filtered genotypes into account.

To point a) above, I agree with Lea when she says (in one of the Zulip threads) that if we use filter_rows to filter variants, and filter_cols to filter samples, then it seems intuitive to use filter_entries to filter genotypes.

Thanks for your thoughts, Kyle! I think that’s totally reasonable. I’ll fix sample_qc to divide by the total number of variants, and fix the tutorials to do the same.

Dan - this is the right model due to consistency. To clarify, the “this” I mean is that we do not aggregate over filtered entries.

  • The following pairs should be the same:

     mt.aggregate_entries(hl.agg.count())
     mt.entries().aggregate(hl.agg.count())
    
     mt.annotate_rows(x = hl.agg.count())
     mt.entries().group_by(*mt.row_key).aggregate(x = hl.agg.count())
    
  • Things like aggregate_entries(hl.agg.count_where(hl.is_missing(mt.GQ)) should not include filtered entries as True.

  • Likewise, I would be very confused if hl.agg.fraction(mt.GQ > 20) included filtered entries in the denominator. After filtering rows or columns, those values will never appear in future aggregations.