The entry filtering bugfix has been a bit of a shitshow. I remain steadfastly convinced that the current (fixed) model is correct, especially given examples like the below where the denominator of
fraction should clearly be the non-filtered entries:
mt = mt.filter_entries(thing_is_bad) mt = mt.annotate_rows(fraction_gq_over_20 = hl.agg.fraction(mt.GQ > 20))
However, we still have some lingering issues:
- Computing call rate is counterintuitive if people are using filter_entries for genotype qc. I still haven’t fixed sample_qc to divide the the number of variants yet: https://github.com/hail-is/hail/issues/5561. It’s also extremely difficult to compute grouped call rate, say, grouped by population – as far as I can tell the only way to do this is to compute global denominators first with
aggregate_cols, make it a literal, then divide explicitly.
- People are getting errors from
BlockMatrix.from_entry_exprcalls (which error if an entry is missing) inside things like pc_relate. There need to be unfilter_entries calls inside these methods, and that needs to be documented.
- It’s very hard to ask about the number of filtered entries. Right now all you can do is subtract the number of entries present (
hl.agg.count()) from whatever total number is relevant (
mt.count_rows()). This feels insufficient, but I have no idea how to support something like