The curious case of `impute_sex`

impute_sex probabilistically guesses if samples in a dataset are XX, XY, or if there’s insufficient evidence to support either of those two states. The algorithm is rather simple:

  1. filter to the non psuedo-autosomal X chromosome loci
  2. compute the alternate allele frequency for each row (or use a supplied allele frequency per row)
  3. remove rows outside a given allele frequency
  4. compute the inbreeding statistics per column
  5. guess XX, XY, or NA for each column based on Fstat

The above algorithm has three important inputs: locus, call, and population_frequency. All three should be expressions and they don’t necessarily come from the same matrix table. The important bit is that call should be row & col indexed while locus and population_frequency should have the same row index as call.

I achieved this as follows:

gr = locus.dtype.reference_genome

lsource = locus._indices.source
csource = call._indices.source


analyze('methods.impute_sex/locus', locus, lsource._row_indices)
analyze('methods.impute_sex/call', call, csource._entry_indices)

# FIXME: assert that row indices of locus are same as row indices of call?

lds, _ = lsource._process_joins(locus)
cds, _ = csource._process_joins(call)

# put call in an entry field so we can access it later
ds = cds.annotate_entries(call = call)
# put locus in an row field so we can access it later
lds = lds.annotate_rows(locus = locus)
# join call and locus into one matrix (FIXME: what happens if lds is a table?!)
ds = ds.annotate_rows(locus = lds[ds.v,:].locus)

The population_frequency is even more complicated because I provide a default expression:

if (population_frequency is None):
    ds = ds.annotate_rows(
        aaf = (agg.sum(ds.call.num_alt_alleles()).to_float64() /
               agg.count_where(f.is_defined(ds.call)) / 2))
else:
    pf_source = population_frequency._indices.source
    analyze('methods.impute_sex/locus', population_frequency,
            pf_source._row_indices, {pf_source._col_axis})
    pf_kt, _ = pf_source._process_joins(population_frequency)
    pf_kt = pf_kt.annotate_rows(aaf = population_frequency)
    # FIXME: what happens if pf_kt is a table, like a sites key table from gnomad?
    ds = ds.annotate_rows(aaf = pf_kt[ds.v, :].aaf)

A few things that would make this easier:

  1. how do I require that the indices of one expression are compatible with another? i.e. row indices of call are joinable with indices of locus?
  2. how do I join against something that’s either a matrix or a table?
  3. rows and columns sort of don’t matter, this whole method should work fine if the locus and population frequency indices are compatible with the call col-index.
  4. I shouldn’t have to state the indexes and aggregation context. I just want to use population_frequency as if it is a singly-indexed thing. I’m happy to state the type of that index. Something like:
pf = population_frequency.process([ds.rowkey_schema])
ds = ds.annotate_rows(aaf = pf_kt[ds.v].aaf)

Here’s my ideal API. to_matrix_table should convert an expression to a matrix table. I shouldn’t need to to_matrix_table or to_table if I’m just using an expression in an expression context. The indices are given by the context. Any aggregation is permitted.

gr = locus.dtype.reference_genome

ds = call.to_matrix_table()
ds.annotate_rows(locus = locus[ds.v])

# ...

if (population_frequency is None):
    population_frequency = (agg.sum(ds.call.num_alt_alleles()).to_float64() /
                            agg.count_where(f.is_defined(ds.call)) / 2)

ds = ds.annotate_rows(aaf = population_frequency[ds.v])

Interface to impute_sex then looks like:

my_data = hc.read('...')
gnomad = hc.read('...')
crosswalk = hc.read_table('...')
methods.impute_sex(my_data.GT,
                   crosswalk[my_data.rsid].locus,
                   population_frequency=agg.mean(gnomad.GT.num_alt_alleles()))

There’s no way this last one can be valid – the population frequency argument is an unaggregated function on another dataset. This will fail the source check, and I think that’s a good thing.

Talked with tim, this is the interface I want:

gnomad.annotate_rows(af = agg.mean(gnomad.GT.num_alt_alleles())))
methods.impute_sex(my_data.GT,
                   crosswalk[my_data.rsid].locus,
                   population_frequency=gnomad[my_data.v, :].af)
gnomad.annotate_rows(af = agg.mean(gnomad.GT.num_alt_alleles())))
methods.impute_sex(my_data.GT,
                   crosswalk[my_data.rsid].locus,
                   population_frequency=gnomad[my_data.v, :].af)

This is what I want .All 3 exprs there have my_data as the source, which I think we should check, and error if it’s not the case.

More out of band conversations were had. I see two options which are characterized by who states the joins: the caller or the callee.

Three Hail objects are referenced in the examples below:

my_data    : Matrix { rowKeys:     [ RSID: String ]
                    , colKeys:     [ Sample: String ]
                    , entryFields: [ GT: TCall ]
                    }
other_data : Matrix { rowKeys:   [ RSID: String ]
                    , colKeys:   [ Sample: String ]
                    , rowFields: [ af: Double ]
                    }
crosswalk  : Table  { keys:   [ RSID: String ]
                    , fields: [ locus: Locus ]
                    }

Caller Joins

methods.impute_sex(my_data.GT
                   crosswalk[my_data.rsid].locus,
                   population_frequency=other_data[my_data.rsid, :].af)

impute_sex requires that the three expressions are all “based” (sourced?) on
the same table, in this case: my_data. The caller must specify (via joins) how
to connect the extra data to the table hosting the call expression (in this
case: my_data.GT).

Callee Joins

methods.impute_sex(my_data.GT
                   crosswalk.locus,
                   population_frequency=other_data[_, :].af)

Here, impute_sex states in the documentation that the first axis of GT is used
to key into the locus and population_frequency expressions. NB: I need a way
to convert a two-axis object to a one-axis object, I have introduced the
o[_, :] syntax for this. We can, of course, also write o[:, _].

The following is a sketch of the body of impute_sex. Note that I use o.1 to
refer to the first axis’ keys of a MatrixTable o.

gr = locus.dtype.reference_genome

call = call.filter_rows(f.capture(gr.x_contigs).contains(locus[call.1].contig))
call = call.annotate_cols(ib = agg.inbreeding(call, population_frequency[call.1]))

# use Fstat to guess sex

This looks weird, but bear with me. An expression, like call, is a hail object
(matrix or table, as the case may be) that is “focused” on a particular field or
expression. If I use the object in a location that a field or expression is
expected, it refers to the “focused” field or expression. For a first cut: when
an object is “focused” we annotate the focused field to a UUID so it cannot be
mutilated. Some part of our system (optimizer or elsewhere) should elide this
operation when:

  • if focused on a field, elide when no annotations of that field occur

  • if focused on an expression, elide when no annotations of any component field
    occurs, and no filtering of joined tables occurs (I think that’s sufficient?
    this case is definitely more complicated than focusing on a field.)

I don’t understand why you need the [_, :] here. Isn’t it clear from context?

call.filter_rows

Do expressions have the full suite of MatrixTable / Table methods?

If you can join on exprs with [], then how do I know I’m joining an array expression versus indexing into it?

What do error messages look like?

I don’t think so. The body of the method should not care if the data is coming from a Table or a MatrixTable. If I index it with a single value (as I do: population_frequency[call.1]), then it’s only unambiguous if the MatrixTable doesn’t have two keys with the same type. I prefer the explicit o[_, :] syntax that drops one axis, it’s short hand for o.rows_table().

Singly-indexed “focused” expressions have all the Table methods, double-indexed “focused” expressions have all the MatrixTable methods.

Yeah there doesn’t seem to be a good answer here. We can’t continue to pun on indexing and joining if we go this route. Maybe joining should use function call syntax?

gr = locus.dtype.reference_genome

call = call.filter_rows(f.capture(g.x_contigs).contains(locus(call.1).contig))
call = call.annotate_cols(ib = agg.inbreeding(call, population_frequency(call.1)))

Kind of reads nicely.

Are you worried about a particular error case? The messages should probably include the “focused” field, but otherwise I want it to look the same as if the error occurred on the source dataset.

Actually, I would be fine if my method had to explicitly focus an
expression. After focusing, you can’t use array indexing. A call:

impute_sex(mydata.GT, crosswalk.locus, gnomad[_,:].af)

The impl:

def impute_sex(_locus, _call, _maf, include_par, f_thres, m_thresh):
    f = functions
    gr = _locus.dtype.reference_genome

    if _maf is None:
        maf =
          (agg.sum(call.num_alt_alleles()).to_float64() /
                  agg.count_where(f.is_defined(call)) / 2))

    locus = focus(_locus)
    call = focus(_call)
    maf = focus(_maf)

    call = call.filter_rows(f.capture(set(gr.x_contigs)).contains(locus[call.1].contig))

    if (not include_par):
        call = call.filter_rows(f.capture(gr.par).exists(lambda par: par.contains(locus[call.1])),
                                keep=False)

    call = call.filter_rows(maf > maf_threshold)
    call = call.annotate_cols(ib = agg.inbreeding(call, maf))
    return call.select_cols(
        isFemale = f.cond(ds.ib.Fstat < f_thresh,
                          True,
                          f.cond(ds.ib.Fstat > m_thresh,
                                 False,
                                 f.null(TBoolean()))),
        **ds.ib).cols_table()

This will produce some unnecessary joins that we’ll have to clean up in an
optimizer pass.

The alternative (AFAICT) is:

def impute_sex(locus, call, _maf, include_par, f_thres, m_thresh):
    f = functions
    gr = _locus.dtype.reference_genome

    # FIXME: not sure how to do this because I need to reference the call
    if _maf is None:
        maf = 0.0
    else
        maf = _maf

    ds = locus._indices.source
    ds = ds.annotate_all(
      row = { locus = locus
            , maf = maf
            },
      entries = { call = call })

    if _maf is None:
        ds = ds.annotate_rows(maf = (agg.sum(call.num_alt_alleles()).to_float64() /
                                     agg.count_where(f.is_defined(call)) / 2))

    ds = ds.filter_rows(f.capture(set(gr.x_contigs)).contains(ds.locus.contig))

    if (not include_par):
        ds = ds.filter_rows(f.capture(gr.par).exists(lambda par: par.contains(ds.locus)),
                                keep=False)

    ds = ds.filter_rows(ds.maf > maf_threshold)
    ds = ds.annotate_cols(ib = agg.inbreeding(ds.call, ds.maf))
    return ds.select_cols(
        isFemale = f.cond(ds.ib.Fstat < f_thresh,
                          True,
                          f.cond(ds.ib.Fstat > m_thresh,
                                 False,
                                 f.null(TBoolean()))),
        **ds.ib).cols_table()

Another way around the array syntax issue: Allow me to refer to the expression’s value as e._ so a join that takes the second element of an array looks like: e[ds.v]._[1]. An expression wouldn’t have any other fields anyway (it’s an expression rather than a fully fledged table).

A few unorganized thoughts:

I still don’t understand what gnomad[_, :].AF is doing. gnomad.AF is an expression that is indexed by the row axis of gnomad, so if you were to do any implicit indexing, you would just need a type compatible with the row key of gnomad.

My larger concern is that I feel this implicit joining means that we can’t really specify how expressions work with each other.

Take this example:

>>> x = ds.row_idx
>>> y = ds.col_idx
>>> z = x * y

What is z? What is its index? In our current API, the index of Z is the index of x unioned with the index of y. In this case, the index of z is both the row and col axes of ds. Using any operator that combines expressions of different indices does a product – each unique value of x with each unique value of y to produce z. There’s no way to get z back to the index of x or y without aggregation.

This is why I’m extremely concerned about doing implicit joins like the above impute_sex interface where we use mydata and crosswalk expressions. In my mind, the index of any expression that uses mydata.GT and crosswalk.locus should be the row and column axes of mydata and the row axis of crosswalk – it has to be a 3-tensor. Since we don’t have any operations that support this kind of data representation in our current interface, it makes sense to me to forbid it.

Here’s another example:

ds.filter_entries(f.cond(crosswalk.locus.contig == "X",
                  ds.GT,
                  f.null(TCall()))

What does this do? Does it do an implicit join of crosswalk and ds along rows? How do I know it’s along rows and not entries? Should it be allowed? Is it meaningful?

Ah, yeah, I think you’re right, this syntax is unnecessary.

I think this should be invalid syntax. I think I gave the wrong impression that I want implicit joins. I don’t. I agree that those are error prone.

I just want to use expressions like they’re matrix tables or tables with exactly one field.

Let me try my example again now that you’ve helped me focus my thinking:

impute_sex(mydata.GT, crosswalk.locus, gnomad.af)
def impute_sex(locus, call, maf, include_par, f_thres, m_thresh):
    f = functions
    gr = locus.dtype.reference_genome

    if maf is None:
        maf = (agg.sum(call.num_alt_alleles()).to_float64() /
               agg.count_where(f.is_defined(call)) / 2)

    locus = tablify(locus)
    call = tablify(call)
    maf = tablify(maf)

    call = call.filter_rows(f.capture(set(gr.x_contigs)).contains(locus[call.1]._.contig))

    if (not include_par):
        call = call.filter_rows(f.capture(gr.par).exists(lambda par: par.contains(locus[call.1]._)),
                                keep=False)

    call = call.filter_rows(maf[call.1]._ > maf_threshold)
    call = call.annotate_cols(ib = agg.inbreeding(call._, maf[call.1]._))
    return call.select_cols(
        isFemale = f.cond(ds.ib.Fstat < f_thresh,
                          True,
                          f.cond(ds.ib.Fstat > m_thresh,
                                 False,
                                 f.null(TBoolean()))),
        **ds.ib).cols_table()
def tablify(expr):
    s = expr._indices.source
    l = len(expr._indices.axes)
    if   l == 0:
        return s.select_globals(_=expr).globals()._
    elif l == 1 && isinstance(s, Table):
        return s.select(_=expr)
    elif l == 1 && s._row_axis in expr._indices.axes:
        return s.select_rows(_=expr).rows_table()
    elif l == 1 && s._col_axis in expr._indices.axes:
        return s.select_cols(_=expr).cols_table()
    elif l == 2:
        return s.select_entries(_=expr)

I guess this isn’t so nice if you have to explicitly call tablify. We can add annotate / filter / select methods to expressions, but join and array indexing share the same syntax; therefore, I don’t see a way to get rid of explicit calls to tablify.

I should define what I mean by implicit join – a join that the user didn’t write down with their own fingers. I think they’re bad because they make assumptions about join keys and will likely have high performance implications for the life of the 0.2 stable version.

This implementations of impute_sex does joins that the user didn’t type, and that’s the primary reason I’m pushing back. I also think that the proposed syntax with lots of _ everywhere isn’t especially readable.

I do want to pull on this thread, though:

I just want to use expressions like they’re matrix tables or tables with exactly one field.

I want to know the contexts where our current interface falls short. Here’s one:

gnomad # dataset with row fields AF, aindex
ds # my dataset

I want to add a row field to ds for gnomad.AF[gnomad.aindex]
Here’s what we can do now:

join = gnomad[ds.row_key, :]
ds = ds.annotate_rows(gnomad_af = join.AF[join.aindex])

It doesn’t, however, work if aindex is a global field, because only row fields are exposed in join.
So you’d need to do one of the following:

gnomad_2 = gnomad.annotate_rows(AF = gnomad.AF[gnomad.aindex])
ds = ds.annotate_rows(gnomad_af = gnomad[ds.row_key, :].AF

or

ds = ds.annotate_rows(gnomad_af = gnomad[ds.row_key, :].AF[gnomad[:, :].aindex])

Just to confirm, are the following equivalent?

gnomad[ds.row_key, :]
gnomad.rows_table()

It seems very reasonable for rows_table to keep the globals. If we think about our cartoon:

+-+----+
| |    |
+-+----+
| |    |
| |    |
+-+----+

It’s reasonable for rows_table to mean “drop the column index”, ergo we get the whole left-hand-side.

I’d call this “reducing dimensionality” and I think it is more natural than “get things indexed by X”. If we ever have 3-dimensions, I suspect we will want “remove things keyed by axis X” more than “give me only the things keyed by axis Y”.

(Aside: I now think the name entries_table sounds a bit weird to me. It’s really about taking the product of two axes to get a single axis, rather than about throwing keys away)

I agree on the latter point, and perhaps that’s all there is to say about this. (I now follow with thoughts about your other point, but I guess, assuming we have annotate_all, I’m fine letting this issue die for now)

On the former point, you are correct, but I don’t think that’s a bad thing. Allowing a library method to perform a series of joins for you could be helpfully succinct. In general, our library methods might become very complex! The method should precisely document the relationships, but otherwise that seems fine.

I guess if we have annotate_all, and I can write:

impute_sex(mydata.GT, crosswalk.locus[mydata.GT.rsid], gnomad[my_data.rsid,:].AF)

and I guess this is fine. I’m just not bothered by the method having well-documented, expected relationships between the axes of the inputs.