Proposed changes to Python object hierarchy

tpoterba · November 21, 2017, 10:45am

Proposal 1: rename `KeyTable` to `Table`

This makes much more sense.

Proposal 2: Break up current `VariantDataset`

The huge, bloated, vertically-integrated VariantDataset Python object has caught the eye of the trust busters. I propose breaking it up into (a), a general Matrix object with the general functionality described in Hail2 Python interface discussion, and (b) a VariantSampleMatrix inheritor where the genetics-specific methods live.

We can even bind annotate_variants and annotate_samples to annotate_rows and annotate_cols, but have them return a VariantDataset instead. This could also introduce more confusion than it’s worth, though.

There can be a constructor for VariantDataset that takes a Matrix, and checks the types first.

NB: I’m not sold on Matrix, since we might want to reserve that for numerical matrices. StructuredMatrix? That’s even worse.

cseed · November 22, 2017, 1:08am

I agree a rename is need. Table is a good choice. What about VariantDataset? Some other options that were floated:

Table and Matrix
Table and MatrixTable
DataFrame and DataMatrix

Anything else?

My argument against Matrix is that it sounds too numerical and we’ll have a separate Matrix for lingalg stuff. But structured matrix and numeric matrix should be the same, so if we do that sooner rather than later, Matrix is more attractive.

For proposal 2, I had too suggestions:

Keep the mega object for now. The reason I don’t like another object is because you actually get a proliferation (VariantDataset vs VariantKeyDataset vs GenericDataset) and you have to worry about entering and leaving the VariantDataset world as a user.
Don’t use . chaining in Python, instead use something like >> of dplython so operations can be added independently of the Matrix class. This avoids the mega class, avoids monkey patching, lets you collect groups of common operations into namespaces and we add operations the same way the users do.

tpoterba · November 22, 2017, 1:21am

I’m conflicted. On one hand, I feel totally compelled by the organization benefits of >>. On the other hand, I feel like users might be totally confused with what’s going on.

dking · November 22, 2017, 3:04pm

Aren’t these the same users that are comfortable hacking together a bash pipeline?

tpoterba · November 22, 2017, 5:08pm

This is totally gross:

(table >> select(...)).schema

konradjk · November 22, 2017, 8:05pm

While we’re on renaming a bunch of things, thoughts on aIndex -> a_index and the like to shift more towards pythonic style? Might be a huge pain for people to shift over, but certainly better earlier than later if ever. I don’t have a strong opinion, but just noticed that it’s popping up a lot more (e.g. newV)

cseed · November 22, 2017, 8:07pm

I agree on this one. aIndex and newV are both holdovers from early Scala influence. a_index? allele_index? allele_idx? aindex? iallele?

newv?

tpoterba · November 22, 2017, 8:16pm

I agree that we should py-style every annotation we generate.

This brings me back to the conversation we never had about a_index – what should the interface for filter_alleles look like? How do we expose locally-scoped variables like that?

konradjk · November 22, 2017, 8:19pm

Incidentally, I like allele_index, but new_v (unless v is going to change to variant, but that just feels excessive).

cseed · November 22, 2017, 8:27pm

The only way I know how to handle this in the current design is to create a new python object with the relevant bindings. So something like:

  filt = vds.filter_alleles() # AlelleFilteredVariantDataset
  filt.filter(filt.va.AC[filt.aIndex - 1] == 0, ...)

Something like that. Thoughts?

tpoterba · November 22, 2017, 8:38pm

I do not like that.

I do not want to document or explain AlleleFilteredDataset, or any other similar method-specific VDS inheritor. What operations does it have? It will be extremely difficult for users to reason about it.

Lambdas aren’t ideal, but they would be much clearer:

vds = vds.filter_alleles(lambda ai: vds.AC[ai - 1] != 0,
                         AC = lambda a_indices: a_indices[1:].map(lambda i: vds.AC[i]),
                         AF = lambda a_indices: a_indices[1:].map(lambda i: vds.AF[i]))

tpoterba · November 22, 2017, 8:41pm

Although here, I think it would make sense to just add an annotation fieldold_to_new or something. We don’t expose annotation expressions in split multi.

tpoterba · November 22, 2017, 8:41pm

wait, I guess we do now?

cseed · November 22, 2017, 8:53pm

We now expose oldToNew and newToOld. Generic is … totally generic: https://github.com/hail-is/hail/pull/2465/files#diff-080e7219e05c35f86da998c2c1a1ef2aR1816

cseed · November 22, 2017, 9:09pm

I can’t say I much like the lambda, either. And with wasSplit, newV, oldToNew, etc., it’s going to need a lot of arguments.

Another option is to create a DataMatrix with the necessary additions as annotations. So m.prep_split_multi() would return a DataMatrix with newv and allele_index (variant) annotations, duplicating rows once per allele. This could even be implemented as a variant-parallel explode on the set of alleles*. Then I see two problems.

*We probably want to require such an operation be monotonic.

One: we don’t want to leave a DataMatrix in a partially updated, invalid state (in particular I’m thinking about dependently typed genotype fields, so we can’t update v, say, without simultaneously updating g.PL.) So then we need a way to annotate field everywhere (v, va, g) simultaneously. I’m just not sure how to write that.

Two: we need to make sure we can make that efficient.

tpoterba · November 22, 2017, 9:17pm

What about defining them on the filter_alleles object:

vds = vds >> filter_alleles(vds.AC[filter_alleles.a_index - 1] != 0,
                            AC = filter_alleles.a_indices[1:].map(lambda i: vds.AC[i]),
                            AF = filter_alleles.a_index[1:].map(lambda i: vds.AF[i]))

cseed · November 22, 2017, 9:20pm

Where’s the genotype updates? filter_alleles needs to update both: https://github.com/hail-is/hail/pull/2465/files#diff-080e7219e05c35f86da998c2c1a1ef2aR1743

Maybe this is another argument for separating variant and sample annotations from the genotypes.

tpoterba · November 22, 2017, 9:29pm

ah, good point. Man, this is hard.

cseed · November 22, 2017, 9:30pm

I want something like

m.atomic()
 .annotate_variants(x = m.a)
 .annotate_genotypes(y = m.b)
 .commit()

And the right hand sides in both cases refer to m, including the annotate_genotypes. But then you’re going to say “What’s the return type of atomic, I don’t want to document that.”

I do not want to document or explain AlleleFilteredDataset, or any other similar method-specific VDS inheritor. What operations does it have? It will be extremely difficult for users to reason about it.

Yeah, I think most users don’t want to reasonable about it, they just want a recipe, and the sophisticated ones won’t have trouble understanding what’s really going on.

tpoterba · November 22, 2017, 9:31pm

I guess this isn’t too far from the GroupedVariantDataset. That seems OK.

Topic		Replies	Views
Matrix row key redesign - help wanted	3	587	February 5, 2018
0.2 beta checklist	1	683	December 1, 2017
Things that illegally use MatrixValue right now	0	693	July 3, 2018
File format changes	5	786	February 1, 2018
General MatrixTable keys	1	561	January 22, 2018

Proposed changes to Python object hierarchy

Proposal 1: rename KeyTable to Table

Proposal 2: Break up current VariantDataset

Related Topics

Proposal 1: rename `KeyTable` to `Table`

Proposal 2: Break up current `VariantDataset`