The huge, bloated, vertically-integrated VariantDataset Python object has caught the eye of the trust busters. I propose breaking it up into (a), a general Matrix object with the general functionality described in Hail2 Python interface discussion, and (b) a VariantSampleMatrix inheritor where the genetics-specific methods live.
We can even bind annotate_variants and annotate_samples to annotate_rows and annotate_cols, but have them return a VariantDataset instead. This could also introduce more confusion than it’s worth, though.
There can be a constructor for VariantDataset that takes a Matrix, and checks the types first.
NB: I’m not sold on Matrix, since we might want to reserve that for numerical matrices. StructuredMatrix? That’s even worse.
I agree a rename is need. Table is a good choice. What about VariantDataset? Some other options that were floated:
Table and Matrix
Table and MatrixTable
DataFrame and DataMatrix
Anything else?
My argument against Matrix is that it sounds too numerical and we’ll have a separate Matrix for lingalg stuff. But structured matrix and numeric matrix should be the same, so if we do that sooner rather than later, Matrix is more attractive.
For proposal 2, I had too suggestions:
Keep the mega object for now. The reason I don’t like another object is because you actually get a proliferation (VariantDataset vs VariantKeyDataset vs GenericDataset) and you have to worry about entering and leaving the VariantDataset world as a user.
Don’t use . chaining in Python, instead use something like >> of dplython so operations can be added independently of the Matrix class. This avoids the mega class, avoids monkey patching, lets you collect groups of common operations into namespaces and we add operations the same way the users do.
I’m conflicted. On one hand, I feel totally compelled by the organization benefits of >>. On the other hand, I feel like users might be totally confused with what’s going on.
While we’re on renaming a bunch of things, thoughts on aIndex -> a_index and the like to shift more towards pythonic style? Might be a huge pain for people to shift over, but certainly better earlier than later if ever. I don’t have a strong opinion, but just noticed that it’s popping up a lot more (e.g. newV)
I agree that we should py-style every annotation we generate.
This brings me back to the conversation we never had about a_index – what should the interface for filter_alleles look like? How do we expose locally-scoped variables like that?
I do not want to document or explain AlleleFilteredDataset, or any other similar method-specific VDS inheritor. What operations does it have? It will be extremely difficult for users to reason about it.
Lambdas aren’t ideal, but they would be much clearer:
Although here, I think it would make sense to just add an annotation fieldold_to_new or something. We don’t expose annotation expressions in split multi.
I can’t say I much like the lambda, either. And with wasSplit, newV, oldToNew, etc., it’s going to need a lot of arguments.
Another option is to create a DataMatrix with the necessary additions as annotations. So m.prep_split_multi() would return a DataMatrix with newv and allele_index (variant) annotations, duplicating rows once per allele. This could even be implemented as a variant-parallel explode on the set of alleles*. Then I see two problems.
*We probably want to require such an operation be monotonic.
One: we don’t want to leave a DataMatrix in a partially updated, invalid state (in particular I’m thinking about dependently typed genotype fields, so we can’t update v, say, without simultaneously updating g.PL.) So then we need a way to annotate field everywhere (v, va, g) simultaneously. I’m just not sure how to write that.
Two: we need to make sure we can make that efficient.
And the right hand sides in both cases refer to m, including the annotate_genotypes. But then you’re going to say “What’s the return type of atomic, I don’t want to document that.”
I do not want to document or explain AlleleFilteredDataset, or any other similar method-specific VDS inheritor. What operations does it have? It will be extremely difficult for users to reason about it.
Yeah, I think most users don’t want to reasonable about it, they just want a recipe, and the sophisticated ones won’t have trouble understanding what’s really going on.