Proposal 1: rename
This makes much more sense.
Proposal 2: Break up current
The huge, bloated, vertically-integrated
VariantDataset Python object has caught the eye of the trust busters. I propose breaking it up into (a), a general
Matrix object with the general functionality described in Hail2 Python interface discussion, and (b) a
VariantSampleMatrix inheritor where the genetics-specific methods live.
We can even bind
annotate_cols, but have them return a VariantDataset instead. This could also introduce more confusion than it’s worth, though.
There can be a constructor for VariantDataset that takes a
Matrix, and checks the types first.
NB: I’m not sold on
Matrix, since we might want to reserve that for numerical matrices.
StructuredMatrix? That’s even worse.
I agree a rename is need.
Table is a good choice. What about
VariantDataset? Some other options that were floated:
- Table and Matrix
- Table and MatrixTable
- DataFrame and DataMatrix
My argument against
Matrix is that it sounds too numerical and we’ll have a separate Matrix for lingalg stuff. But structured matrix and numeric matrix should be the same, so if we do that sooner rather than later, Matrix is more attractive.
For proposal 2, I had too suggestions:
Keep the mega object for now. The reason I don’t like another object is because you actually get a proliferation (VariantDataset vs VariantKeyDataset vs GenericDataset) and you have to worry about entering and leaving the VariantDataset world as a user.
. chaining in Python, instead use something like
>> of dplython so operations can be added independently of the Matrix class. This avoids the mega class, avoids monkey patching, lets you collect groups of common operations into namespaces and we add operations the same way the users do.
I’m conflicted. On one hand, I feel totally compelled by the organization benefits of
>>. On the other hand, I feel like users might be totally confused with what’s going on.
Aren’t these the same users that are comfortable hacking together a bash pipeline?
This is totally gross:
(table >> select(...)).schema
While we’re on renaming a bunch of things, thoughts on
a_index and the like to shift more towards pythonic style? Might be a huge pain for people to shift over, but certainly better earlier than later if ever. I don’t have a strong opinion, but just noticed that it’s popping up a lot more (e.g.
I agree on this one.
newV are both holdovers from early Scala influence.
I agree that we should py-style every annotation we generate.
This brings me back to the conversation we never had about a_index – what should the interface for filter_alleles look like? How do we expose locally-scoped variables like that?
Incidentally, I like
v is going to change to
variant, but that just feels excessive).
The only way I know how to handle this in the current design is to create a new python object with the relevant bindings. So something like:
filt = vds.filter_alleles() # AlelleFilteredVariantDataset
filt.filter(filt.va.AC[filt.aIndex - 1] == 0, ...)
Something like that. Thoughts?
I do not like that.
I do not want to document or explain
AlleleFilteredDataset, or any other similar method-specific VDS inheritor. What operations does it have? It will be extremely difficult for users to reason about it.
Lambdas aren’t ideal, but they would be much clearer:
vds = vds.filter_alleles(lambda ai: vds.AC[ai - 1] != 0,
AC = lambda a_indices: a_indices[1:].map(lambda i: vds.AC[i]),
AF = lambda a_indices: a_indices[1:].map(lambda i: vds.AF[i]))
Although here, I think it would make sense to just add an annotation field
old_to_new or something. We don’t expose annotation expressions in split multi.
I can’t say I much like the lambda, either. And with wasSplit, newV, oldToNew, etc., it’s going to need a lot of arguments.
Another option is to create a
DataMatrix with the necessary additions as annotations. So
m.prep_split_multi() would return a
allele_index (variant) annotations, duplicating rows once per allele. This could even be implemented as a variant-parallel
explode on the set of alleles*. Then I see two problems.
*We probably want to require such an operation be monotonic.
One: we don’t want to leave a
DataMatrix in a partially updated, invalid state (in particular I’m thinking about dependently typed genotype fields, so we can’t update
v, say, without simultaneously updating
g.PL.) So then we need a way to annotate field everywhere (v, va, g) simultaneously. I’m just not sure how to write that.
Two: we need to make sure we can make that efficient.
What about defining them on the
vds = vds >> filter_alleles(vds.AC[filter_alleles.a_index - 1] != 0,
AC = filter_alleles.a_indices[1:].map(lambda i: vds.AC[i]),
AF = filter_alleles.a_index[1:].map(lambda i: vds.AF[i]))
Where’s the genotype updates? filter_alleles needs to update both: https://github.com/hail-is/hail/pull/2465/files#diff-080e7219e05c35f86da998c2c1a1ef2aR1743
Maybe this is another argument for separating variant and sample annotations from the genotypes.
ah, good point. Man, this is hard.
I want something like
.annotate_variants(x = m.a)
.annotate_genotypes(y = m.b)
And the right hand sides in both cases refer to
m, including the
annotate_genotypes. But then you’re going to say “What’s the return type of atomic, I don’t want to document that.”
I do not want to document or explain AlleleFilteredDataset, or any other similar method-specific VDS inheritor. What operations does it have? It will be extremely difficult for users to reason about it.
Yeah, I think most users don’t want to reasonable about it, they just want a recipe, and the sophisticated ones won’t have trouble understanding what’s really going on.
I guess this isn’t too far from the GroupedVariantDataset. That seems OK.