We currently “handle” haploids by recording them as diploid homozygous and embedding the PL into the diploid diagonal making the off-diagonal entries haploidNonsensePL. In the full-generic world, I think we need to stop this. What are our options?
Don’t support diploid. I think that will be problematic.
Support true haploid in Call and don’t record the PL. I think this is the way to go. What’s involved?
use 1 bit of a call to indicate the ploidy. In the haploid case, the other bits code for the allele. In the diploid case, they code for the triangle number for the call.
add GT.ploidy which returns 1 or 2.
GT.j works in either case. (Or it should only be used in the diploid case?) In the haploid case, GT.gt == GT.j. GT.k is an error if the ploidy is not 2 (or should GT.j == GT.k for haploid calls?)
haploid is either homref or homvar, depending.
oneHotAlleles returns a vector with a single 1.
nNonRefAlleles returns 0 or 1, depending. What should PCA or linreg do for haploid? Just code them as 0 or 1? Generate an error?
I think this pretty naturally leads to two generalizations: phase (using a second bit) and multi-ploidy indicating alleles are stored out of line in an array.
I think this sounds reasonable, but we should plan for ploidy > 2 as well, even if we don’t implement it right away.
Call right now is stored as a 32 bit integer. We can have a maximum triangle number we’re willing to support. 16 bits will give 65535 triangle numbers, which seems like a sufficient amount to me. Then we can have the upper 16 bits to code for ploidy and anything else.
Some questions:
What do the PL’s look like for a haploid call in an actual VCF?
Should we be considering the case where ploidy could be 0 in an input file format such as a deletion?
GT.j and GT.k don’t make sense if there’s more than 2 alleles. Maybe it should be an array with length equal to ploidy?
What do the PL’s look like for a haploid call in an actual VCF?
There should be one for each genotype, but in the haploid case, that’s n, one for each allele (instead of n*(n + 1)/2 in the diploid case).
Here are diploid and haploid examples for a triallelic variant:
T A,G GT:PL 2/2:396,402,411,33,33,0 2:24,40,0
Should we be considering the case where ploidy could be 0 in an input file format such as a deletion?
Hmm, not sure I understand this. A deletion is still an allele, right? Wouldn’t 0-ploid mean there is no chromosome? I admit, I’m slightly fuzzy about when an indel becomes SV or a change in ploidy. Is there a way to encode this in a VCF? Is an empty call string allowed? But I’m OK with leaving space for 0 ploidy.
GT.j and GT.k don’t make sense if there’s more than 2 alleles. Maybe it should be an array with length equal to ploidy?
Good point. How about just GT[i] returns the allele of the ith chromosome set in the call GT? And And leave GT.j = GT[0] and GT.k = GT[1] for convenience.
I propose the following physical representation:
the low two bits represent the ploidy: haploid, diploid phased/unphased and stored out of line (4 cases),
in the first three cases, the upper 30 bits represent the call,
and when stored out of line, the upper 30 bits represent the 4-byte aligned offset to the calls stored out of line, with the first 32-bit word indicating the ploidy + phasing of the call.
there is no need to leave room for ploidy=0; in the case of a haploid deletion, the allele is D (or whatever you use to code for a deletion in Hail), and a homozygous deletion in a diploid set is DD, etc. In cells which do not have any DNA (red blood cells for example), well, there is no DNA information available at all, so no genotype.
Haploid support is of course the most urgent to attend to, allowing import and analysis of X and Y chromosomes, and Mitonchondria. Not being able to analyse the X chromosome in the UK Biobank dataset is a major drawback at the moment.
Even if, at least for now, you only update Hail’s methods to deal with haploid cases, I would argue that thinking from the beginning of ploidy in a general way is the best course of action, as these situations will arise, either when studying humans or other organisms. In humans, several conditions will require Hail to be resilient to polyploidy: partial aneuploidy (unbalanced translocations), trisomy (concerns 6 human autosomes), aneuploidy of sex chromosomes (such a klinefelter syndrome, X tri, tetra and pentasomy).
Most VCFs do not have the correct representation of haploid calls. I’m not yet familiar with the operations on genotypes, but most people will probably want to compute haploid from diploid on sex chromosomes within hail.
I think phase is a different beast as there is absolute and relative phase to consider. Absolute phase can be represented as 1 bit and is useful in the case of families where each allele can be assigned to its corresponding chromosome based on transmission. Now, in most sequencing cases, we don’t have family members and thus phase is computed statistically based on co-occurence of alleles in the reads and/or in the population. In this case, the result is blocks of variants phased relative to one another, but independent of the other blocks on the chromosome. So there, a single bit doesn’t do, as you also need to keep track of the phase block.
Hi @lfrancioli, thanks for the feedback. To clarify, this discussion only pertains to how the GT (call) fields are represented within Hail. So for call fields I’m proposing roughly:
0 or 1 becomes a haploid call (there is no way to write a missing haploid call),
0/0, 0/1, etc. become diploid unphased calls,
0|0, 0|1, etc. become diploid phased calls.
So in your second point, the haploid sex chromosome calls would be loaded as diploid, and then they can be called and rewritten as haploid.
(There’s also a question of phase is ploidy > 2. Should 0|1/2 be allowed? I think it is in by VCF spec, but GATK/HTSJDK uses a single bit for phase internally and I’m inclined to do the same.)
Therefore, I think phasing at the call level can be represented by as single bit.
For relative phasing, I expect the genotype to be encoded with the phase bit set (e.g. 0|1) and a separate genotype-level field (possible with the new generic genotype representation) to store the relative phase block information. I think that’s how your data is currently represented? What’s the FORMAT field for your existing phased data?
So I think we’re in agreement and the relative phase information can be stored with 1 bit in the call and another genotype field. Sound right?