Hello
I have a requirement to ingest huge amount of samples and do analysis. However I am getting an exceptions when I try to import vcfs from multiple samples the functions fails. Another question is how to “append” data to the existing matrix table? Is it available. I
We’ll need a lot more information to help you out. What format are your samples in? GVCF, project VCF, something else? How many samples: 1k, 10k, 100k, 1M?
If your samples are in project VCF, what do you want to do about sites that exist in one project VCF but not the other?
There 50k samples normal vcfs (generated by illumina on human samples) and more coming soon every day. We need to append new vcfs to matrix table and do variant quality control and other queries and analysis.
Moreover, each vcf has its own sample Id. whenever I try to import_vcf([path1, path2]) I get exceptions that vcfs has different sampleIds
hl.import_vcf([path1, path2, ...])
is intended for datasets that partitioned along the variant axis. That’s why it expects the sample IDs to be shared across all files.
I’m still not sure if you have “project” VCFs with one sample per file or if you have “GVCFs” (which are necessarily one sample per file). Illumina has some docs on GVCFs and there some examples of GVCF files on the GATK docs. In particular, GVCF files preserve non-variant site records. Without these records, you cannot combine multiple samples in a principled way. In particular, consider two sequences, one which is heterozygous at a site and one which doesn’t have many reads of any allele at the site. You don’t want to call the second sample as “homozygous reference” because you don’t have enough evidence for that call.
If you have GVCFs, then you should take a look at the Hail VDS interface and the VDS Combiner. This is very new functionality and we are actively working on a more user-friendly interface. However, it has been successfully used to combine 955k whole exome GVCFs into one Hail VDS. The gnomAD team is in the process of analyzing those 955k whole exomes using the Hail VDS interface.
Critically, the VDS supports incremental addition, so you can add more samples at a later point in time without reprocessing all the original GVCF files.
We have one file per sample, not gvcf it is plain vcf. We need to use Hail for analyzing these 50k samples. More samples are coming every day. I was not able to import them to matrix. So, can you suggest some solution?