I have a question about scaling when using Hail in a Jupyter notebook using Terra. Are there ways to improve the write time for saving a matrix? I don’t believe the write function benefits from a spark cluster. But, for example, with export_vcf you can save as shards. I am currently trying to create a tutorial that uses Hail and saving a matrix with 1000 individuals takes 50 minutes, and we expect users to scale this analysis to 10,000+ individuals. The notebook imports vcf, converts to hail, filters through various steps, and then writes the hail matrix. So, I am not sure if a batch job is best for these interactive filtering steps.
write functions certainly benefit from parallelism – in fact, those ‘files’ are actually folders within which data is written in shards. The distribution of saving things is hidden from the user, which makes for a much nicer user experience!
My guess is that the way things are configured in Terra, it isn’t actually using the cluster at all (running in Spark local mode instead). If we rule that out, we might look into other issues, like importing a non-block-gzip VCF (with
import_vcf(..., force=True) ).
Beth can you share the output of
OK, must be something else. Can you make a post over at the user forum (https://discuss.hail.is) with the first post you made here + the code you ran?