Improve matrix write time?

bethsheets · October 24, 2019, 4:49pm

I have a question about scaling when using Hail in a Jupyter notebook using Terra. Are there ways to improve the write time for saving a matrix? I don’t believe the write function benefits from a spark cluster. But, for example, with export_vcf you can save as shards. I am currently trying to create a tutorial that uses Hail and saving a matrix with 1000 individuals takes 50 minutes, and we expect users to scale this analysis to 10,000+ individuals. The notebook imports vcf, converts to hail, filters through various steps, and then writes the hail matrix. So, I am not sure if a batch job is best for these interactive filtering steps.

tpoterba · October 24, 2019, 5:05pm

Hi Beth,
Hail table/MatrixTable write functions certainly benefit from parallelism – in fact, those ‘files’ are actually folders within which data is written in shards. The distribution of saving things is hidden from the user, which makes for a much nicer user experience!

My guess is that the way things are configured in Terra, it isn’t actually using the cluster at all (running in Spark local mode instead). If we rule that out, we might look into other issues, like importing a non-block-gzip VCF (with import_vcf(..., force=True) ).

dking · October 24, 2019, 6:37pm

Beth can you share the output of hl.spark_context().master?

bethsheets · October 24, 2019, 6:44pm

hl.spark_context().master outputs
‘yarn’

tpoterba · October 24, 2019, 6:45pm

OK, must be something else. Can you make a post over at the user forum (https://discuss.hail.is) with the first post you made here + the code you ran?

bethsheets · October 24, 2019, 6:50pm

Topic		Replies	Views
Why does it take such as long time to do operation as simple as counting the dimensions of the matrix table	3	910	December 2, 2020
Matrix table memory error?	3	480	September 15, 2021
Tell Hail to make different use of Spark processing	2	707	January 22, 2019
Support for hail accessing AWS s3 bucket in spark local mode Site Feedback	2	718	May 7, 2021
MatrixTable file format reference	5	931	January 30, 2020

Improve matrix write time?

Related topics