Vds.new_combiner

jjfarrell · September 12, 2022, 2:18pm

The vds.new_combiner overview is missing on this page. Could that section be updated?
https://hail.is/docs/0.2/vds/index.html

Also is there a tutorial for the new variant data set api?

Should vds.new_combiner now be used instead of the hail.experimental.run_combiner for combining gvcfs?

dking · September 13, 2022, 3:10pm

Yes, definitely. The VCF combiner / VDS combiner took off a bit faster than we anticipated! I expect some basic docs to be up before the end of the week.

dking · September 16, 2022, 4:58pm

Docs for the combiner can now be found here. Documentation on the VDS representation and associated classes can be found here.

jjfarrell · September 21, 2022, 5:28pm

Thanks @dking! I just tried the new_combiner and getting the following warning and then the job suddenly stops. Any suggestions on this?

Running on Apache Spark version 3.1.2
SparkUI available at http://scc-hadoop.bu.edu:4040
Welcome to
__ __ <>__
/ // /__ __/ /
/ __ / _ `/ / /
// //_,/// version 0.2.97-ccc85ed41b15
LOGGING: writing to /restricted/projectnb/kageproj/gatk/combiner/combiner.log
n gvcfs: 150
vds:/project/kageproj/dragen/kage.gvcf.chr22.vds
2022-09-21 13:20:25 Hail: INFO: Coerced sorted dataset (0 + 1) / 1]
2022-09-21 13:20:40 Hail: WARN: generated combiner save path of tmp/combiner-plans/vds-combiner-plan_fef8d273ee43de7affbc317e5237cef1fa4af84bda70d343a40dc1841d971c42_0.2.97.json
2022-09-21 13:20:42 Hail: WARN: gvcf_batch_size of 100 would produce too many tasks using 58 instead

I decreased the number of gvcfs from ~3200 to 150 and still get this warning and then stopping.

The Hail script is this:

import os
import sys
import hail as hl
chr= sys.argv[1]
hl.init(log=‘/restricted/projectnb/kageproj/gatk/combiner/combiner.log’,tmp_dir=‘/project/kageproj/tmp’)
os.system(“hdfs dfs -ls -C /project/kageproj/gvcf.rb/chr”+chr+“/*.rb.g.vcf.gz |head -150 >combiner_”+chr+“.list” )
os.system(“hdfs dfs -put -f combiner_”+chr+“.list /project/kageproj/gvcf.rb” )

path_to_input_list = ‘/project/kageproj/gvcf.rb/combiner_’+chr+“.list” # a file with one GVCF path per line
gvcfs =
with hl.hadoop_open(path_to_input_list, ‘r’) as f:
for line in f:
gvcfs.append(line.strip())
print(“n gvcfs: “+str(len(gvcfs)))
vds_path = “/project/kageproj/dragen/kage.gvcf.chr”+chr+”.vds” # output destination
print(“vds:”+vds_path)
temp_bucket = ‘/project/kageproj/tmp’ # bucket for storing intermediate files

hl.vds.new_combiner(reference_genome=‘GRCh38’,
temp_path=‘tmp’,
gvcf_paths=gvcfs,
output_path=vds_path,
use_genome_default_intervals=True)

tpoterba · September 21, 2022, 5:32pm

new_combiner doesn’t actually run the combiner – it builds the plan.

do:

combiner = hl.vds.new_combiner(reference_genome=‘GRCh38’,
temp_path=‘tmp’,
gvcf_paths=gvcfs,
output_path=vds_path,
use_genome_default_intervals=True)

combiner.run()

dking · September 21, 2022, 5:41pm

The warning is about the “batch size”, not the number of input GVCF files. The batch size controls how many GVCF files are combined in one iteration of the combiner. I believe we’re attempting to defend against Spark’s task graph size limitations. You shouldn’t need to take any action. If you want to hide the warning, specify batch_size=50 to new_combiner.

dking · September 30, 2022, 2:47pm

6 posts were split to a new topic: I cannot export a VDS created with the VDS combiner as a VCF

Topic		Replies	Views
I cannot export a VDS created with the VDS combiner as a VCF	7	458	October 3, 2022
Tell Hail to make different use of Spark processing	2	707	January 22, 2019
Loading a large and growing cohort into Hail	2	724	May 16, 2019
Improve matrix write time?	5	664	October 24, 2019
Hail on Amazon Web Services	0	619	November 30, 2020

Vds.new_combiner

Related topics