Vds.new_combiner

The vds.new_combiner overview is missing on this page. Could that section be updated?
https://hail.is/docs/0.2/vds/index.html

Also is there a tutorial for the new variant data set api?

Should vds.new_combiner now be used instead of the hail.experimental.run_combiner for combining gvcfs?

Yes, definitely. The VCF combiner / VDS combiner took off a bit faster than we anticipated! I expect some basic docs to be up before the end of the week.

Docs for the combiner can now be found here. Documentation on the VDS representation and associated classes can be found here.

Thanks @dking! I just tried the new_combiner and getting the following warning and then the job suddenly stops. Any suggestions on this?

Running on Apache Spark version 3.1.2
SparkUI available at http://scc-hadoop.bu.edu:4040
Welcome to
__ __ <>__
/ // /__ __/ /
/ __ / _ `/ / /
/
/ //_,/// version 0.2.97-ccc85ed41b15
LOGGING: writing to /restricted/projectnb/kageproj/gatk/combiner/combiner.log
n gvcfs: 150
vds:/project/kageproj/dragen/kage.gvcf.chr22.vds
2022-09-21 13:20:25 Hail: INFO: Coerced sorted dataset (0 + 1) / 1]
2022-09-21 13:20:40 Hail: WARN: generated combiner save path of tmp/combiner-plans/vds-combiner-plan_fef8d273ee43de7affbc317e5237cef1fa4af84bda70d343a40dc1841d971c42_0.2.97.json
2022-09-21 13:20:42 Hail: WARN: gvcf_batch_size of 100 would produce too many tasks using 58 instead

I decreased the number of gvcfs from ~3200 to 150 and still get this warning and then stopping.

The Hail script is this:

import os
import sys
import hail as hl
chr= sys.argv[1]
hl.init(log=‘/restricted/projectnb/kageproj/gatk/combiner/combiner.log’,tmp_dir=‘/project/kageproj/tmp’)
os.system(“hdfs dfs -ls -C /project/kageproj/gvcf.rb/chr”+chr+“/*.rb.g.vcf.gz |head -150 >combiner_”+chr+“.list” )
os.system(“hdfs dfs -put -f combiner_”+chr+“.list /project/kageproj/gvcf.rb” )

path_to_input_list = ‘/project/kageproj/gvcf.rb/combiner_’+chr+“.list” # a file with one GVCF path per line
gvcfs =
with hl.hadoop_open(path_to_input_list, ‘r’) as f:
for line in f:
gvcfs.append(line.strip())
print(“n gvcfs: “+str(len(gvcfs)))
vds_path = “/project/kageproj/dragen/kage.gvcf.chr”+chr+”.vds” # output destination
print(“vds:”+vds_path)
temp_bucket = ‘/project/kageproj/tmp’ # bucket for storing intermediate files

hl.vds.new_combiner(reference_genome=‘GRCh38’,
temp_path=‘tmp’,
gvcf_paths=gvcfs,
output_path=vds_path,
use_genome_default_intervals=True)

new_combiner doesn’t actually run the combiner – it builds the plan.

do:

combiner = hl.vds.new_combiner(reference_genome=‘GRCh38’,
temp_path=‘tmp’,
gvcf_paths=gvcfs,
output_path=vds_path,
use_genome_default_intervals=True)

combiner.run()

The warning is about the “batch size”, not the number of input GVCF files. The batch size controls how many GVCF files are combined in one iteration of the combiner. I believe we’re attempting to defend against Spark’s task graph size limitations. You shouldn’t need to take any action. If you want to hide the warning, specify batch_size=50 to new_combiner.

6 posts were split to a new topic: I cannot export a VDS created with the VDS combiner as a VCF