Yes, definitely. The VCF combiner / VDS combiner took off a bit faster than we anticipated! I expect some basic docs to be up before the end of the week.
Thanks @dking! I just tried the new_combiner and getting the following warning and then the job suddenly stops. Any suggestions on this?
Running on Apache Spark version 3.1.2
SparkUI available at http://scc-hadoop.bu.edu:4040
Welcome to
__ __ <>__
/ // /__ __/ /
/ __ / _ `/ / /
// //_,/// version 0.2.97-ccc85ed41b15
LOGGING: writing to /restricted/projectnb/kageproj/gatk/combiner/combiner.log
n gvcfs: 150
vds:/project/kageproj/dragen/kage.gvcf.chr22.vds
2022-09-21 13:20:25 Hail: INFO: Coerced sorted dataset (0 + 1) / 1]
2022-09-21 13:20:40 Hail: WARN: generated combiner save path of tmp/combiner-plans/vds-combiner-plan_fef8d273ee43de7affbc317e5237cef1fa4af84bda70d343a40dc1841d971c42_0.2.97.json
2022-09-21 13:20:42 Hail: WARN: gvcf_batch_size of 100 would produce too many tasks using 58 instead
I decreased the number of gvcfs from ~3200 to 150 and still get this warning and then stopping.
path_to_input_list = ‘/project/kageproj/gvcf.rb/combiner_’+chr+“.list” # a file with one GVCF path per line
gvcfs =
with hl.hadoop_open(path_to_input_list, ‘r’) as f:
for line in f:
gvcfs.append(line.strip())
print(“n gvcfs: “+str(len(gvcfs)))
vds_path = “/project/kageproj/dragen/kage.gvcf.chr”+chr+”.vds” # output destination
print(“vds:”+vds_path)
temp_bucket = ‘/project/kageproj/tmp’ # bucket for storing intermediate files
The warning is about the “batch size”, not the number of input GVCF files. The batch size controls how many GVCF files are combined in one iteration of the combiner. I believe we’re attempting to defend against Spark’s task graph size limitations. You shouldn’t need to take any action. If you want to hide the warning, specify batch_size=50 to new_combiner.