I cannot export a VDS created with the VDS combiner as a VCF

Thanks! I then tried to convert the vds to a matrix table and vcf with the following steps. During the vcf export, an Invalid type for format field ‘gvcf_info’ error was generated (see below). How should this this error be dealt with? Is there a better approach to create a vcf from a vds?

import os
import sys
import hail as hl
chr= sys.argv[1]
hl.init(log=‘/restricted/projectnb/kageproj/gatk/combiner/combiner.log’,tmp_dir=‘/project/kageproj/tmp’)
os.system(“hdfs dfs -ls -C /project/kageproj/gvcf.rb/chr”+chr+“/*.rb.g.vcf.gz >combiner_”+chr+“.list” )
os.system(“hdfs dfs -put -f combiner_”+chr+“.list /project/kageproj/gvcf.rb” )

path_to_input_list = ‘/project/kageproj/gvcf.rb/combiner_’+chr+“.list” # a file with one GVCF path per line
print(path_to_input_list)
gvcfs =
with hl.hadoop_open(path_to_input_list, ‘r’) as f:
for line in f:
gvcfs.append(line.strip())
print(“n gvcfs: “+str(len(gvcfs)))
vds_path = “/project/kageproj/dragen/kage.gvcf.chr”+chr+”.vds” # output destination
print(“vds:”+vds_path)
temp_bucket = ‘/project/kageproj/tmp’ # bucket for storing intermediate files

combiner=hl.vds.new_combiner(reference_genome=‘GRCh38’,
temp_path=‘tmp’,
gvcf_paths=gvcfs,
output_path=vds_path,
use_genome_default_intervals=True)
combiner.run()
vds = hl.vds.read_vds(vds_path)
print(“before split”)
vds_split = hl.vds.split_multi(vds)
print(“after split”)
mt = hl.vds.to_dense_mt(vds_split)
print(“after to dense”)
mt.count()
mt.describe()
mt.summarize()
mt.show(10)
mt.write(“/project/kageproj/mt/gard.chr”+chr+“.raw.mt”)
mt=hl.read_matrix_table(“/project/kageproj/mt/gard.chr”+chr+“.raw.mt”)
print(mt.count())
mt.describe()
mt.show()
hl.export_vcf(mt,“/project/kageproj/pVCF/gard.chr”+chr+“.raw.vcf.bgz”)

During the export, the following error occurred:

showing top 10 rows
showing the first 0 of 3449 columns
2022-09-22 10:43:43 Hail: WARN: export_vcf: ignored the following fields:
‘a_index’ (row)
‘was_split’ (row)
Traceback (most recent call last):
File “/restricted/projectnb/kageproj/gatk/vds.pipeline/./check_vds.py”, line 44, in
hl.export_vcf(mt,“/project/kageproj/pVCF/gard.chr”+chr+“.raw.vcf.bgz”)
File “”, line 2, in export_vcf
File “/share/pkg.7/hail/0.2.97/install/lib/python3.7/site-packages/hail/typecheck/check.py”, line 577, in wrapper
return original_func(*args, **kwargs)
File “/share/pkg.7/hail/0.2.97/install/lib/python3.7/site-packages/hail/methods/impex.py”, line 554, in export_vcf
Env.backend().execute(ir.MatrixWrite(dataset._mir, writer))
File “/share/pkg.7/hail/0.2.97/install/lib/python3.7/site-packages/hail/backend/py4j_backend.py”, line 104, in execute
self._handle_fatal_error_from_backend(e, ir)
File “/share/pkg.7/hail/0.2.97/install/lib/python3.7/site-packages/hail/backend/backend.py”, line 181, in _handle_fatal_error_from_backend
raise err
File “/share/pkg.7/hail/0.2.97/install/lib/python3.7/site-packages/hail/backend/py4j_backend.py”, line 98, in execute
result_tuple = self._jbackend.executeEncode(jir, stream_codec)
File “/share/pkg.7/spark/3.1.2/install/spark-3.1.2-bin-scc-spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py”, line 1305, in call
File “/share/pkg.7/hail/0.2.97/install/lib/python3.7/site-packages/hail/backend/py4j_backend.py”, line 31, in deco
raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
hail.utils.java.FatalError: HailException: Invalid type for format field ‘gvcf_info’. Found ‘struct{AC: array, AF: array, AN: int32, AS_BaseQRankSum: array, AS_FS: array, AS_InbreedingCoeff: array, AS_MQ: array, AS_MQRankSum: array, AS_QD: array, AS_QUALapprox: array, AS_RAW_BaseQRankSum: str, AS_RAW_MQ: array, AS_RAW_MQRankSum: array<tuple(float64, int32)>, AS_RAW_ReadPosRankSum: array<tuple(float64, int32)>, AS_ReadPosRankSum: array, AS_SB_TABLE: array<array>, AS_SOR: array, AS_VarDP: array, BaseQRankSum: float64, DRAGstrInfo: array, DRAGstrParams: array, ExcessHet: float64, FS: float64, InbreedingCoeff: float64, MQ: float64, MQRankSum: float64, MQ_DP: int32, QD: float64, QUALapprox: int32, RAW_GT_COUNT: array, RAW_MQandDP: array, ReadPosRankSum: float64, SOR: float64, VarDP: int32}’.

Java stack trace:
is.hail.utils.HailException: Invalid type for format field ‘gvcf_info’. Found ‘struct{AC: array, AF: array, AN: int32, AS_BaseQRankSum: array, AS_FS: array, AS_InbreedingCoeff: array, AS_MQ: array, AS_MQRankSum: array, AS_QD: array, AS_QUALapprox: array, AS_RAW_BaseQRankSum: str, AS_RAW_MQ: array, AS_RAW_MQRankSum: array<tuple(float64, int32)>, AS_RAW_ReadPosRankSum: array<tuple(float64, int32)>, AS_ReadPosRankSum: array, AS_SB_TABLE: array<array>, AS_SOR: array, AS_VarDP: array, BaseQRankSum: float64, DRAGstrInfo: array, DRAGstrParams: array, ExcessHet: float64, FS: float64, InbreedingCoeff: float64, MQ: float64, MQRankSum: float64, MQ_DP: int32, QD: float64, QUALapprox: int32, RAW_GT_COUNT: array, RAW_MQandDP: array, ReadPosRankSum: float64, SOR: float64, VarDP: int32}’.
at is.hail.utils.ErrorHandling.fatal(ErrorHandling.scala:17)
at is.hail.utils.ErrorHandling.fatal$(ErrorHandling.scala:17)
at is.hail.utils.package$.fatal(package.scala:78)
at is.hail.io.vcf.ExportVCF$.$anonfun$checkFormatSignature$1(ExportVCF.scala:91)
at is.hail.io.vcf.ExportVCF$.$anonfun$checkFormatSignature$1$adapted(ExportVCF.scala:85)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
at is.hail.io.vcf.ExportVCF$.checkFormatSignature(ExportVCF.scala:85)
at is.hail.expr.ir.MatrixVCFWriter.lower(MatrixWriter.scala:411)
at is.hail.expr.ir.WrappedMatrixWriter.lower(MatrixWriter.scala:54)
at is.hail.expr.ir.lowering.LowerTableIR$.apply(LowerTableIR.scala:679)
at is.hail.expr.ir.lowering.LowerToCDA$.lower(LowerToCDA.scala:73)
at is.hail.expr.ir.lowering.LowerToCDA$.apply(LowerToCDA.scala:18)
at is.hail.expr.ir.lowering.LowerToDistributedArrayPass.transform(LoweringPass.scala:77)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:27)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:67)
at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:72)
at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:69)
at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:64)
at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:15)
at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:13)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:13)
at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:47)
at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:416)
at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:452)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:70)
at is.hail.utils.package$.using(package.scala:640)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:70)
at is.hail.utils.package$.using(package.scala:640)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:59)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:310)
at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:449)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:448)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)

Hail version: 0.2.97-ccc85ed41b15
Error summary: HailException: Invalid type for format field ‘gvcf_info’. Found ‘struct{AC: array, AF: array, AN: int32, AS_BaseQRankSum: array, AS_FS: array, AS_InbreedingCoeff: array, AS_MQ: array, AS_MQRankSum: array, AS_QD: array, AS_QUALapprox: array, AS_RAW_BaseQRankSum: str, AS_RAW_MQ: array, AS_RAW_MQRankSum: array<tuple(float64, int32)>, AS_RAW_ReadPosRankSum: array<tuple(float64, int32)>, AS_ReadPosRankSum: array, AS_SB_TABLE: array<array>, AS_SOR: array, AS_VarDP: array, BaseQRankSum: float64, DRAGstrInfo: array, DRAGstrParams: array, ExcessHet: float64, FS: float64, InbreedingCoeff: float64, MQ: float64, MQRankSum: float64, MQ_DP: int32, QD: float64, QUALapprox: int32, RAW_GT_COUNT: array, RAW_MQandDP: array, ReadPosRankSum: float64, SOR: float64, VarDP: int32}’.

VCFs cannot have structure-typed fields. In particular, you can’t represent something like:

{"some_struct_entry_field": {
    "a_nested_entry_field": 123, 
    "another_nested_entry_field": 456
 },
 "GT": "0/1"
}

You can .drop("gvcf_info") to throw away all that information or you can flatten the structure by lifting all the fields to the top level:

mt = mt.select_entries(**mt.entry.flatten())

I should warn you that generating a dense project VCF will create a VCF much larger than the VDS! That’s fine if its what you need, but it will be rather large, particularly as your sample sizes get into the 100s of thousands.

@dking @tpoterba So the original gvcf INFO field is stored in the each sample’s entry field in the vds. When creating the gvcfs, I am including allele specific annotations. See Allele-specific annotation and filtering of germline short variants. I would like to eventually recreate the vcf generated with the GATK GenotypeGVFs tool with the Allele Specific annotation in the info field. I have already run the GATK pipeline on 3200 samples to compare the results.

As the number of samples increase, the proportion of sites with multiallelic variants increase. The GATK VQSR step was modified to QC specific alleles instead of the site. So for a site with two alt alleles, one may pass and the other may not. Previously, vqsr step would assign a pass for the site itself and not distinguish the quality of multiple alt alleles. As a result, the multiallelic variants had poorer quality as a result.

For the Allele Specific Annotations in the vdf entry gvcf_info structure, would the allele specific fields get split if the fields were moved to the top level of the entry field with vds_split = hl.vds.split_multi(vds)

My initial attempt to export the vcf with the allele specific annotation resulted in the following error. This used

mt = mt.select_entries(**mt.entry.flatten())

to move the gvcf_info to the top level. It is having trouble with. gvcf_info.AS_RAW_MQRankSum. Found array<tuple(float64, int32)>.

The original gvcf has the AS_RAW_MQRankSum format as a string and not a tuple. Not sure why it is in the mt as a array<tuple(float64, int32)>.

##INFO=<ID=AS_RAW_MQRankSum,Number=1,Type=String,Description="Allele-specfic raw data for Mapping Quality Rank Sum">
chr1    10329   .       AC      A,<NON_REF>     41.60   .       AS_QUALapprox=|49|0;AS_RAW_BaseQRankSum=|-0.4,1|1.4,1;AS_RAW_MQ=13383.00|5864.00|585.00;AS_RAW_MQRankSum=|-0.6,1|-1.9,1;AS_RAW_ReadPosRankSum=|0.6,1|1.4,1;AS_SB_TABLE=0,16|1,8|3,5;AS_VarDP=16|9|0;BaseQRankSum=0.436;DP=36;MQRankSum=-1.450;QUALapprox=49;RAW_GT_COUNT=0,1,0;RAW_MQandDP=22003,36;ReadPosRankSum=1.167;VarDP=25        GT:AD:DP:GQ:PGT:PID:PL:PS:SB    0|1:16,9,0:33:37:0|1:10321_C_T:49,0,37,172,325,212:10321:0,16,4,13
chr1    10492   .       C       T,<NON_REF>     58.75   .       AS_QUALapprox=|66|0;AS_RAW_BaseQRankSum=|-0.2,1|0.0,1;AS_RAW_MQ=441.00|1686.00|49.00;AS_RAW_MQRankSum=|-0.6,1|-0.7,1;AS_RAW_ReadPosRankSum=|1.4,1|0.6,1;AS_SB_TABLE=0,1|5,1|0,1;AS_VarDP=1|6|0;BaseQRankSum=-0.157;DP=8;MQRankSum=-0.674;QUALapprox=66;RAW_GT_COUNT=0,1,0;RAW_MQandDP=2176,8;ReadPosRankSum=1.534;VarDP=7        GT:AD:DP:GQ:PL:SB       0/1:1,6,0:8:6:66,0,6,166,32,97:0,1,5,2
Entry fields:
    'GT': call
    'DP': int32
    'GQ': int32
    'RGQ': int32
    'gvcf_info.AC': array<int32>
    'gvcf_info.AF': array<float64>
    'gvcf_info.AN': int32
    'gvcf_info.AS_BaseQRankSum': array<float64>
    'gvcf_info.AS_FS': array<float64>
    'gvcf_info.AS_InbreedingCoeff': array<float64>
    'gvcf_info.AS_MQ': array<float64>
    'gvcf_info.AS_MQRankSum': array<float64>
    'gvcf_info.AS_QD': array<float64>
    'gvcf_info.AS_QUALapprox': array<int32>
    'gvcf_info.AS_RAW_BaseQRankSum': str
    'gvcf_info.AS_RAW_MQ': array<float64>
    'gvcf_info.AS_RAW_MQRankSum': array<tuple (
        float64,
        int32
    )>
    'gvcf_info.AS_RAW_ReadPosRankSum': array<tuple (
        float64,
        int32
    )>
    'gvcf_info.AS_ReadPosRankSum': array<float64>
    'gvcf_info.AS_SB_TABLE': array<array<int32>>
    'gvcf_info.AS_SOR': array<float64>
    'gvcf_info.AS_VarDP': array<int32>
    'gvcf_info.BaseQRankSum': float64
    'gvcf_info.DRAGstrInfo': array<int32>
    'gvcf_info.DRAGstrParams': array<float64>
    'gvcf_info.ExcessHet': float64
    'gvcf_info.FS': float64
    'gvcf_info.InbreedingCoeff': float64
    'gvcf_info.MQ': float64
    'gvcf_info.MQRankSum': float64
    'gvcf_info.MQ_DP': int32
    'gvcf_info.QD': float64
    'gvcf_info.QUALapprox': int32
    'gvcf_info.RAW_GT_COUNT': array<int32>
    'gvcf_info.RAW_MQandDP': array<int32>
    'gvcf_info.ReadPosRankSum': float64
    'gvcf_info.SOR': float64
    'gvcf_info.VarDP': int32
    'GP': array<float64>
    'MIN_DP': int32
    'PG': array<float64>
    'PID': str
    'PS': int32
    'SB': array<int32>
    'PGT': call
    'AD': array<int32>
    'PL': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------
2022-09-30 09:07:14 Hail: WARN: export_vcf: ignored the following fields:
    'a_index' (row)
    'was_split' (row)
Traceback (most recent call last):
  File "/restricted/projectnb/kageproj/gatk/vds.pipeline/./vds_combiner.py", line 46, in <module>
    hl.export_vcf(mt,"/project/kageproj/pVCF/gard.chr"+chr+".raw.vcf.bgz")
  File "<decorator-gen-1338>", line 2, in export_vcf
  File "/share/pkg.7/hail/0.2.97/install/lib/python3.7/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/share/pkg.7/hail/0.2.97/install/lib/python3.7/site-packages/hail/methods/impex.py", line 554, in export_vcf
    Env.backend().execute(ir.MatrixWrite(dataset._mir, writer))
  File "/share/pkg.7/hail/0.2.97/install/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 104, in execute
    self._handle_fatal_error_from_backend(e, ir)
  File "/share/pkg.7/hail/0.2.97/install/lib/python3.7/site-packages/hail/backend/backend.py", line 181, in _handle_fatal_error_from_backend
    raise err
  File "/share/pkg.7/hail/0.2.97/install/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 98, in execute
    result_tuple = self._jbackend.executeEncode(jir, stream_codec)
  File "/share/pkg.7/spark/3.1.2/install/spark-3.1.2-bin-scc-spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/share/pkg.7/hail/0.2.97/install/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 31, in deco
    raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
hail.utils.java.FatalError: HailException: Invalid type for format field 'gvcf_info.AS_RAW_MQRankSum'. Found 'array<tuple(float64, int32)>'.

Java stack trace:
is.hail.utils.HailException: Invalid type for format field 'gvcf_info.AS_RAW_MQRankSum'. Found 'array<tuple(float64, int32)>'.
        at is.hail.utils.ErrorHandling.fatal(ErrorHandling.scala:17)
        at is.hail.utils.ErrorHandling.fatal$(ErrorHandling.scala:17)
        at is.hail.utils.package$.fatal(package.scala:78)
        at is.hail.io.vcf.ExportVCF$.$anonfun$checkFormatSignature$1(ExportVCF.scala:91)
        at is.hail.io.vcf.ExportVCF$.$anonfun$checkFormatSignature$1$adapted(ExportVCF.scala:85)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
        at is.hail.io.vcf.ExportVCF$.checkFormatSignature(ExportVCF.scala:85)
        at is.hail.expr.ir.MatrixVCFWriter.lower(MatrixWriter.scala:411)
        at is.hail.expr.ir.WrappedMatrixWriter.lower(MatrixWriter.scala:54)
        at is.hail.expr.ir.lowering.LowerTableIR$.apply(LowerTableIR.scala:679)
        at is.hail.expr.ir.lowering.LowerToCDA$.lower(LowerToCDA.scala:73)
        at is.hail.expr.ir.lowering.LowerToCDA$.apply(LowerToCDA.scala:18)
        at is.hail.expr.ir.lowering.LowerToDistributedArrayPass.transform(LoweringPass.scala:77)
        at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:27)
        at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:67)
        at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:72)
        at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:69)
        at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
        at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
        at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
        at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
        at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
        at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
        at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:64)
        at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:15)
        at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:13)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
        at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:13)
        at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:47)
        at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:416)
        at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:452)
        at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:70)
        at is.hail.utils.package$.using(package.scala:640)
        at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:70)
        at is.hail.utils.package$.using(package.scala:640)
        at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
        at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:59)
        at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:310)
        at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:449)
        at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
        at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:448)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:750)

Hey @jjfarrell !

This appears to be a limitation in export_vcf. Let me take a closer look and get back to you.

Hey @jjfarrell !

OK, so, the issue is that the vds_combiner is trying too hard to be helpful. It’s pretty clear that:

AS_RAW_MQRankSum=|-0.6,1|-1.9,1

Should be interpreted as an array of tuples of floats and ints, not an array of strings. However, obviously, VCF doesn’t support that. The VDS combiner opts to produce the most useful format for you when working within the VDS-world, but this creates an issue when you try to round-trip back to VCF. We’re gonna take some time to think about the right thing to do here. Clearly, VDS users want arrays of pairs, but we don’t want bad error messages and a painful export experience either!

In the meantime, you can do this:

mt = mt.annotate_entries(
    gvcf_info = mt.gvcf_info.annotate(
        AS_RAW_MQRankSum = mt.gvcf_info.AS_RAW_MQRankSum.map(hl.delimit),
        AS_RAW_ReadPosRankSum = mt.gvcf_info.AS_RAW_ReadPosRankSum.map(hl.delimit)
))

This just converts the arrays of pairs into arrays of strings where the pair is represented as "first,second".


So, all together, what you want is this:

mt = mt.annotate_entries(
    gvcf_info = mt.gvcf_info.annotate(
        AS_RAW_MQRankSum = mt.gvcf_info.AS_RAW_MQRankSum.map(hl.delimit),
        AS_RAW_ReadPosRankSum = mt.gvcf_info.AS_RAW_ReadPosRankSum.map(hl.delimit)
))
mt = mt.select_entries(**mt.entry.flatten())

@dking

Thanks for the workaround.

I think the bigger issue is that some of these gvcf_info AS_ fields are not being split correctly. The hail vds.split_multi works fine when there is a single variant at the site. But there is a problem when multiallelic sites are split. When the split of a multiallelic site is run with vds_split = hl.vds.split_multi(vds), these AS gvcf_info fields are not split into the new variant sites for the multiallelic sites.

The information on which field are Alleles or genotypes is stored in the header (Number = A or G). If the vdf header INFO could be parsed into a hail table, that information could be used to split the multiallelic fields as needed.

header table example…
field, Number, Type, Description
AS_BaseQRankSum,Number, A, Float, allele specific Z-score from Wilcoxon rank sum test of each Alt Vs. Ref base qualities

##INFO=<ID=AS_BaseQRankSum,Number=A,Type=Float,Description="allele specific Z-score from Wilcoxon rank sum test of each Alt Vs. Ref base qualities">
##INFO=<ID=AS_FS,Number=A,Type=Float,Description="allele specific phred-scaled p-value using Fisher's exact test to detect strand bias of each alt allele">
##INFO=<ID=AS_InbreedingCoeff,Number=A,Type=Float,Description="Allele-specific inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=AS_MQ,Number=A,Type=Float,Description="Allele-specific RMS Mapping Quality">
##INFO=<ID=AS_MQRankSum,Number=A,Type=Float,Description="Allele-specific Mapping Quality Rank Sum">
##INFO=<ID=AS_QD,Number=A,Type=Float,Description="Allele-specific Variant Confidence/Quality by Depth">
##INFO=<ID=AS_QUALapprox,Number=1,Type=String,Description="Allele-specific QUAL approximations">
##INFO=<ID=AS_RAW_BaseQRankSum,Number=1,Type=String,Description="raw data for allele specific rank sum test of base qualities">
##INFO=<ID=AS_RAW_MQ,Number=1,Type=String,Description="Allele-specfic raw data for RMS Mapping Quality">
##INFO=<ID=AS_RAW_MQRankSum,Number=1,Type=String,Description="Allele-specfic raw data for Mapping Quality Rank Sum">
##INFO=<ID=AS_RAW_ReadPosRankSum,Number=1,Type=String,Description="allele specific raw data for rank sum test of read position bias">
##INFO=<ID=AS_ReadPosRankSum,Number=A,Type=Float,Description="allele specific Z-score from Wilcoxon rank sum test of each Alt vs. Ref read position bias">
##INFO=<ID=AS_SB_TABLE,Number=1,Type=String,Description="Allele-specific forward/reverse read counts for strand bias tests. Includes the reference and alleles separated by |.">
##INFO=<ID=AS_SOR,Number=A,Type=Float,Description="Allele specific strand Odds Ratio of 2x|Alts| contingency table to detect allele specific strand bias">
##INFO=<ID=AS_VarDP,Number=1,Type=String,Description="Allele-specific (informative) depth over variant genotypes -- including ref, RAW format">

Hey @jjfarrell !

Thanks for the note. I’ll talk to the team about preserving or surfacing the INFO metadata from VCFs.