Why does it take such as long time to do operation as simple as counting the dimensions of the matrix table

I’m doing this on Compute Canada using resource allocation --ntasks=1 --cpus-per-task=8 --mem=250G. However it’s been a while but I still haven’t seen the result.

    Singularity> python
    Python 3.6.9 (default, Nov 23 2019, 07:02:27) 
    [GCC 6.3.0 20170516] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import hail as hl
    >>> ds = hl.import_plink("/scratch/zhupy/ukb_imp_chr1_v3_pruned.bed","/scratch/zhupy/ukb_imp_chr1_v3_pruned.bim","/scratch/zhupy/ukb_imp_chr1_v3_pruned.fam")
    Initializing Hail with default parameters...
    2020-12-02 16:30:31 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    2020-12-02 16:30:33 WARN  Hail:37 - This Hail JAR was compiled for Spark 2.4.5, running with Spark 2.4.1.
      Compatibility is not guaranteed.
    Running on Apache Spark version 2.4.1
    SparkUI available at http://cdr860.int.cedar.computecanada.ca:4040
    Welcome to
         __  __     <>__
        / /_/ /__  __/ /
       / __  / _ `/ / /
      /_/ /_/\_,_/_/_/   version 0.2.60-de1845e1c2f6
    LOGGING: writing to /scratch/zhupy/hail-20201202-0830-0.2.60-de1845e1c2f6.log
    2020-12-02 16:30:48 Hail: INFO: Found 487409 samples in fam file.
    2020-12-02 16:30:48 Hail: INFO: Found 1220764 variants in bim file.
    >>> typeof(10)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    NameError: name 'typeof' is not defined
    >>> type(10)
    <class 'int'>
    >>> type(ds)
    <class 'hail.matrixtable.MatrixTable'>
    >>> ds.count()
    [Stage 0:>                                                          (0 + 1) / 1]

The configuration you’re using is going to make this take longer. I’m assuming that --ntasks and --cpus-per-task are being passed through to Spark configuration. The extra threads won’t help here (so --cpus-per-task=8 is wasting 7), and you’re manually setting 1 task so the parallelism is 1-way. Try increasing the number of tasks a bit and setting --cpus-per-task=1.

Separately, we should have an optimizer rule that can statically compute the dimensions of a matrix coming from PLINK, but this isn’t the general solution.

I used another job configuration salloc --time=24:00:00 --ntasks=10 --cpus-per-task=1 --mem=250G but I got an error message down here. Do you know what is the reason?

>>> mt.head(10).show()
[Stage 0:>                                                          (0 + 0) / 1]Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<decorator-gen-1233>", line 2, in show
  File "/usr/local/lib/python3.6/site-packages/hail/typecheck/check.py", line 614, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/usr/local/lib/python3.6/site-packages/hail/matrixtable.py", line 2628, in show
    handler(MatrixTable._Show(t, n_rows, actual_n_cols, displayed_n_cols, width, truncate, types))
  File "/usr/local/lib/python3.6/site-packages/IPython/core/display.py", line 282, in display
    print(*objs)
  File "/usr/local/lib/python3.6/site-packages/hail/matrixtable.py", line 2537, in __str__
    s = self.table_show.__str__()
  File "/usr/local/lib/python3.6/site-packages/hail/table.py", line 1294, in __str__
    return self._ascii_str()
  File "/usr/local/lib/python3.6/site-packages/hail/table.py", line 1320, in _ascii_str
    rows, has_more, dtype = self.data()
  File "/usr/local/lib/python3.6/site-packages/hail/table.py", line 1304, in data
    rows, has_more = t._take_n(self.n)
  File "/usr/local/lib/python3.6/site-packages/hail/table.py", line 1451, in _take_n
    rows = self.take(n + 1)
  File "<decorator-gen-1109>", line 2, in take
  File "/usr/local/lib/python3.6/site-packages/hail/typecheck/check.py", line 614, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/usr/local/lib/python3.6/site-packages/hail/table.py", line 2121, in take
    return self.head(n).collect(_localize)
  File "<decorator-gen-1103>", line 2, in collect
  File "/usr/local/lib/python3.6/site-packages/hail/typecheck/check.py", line 614, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/usr/local/lib/python3.6/site-packages/hail/table.py", line 1920, in collect
    return Env.backend().execute(e._ir)
  File "/usr/local/lib/python3.6/site-packages/hail/backend/py4j_backend.py", line 98, in execute
    raise e
  File "/usr/local/lib/python3.6/site-packages/hail/backend/py4j_backend.py", line 74, in execute
    result = json.loads(self._jhc.backend().executeJSON(jir))
  File "/usr/local/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/lib/python3.6/site-packages/hail/backend/py4j_backend.py", line 32, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
hail.utils.java.FatalError: OutOfMemoryError: Java heap space

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3181)
	at java.util.ArrayList.grow(ArrayList.java:267)
	at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:241)
	at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:233)
	at java.util.ArrayList.add(ArrayList.java:464)
	at com.esotericsoftware.kryo.util.MapReferenceResolver.nextReadId(MapReferenceResolver.java:51)
	at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:850)
	at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:780)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.read(DefaultArraySerializers.java:286)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.read(DefaultArraySerializers.java:258)
	at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
	at com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:36)
	at com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:23)
	at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:391)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:302)
	at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
	at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
	at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
	at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
	at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:391)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:302)
	at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
	at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:278)
	at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:168)
	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
	at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
	at org.apache.spark.storage.BlockManager.maybeCacheDiskValuesInMemory(BlockManager.scala:1312)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at is.hail.sparkextras.ContextRDD.runJob(ContextRDD.scala:351)
	at is.hail.rvd.RVD$$anonfun$13.apply(RVD.scala:526)
	at is.hail.rvd.RVD$$anonfun$13.apply(RVD.scala:526)
	at is.hail.utils.PartitionCounts$.incrementalPCSubsetOffset(PartitionCounts.scala:73)
	at is.hail.rvd.RVD.head(RVD.scala:525)
	at is.hail.expr.ir.TableSubset$class.execute(TableIR.scala:1326)
	at is.hail.expr.ir.TableHead.execute(TableIR.scala:1332)
	at is.hail.expr.ir.TableRename.execute(TableIR.scala:2703)
	at is.hail.expr.ir.TableOrderBy.execute(TableIR.scala:2629)
	at is.hail.expr.ir.TableMapRows.execute(TableIR.scala:1845)
	at is.hail.expr.ir.TableOrderBy.execute(TableIR.scala:2629)
	at is.hail.expr.ir.TableSubset$class.execute(TableIR.scala:1324)
	at is.hail.expr.ir.TableHead.execute(TableIR.scala:1332)
	at is.hail.expr.ir.TableMapRows.execute(TableIR.scala:1845)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:819)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.interpretAndCoerce$1(InterpretNonCompilable.scala:16)
	at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$$anonfun$1.apply(InterpretNonCompilable.scala:25)
	at is.hail.expr.ir.InterpretNonCompilable$$anonfun$1.apply(InterpretNonCompilable.scala:25)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at is.hail.expr.ir.InterpretNonCompilable$.rewriteChildren$1(InterpretNonCompilable.scala:25)
	at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:54)
	at is.hail.expr.ir.InterpretNonCompilable$.apply(InterpretNonCompilable.scala:58)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.transform(LoweringPass.scala:67)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:13)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$class.apply(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.apply(LoweringPass.scala:62)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:14)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:12)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:28)
	at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:354)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:338)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:335)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:25)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:23)
	at is.hail.utils.package$.using(package.scala:618)
	at is.hail.annotations.Region$.scoped(Region.scala:18)
	at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:23)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:247)
	at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:335)
	at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:379)
	at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:377)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:377)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3181)
	at java.util.ArrayList.grow(ArrayList.java:267)
	at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:241)
	at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:233)
	at java.util.ArrayList.add(ArrayList.java:464)
	at com.esotericsoftware.kryo.util.MapReferenceResolver.nextReadId(MapReferenceResolver.java:51)
	at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:850)
	at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:780)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.read(DefaultArraySerializers.java:286)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.read(DefaultArraySerializers.java:258)
	at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
	at com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:36)
	at com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:23)
	at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:391)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:302)
	at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
	at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
	at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
	at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
	at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:391)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:302)
	at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
	at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:278)
	at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:168)
	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
	at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
	at org.apache.spark.storage.BlockManager.maybeCacheDiskValuesInMemory(BlockManager.scala:1312)




Hail version: 0.2.60-de1845e1c2f6
Error summary: OutOfMemoryError: Java heap space

I think Tim was mistaken. It looks like you’re using a standard SLURM-style cluster. Hail doesn’t have native support for this type of cluster. Hail can’t scale as large or run as fast in this environment as it will on Amazon or Google.


If I understand correctly, you intend to submit a single job to the compute cluster which starts python and executes a Hail script.

The main issue you’re facing is that your PLINK file appears to be imported as a single partition (notice the 0 + 1). This means you’re not benefiting from any parallelism (you’re only using one CPU core). Can you share the whole script that you submitted to the cluster? In what file format is your data stored? Some file formats are not splittable which means you’re stuck with a single core. In that case, you should import your data and write it in Hail MatrixTable format once. All subsequent analyses will be fast.


Also, be certain you are telling Hail how much memory it can use.