Support for hail accessing AWS s3 bucket in spark local mode

Abhishek · May 7, 2021, 6:51am

Hi all,
I am currently working on developing a python package for GWAS pipelines using hail. It works fine when I run with spark cluster mode. However, when I try to use hail using spark local mode, it throws the following error.

Is it not possible to read hail matrix tables in spark local mode using hail?

Stack trace

Traceback (most recent call last):
  File "tests/test.py", line 118, in test_use_cohort_from_source
    self.assertTrue(expr=ct.output().mt.exists(), msg=f'{ct.output().mt} does not exist!')
  File "/home/users/ab904123/piranha_package/piranha/piranha/workflows/target.py", line 223, in exists
    return self.spark_state == 'complete'
  File "/home/users/ab904123/piranha_package/piranha/piranha/workflows/target.py", line 215, in spark_state
    if self.fs.exists(self.completeness_file):
  File "/home/users/ab904123/piranha_package/piranha/piranha/bmrn_luigi_ext/config.py", line 186, in wrapped
    return fn(*args, **kwargs)
  File "/home/users/ab904123/piranha_package/piranha/piranha/workflows/filesystem.py", line 17, in exists
    return hl.hadoop_exists(path)
  File "/bmrn/apps/hail/0.2.42/python/hail-0.2.42-py3-none-any.egg/hail/utils/hadoop_utils.py", line 128, in hadoop_exists
    return Env.fs().exists(path)
  File "/bmrn/apps/hail/0.2.42/python/hail-0.2.42-py3-none-any.egg/hail/fs/hadoop_fs.py", line 30, in exists
    return self._jfs.exists(path)
  File "/bmrn/apps/spark/2.4.5/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/bmrn/apps/hail/0.2.42/python/hail-0.2.42-py3-none-any.egg/hail/backend/spark_backend.py", line 41, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
hail.utils.java.FatalError: IllegalArgumentException: null

Java stack trace:
java.lang.IllegalArgumentException: null
        at java.util.concurrent.ThreadPoolExecutor.<init>(ThreadPoolExecutor.java:1314)
        at java.util.concurrent.ThreadPoolExecutor.<init>(ThreadPoolExecutor.java:1237)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:280)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at is.hail.io.fs.HadoopFS.fileStatus(HadoopFS.scala:148)
        at is.hail.io.fs.FS$class.exists(FS.scala:114)
        at is.hail.io.fs.HadoopFS.exists(HadoopFS.scala:57)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)

tpoterba · May 7, 2021, 12:27pm

We’re not super familiar with AWS Spark tooling. Google has the Cloud Storage connector | Dataproc Documentation | Google Cloud for Hadoop/Spark which lets you access GS files in local mode. I’m assuming that you have something similar installed, or you’d be getting a “no file scheme exists for s3a …” error.

It’s also possible that the S3 connector class fs.s3a.S3AFileSystem isn’t thread-safe – we parallelize listStatus inside HadoopFS.fileStatus.

tpoterba · May 7, 2021, 12:28pm

you might try removing the par here:

github.com

hail-is/hail/blob/78682e0a9191f49daeae14e10fba29b1c09be54e/hail/src/main/scala/is/hail/io/fs/HadoopFS.scala#L108


  new hadoop.fs.Path(filename).getFileSystem(conf.value)
}


def listStatus(filename: String): Array[FileStatus] = {
  val fs = getFileSystem(filename)
  val hPath = new hadoop.fs.Path(filename)
  var statuses = fs.globStatus(hPath)
  if (statuses == null) {
    throw new FileNotFoundException(filename)
  } else {
    statuses.par.map(_.getPath)
      .flatMap(fs.listStatus(_))
      .map(new HadoopFileStatus(_))
      .toArray
  }
}


def mkDir(dirname: String): Unit = {
  getFileSystem(dirname).mkdirs(new hadoop.fs.Path(dirname))
}

And rebuilding.

Topic		Replies	Views
Is there container for the latest (or close to latest) version of hail that I can install?	5	719	December 2, 2020
Tell Hail to make different use of Spark processing	2	708	January 22, 2019
GPU Support to Hail	6	1336	September 21, 2021
Why does it take such as long time to do operation as simple as counting the dimensions of the matrix table	3	912	December 2, 2020
Improve matrix write time?	5	669	October 24, 2019

Support for hail accessing AWS s3 bucket in spark local mode

Related topics