Hail2 Python to-do items

Necessary code work:

  • Add the rest of the core methods from VDS/KT to api2 (#2591 does most for KT, order_by is the only outstanding KT method that’s not moved to table there. Same needs to be done for VDS, this isn’t too hard)
  • Add the non-core methods to hail.methods / hail.genetics.methods
    • some stuff here is much harder than the rest, like filter_alleles
    • This is mostly just labor, but some require more thought than others, like moving TDT to use hail2 expr
  • Support intervals in the index_* methods. It’s possible now to join by locus, but not using the annotateLociTable fast path.
  • Move to Python 3 so argument order is preserved
  • Test the hail2 api much more rigorously than we do now (at the very least, call each parameter branch for each method!
  • Typecheck the expression language. This isn’t super trivial, and making a nice system to integrate our typecheck module and expressions will require some thoughtful design work.
  • Some more organization around the package: monkey patching with import hail.genetics is an idea I like, but want to think about the edge cases first.
  • Implement history for hail2


  • Document the index_* methods / joins
  • Translate the Hail Overview tutorial
  • Make new tutorials to replace the 2 expr ones we have
  • Fill in docs on api2 methods (they’re not all there yet)
  • Fill in docs on expression language (things like __mul__ on NumericExpression haven’t been documented)
  • Write “integrative docs” that provide how-tos for common types of workflows. Show the power of annotate / select / group_by/aggregate, etc.

Longer term QoL:

  • Move over tests to Python as much as possible. I looked at the linear regression suite and it can be moved entirely into Python without many problems.
  • Write a type parser in Python. The nested calls into the JVM for Type._from_java make the library feel extremely sluggish on teensy data.
  • Integrate RV with C/C++, so we can transmit data much more efficiently between Python and Java.
  • Rethink the expr language function registry, because many functions there can be implemented in terms of others in Python.
  • add back in de novo


  • add infoScore aggregator to Python
  • improve PCA docs and add mean_center=True parameter

fix names of annotations generated everywhere to be python compliant (underscores, not camel case)

Sphinx style:

Use :class: and :meth: rather than :py:class: and :py:meth:.
Use absolute identifiers where possible: :meth:`.linreg` instead of :meth:`hail.methods.linreg`