Lifting over methods from 0.1 to 0.2

Lifting over methods from Hail 0.1 to Hail 0.2

One of the high-priority agenda items right now is completing the Hail 0.2 API. We’ve all been working super hard in the last months, and have a chance to get all of that work into the hands of some beta-testers very soon. There are already some (brave) users pushing on devel, and as soon as the API is complete and stable we can start pushing more people in.

The generic interface is getting pretty complete, but the majority of the statistical and genetics methods are conspicuously absent. We need to change that! To that end, I’m farming out the method liftover task to the group. I think this post has what you’ll need to do that, but let me know if something is unclear so I can add more info.

Organization

In Hail 0.2, the Table and MatrixTable APIs are totally generic. That means no genetics-specific stuff goes there, which has an added benefit of giving us an excuse not to have the über-object we had in 0.1.

Instead, genetics methods will go in a separate subpackage called methods. The user will not write something like:

vds = vds.variant_qc().sample_qc()

but instead

ds = methods.variant_qc(ds)
ds = methods.sample_qc(ds)

Although this is certainly more verbose, other design decisions have already guaranteed that the big pipelines of chained method calls are dead, though in some cases smaller pipelines will still be possible.

Numpy style docs

As part of the liftover to Hail 0.2, we’re rewriting our Python docs in Numpy style. This docstring style is much more readable in plaintext format (like when someone uses built in ipython help methods), and can be interpreted into the same html docs using the sphinx plugin napoleon. You’ll need to pip install sphinxcontrib-napoleon if you want to build the docs locally.

Good examples of Numpy style docs are available here. You can also see examples by looking at the Hail code already lifted over.

Part of readability is keeping lines short enough (<= 80 char) to print in reasonably-sized terminals. I wrote a javascript tool to make it easier to format doc paragraphs. PRs with long doc lines will be rejected!

Other documentation liftover tasks

Our numeric types were renamed (Double=>Float64, etc), and few of the 0.1 methods were updated in devel. You’ll probably need to lift over types if your method includes annotatation docs.

Ensure that the language is as generic as possible – use row / column / entry instead of variant/sample/genotype where possible, this may be a bit confusing for users in the short term but having everything documented consistently will probably be better long-term.

In general, we’re using ‘field’ instead of ‘annotation’ now. A 0.1 “variant annotation” is now a “row field”, and so on.

Steps

  1. Decide where the method should live in hail.methods. There currently exist files
    qc.py, statgen.py, family_methods.py. Organization within methods doesn’t really matter; feel free to add a new file if these don’t fit.
  2. Update the method signature and code. Instead of returning a VariantDataset or KeyTable,
    the method should return a MatrixTable or Table.
  3. Add the @require_biallelic decorator if necessary.
  4. Update docs. This is where I expect most of the time to be spent. You’ll need to:
    a. Translate the doc to Numpy style
    b. Follow the doc liftover tasks above
  5. Update tests in hail.methods.tests.py. Look at the Scala test suite for your method – is it easy to rewrite in Python using the Table/MatrixTable primitives? If so, do it! If not, port
    over any tests from hail.api1.tests.py, and maybe add a few trivial examples.
  6. Add the method to __all__ in hail.methods.__init__.py (don’t forget to change the import too).
  7. Add the method to docs/methods/index.rst. Alternatively, if someone wants to figure out whether sphinx can do this automatically (a few hours spent trying to get ..automodule:: to work turned up nothing), that would be awesome!
  8. Important: make a judgment call about whether the method is sufficiently generic, or if it is a candidate for more work. Amanda’s work to make PCA generic by having it take an expression for the numeric matrix entry was a huge win – can we do this to your method too? If yes, please communicate that to the group but feel free to leave it for a separate PR (unless it’s easy or you’re excited).
  9. Also important: communicate any feedback about interface pain points, learning barriers, or inflexibilities to the group!

Assignments

I didn’t randomly assign each method, because last time that left Jackie with all the most time-consuming ones. Instead, I grouped the methods into chunks I thought would be roughly the same difficulty (trying to keep methods like grm/rrm/etc together, so our docs look consistent for similar methods), and randomly assigned those chunks (aside from reassigning Jon after he got all the regressions).

Cotton

  • filter_alleles[_hts]
  • annotate_alleles_expr[_hts]

Jackie

  • mendel errors
  • rename_duplicates
  • concordance
  • vep
  • nirvana

Jon

  • tdt (will take more time than others, because you’ll need to lift over 0.1 expr to 0.2)
  • export_gen
  • export_plink
  • export_vcf
  • export_solr
  • export_cassandra

Patrick

  • reorder_samples
  • lmmreg
  • logreg
  • sample_variants
  • KeyTable.maximal_independent_set
  • skat

Amanda

  • grm
  • ibd
  • rrm
  • pcrelate
  • KeyTable.import_interval_list
  • KeyTable.import_bed
  • KeyTable.import_fam

Dan

  • split_multi[_hts]
  • ld_prune
  • imputesex
  • to_pandas (should remain a Table instance method)
  • to_dataframe (should remain a Table instance method, but maybe call it to_spark?)

Meredith

  • Hail overview tutorial

Make sure you refer to classes as :class:`.MatrixTable` etc. with the period before the class name. Otherwise, it won’t link properly!

Also, please make sure to double check the docs preview when reviewing PRs. They can be found in team city under Artifacts. Click on index.html to open the preview.

Be wary of ..math :: sections. I found when moving linreg docs that latex names like \beta that rendered just fine before now were causing a strange error (SEVERE: Unexpected section title.). You can check this is the case by deleting the block and compiling docs again. The reason some names don’t need the second backslash seems to be that the first letter is not a valid escape character.

The sphinx docs say: “Keep in mind that when you put math markup in Python docstrings read by autodoc, you either have to double all backslashes, or use Python raw strings (r"raw”).

Indeed, using two backslashes \\beta fixed the issue.

annotating a variant-keyed MatrixTable with Interval-keyed data still doesn’t have a great solution in api2, e.g. the examples

>>> bed = KeyTable.import_bed('data/file1.bed')
 >>> vds_result = vds.annotate_variants_table(bed, root='va.cnvRegion')

>>> bed = KeyTable.import_bed('data/file2.bed')
>>> vds_result = vds.annotate_variants_table(bed, root='va.cnvID')

can’t yet be implemented in api2. I’ve commented them out for the time being to be added back in once that part is more fleshed out.

Sphinx style:

Use :class: and :meth: rather than :py:class: and :py:meth:.
Use relative identifiers where possible: :meth:`.linreg` instead of :meth:`hail.methods.linreg`

Testing:
you can test one module at a time with:

python -m unittest hail.api2.tests.MatrixTests.test_joins

Things @konradjk asked for so he can drop api1 (in other words, things that need to go into breaking before it gets merged in):

  • filter_intervals
  • hc.import_matrix (@wang)
  • docs for hc methods (@tpoterba can you farm these out)
  • annotate_alleles (@cseed)
  • impute_sex (@dking)

already done (or with pending PRs): to_pandas, to_spark (dataframe), ld_prune (stub, needs docs), min_rep, MatrixTable.from_rows_table.

There’s also filter_intervals, but not sure what your plans are in that space? Otherwise looks good!

Assignments for method liftover:

Jon

balding_nichols_model

patrick

import_plink
read_matrix
read_table

jackie

import_bgen
import_gen

dan

index_bgen
import_table

Amanda

get_vcf_metadata
grep


Directions

These methods all now live in the methods package, but call out to the Hail1 HailContext in their code and are undocumented. We need to a) put the necessary code inside the method so we can safely delete the HailContext, and b) port the docs. You know what to do!

Also, do document the sample and variant fields as produced fields – v and s are no longer assumed!

put the necessary code inside the method so we can safely delete the HailContext

I already did this in the noapi1 branch. I’ll pull it out into a separate PR. Just update the docs.

This PR updates the methods code: #2837. I’d already done get_vcf_metadata. @wang can you do the docs for filter_intervals? I have a PR for the code: #2836. Thanks all!