Lifting over methods from 0.1 to 0.2

tpoterba · January 9, 2018, 1:36pm

Lifting over methods from Hail 0.1 to Hail 0.2

One of the high-priority agenda items right now is completing the Hail 0.2 API. We’ve all been working super hard in the last months, and have a chance to get all of that work into the hands of some beta-testers very soon. There are already some (brave) users pushing on devel, and as soon as the API is complete and stable we can start pushing more people in.

The generic interface is getting pretty complete, but the majority of the statistical and genetics methods are conspicuously absent. We need to change that! To that end, I’m farming out the method liftover task to the group. I think this post has what you’ll need to do that, but let me know if something is unclear so I can add more info.

Organization

In Hail 0.2, the Table and MatrixTable APIs are totally generic. That means no genetics-specific stuff goes there, which has an added benefit of giving us an excuse not to have the über-object we had in 0.1.

Instead, genetics methods will go in a separate subpackage called methods. The user will not write something like:

vds = vds.variant_qc().sample_qc()

but instead

ds = methods.variant_qc(ds)
ds = methods.sample_qc(ds)

Although this is certainly more verbose, other design decisions have already guaranteed that the big pipelines of chained method calls are dead, though in some cases smaller pipelines will still be possible.

Numpy style docs

As part of the liftover to Hail 0.2, we’re rewriting our Python docs in Numpy style. This docstring style is much more readable in plaintext format (like when someone uses built in ipython help methods), and can be interpreted into the same html docs using the sphinx plugin napoleon. You’ll need to pip install sphinxcontrib-napoleon if you want to build the docs locally.

Good examples of Numpy style docs are available here. You can also see examples by looking at the Hail code already lifted over.

Part of readability is keeping lines short enough (<= 80 char) to print in reasonably-sized terminals. I wrote a javascript tool to make it easier to format doc paragraphs. PRs with long doc lines will be rejected!

Steps

Decide where the method should live in hail.methods. There currently exist files
qc.py, statgen.py, family_methods.py. Organization within methods doesn’t really matter; feel free to add a new file if these don’t fit.
Update the method signature and code. Instead of returning a VariantDataset or KeyTable,
the method should return a MatrixTable or Table.
Add the @require_biallelic decorator if necessary.
Update docs. This is where I expect most of the time to be spent. You’ll need to:
a. Translate the doc to Numpy style
b. Follow the doc liftover tasks above
Update tests in hail.methods.tests.py. Look at the Scala test suite for your method – is it easy to rewrite in Python using the Table/MatrixTable primitives? If so, do it! If not, port
over any tests from hail.api1.tests.py, and maybe add a few trivial examples.
Add the method to __all__ in hail.methods.__init__.py (don’t forget to change the import too).
Add the method to docs/methods/index.rst. Alternatively, if someone wants to figure out whether sphinx can do this automatically (a few hours spent trying to get ..automodule:: to work turned up nothing), that would be awesome!
Important: make a judgment call about whether the method is sufficiently generic, or if it is a candidate for more work. Amanda’s work to make PCA generic by having it take an expression for the numeric matrix entry was a huge win – can we do this to your method too? If yes, please communicate that to the group but feel free to leave it for a separate PR (unless it’s easy or you’re excited).
Also important: communicate any feedback about interface pain points, learning barriers, or inflexibilities to the group!

tpoterba · January 9, 2018, 1:43pm

Assignments

I didn’t randomly assign each method, because last time that left Jackie with all the most time-consuming ones. Instead, I grouped the methods into chunks I thought would be roughly the same difficulty (trying to keep methods like grm/rrm/etc together, so our docs look consistent for similar methods), and randomly assigned those chunks (aside from reassigning Jon after he got all the regressions).

Cotton

filter_alleles[_hts]
annotate_alleles_expr[_hts]

Jackie

mendel errors
rename_duplicates
concordance
vep
nirvana

Jon

tdt (will take more time than others, because you’ll need to lift over 0.1 expr to 0.2)
export_gen
export_plink
export_vcf
export_solr
export_cassandra

Patrick

reorder_samples
lmmreg
logreg
sample_variants
KeyTable.maximal_independent_set
skat

Amanda

grm
ibd
rrm
pcrelate
KeyTable.import_interval_list
KeyTable.import_bed
KeyTable.import_fam

Dan

split_multi[_hts]
ld_prune
imputesex
to_pandas (should remain a Table instance method)
to_dataframe (should remain a Table instance method, but maybe call it to_spark?)

Meredith

Hail overview tutorial

jigold · January 9, 2018, 9:32pm

Make sure you refer to classes as :class:`.MatrixTable` etc. with the period before the class name. Otherwise, it won’t link properly!

jigold · January 10, 2018, 4:46pm

Also, please make sure to double check the docs preview when reviewing PRs. They can be found in team city under Artifacts. Click on index.html to open the preview.

jbloom · January 11, 2018, 9:01pm

Be wary of ..math :: sections. I found when moving linreg docs that latex names like \beta that rendered just fine before now were causing a strange error (SEVERE: Unexpected section title.). You can check this is the case by deleting the block and compiling docs again. The reason some names don’t need the second backslash seems to be that the first letter is not a valid escape character.

The sphinx docs say: “Keep in mind that when you put math markup in Python docstrings read by autodoc, you either have to double all backslashes, or use Python raw strings (r"raw”).

Indeed, using two backslashes \\beta fixed the issue.

wang · January 16, 2018, 11:37pm

annotating a variant-keyed MatrixTable with Interval-keyed data still doesn’t have a great solution in api2, e.g. the examples

>>> bed = KeyTable.import_bed('data/file1.bed')
 >>> vds_result = vds.annotate_variants_table(bed, root='va.cnvRegion')

>>> bed = KeyTable.import_bed('data/file2.bed')
>>> vds_result = vds.annotate_variants_table(bed, root='va.cnvID')

can’t yet be implemented in api2. I’ve commented them out for the time being to be added back in once that part is more fleshed out.

tpoterba · January 18, 2018, 7:20pm

Sphinx style:

Use :class: and :meth: rather than :py:class: and :py:meth:.
Use relative identifiers where possible: :meth:`.linreg` instead of :meth:`hail.methods.linreg`

Testing:
you can test one module at a time with:

python -m unittest hail.api2.tests.MatrixTests.test_joins

cseed · January 28, 2018, 11:31pm

Things @konradjk asked for so he can drop api1 (in other words, things that need to go into breaking before it gets merged in):

filter_intervals
hc.import_matrix (@wang)
docs for hc methods (@tpoterba can you farm these out)
annotate_alleles (@cseed)
impute_sex (@dking)

already done (or with pending PRs): to_pandas, to_spark (dataframe), ld_prune (stub, needs docs), min_rep, MatrixTable.from_rows_table.

konradjk · January 28, 2018, 11:53pm

There’s also filter_intervals, but not sure what your plans are in that space? Otherwise looks good!

tpoterba · January 29, 2018, 12:22am

Assignments for method liftover:

Jon

balding_nichols_model

patrick

import_plink
read_matrix
read_table

jackie

import_bgen
import_gen

dan

index_bgen
import_table

Amanda

get_vcf_metadata
grep

Directions

These methods all now live in the methods package, but call out to the Hail1 HailContext in their code and are undocumented. We need to a) put the necessary code inside the method so we can safely delete the HailContext, and b) port the docs. You know what to do!

tpoterba · January 29, 2018, 12:25am

Also, do document the sample and variant fields as produced fields – v and s are no longer assumed!

cseed · January 29, 2018, 12:49am

put the necessary code inside the method so we can safely delete the HailContext

I already did this in the noapi1 branch. I’ll pull it out into a separate PR. Just update the docs.

cseed · January 29, 2018, 2:23am

This PR updates the methods code: #2837. I’d already done get_vcf_metadata. @wang can you do the docs for filter_intervals? I have a PR for the code: #2836. Thanks all!

Topic		Replies	Views
Hail2 Python to-do items	3	614	January 17, 2018
Hail2 python docs style	1	604	December 11, 2017
Python organizational update	0	658	February 13, 2018
Proposed changes to Python object hierarchy	21	1130	November 22, 2017
Hail2 Python interface discussion	21	1315	November 27, 2017

Lifting over methods from 0.1 to 0.2

Lifting over methods from Hail 0.1 to Hail 0.2

Organization

Numpy style docs

Other documentation liftover tasks

Steps

Assignments

Cotton

Jackie

Jon

Patrick

Amanda

Dan

Meredith

Jon

patrick

jackie

dan

Amanda

Directions

Related topics