See here: https://github.com/tpoterba/hail/blob/py-expr-2/python/hail2/dataset.py
Some thoughts
KeyTables
Keytables have the following manipulation operations:
def annotate(**named_exprs):
def filter(expr):
def select(expr):
def transmute(**named_exprs): # not yet implemented
def group_by(*exprs, **named_exprs):
-> def aggregate(**named_exprs)
def aggregate(**named_exprs): # the new 'query'
VariantDataset / Matrix
Matrix have the following manipulation operations:
def annotate_{rows, cols, entries}(**named_exprs):
def filter_{rows, cols, entries}(expr):
def select_{rows, cols, entries}(expr):
def transmute_{rows, cols, entries}(**named_exprs): # not yet implemented
def group_{rows, cols}_by(*exprs, **named_exprs): # not yet implemented
-> def aggregate(**named_exprs): # group_rows_by.aggregate produces a new Matrix
def aggregate_{rows, cols, entries}(**named_exprs): # the new 'query' methods. Could be removed.
In order to unify the KeyTable and Matrix interfaces, it seems like a good idea to purge references to “va”, “sa”, “g”, and “globals”: instead, all fields in any annotation category are available at the top level namespace:
vds = vds.annotate_rows(
mean_nfe_gq = f.mean(f.filter(vds.GQ, vds.pop == 'NFE')))
How does this work? Each field / expression knows its index. KeyTables only have one index, row
, but matrices have two: row
and column
. In the above expression, GQ
is indexed by [row, col]
, and pop
is indexed by col
. We should be able to use this information in combination with runtime analysis of the expression DAG / AST to generate nice error messages when undefined operations are invoked.