Path to get rid of AST

I think we’re getting close to get rid of AST. Things left I see:

  • Support for strings. Actually, I think these will work now if you do registerCode and take Longs pointing to (string) region values, and convert between region and JVM strings.

  • We need a way to capture types (for str, json). We could also emit code, but I think this will be easier and these operations aren’t particularly performance sensitive. Basically, I to generate the following sort of code in Emit for types:

      var _t: Type = null
      def t: Type = { if (_t == null) _t = Type.parse("Struct { f: Float64, ... }"); _t }
    
  • The same thing for the reference genome, as JSON.

  • We need to generate comparison code for types in Emit.

  • We need support for dict and set. With deforestation, I think this is easy: dicts and sets are just (sorted) array region values. We need ToSet and ToDict, which take containers and sort them. Operations ArrayMap, ArrayFilter and ArrayFold can just be ContainerMap, ContainerFilter and ContainerFold. We need to implement SetContains and DictGet. I think that’s it.

  • We need to move everything out of FunctionRegistry. Most stuff can go as-is with the existing IRFunctionRegistry support. This includes region value aggregators that haven’t been implemented yet.

  • Remaining methods on MatrixTable and Table that use AST and don’t have IR alternatives need them. In many cases, these can be simplified by pushing functionality into Python and not evaluating expressions at all. These include:

    • annotateColsTable
    • selectCols
    • queryEntries, queryGlobal, queryCols, queryRows
    • maximalIndependentSet
    • the regression methods
    • filterEntries
    • annotateGlobals variants
    • Table.aggregate
    • MatrixTable.aggregateRowsByKey
    • ExportPlink (wip @jigold)
    • MatrixTable.groupColsBy
    • MatrixTable.makeKT
    • ibd maf
    • FilterAlleles
    • Table.aggregate (aggregate is overloaded, this is the group_by variant)
    • ExportGen (uses queryVA)

Finally, we need a way to build the IR from Python without going through the Parser/AST. I’ve already started that.

Just so we don’t duplicate work:

  • @wang comparisons, infrastructure for capturing types and the genome reference, dict and set, and overseeing the FunctionRegistry conversion
  • @jigold is doing the MT.query* functions

If you take something on, please note it here.

I’m doing Table.annotateGlobals and MatrixTable.annotateGlobals.

I’ll do filterEntries. I want to learn how to implement it as an annotate node.

@tpoterba My bad, I did filterEntries this morning before seeing this: https://github.com/hail-is/hail/pull/3354 . I should have posted/checked here first. Sorry!

I’ll do export_gen and export_plink.

I’ll do regression. I may regret this.

I already did Table.aggregate with the interpreter stuff.

Doing annotateColsTable

In the process of working on ExportPlink, I put MatrixTable.colsTable into the IR as well as Table.export.

I am working on selectCols.

I’m working on MaximalIndependentSet.