I think we’re getting close to get rid of AST. Things left I see:
-
Support for strings. Actually, I think these will work now if you do registerCode and take Longs pointing to (string) region values, and convert between region and JVM strings.
-
We need a way to capture types (for str, json). We could also emit code, but I think this will be easier and these operations aren’t particularly performance sensitive. Basically, I to generate the following sort of code in Emit for types:
var _t: Type = null def t: Type = { if (_t == null) _t = Type.parse("Struct { f: Float64, ... }"); _t }
-
The same thing for the reference genome, as JSON.
-
We need to generate comparison code for types in Emit.
-
We need support for dict and set. With deforestation, I think this is easy: dicts and sets are just (sorted) array region values. We need ToSet and ToDict, which take containers and sort them. Operations ArrayMap, ArrayFilter and ArrayFold can just be ContainerMap, ContainerFilter and ContainerFold. We need to implement SetContains and DictGet. I think that’s it.
-
We need to move everything out of FunctionRegistry. Most stuff can go as-is with the existing IRFunctionRegistry support. This includes region value aggregators that haven’t been implemented yet.
-
Remaining methods on MatrixTable and Table that use AST and don’t have IR alternatives need them. In many cases, these can be simplified by pushing functionality into Python and not evaluating expressions at all. These include:
annotateColsTableselectColsqueryEntries, queryGlobal, queryCols, queryRowsmaximalIndependentSetthe regression methodsfilterEntriesannotateGlobals variantsTable.aggregate- MatrixTable.aggregateRowsByKey
- ExportPlink (wip @jigold)
- MatrixTable.groupColsBy
MatrixTable.makeKT- ibd maf
- FilterAlleles
- Table.aggregate (aggregate is overloaded, this is the group_by variant)
- ExportGen (uses queryVA)
Finally, we need a way to build the IR from Python without going through the Parser/AST. I’ve already started that.