This thread is to keep a list of ideas for the IR optimizer. Compare: Ideas for improving generated code.
-
After dead field elimination (which @tpoterba is working on), I think common subexpression elimination will have the next highest impact. @konradjk’s pipelines regularly slow massive amounts of duplicated code.
-
Push filters down. (AggFilter through AggMap: PRed in: https://github.com/hail-is/hail/pull/3429) Relational filters through maps and esp. joins.
-
Downward code motion. If an expresison is bound in a let in an outer context but only referenced in one part of a conditional, it can be moved into the conditional. It might be natural to do this as part of CSE.
-
Hoist loop invariants. Expressions inside Array* operations that don’t depend on the elements should be bound by a let outside.
-
Code hoisting. If you’re doing a per-element operation (that is, indexed by row and col axis), but there is a subexpression that only depends on the row, it should be compute first per-row. Same for global (invariant) expressions in table operations. As a simple case, we saw this as a problem with string constants (although that can be fixed in Python … by doing this to begin with).
-
InsertFields(MakeStruct(…), …) can be turned into a MakeStruct. (@tpoterba did you have this somewhere?)
-
replace simple regular expressions with String operations like startsWith, endsWith, contains, etc.
-
ArrayMap(MakeArray(…), …) should be simplified if the individual terms are constant-foldable. Same for ArrayFilter.
-
fuse Matrix/Table filter and maps if aggregators don’t get in the way. (Should filters even have aggregators?) Esp. important for MatrixMapCols.