- Benchmarks and examples
- Reference Interpreter + Random IR generation for testing
- Function registry
- GroupBy Aggregations
I’m working on the aggregators now, I hope to have them PR’ed by January 10th.
Dictionaries and Sets
The current implementation strategy for Dictionaries and Sets requires a notion of ordering. Unfortunately, the ordering on Variants requires the genome reference to be transferred to the worker nodes. My current plan for that is to capture the reference in the thunk sent to the worker nodes. At the worker nodes the capture reference will be assigned to a field of the object holding the compiled expr byte code. I do not know how many times a function used in an
RDD.map is deserialized. Hopefully it is deserialized once per JVM. If such a function is deserialized once per JVM, then only one genome reference will be allocated per-hail-expr, per-JVM.
The physical strategy for sets is sorted arrays. We need a custom in-place sort implementation that sorts region values in a region (no allocation in the Java heap). We also need an indexing method. Binary search suffices, though for small arrays CS tribal knowledge says linear search is faster.
The physical strategy for dictionaries is probably also sorted arrays of structs. For this we need sort and search that looks at a particular field of a struct.
The ordering on all of our types needs to be staged so it can be used in the compiler. For the primitives types this is trivial (e.g. emit a static call to
Integer.compare). This conversion is increasingly annoying for more complex types. It’s particularly annoying for types with a genome reference as noted above.
No particular plan. Store as 16-bit unicode character point arrays. Do equality in-region. Substring can be done in-region. Most interesting operations require allocation to Java heap.
First goal is a constant folding pass. Convert things like
3 + 3 into
6. Probably should constant fold using known values of globals. Another classic compiler transformation is common sub-expression elimination. Find repeated expressions and lift them into a let. For example,
(x * x + 1) / x * x
let xx = x * x in (xx + 1) / xx
Another basic pass is dead variable elimination. If a variable is bound but never used, remove it entirely.
Benchmarks and Examples
Writing the optimizer blind is a bad idea. We need a way to measure our success, and to set a goal for our work. We should collect some examples from Konrad, Laurent, TJ, Mary, Xiao Li, etc. and distill them into benchmarks that we can use to measure our progress.
Reference Interpreter and Random Generation of IR
I mostly fleshed out a reference interpreter which operates on the
Annotation-representation of Hail values. Someone needs to think about how to generate random IRs in a reasonable fashion (not too big, not too small, all generated expressions should be well-typed). Then we should just compare the interpreter and the compiler (with optimizations enabled) for equality on the same input.
As Tim mentioned elsewhere on the dev forum, I think a lot of these can be implemented in python. I’d like to avoid repeating work that Tim does. We should convert fundamental functionality into
TernaryOp as the case may be). Everything else should be implemented in python. An outstanding problem is higher-order functions. For example, uniroot, which searches for a root of a user-supplied function. Adding true closures to the language is a possibility. We clearly need to be able to send said closures to JVM functions (so we need a shim that presents a
Function1[T, U] interface). We could also implement
uniroot ourselves. In that case, we don’t need true closures, we can add a form like:
ApplyHOTernaryPrimOp(op, arg1, arg2, name, body)
The compiler needs to treat requiredness properly. In particular, if a type is required, we should not emit missingness checks, nor store any missingness bits.
We should be able to do something like:
gs.map(x => ...).group_by(x => sa.population, gs => gs.sum())
To achieve this, we must add a
GroupByRegionValueAggregator, we must have dictionaries, and the
GroupByRegionValueAggregator must be able to allocate new instances of the downstream
Aggregator(s). A particular snag here is:
gs.group_by(x => sa.population, gs => gs.sum() + gs.sum())
So we need to recursively extract aggregators and hand the constructors to GroupByAggregator.
I think the implementation strategy should be that GroupByAggregator stores the key in a local variable and sends a modified continuation to the downstream aggregators. The modified continuation checks the key variable to determine which instance of the downstream aggregator to use.