Compiler Roadmap

  • Aggregators
  • Dictionaries
  • Sets
  • Strings
  • Optimizer
  • Benchmarks and examples
  • Reference Interpreter + Random IR generation for testing
  • Function registry
  • Requiredness
  • GroupBy Aggregations


I’m working on the aggregators now, I hope to have them PR’ed by January 10th.

Dictionaries and Sets

The current implementation strategy for Dictionaries and Sets requires a notion of ordering. Unfortunately, the ordering on Variants requires the genome reference to be transferred to the worker nodes. My current plan for that is to capture the reference in the thunk sent to the worker nodes. At the worker nodes the capture reference will be assigned to a field of the object holding the compiled expr byte code. I do not know how many times a function used in an is deserialized. Hopefully it is deserialized once per JVM. If such a function is deserialized once per JVM, then only one genome reference will be allocated per-hail-expr, per-JVM.

The physical strategy for sets is sorted arrays. We need a custom in-place sort implementation that sorts region values in a region (no allocation in the Java heap). We also need an indexing method. Binary search suffices, though for small arrays CS tribal knowledge says linear search is faster.

The physical strategy for dictionaries is probably also sorted arrays of structs. For this we need sort and search that looks at a particular field of a struct.

The ordering on all of our types needs to be staged so it can be used in the compiler. For the primitives types this is trivial (e.g. emit a static call to This conversion is increasingly annoying for more complex types. It’s particularly annoying for types with a genome reference as noted above.


No particular plan. Store as 16-bit unicode character point arrays. Do equality in-region. Substring can be done in-region. Most interesting operations require allocation to Java heap.


First goal is a constant folding pass. Convert things like 3 + 3 into 6. Probably should constant fold using known values of globals. Another classic compiler transformation is common sub-expression elimination. Find repeated expressions and lift them into a let. For example,

(x * x + 1) / x * x


let xx = x * x
in (xx + 1) / xx

Another basic pass is dead variable elimination. If a variable is bound but never used, remove it entirely.

Benchmarks and Examples

Writing the optimizer blind is a bad idea. We need a way to measure our success, and to set a goal for our work. We should collect some examples from Konrad, Laurent, TJ, Mary, Xiao Li, etc. and distill them into benchmarks that we can use to measure our progress.

Reference Interpreter and Random Generation of IR

I mostly fleshed out a reference interpreter which operates on the Annotation-representation of Hail values. Someone needs to think about how to generate random IRs in a reasonable fashion (not too big, not too small, all generated expressions should be well-typed). Then we should just compare the interpreter and the compiler (with optimizations enabled) for equality on the same input.

Function Registry

As Tim mentioned elsewhere on the dev forum, I think a lot of these can be implemented in python. I’d like to avoid repeating work that Tim does. We should convert fundamental functionality into UnaryOp or BinaryOp (or TernaryOp as the case may be). Everything else should be implemented in python. An outstanding problem is higher-order functions. For example, uniroot, which searches for a root of a user-supplied function. Adding true closures to the language is a possibility. We clearly need to be able to send said closures to JVM functions (so we need a shim that presents a Function1[T, U] interface). We could also implement uniroot ourselves. In that case, we don’t need true closures, we can add a form like:

ApplyHOTernaryPrimOp(op, arg1, arg2, name, body)


The compiler needs to treat requiredness properly. In particular, if a type is required, we should not emit missingness checks, nor store any missingness bits.

GroupBy Aggregations

We should be able to do something like: => ...).group_by(x => sa.population, gs => gs.sum())

To achieve this, we must add a GroupByRegionValueAggregator, we must have dictionaries, and the GroupByRegionValueAggregator must be able to allocate new instances of the downstream Aggregator(s). A particular snag here is:

gs.group_by(x => sa.population, gs => gs.sum() + gs.sum())

So we need to recursively extract aggregators and hand the constructors to GroupByAggregator.

I think the implementation strategy should be that GroupByAggregator stores the key in a local variable and sends a modified continuation to the downstream aggregators. The modified continuation checks the key variable to determine which instance of the downstream aggregator to use.


One point on comparing reference interpreter vs compiler. I saw this approach used in a previous job, and it was very effective in finding lots of bugs. It was taken a step further: once it hit a random query which showed a discrepancy between interpreter and compiler, it attempted to simplify the query in various ways, until it had an irreducible query showing the issue - which was then usually much simpler and thus easier to debug.