LowerTableIR and nested CDAs

Here is another perspective:

I’d call that a subquery (although it is a trivial one), and I’d say it should error because we don’t support subqueries. There user should explicitly use rbind to write a subquery free pipeline:

hl.rbind(ht1.aggregate(hl.agg.count(), _localize=False),
  lambda ht1count: ht2.aggregate(hl.agg.count() + ht1count))

extracts these nested patterns as RelationalLet IRs

I’m confused why these will be relational lets. I thought relational lets are only needed when pulling out values in “freestanding” matrix or table IR. IR in lowering are always value IR at the toplevel. Therefore, subqueries can be pulled up to the value level and bound in a normal let.

Proposal 2

We need to recompute requiredness for each of these nodes. This scales quadratically with the size of an IR.

I don’t understand why this is necessary. Just pass in the requiredness analysis recursively.

We can emit CDAs inside of loops, and we currently don’t have a code motion pass to clean this up.

I’m also confused by this. If you have a relational node in a loop, you need to run it on each iteration. Or do you mean in a loop inside e.g. TableFilter? That has to get pulled out.

To implement proposal 2, you need lowering to keep track of and return the set of relational nodes that have been pulled out so they can be placed in globals when you hit the value level again.

These proposals essentially seem the same to me: either you pull things out beforehand, or you pull them out while you do the lowering. I’d probably do proposal 2 since LowerTableIR already has all the infrastructure to do this. You can just throw the lowered subqueries in the globals, right?