Would maybe be nice to have the ability to use/access multiple regions in the IR. I think Dan mentioned that it would be useful to have one writeable region and have the rest of the regions be read-only. One of the nice things about this is that it lets us do things like
- avoid copying e.g. globals and sample annotations into a row to be able to access them within IR-generated functions
- handle annotate-type-operations and other things that involve changing the structure of a row by constructing the new row in a different region, instead of appending another RegionValue in the same region.
IR expressions don’t care about the physical representation of the value, and it would be nice to keep it that way. As it stands, keeping the one-writeable-region model, the only IRs that could possibly exist in different regions are In(…) and anything descended from it, e.g. ArrayRef(In(1), …).
I think we can, within Compile(), define the number of regions, a way to set the regions, and way to specify which regions the inputs to the compiled functions (In(0), In(1), …) are coming from. Maybe it would look vaguely like this:
Compile(regions: Array[Region], in0, in1, ..., ret, bodyIR)
where in0
and in1
are either single arguments or groups of arguments that specify name, region, and type. I think ret
would either use the main region by default or we could populate the return region as we generate the code for the function in Emit
. (This is mostly relevant if we’re just using the IR to reference something in a read-only region).
This is kind of nice because it lets us provide some regions and specify where we expect our inputs to come from, potentially check where we expect our return value to be, and then trust the compiler to get the rest of it right, as long as we gave it the correct regions and region mapping. On the generated code side, it’s also nice because the regions are known statically at compile time and we don’t have to generate code that needs to look up the region and potentially combine things from the same/different regions in different ways, so the generated code shouldn’t look very different except where we need to construct structs and arrays from values in different regions.
Here’s what I think it means for how the function generating stuff would change:
FunctionBuilder:
- The FunctionBuilder object needs a way to specify and store multiple regions. I want to do this by creating settable fields on the class, and setting them in
Compile
(perhaps still having the main region passed in as arg0 to maintain as-is compatibility withsrvb
, but I kind of want to move that over to useMethodBuilder
so that the main function in the FunctionBuilder can have more flexibility anyways) - I think the return signature of the function can stay the same, since we can figure out what region the return value is in outside of the function itself.
Compile:
-
Compile
would need to take the regions, set them within the function builder, and then pass them toEmit
. - If we wanted to be real cute, we could change the representation of values at this level to be primitive values that the IR functions return directly and RegionValues, so you could pass those in to
Compile
and wrap the returned function to do the same.Compile
would then have to deal with collecting distinct regions and tracking the regions that each input is in. Maybe we would still want to pass in the writeable region.
Emit:
Emit
needs to know which regions all the In
IRs come from, and deal with things accordingly. (This could be something like passing a map in as an argument) Emit.emit
would probably have to return 4 things: the three it currently returns (setup: Code[Unit]
, missing: Code[Boolean]
, and value: Code[_]
) and also the region to which it belongs (Code[Region]
).
It might be nice to bundle this into an actual class, like:
case class StagedIRValue(setup: Code[Unit]
region: Code[Region],
value: Code[_],
missing: Code[Boolean])
which would maybe have the side effect of making things easier to read. We could also consider creating new LocalRef
s to hold value
and missing
, since they tend to get stored immediately into LocalRef
s anyways. For example, if we have emit
return the following object:
case class StagedIRValue(setup: Code[Unit]
region: Code[Region],
value: LocalRef[_],
missing: SettableBit) {
def getOrElse(default: Code[_]): Code[_] = missing.mux(default, value)
}
ArrayRef
would maybe look something like this:
case ArrayRef(a, i, typ) =>
val ti = typeToTypeInfo(typ)
val tarray = TArray(typ)
val va = emit(a)
val vi = emit(i)
val xmv = mb.newBit()
val setup = Code(
va.setup,
vi.setup,
xa := coerce[Long](va.getOrElse(defaultValue(tarray))),
xi := coerce[Long](vi.getOrElse(defaultValue(TInt32()))),
xmv := va.missing || vi.missing || !tarray.isElementDefined(va.region, xa, xi))
StagedIRValue(setup,
va.region,
va.region.loadIRIntermediate(typ)(tarray.loadElement(va.region, xa, xi)),
xmv)
But what happens when we need to pull things from multiple regions into one structure? For example, MakeStruct
:
case x@MakeStruct(fields, _) =>
val initializers = fields.map { case (_, v) => (v.typ, emit(v)) }
val srvb = new StagedRegionValueBuilder(fb, x.typ)
present(Code(
srvb.start(init = true),
Code(initializers.map { case (t, (dov, mv, vv)) =>
Code(
dov,
mv.mux(srvb.setMissing(), srvb.addIRIntermediate(t)(vv)),
srvb.advance()) }: _*),
srvb.offset))
would probably need some logic to copy over any struct fields that belong to regions not in the main region. This might look like adding some things to srvb
that allow us to do something like srvb.addIRIntermediate(t)(region, offset)
, which would do a deep copy into the region that is being built on.
I originally got super turned around because I was thinking it might be nice to be able to write to all the regions, but Dan pointed out that a. we try not to mutate RegionValues and 2) generally we only have one thing we’re trying to construct at a time.