Regions in the IR

Would maybe be nice to have the ability to use/access multiple regions in the IR. I think Dan mentioned that it would be useful to have one writeable region and have the rest of the regions be read-only. One of the nice things about this is that it lets us do things like

  • avoid copying e.g. globals and sample annotations into a row to be able to access them within IR-generated functions
  • handle annotate-type-operations and other things that involve changing the structure of a row by constructing the new row in a different region, instead of appending another RegionValue in the same region.

IR expressions don’t care about the physical representation of the value, and it would be nice to keep it that way. As it stands, keeping the one-writeable-region model, the only IRs that could possibly exist in different regions are In(…) and anything descended from it, e.g. ArrayRef(In(1), …).

I think we can, within Compile(), define the number of regions, a way to set the regions, and way to specify which regions the inputs to the compiled functions (In(0), In(1), …) are coming from. Maybe it would look vaguely like this:

Compile(regions: Array[Region], in0, in1, ..., ret, bodyIR)

where in0 and in1 are either single arguments or groups of arguments that specify name, region, and type. I think ret would either use the main region by default or we could populate the return region as we generate the code for the function in Emit. (This is mostly relevant if we’re just using the IR to reference something in a read-only region).

This is kind of nice because it lets us provide some regions and specify where we expect our inputs to come from, potentially check where we expect our return value to be, and then trust the compiler to get the rest of it right, as long as we gave it the correct regions and region mapping. On the generated code side, it’s also nice because the regions are known statically at compile time and we don’t have to generate code that needs to look up the region and potentially combine things from the same/different regions in different ways, so the generated code shouldn’t look very different except where we need to construct structs and arrays from values in different regions.

Here’s what I think it means for how the function generating stuff would change:

FunctionBuilder:

  • The FunctionBuilder object needs a way to specify and store multiple regions. I want to do this by creating settable fields on the class, and setting them in Compile (perhaps still having the main region passed in as arg0 to maintain as-is compatibility with srvb, but I kind of want to move that over to use MethodBuilder so that the main function in the FunctionBuilder can have more flexibility anyways)
  • I think the return signature of the function can stay the same, since we can figure out what region the return value is in outside of the function itself.

Compile:

  • Compile would need to take the regions, set them within the function builder, and then pass them to Emit.
  • If we wanted to be real cute, we could change the representation of values at this level to be primitive values that the IR functions return directly and RegionValues, so you could pass those in to Compile and wrap the returned function to do the same. Compile would then have to deal with collecting distinct regions and tracking the regions that each input is in. Maybe we would still want to pass in the writeable region.

Emit:

Emit needs to know which regions all the In IRs come from, and deal with things accordingly. (This could be something like passing a map in as an argument) Emit.emit would probably have to return 4 things: the three it currently returns (setup: Code[Unit], missing: Code[Boolean], and value: Code[_]) and also the region to which it belongs (Code[Region]).

It might be nice to bundle this into an actual class, like:

case class StagedIRValue(setup: Code[Unit]
    region: Code[Region], 
    value: Code[_], 
    missing: Code[Boolean])

which would maybe have the side effect of making things easier to read. We could also consider creating new LocalRefs to hold value and missing, since they tend to get stored immediately into LocalRefs anyways. For example, if we have emit return the following object:

case class StagedIRValue(setup: Code[Unit]
    region: Code[Region], 
    value: LocalRef[_], 
    missing: SettableBit) {
    def getOrElse(default: Code[_]): Code[_] = missing.mux(default, value)
}

ArrayRef would maybe look something like this:

case ArrayRef(a, i, typ) =>
    val ti = typeToTypeInfo(typ)
    val tarray = TArray(typ)
    val va = emit(a)
    val vi = emit(i)
    val xmv = mb.newBit()
    val setup = Code(
      va.setup,
      vi.setup,
      xa := coerce[Long](va.getOrElse(defaultValue(tarray))),
      xi := coerce[Long](vi.getOrElse(defaultValue(TInt32()))),
      xmv := va.missing || vi.missing || !tarray.isElementDefined(va.region, xa, xi))

    StagedIRValue(setup, 
        va.region, 
        va.region.loadIRIntermediate(typ)(tarray.loadElement(va.region, xa, xi)), 
        xmv)

But what happens when we need to pull things from multiple regions into one structure? For example, MakeStruct:

case x@MakeStruct(fields, _) =>
    val initializers = fields.map { case (_, v) => (v.typ, emit(v)) }
    val srvb = new StagedRegionValueBuilder(fb, x.typ)
    present(Code(
      srvb.start(init = true),
      Code(initializers.map { case (t, (dov, mv, vv)) =>
        Code(
          dov,
          mv.mux(srvb.setMissing(), srvb.addIRIntermediate(t)(vv)),
          srvb.advance()) }: _*),
      srvb.offset))

would probably need some logic to copy over any struct fields that belong to regions not in the main region. This might look like adding some things to srvb that allow us to do something like srvb.addIRIntermediate(t)(region, offset), which would do a deep copy into the region that is being built on.

I originally got super turned around because I was thinking it might be nice to be able to write to all the regions, but Dan pointed out that a. we try not to mutate RegionValues and 2) generally we only have one thing we’re trying to construct at a time.

Overall your proposal looks good. I have some small local comments that I’ll put in a separate response, but first I want to consider the alternatives from a broader perspective.

The motivation for the region-based design was three parts: (1) to minimize memory allocation overhead for data values, (2) have a value representation that isn’t tied to the Java object representation so we can easily interoperate with C-like languages, and (3) have a value representation with low (zero?) serialization overhead.

The hope with (3) was that we could write the value representation directly to disk (with some light/fast compression). That failed because the file size with that approach was much larger than what people are used to (they’re used to .vcf.gz files). Here are some numbers I cooked up recently:

  • consider a 4.8GB .vcf.gz file (a shard of gnomAD exomes, ~200K samples)
  • with the current master (encoding using leb128 integers followed by lz4 compression) is ~5GB
  • the encoded (but not lz4 compressed) intermediate is ~25GB of data
  • lz4 decompresses ~3.5GB/s, so it takes ~7s to decompress this
  • the decoded data (region representation) is ~125GB
  • lz4 compression only (no encoding) is ~4x larger than the original VCF (deal breaker?)
  • that takes about ~35s to decompress
  • the region representation is about 80% zeros (small integers, alignment padding, etc.) This is why leb128 helps a lot
  • one option is to not use encoding (or leb128) but to compress away the zeros. I wrote an packer/unpacker that unpacks ~5GB/s, see: https://github.com/hail-is/hail3/blob/master/cpp/pack.cc (I can describe the format elsewhere)
  • I don’t seem to have notes on the size/performance of the packer
  • another option is to decode but not use leb128 and let lz4 handle the zeros. Decoding leb128 is slow (lots of branches). Again, I don’t have good notes but I think this was the fastest solution I tested (decoding in ~1m10s) but the file size increased +40% (maybe OK?)

The takeaway here is that in no case could we reasonably directly write out the region format directly. (Also, sadly, in no case did we decompress at the gs:// per-core bandwidth (100MB/s), although we did get close (~50s vs ~1m10s)). Therefore, I don’t think we need to maintain the ability to relocate (which we don’t have, actually, since we use region-absolute offsets) or memcpy or directly serialize (the latter may have more application if/when we use the region value format as the wire format for communication).

We use offsets (instead of pointers) for two reasons: (1) for serializability, as discussed above, and (2) because Region stores an Array[Byte] instead of memory allocated on the C heap.

A Region is a pool of memory that will be deallocated en-mass. Each region has its own lifetime (globally, for globals and sample annotations, and per-row for the processing of each record in a RVD). Therefore, it should be OK to have pointers from regions with shorter lifetimes to regions with longer lifetimes. Thus, I have an alternative proposal:

Use pointers inside of regions and allow pointers from regions with shorter lifetimes to regions with longer lifetimes.

(Aside: it should be relatively efficient to make regions with pointers directly serializable by rewriting just the the pointers to be relative offsets since this will involve touching just the pointers.)

In the current setup, I see a few challenges to implementing this idea:

  • when we grow a region, the addresses might change. We can’t do this anymore. Regions should instead allocate memory in blocks (instead of one giant block) and move all the blocks to the free list when it is cleared.

  • since we’re using Array[Byte], we can’t use actual pointers. I see two options here:

    • stop using Array[Byte]. To do this, we need to know when we can free the memory of a region. Originally I had intended to rely on the Java finalizer mechanism, but that will never work: we are allocating memory off-heap but that isn’t tracked by Java and therefore doesn’t create memory pressure to cause a GC, so you blow out the memory.

      The reason we don’t know when to deallocate is that regions are allocated by producers and live inside Java iterators (inside Spark tasks) and we don’t know when the iterators are finished being used. (One case is the !hasNext, but this is insufficient, e.g. consider take(5)).

      Therefore, we need to make regions consumer-allocated. They should always know when the iterator is no longer needed. However, this flips the Spark control flow on its head: the consumer now needs to pass the region to the producer so it knows where to store the region value. I think we can implement this control flow pattern in RVD once OrderedRDD is gone (instead of having an RDD[RegionValue], we’ll have a RDD[Region => Iterator[RegionValue]] where is partition is a singleton function that takes the region for that partition and returns the iterator of regions in that partition.)

    • Second, we use symbolic, multi-part pointers: use 16-bits to represent the region we’re pointing to, and the remaining 48-bits to be the offset into the region.