From Offsets to Pointers

First, @rcownie, thanks for challenging the larger design. I look forward to more of the discussions and I think they can only improve the design.

I think @dking and I might have slightly different expectations about where things are going. Let me give my take.

I wanted to keep the IR functional because I think it greatly simplifies most of what we want to do in the query optimizer. In the initial draft of the IR, @dking included struct and array mutation operations and general looping constructs which I pushed back on.

I don’t think the implementation of the IR needs to be immutable, nor should it. I’m totally fine with modifying unshared structures and have described a few examples where I’m explicitly hoping for it: to use your struct example, if you’re going to insert into a struct downstream, you should allocate space for the insertion upstream and modify it downstream.

The existing Scala/Spark model is totally iterators over immutable data. That’s where we inherited it from. In walking from Spark + annotation => C++ + region values, I thought changing it would be too disruptive. (I also think it has simplifying value.) I’m happy to revisit this if you have an alternate proposal.

I think GC and iterators are both wrong for us so I think moving away from them is a feature, not a bug. The work @dking is doing on ContextRDD to invert the iterator model will allow us to manage memory explicitly rather than relying on GC.

Absolutely, we want to vectorize our computations in the database sense. My model for this was to be able to have the caller request multiple rows delivered in a region and return either an array of pointers to the row, or a pointer to an array of rows in the region. If you do the later, none of the existing interfaces change.

But that would require decoupling the Region-lifetime (and frame-on-Region
lifetime) from the timing of the iterator operations

I’m not sure I understand. The region lifetimes and the iterator operations are completely decoupled in the current design. There is nothing to stop a consumer from calling a producer multiple times to add values to a region without reseting the region, or from returning values in the same region multiple times in a consumer (exactly the flatMap case).

Along with that, I have a question. What immutable data is it that we’re hoping to share between upstream and downstream RegionValue’s on the same Region ? My guess is that it’s mostly string-contents. When you mutate one element in a TArray (or change its missingness), you’ve got to copy the whole TArray. When you mutate one field in a TStruct, you’ve got to copy the whole TStruct. So what actually does get preserved, other than string contents ?

These are great and important questions. A few comments:

  1. Almost none of our data is strings. The most common entry schema is:

    struct {
      GT: call
      AD: array<int>,
      DP: int,
      GQ: int,
      PL: array<int>
    }
    

    This accounts for 99% of our data (by volume). If by strings, you mean non-scalar valus, then yes, you’re right: strings, arrays, sets and dicts. Remember, the entries are stored in the region value row as a column-indexed array of the above entry structs. When you insert a toplevel field into the region value row, that array (99%) of the data, get shared.

  2. For inserting fields into structs, I see three possible implementations: (1) copy the struct: bad! (2) pre-allocate the space for the inserted field and assign the field (and missing bits) and (3) save the value to be inserted and a pointer to the original struct.

    To implement (3), as compared to the “default” layout for a struct, we need a way to (statically) distinguish between multiple physical implementations of the struct type. I call these “physical types” and are on the near-term roadmap. I wrote up the idea a while back here: Future of types 2: virtual and physical types.

    From this perspective, I think of user-level types as the “interface” to values inside Hail, and physical types to be “implementations” of those interfaces, but these “implementations” should be known and resolved statically.

  3. We don’t make it easy to mutate a single array element at the user level because that just isn’t efficient unless we make mutation part of the user model, which I’d like to avoid for lots of reasons. (Does SQL?)

I think it would be reasonable to make structs pointed to. In fact, we should have both: the choice should be a physical type.