So when we were rewriting the select/annotate/drop/transmute operations to be in terms of a more general MapStruct operation, we ran into some interesting questions about how to handle keys in these interfaces, since changing select from taking per-field expression to a struct-level expression means it is more difficult to distinguish expressions that are changing the keys from expressions that aren’t. The easy (but moderately messy) fix is to analyze the resulting IR/AST to figure out whether or not the key fields are touched in a given expression, which is what we’ve done for now.
@pschultz pointed out that maybe we could have more conceptual and logical separation between keys and values. Currently, keys are treated like any other value field, which means that it’s really easy to change them, for better or for worse, with
drop, and we silently drop key fields from the list of keys if they get dropped from the underlying struct, or pick up a new key expression if the old one gets overwritten.
I think this gets a little confusing with the way that we use keys; it’s pretty easy/intuitive to think about, for example, a table struct as a map (or multi-map, since we don’t enforce uniqueness?) from key fields to value fields, and that creates a distinct asymmetry between how we think about and want to manipulate key fields vs how we want to manipulate value fields.
It feels like we now have this implicit notion of a “keyed struct”—the rows of a table can be thought of as a keyed
StructExpression, as can
MatrixTable.col—and it might be nice to make this concept more explicit since it seems useful in how we & users think about keys. In the meantime, though, I’ve been thinking about what it would mean for the Python interface and (maybe) the IR representation by treating keys distinctly from value fields.
I’ve outlined a little bit of what I’ve been thinking below, and I have a branch [diff here] that implements the Python interface changes for the
rows interface of
MatrixTable. (It won’t currently compile, since I still need to rebase on the MapRows changes.)
is the only method that can modify key fields. The interface is identical to the current
select interface, where non-named exprs must be field references (not necessarily top-level), but non-field-reference expressions can still be used if a name is provided. All unused former key expressions are retained as value fields.
I’m not super sure what to do with the
partition_key argument that exists right now on
key_rows_by; for now, I’ve removed it since it was only added a short while ago for
GroupedMatrixTable, and afaict no-one is using it. I’ve added a private
_key_rows_by(pk_dict, rest_of_k_dict) so the functionality is retained for that use. If we want to keep it as an option, I kind of like some thing by which someone can do e.g.
key_rows_by(x, y, z).partition_rows_by(x, y).
should deal only with value fields. An attempt to annotate over a key field will cause an error. If you actually want to annotate over a key field, a separate call to
key_by has to be made.
This is a little annoying for things like
annotate(key3=expr) in the old interface, since you have to do something like
key_by(key1=t.key1, key2=t.key2, key3=expr, key4=t.key4) or
key_by(**t.key.annotate(key3=expr)). I don’t know how often this would come up (I suspect not super often) but I think it’s still better at reinforcing what the new key would look like. (this could possibly be mitigated by something like
annotate_key, which just adds/replaces a key field).
likewise deals only with value fields. All key fields are automatically preserved; if you wanted to do something like
select(k2, k4, v1=expr), you’d do something like
I currently have this so that key fields can’t be dropped (
key_by(**t.key.drop(k4)).drop('k4') would drop a key field).
I haven’t implemented this yet since none of them are implemented in MatrixTable, but I imagine it would be identical to the current transmute except that you wouldn’t be able to overwrite key fields and any of the key fields used in a
transmute expression would not be dropped.
@tpoterba thought this could be all the non-key row fields? I don’t really know what to do here, except maybe it would be nice to have one property that’s the values struct and one property that’s the entire struct (keys + values)?
Taking the MatrixTable stuff as an example, there’s something conceptually neat about the symmetry that arises between
mt.entries when we define
mt.row as “the set of row-indexed value fields”. This means that we have:
mt.rows() => hl.struct(**mt.row_key, **mt.row)
mt.cols() => hl.struct(**mt.col_key, **mt.col)
mt.entries() => hl.struct(**mt.entry_key, **mt.entry)
mt.entry_key doesn’t actually exist; it’s kind of just shorthand for
**mt.row_key, **mt.col_key since
entry is the row and col indexed stuff.
I haven’t really thought this part through. Right now, I’ve written out
keyRowsBy s.t. it just uses the existing
selectRows implementation, so the only thing that exists on the IR side is the
MapRows concept. A couple of options that I think are reasonable without changing too much:
- two separate nodes:
MapRowWithExistingKeys, the latter of which preserves the keys and doesn’t shuffle, the former of which will change the keys (and therefore needs to shuffle).
- one IR node,
MapRow, with the separate key mapping and values mapping. We’d still need to figure out whether or not the keys are being changed (or maybe
keyswould just be Option[IR] with
None=> preserve keys).
I think having a notion of
KeyedStruct would help a lot in defining the IR representation.