Proposed changes to the python select/annotate/drop interface for keys


#1

So when we were rewriting the select/annotate/drop/transmute operations to be in terms of a more general MapStruct operation, we ran into some interesting questions about how to handle keys in these interfaces, since changing select from taking per-field expression to a struct-level expression means it is more difficult to distinguish expressions that are changing the keys from expressions that aren’t. The easy (but moderately messy) fix is to analyze the resulting IR/AST to figure out whether or not the key fields are touched in a given expression, which is what we’ve done for now.

@pschultz pointed out that maybe we could have more conceptual and logical separation between keys and values. Currently, keys are treated like any other value field, which means that it’s really easy to change them, for better or for worse, with select/annotate/drop, and we silently drop key fields from the list of keys if they get dropped from the underlying struct, or pick up a new key expression if the old one gets overwritten.

I think this gets a little confusing with the way that we use keys; it’s pretty easy/intuitive to think about, for example, a table struct as a map (or multi-map, since we don’t enforce uniqueness?) from key fields to value fields, and that creates a distinct asymmetry between how we think about and want to manipulate key fields vs how we want to manipulate value fields.

It feels like we now have this implicit notion of a “keyed struct”—the rows of a table can be thought of as a keyed StructExpression, as can MatrixTable.row and MatrixTable.col—and it might be nice to make this concept more explicit since it seems useful in how we & users think about keys. In the meantime, though, I’ve been thinking about what it would mean for the Python interface and (maybe) the IR representation by treating keys distinctly from value fields.

I’ve outlined a little bit of what I’ve been thinking below, and I have a branch [diff here] that implements the Python interface changes for the rows interface of MatrixTable. (It won’t currently compile, since I still need to rebase on the MapRows changes.)

Python

key_by(exprs, named_exprs)

is the only method that can modify key fields. The interface is identical to the current select interface, where non-named exprs must be field references (not necessarily top-level), but non-field-reference expressions can still be used if a name is provided. All unused former key expressions are retained as value fields.

I’m not super sure what to do with the partition_key argument that exists right now on key_rows_by; for now, I’ve removed it since it was only added a short while ago for GroupedMatrixTable, and afaict no-one is using it. I’ve added a private _key_rows_by(pk_dict, rest_of_k_dict) so the functionality is retained for that use. If we want to keep it as an option, I kind of like some thing by which someone can do e.g. key_rows_by(x, y, z).partition_rows_by(x, y).

annotate(named_exprs)

should deal only with value fields. An attempt to annotate over a key field will cause an error. If you actually want to annotate over a key field, a separate call to key_by has to be made.

This is a little annoying for things like annotate(key3=expr) in the old interface, since you have to do something like key_by(key1=t.key1, key2=t.key2, key3=expr, key4=t.key4) or key_by(**t.key.annotate(key3=expr)). I don’t know how often this would come up (I suspect not super often) but I think it’s still better at reinforcing what the new key would look like. (this could possibly be mitigated by something like annotate_key, which just adds/replaces a key field).

select(exprs, named_exprs)

likewise deals only with value fields. All key fields are automatically preserved; if you wanted to do something like select(k2, k4, v1=expr), you’d do something like key_by(k2, k4).select(v1=expr).

drop

I currently have this so that key fields can’t be dropped (key_by(**t.key.drop(k4)).drop('k4') would drop a key field).

transmute

I haven’t implemented this yet since none of them are implemented in MatrixTable, but I imagine it would be identical to the current transmute except that you wouldn’t be able to overwrite key fields and any of the key fields used in a transmute expression would not be dropped.

self.row (property)

@tpoterba thought this could be all the non-key row fields? I don’t really know what to do here, except maybe it would be nice to have one property that’s the values struct and one property that’s the entire struct (keys + values)?

Taking the MatrixTable stuff as an example, there’s something conceptually neat about the symmetry that arises between mt.row/mt.rows(), mt.col/mt.cols(), and mt.entry/mt.entries when we define mt.row as “the set of row-indexed value fields”. This means that we have:

  • mt.rows() => hl.struct(**mt.row_key, **mt.row)
  • mt.cols() => hl.struct(**mt.col_key, **mt.col)
  • mt.entries() => hl.struct(**mt.entry_key, **mt.entry)

…well, mt.entry_key doesn’t actually exist; it’s kind of just shorthand for **mt.row_key, **mt.col_key since entry is the row and col indexed stuff.

IR/Scala representation

I haven’t really thought this part through. Right now, I’ve written out keyRowsBy s.t. it just uses the existing selectRows implementation, so the only thing that exists on the IR side is the MapRows concept. A couple of options that I think are reasonable without changing too much:

  • two separate nodes: MapRow and MapRowWithExistingKeys, the latter of which preserves the keys and doesn’t shuffle, the former of which will change the keys (and therefore needs to shuffle).
  • one IR node, MapRow, with the separate key mapping and values mapping. We’d still need to figure out whether or not the keys are being changed (or maybe keys would just be Option[IR] with None => preserve keys).

I think having a notion of KeyedStruct would help a lot in defining the IR representation.