Type.scala cleanup

  • There are two kinds of Type: ComplexType and FundamentalType

    • Every ComplexType has a representation in terms of other another Type (possibly another ComplexType).
    • FundamentalTypes have a physical representation in a region and methods to access that physical representation
  • A subset of FundamentalTypes are scalar types. The physical representations of scalar types are not defined in terms of the physical representations of other types.

  • Every trait in a sealed hierarchy must be sealed, every class in a sealed hierarchy must be sealed or final

Proposed Type Hierarchy

FixedWidth names correspond to Scala traits/classes. italic names refer to concepts not reified in Scala.

  • ComplexType
    • TSet
    • TDict
    • TString
    • genetics types
      • TCall
      • TGenotype
      • TAltAllele
      • TVariant
      • TLocus
      • TInterval
  • FundamentalType
    • TStruct
    • TArray
    • scalar types
      • TBinary
      • TBoolean
      • TInt32
      • TInt64
      • TFloat32
      • TFloat64

Thoughts / Notes / Question

Every Type is ordered.

Every Type is printable (via pretty and str).

Every Type can generate random inhabitants of itself.

Every Type corresponds to a Spark type. Question: should the injection into Spark types for ComplexTypes be defined in terms of the corresponding fundamental type?

Every Type corresponds to a JSON value. Question: should the injection into Spark types for ComplexTypes be defined in terms of the corresponding fundamental type?

Every Type has an alignment and byteSize. Question: should this move to FundamentalType?

re alignment and byteSize–I don’t think it’s unreasonable to leave things as they are now, such that there’s a single version of those defined on ComplexType that just calls that function on the underlying representation.

On that note, we might want to think about also using final on methods with more prejudice so that it’s immediately apparent when a subclass is overriding a method that it’s not supposed to–e.g. for alignment and byteSize on ComplexType.

1 Like

I agree its not unreasonable, but I get this sense that we are violating a barrier. Like, if we want to talk about physical representations we should be forced to switch over to the FundamentalType land.


Strong Agree.


Aside:

I think all ComplexTypes need to define field accessors so as to create an abstraction barrier between the type and its representation.

That makes sense. I think the only place we want to see alignment and byteSize anyways are on RegionValueBuilder and encoder/decoder anyway, right?

Maybe this was too obvious to state, but also when loading array elements of those types.

Another thing that bothers me:

We keep using views to pretend one struct is another, even though the fields are mismatched and disordered. What does it mean for a hail expression to have type Variant? I suppose this behavior mimics most of the expression language because our users never write functions wherein they would be forced to decide what they will accept as input. When users start defining functions, are we gonna auto-coerce structures to meet these views?

Perhaps views are a distinct feature of the language, some sort of field-polymorphic Struct type. They seem related to sub typing.

I’ve been having similar thoughts. My thought was that a view is like a typeclass. So if V = View(GT: Int32, GQ: Int32), then a function f(x: V) applies to any Struct with two Int32 fields GT and GQ. And a value of Array[V] is an array of S, where S is some concrete struct type with those two fields.

I think the typeclass approach would be easier to get the language design details right than for subtyping. Also, correct me if I’m wrong, but as I understand it we can’t have an Array[V], where the elements of the array are values of heterogeneous struct types all having two particular fields, but possibly having different byte sizes. I think that rules out subtyping.

I’m having trouble thinking of any cases in the expression language where you really want a concrete struct type and not a view/typeclass. What if Struct just meant view, with the one exception of directly constructing a new value of the sruct type?

A subset of FundamentalTypes are scalar types

I think you’ve been calling these primitive. I’m happy to use scalar instead, we should just be consistent. I think non-scalar fundamental types should be called compound.

Every Type corresponds to a Spark type. Question: should the injection into Spark types for ComplexTypes be defined in terms of the corresponding fundamental type?

Spark type is too strong here. We use types in annotation not supported by Spark (e.g. Variant). The answer to this question is currently now (e.g. Variant). I think the fundamental type is a “region value” world thing and the JVM realization of typed values as an “annotation” world thing, with the latter going away.

Every Type corresponds to a JSON value. Question: should the injection into Spark types for ComplexTypes be defined in terms of the corresponding fundamental type?

This is fine. This is reasonable to do when the JSON import/export code is RV-ified.

Every Type has an alignment and byteSize. Question: should this move to FundamentalType?

There are many comments about types vs representations in this thread. I have been using the name virtual type to refer to a conceptual datatype with an accessor interface and physical type to its concrete realization (not just layout, but implementation of the accessor interface). I will write this up in another post. byteSize and alignment should both move to physical types.

There are cases this won’t work, though. Variant, for instance – as soon as the reference stuff is a bit more built up, we can represent variant as a struct where contig is an integer (reference contig) but if we export to spark / json / anywhere else, we want that to be a string.

We keep using views to pretend one struct is another, even though the fields are mismatched and disordered. What does it mean for a hail expression to have type Variant?

We only do this for genotype. A variant is Variant type in the code and will not match a struct.

Before we talk about what we should do (e.g. structural subtyping or typeclasses), let’s make sure we’re clear on why we’re doing this.

  1. Of the common data formats, some methods only need a subset of the data (e.g. GT) and since that relatively common, for efficiency reasons, it is nice to store only the subset (hardcalls). It is for performance.

  2. For genotype data, there are several overlapping schemas (HTS vs Genotype, Michigan vs GATK’s HTS schema, hardcalls, HTS + phasing, etc.) and the user doesn’t want to have to do type conversions when calling generic methods. This is a convenience and performance issue.

  3. This is a familiar model in data analysis tools. If you write a function on dataframes or a SQL query, you don’t declare the “type” of your dataframe/table inputs, you just use the fields you need and if they are present it works just fine.

We rarely (ever?) match structs at the level of the expression language. We do it at the level of VSM and KeyTable. Are you also proposing to statically type KeyTable and VSM operations? Or is this a conceptual conversation, how do we think about these views that we’re taking?

suppose this behavior mimics most of the expression language because our users never write functions wherein they would be forced to decide what they will accept as input.

Well, that’s just true of Python (duck typing). No reason they couldn’t use our @typecheck declarator declare and statically (at Hail compile time) verify inputs.

When users start defining functions, are we gonna auto-coerce structures to meet these views?

I think I need a concrete example here. But no, I was not imagining we would do anything more about this than we’re doing now.

I wholeheartedly agree we should think about the conceptual model for our data and operations, and even the formal model that underlies it. But I think the instinct to type everything for typing’s sake is wrong. If you (and/or Patrick) are proposing building new type machinery to able to statically declare and verify these “struct view” relationships, I’m very skeptical of the value to the user given the work that would be required to implement such a thing. What problem is this solving?

In addition, I think we already fully type check the relational level (that is, VSM and KeyTable) operations when the user executes the pipeline and we build and verify the IR at compile time.

I think we misunderstand one another. For every type, we have a way to convert annotations of that type to values of a “Spark type” (SparkAnnotationImpex). We currently define this mapping for every type. The question/suggestion I made: the correspondence between values (whether realized as region values or annotations) should only be defined on FundamentalType. A ComplexType can be imported-to/exported-from Spark by using the ComplexType’s realization. I suppose this is irrelevant if we are imminently removing the ability to convert a KeyTable to/from Spark DataFrame.

I agree that this is about physical types. If physical types will be added co-temporally with the above proposed changes, then these operations belong there. If not, I think it simplifies the code and the conception to treat FundamentalTypes as the only types that have alignment and byteSize (i.e. until further notice FundamentalTypes are one-to-one with the physical types).

I envision a Variant’s JSON-representation including the genome reference. A la:

{ gr: "GRCh38"
, contig: 1
, pos: 1
, ref: "A"
, alt: "B"
}

I suppose there are situations wherein we want to define the genome reference once for an array of Variants, but that behavior seems somewhat custom and requiring a fair bit more complicated logic than the current JSONAnnotationImpex contains.

Ah, I misunderstood your comment on my PR to rv-ify export_plink. I thought you wanted to created a VariantView so that we could treat anything with the right fields as a Variant.

Agreed. I’m referring to User-defined functions, wherein the user must explicitly annotate types on functions. It seems reasonable that I might write a function on structs with a GT and and a PL field. The user can rewrite their function to take two arguments, gt and pl. This is fine, but certainly not the nicest possible programming interface. This is not an urgent matter (UDFs don’t currently exist), but it comes to mind when I think about how our users can best interact with Hail.