Required Types: Syntax and Semantics

Picking up from an email:

It seems like every type should have a notion of requiredness. Storing that knowledge on the struct field or array adds an extra step of propagating the bit into the ExtendedType.

How to produce non-missing values:

transcripts.map(x => orElse(x, "NA")): Array[!String]
Array("foo", "bar", "baz"): Array[!String]

Probably should have type annotations:

"foo": String

and an explicit lifting function:

orNull("foo")

We should probably call it orNA, but I prefer:

nullable("foo")

because orElse does a condition check, so it seems weird that orNull is a no-op.

or maybe:

?"foo"

What about using the same notation on the type and term levels: given "foo": String, lift it to !"foo": !String. I read !"foo" as "definitely "foo"".

Edit: Sorry, I had that backwards.

That’s nicely consistent. Not sure what the right default is for values. It seems like "foo" should have the tightest correct type.

here’s a first crack at what putting requiredness on types would mean. Mostly I’m concerned that this clutters up Type.scala a bunch, and also with the extractor for, e.g.

TArray(elementType, required)

the required flag is almost never going to be relevant (which leaves a bunch of TArray(elementType, _) nonsense everywhere).

Also, there’s another problem where anything that used to be

t match {
    case TInt32 => ...
    case TInt64 => ...
}

will now have to be

t match {
    case _: TInt32 => ...
    case _: TInt64 => ...
}

or

t match {
    case TInt32 => ...
    case TInt32Required => ...
    case TInt64 => ...
    case TInt64Required => ...
}

and t == TInt32 becomes t.isInstanceOf[TInt32].

Also, I’m not sure what some of the types should be doing w.r.t. requiredness—e.g. TFunction, etc.

The nice thing about putting requiredness on types is it makes it pretty straightforward to differentiate between things like TDict(!K, !V) and TDict(!K, V) if we want to.

You didn’t say this explicit but I expect you have:

class Type {
  ...
  def required: Boolean
}

and each type will implement with a val required: Boolean parameter, right?

re: TArray.unapply. This is the classic tension between conciseness and explicitness. The downside of silently ignoring req is the danger that you miss cases when it is actually required. I lean towards explicitness and don’t super mind TArray(elementType, _).

You can’t overload unapply in Scala (since they would only vary by return type, but you can create alternate extractors using other class names, so TArrayT(elementType) (T for type only?) and TArray(elementType, req).

I also don’t find _: TInt32 onerous and we’ve already had to make that change in a few cases and it wasn’t bad (e.g. TVariant => TVariant(gr) or _: TVariant)

If you support case TInt32 => ... and case TInt32Required => ..., they should probably be a common subtype. What’s it called? I would probably have TInt32 and class TInt32Optional extends TInt32 and TInt32Required extends TInt32. (Maybe Opt and Req? Too short?)

TFunction can never be missing in the current design, e.g., syntactically you can’t say a.map(NA: TFunction...).

I guess my overall comment would be to not be too attached to the current code/design. If a new feature requires a change to the code, we should make it universally and consistently in light of the new design.

Currently, I have it such that it’s

class Type {
    ...
    def required: Boolean = false
}

and types only have override val required: Boolean if they’re case classes or if they’re required = true, although I could get rid of that and make all types define their own default requiredness.

I’ve also changed some stuff around so that typeCheck, toString, and pretty will all automatically reflect the requiredness (so instead of subclasses defining those methods directly, they define e.g. _typeCheck(a) and then typeCheck(a) wraps that)

The way I’ve structured the types that used to be case object is like this:

sealed abstract class TInt32 {
    ... [all definitions here]
}

case object TInt32 extends TInt32
case object TInt32Required extends TInt32 {
    override val required = true
}

but maybe it would make more sense to rename the object TInt32Optional? I have it currently named TInt32 so that you can just write, e.g., TArray(TInt32) and TArray(!TInt32) instead of having to write TArray(TInt32Required) due to the method unary_!() defined on the base class Type.

types only have override val required: Boolean if they’re case classes or if they’re required = true, although I could get rid of that and make all types define their own default requiredness

I prefer the explicitness of no defaults and all types define required. At least internally, I think we shouldn’t express a preference for required vs optional. Don’t make a value judgement. Moreover, things that are conceptually symmetric should be symmetric in the code.

maybe it would make more sense to rename the object TInt32Optional?

By the same argument, I think so.

case object TInt32 extends TInt32

I think this is going to cause confusion.

Another possible syntax is: Int32(true) which can return interned type objects so we can avoid creating tons of Int32s. This has the advantage of also working with pattern matching, being able to say Int32(_) or _: Int32 and being consistent with other type constructors.

Syntax notes:

Non-singleton Type objects (the ones implemented as case classes) now have an extra field in the constructor for requiredness, e.g. case class TArray(elementType: Type) -> case class TArray(elementType: Type, required: Boolean = false). Things that this changes syntax-wise:

  • uses of the extractor function will now need an extra field for the flag, i.e. TArray(elt) = ... -> TArray(elt, req) or TArray(elt, _).
  • things that expect the apply method to take only the one input will need to be written out more explicitly, such as in typ.py

Previously-singleton Type objects (TInt32, TInt64, TFloat32, TFloat64, TBinary, TBoolean, TString, TGenotype, TCall, TAltAllele) have gotten broken up in order to accomodate requiredness. The objects have been broken up into 4 parts: the class that defines the underlying behavior, the companion object that defines apply and unapply objects, and then two case objects that define objects with required = true and required = false. Syntax changes (using TInt32 as an example)

  • TInt32 (the case object) becomes either TInt32Optional or TInt32() (or TInt32(true))
  • the unary_! operator defined on Type flips the required flag. so !TInt32() is the same as TInt32Required and !!TInt32() = TInt32().
  • case TInt32 becomes case TInt32(req) or case t: TInt32
  • t == TInt32 becomes–if you care about requiredness–t == TInt32()
  • if you don’t care about the requiredness, but only about the underlying types, you can use t.isInstanceOf[TInt32] or, for nested structs and containers for which it can be annoying to check all the cases, t.isOfType(TInt32) will recursively check that the types are the same while ignoring requiredness.