[wip] PType Design Document [wip]

What are Physical Types?

Physical types are the classes that create in-memory representations (or code generate the in-memory representations), of Hail Types (Virtual Types). They serve as the implementations of Virtual Types.

Where possible Physical Type behavior should follow Python type behavior.

This proposal deals with the architectural goals of the PType implementation for 2020 Q1.

Motivation:

  • Improve performance by building specialized memory representations for data (improve developer velocity / enable performance optimizations in the future).

Project technical goals:

  • Abstract PType interfaces define code-generation and interpretation primitives (for example, PCanonicalArray concretely implements the PArray interface).
  • Remove requiredness from virtual types

Future directions:

  • Introduce the following invariant in the codebase: All region methods / Memory methods are used only in the ptypes hierarchy when dealing with values of Hail types.

Conceptual class hierarchy:

PArray

PSet

PDict

PNDArray

  • Specialized implementations (canonical/non)

PTuple

PStruct

PLocus

PCall

  • Specialized implementations (canonical/non)

PInterval

  • Specialized implementations (canonical/non)

PFloat32

PFloat64

PInt32

PInt64

PString

PBinary

PVoid


PArray

An abstract class for immutable ordered collections where all elements are of a single type. Does not contain the value constructor (e.g allocate)

Core Methods

(Each method has a staged version)

def loadLength(arrayAddress: Long): Long = ...
def loadLength(arrayAddress: Code[Long]): Code[Long] = ...

def isElementMissing(arrayAddress: Long, index: Int): Boolean= ...
def isElementMissing(arrayAddress: Long, index: Code[Int]): Code[Boolean] = ...

def loadElementAddress(arrayAddress: Long, index: Int): Long = ...
def loadElementAddress(arrayAddress: Code[Long], index: Code[Int]): Code[Long] = ...
  • Renamed from loadElement because this function only returns the address of the element, not the element itself
  • Does not take a region instance because memory addresses are valid across regions. In the current loadElement signatures that take a region instance, we do not use that region instance. The only cases that a region instance would be needed is if loadElement needs to allocate memory off-heap, but this seems semantically inconsistent with loading (instead that would be a value construction, which happens in allocate)

Questions:

  1. Do we want to allow a range of maximum array lengths (not just 32 bit)
  2. Do we want to have a loadElement that returns the actual data stored at that address? Currently the caller always needs to be perform a second step, at the cost of more allocations (and the number of bytes returned will be greater for an address than any primitive besides Long and Double)
  3. Do we want a loadElements that returns an iterable? This would save the caller boilerplate: currently they need to store the length of the array, an index variable, and manually construct a while loop, check whether an element is missing, (typically over non-null elements)
  4. Do we want, instead of isElementMissing a version of

Core methods

def allocate(region: Region length: Long): Long = ...
def allocate(region: Code[Region], length: Code[Long]): Code[Long] = ...
  • Allocates the value array (e.g code-generate allocation and insertion) and returns the memory address of [the start of] the array

TODO: need some construction method


PSet

An abstract class for immutable (potentially unordered) collections of values where all values are unique and of one type. Does not contain the value constructor (e.g allocate)

  • TODO: This is wrong I think. Not sure what the intended semantics of our sets is, besides uniqueness. Should we be able to access them by index?

Core Methods

(Each method has a staged version)

def loadLength(arrayAddress: Long): Long = ...
def loadLength(arrayAddress: Code[Long]): Code[Long] = ...
  • Returns the array length

def loadElement(arrayAddress: Long, item: AnyVal): Option[AnyVal]
def loadElement(arrayAddress: Code[Long], item: Hashable): Code[Optional[AnyVal]] = ...
  • The return semantics for PSet’s loadElement instance method are identical to PCanonicaArray’s loadElement instance method

PSet concrete implementations

PCanonicalSet

Signature

PSet(elementType: PType, required: Boolean = false)

Core methods

def allocate(region: Region length: Long): Long = ...
def allocate(region: Code[Region], length: Code[Long]): Code[Long] = ...
  • Construct the value array (e.g code-generate allocation and insertion) and returns the memory address of [the start of] the array

PDict

An abstract class for immutable unordered collections of key:value pairs where keys are unique. Keys must all be of the same type, and values must all be of the same type (though can be different than the key type). Does not contain the value constructor (e.g allocate)

Core Methods

(Each method has a staged version)

def loadLength(arrayAddress: Long): Long = ...
def loadLength(arrayAddress: Code[Long]): Code[Long] = ...
  • Returns the array length

def loadElement(arrayAddress: Long, key: Hashable): Option[AnyVal]
def loadElement(arrayAddress: Code[Long], key: Hashable): Code[Optional[AnyVal]] = ...
  • Returns the key’s corresponding value, if present. In the interpreted version, uses Scala’s Option, and requires matching on Some/None. In staged version, uses Java’s Optional semantics, match on v.isNull, just like PArray’s loadElement.

PDict concrete implementations

PCanonicalDict

Signature

PCanonicalDict((keyType: PType, valueType: PType, required: Boolean = false)

Core methods

def allocate(region: Region length: Long): Long = ...
def allocate(region: Code[Region], length: Code[Long]): Code[Long] = ...

PTuple

An abstract class for immutable ordered collections of values that may be of different types. Does not contain the value constructor (e.g allocate)

Core Methods

(Each method has a staged version)

def loadLength(arrayAddress: Long): Long = ...
def loadLength(arrayAddress: Code[Long]): Code[Long] = ...
  • Returns the array length

def loadElement(arrayAddress: Long, index: Long): Option[AnyVal]
def loadElement(arrayAddress: Code[Long], index: Long): Code[Optional[AnyVal]] = ...

PTuple concrete implementations

PCanonicalTuple

Signature

PCanonicalTuple(fields: IndexeSeq[PType], required: Boolean = false)

Core methods

def allocate(region: Region length: Long): Long = ...
def allocate(region: Code[Region], length: Code[Long]): Code[Long] = ...

PStruct

An abstract class for immutable collections of (key, value) pairs of (potentially) different types. Keys are always strings. Values are looked up by key only. Does not contain the value constructor (e.g allocate)

Core Methods

(Each method has a staged version)

def loadLength(arrayAddress: Long): Long = ...
def loadLength(arrayAddress: Code[Long]): Code[Long] = ...
  • Returns the array length

def loadElement(arrayAddress: Long, fieldName: String): Option[AnyVal]
def loadElement(arrayAddress: Code[Long], fieldName: String): Code[Optional[AnyVal]] = ...
  • Same return value semantics as PArray with regard to missingness

PStruct concrete implementations

PCanonicalSturct

Signature

PCanonicalStruct(fields: Seq[String, PType], required: Boolean = false)

Core methods

def allocate(region: Region length: Long): Long = ...
def allocate(region: Code[Region], length: Code[Long]): Code[Long] = ...

PLocus

A representation of a chromosomal locus, encapsulating the reference genome, chromosome (called contig in our documentation), and position.

Core Methods

(Each method has a staged version)

def reference(arrayAddress: Long): String = ...
def reference(arrayAddress: Code[Long]): Code[String] = ...

def chromosome(arrayAddress: Long): String = ...
def chromosome(arrayAddress: Code[Long]): Code[String] = ...

def position(arrayAddress: Long): Long = ...
def position(arrayAddress: Code[Long]): Code[Long] = ...

Locus concrete implementations

PCanonicalLocus

Signature

PCanonicalLocus(reference: PString, chromosome: PString, position: PInt64)

Questions:

  1. Do we need to have a value constructor for PLocus?

You put stack representations in future directions. This is something we definitely want, but it is a non-trivial change from what we have, so I want to makes sure we don’t want to include it in this proposal.

(Each method has a staged version)

I’m not sure the staged signatures are clear to me.

def loadElement(arrayAddress: Long, index: Long): (Boolean, Any)

I’m not sure Any makes sense here. I don’t like the tuple, it allocates. A tuple of Code might be OK for the staged functions, although we should probably keep them as close together as possible.

Do we want to make these generic on Numeric

What are these? I don’t understand the question.

Do we want to allow mutable arrays?

No.

Construct the value array (e.g code-generate allocation and insertion)

Can we make this concrete with a signature?

loadLength’s return value and loadElement’s index. I chose Long here to allow for large arrays, but concrete implementations of these classes could be for much smaller arrays, and therefore could use less memory to represent these values. If we don’t care then Scala’s implicit conversions will handle this (Int.scala for instance defines implicit def int2long(x: Int): Long = x.toLong)