What are Physical Types?
Physical types are the classes that create in-memory representations (or code generate the in-memory representations), of Hail Types (Virtual Types). They serve as the implementations of Virtual Types.
Where possible Physical Type behavior should follow Python type behavior.
This proposal deals with the architectural goals of the PType implementation for 2020 Q1.
Motivation:
- Improve performance by building specialized memory representations for data (improve developer velocity / enable performance optimizations in the future).
Project technical goals:
- Abstract PType interfaces define code-generation and interpretation primitives (for example, PCanonicalArray concretely implements the PArray interface).
- Remove requiredness from virtual types
Future directions:
- Introduce the following invariant in the codebase: All region methods / Memory methods are used only in the ptypes hierarchy when dealing with values of Hail types.
Conceptual class hierarchy:
PNDArray
- Specialized implementations (canonical/non)
PCall
- Specialized implementations (canonical/non)
PInterval
- Specialized implementations (canonical/non)
PFloat32
PFloat64
PInt32
PInt64
PString
PBinary
PVoid
PArray
An abstract class for immutable ordered collections where all elements are of a single type. Does not contain the value constructor (e.g allocate)
Core Methods
(Each method has a staged version)
def loadLength(arrayAddress: Long): Long = ...
def loadLength(arrayAddress: Code[Long]): Code[Long] = ...
def isElementMissing(arrayAddress: Long, index: Int): Boolean= ...
def isElementMissing(arrayAddress: Long, index: Code[Int]): Code[Boolean] = ...
def loadElementAddress(arrayAddress: Long, index: Int): Long = ...
def loadElementAddress(arrayAddress: Code[Long], index: Code[Int]): Code[Long] = ...
- Renamed from
loadElement
because this function only returns the address of the element, not the element itself - Does not take a region instance because memory addresses are valid across regions. In the current
loadElement
signatures that take a region instance, we do not use that region instance. The only cases that a region instance would be needed is if loadElement needs to allocate memory off-heap, but this seems semantically inconsistent with loading (instead that would be a value construction, which happens in allocate)
Questions:
- Do we want to allow a range of maximum array lengths (not just 32 bit)
- Do we want to have a loadElement that returns the actual data stored at that address? Currently the caller always needs to be perform a second step, at the cost of more allocations (and the number of bytes returned will be greater for an address than any primitive besides Long and Double)
- Do we want a
loadElements
that returns an iterable? This would save the caller boilerplate: currently they need to store the length of the array, an index variable, and manually construct a while loop, check whether an element is missing, (typically over non-null elements) - Do we want, instead of isElementMissing a version of
Core methods
def allocate(region: Region length: Long): Long = ...
def allocate(region: Code[Region], length: Code[Long]): Code[Long] = ...
- Allocates the value array (e.g code-generate allocation and insertion) and returns the memory address of [the start of] the array
TODO: need some construction method
PSet
An abstract class for immutable (potentially unordered) collections of values where all values are unique and of one type. Does not contain the value constructor (e.g allocate)
- TODO: This is wrong I think. Not sure what the intended semantics of our sets is, besides uniqueness. Should we be able to access them by index?
Core Methods
(Each method has a staged version)
def loadLength(arrayAddress: Long): Long = ...
def loadLength(arrayAddress: Code[Long]): Code[Long] = ...
- Returns the array length
def loadElement(arrayAddress: Long, item: AnyVal): Option[AnyVal]
def loadElement(arrayAddress: Code[Long], item: Hashable): Code[Optional[AnyVal]] = ...
- The return semantics for PSet’s loadElement instance method are identical to PCanonicaArray’s loadElement instance method
PSet concrete implementations
PCanonicalSet
Signature
PSet(elementType: PType, required: Boolean = false)
Core methods
def allocate(region: Region length: Long): Long = ...
def allocate(region: Code[Region], length: Code[Long]): Code[Long] = ...
- Construct the value array (e.g code-generate allocation and insertion) and returns the memory address of [the start of] the array
PDict
An abstract class for immutable unordered collections of key:value pairs where keys are unique. Keys must all be of the same type, and values must all be of the same type (though can be different than the key type). Does not contain the value constructor (e.g allocate)
Core Methods
(Each method has a staged version)
def loadLength(arrayAddress: Long): Long = ...
def loadLength(arrayAddress: Code[Long]): Code[Long] = ...
- Returns the array length
def loadElement(arrayAddress: Long, key: Hashable): Option[AnyVal]
def loadElement(arrayAddress: Code[Long], key: Hashable): Code[Optional[AnyVal]] = ...
- Returns the key’s corresponding value, if present. In the interpreted version, uses Scala’s Option, and requires matching on Some/None. In staged version, uses Java’s Optional semantics, match on v.isNull, just like PArray’s loadElement.
PDict concrete implementations
PCanonicalDict
Signature
PCanonicalDict((keyType: PType, valueType: PType, required: Boolean = false)
Core methods
def allocate(region: Region length: Long): Long = ...
def allocate(region: Code[Region], length: Code[Long]): Code[Long] = ...
PTuple
An abstract class for immutable ordered collections of values that may be of different types. Does not contain the value constructor (e.g allocate)
Core Methods
(Each method has a staged version)
def loadLength(arrayAddress: Long): Long = ...
def loadLength(arrayAddress: Code[Long]): Code[Long] = ...
- Returns the array length
def loadElement(arrayAddress: Long, index: Long): Option[AnyVal]
def loadElement(arrayAddress: Code[Long], index: Long): Code[Optional[AnyVal]] = ...
- Same semantics as PArray
PTuple concrete implementations
PCanonicalTuple
Signature
PCanonicalTuple(fields: IndexeSeq[PType], required: Boolean = false)
Core methods
def allocate(region: Region length: Long): Long = ...
def allocate(region: Code[Region], length: Code[Long]): Code[Long] = ...
PStruct
An abstract class for immutable collections of (key, value) pairs of (potentially) different types. Keys are always strings. Values are looked up by key only. Does not contain the value constructor (e.g allocate)
Core Methods
(Each method has a staged version)
def loadLength(arrayAddress: Long): Long = ...
def loadLength(arrayAddress: Code[Long]): Code[Long] = ...
- Returns the array length
def loadElement(arrayAddress: Long, fieldName: String): Option[AnyVal]
def loadElement(arrayAddress: Code[Long], fieldName: String): Code[Optional[AnyVal]] = ...
- Same return value semantics as PArray with regard to missingness
PStruct concrete implementations
PCanonicalSturct
Signature
PCanonicalStruct(fields: Seq[String, PType], required: Boolean = false)
Core methods
def allocate(region: Region length: Long): Long = ...
def allocate(region: Code[Region], length: Code[Long]): Code[Long] = ...
PLocus
A representation of a chromosomal locus, encapsulating the reference genome, chromosome (called contig in our documentation), and position.
Core Methods
(Each method has a staged version)
def reference(arrayAddress: Long): String = ...
def reference(arrayAddress: Code[Long]): Code[String] = ...
def chromosome(arrayAddress: Long): String = ...
def chromosome(arrayAddress: Code[Long]): Code[String] = ...
def position(arrayAddress: Long): Long = ...
def position(arrayAddress: Code[Long]): Code[Long] = ...
Locus concrete implementations
PCanonicalLocus
Signature
PCanonicalLocus(reference: PString, chromosome: PString, position: PInt64)
Questions:
- Do we need to have a value constructor for PLocus?