I just opened https://github.com/hail-is/hail/pull/3016, which introduces new iterator abstractions I think will simplify a lot of the existing iterator code. Since this potentially affects anybody who will work with iterators, I thought it worth copying the overview I wrote for the PR here. I welcome any feedback/concerns/questions.
I am proposing a new iterator abstraction that I think should be preferred to Scala Iterator
throughout most of the codebase, especially for iterators of region values. This is a low-level change, which could affect all code involving iterators, so I welcome feedback from everybody.
The new abstractions are what I called FlipbookIterator
and StagingIterator
(I’m open to name suggestions). My goal was to simplify and raise the level of abstraction of most of the iterator manipulating code in the codebase—which can be subtle and bug-prone, and difficult to read—while paying as minimal as possible a performance overhead for the abstraction. This was surprisingly subtle to find the right abstractions and get their implementation right, and my hope is that all the non-obvious iterator code will now be concentrated in a small, well tested, component.
FlipbookIterator
solves the confusing behavior where hasNext
potentially wipes out the current value. (All methods on FlipbookIterator
and StagingIterator
should obey the rule that methods defined without trailing ()
do not change the state of the iterator in any way detectable through the API.) The core interface of FlipbookIterator[A]
consists of the methods
isValid: Boolean
value: A
advance(): Unit
The metaphor is a flipbook, where when you turn the page, you no longer have access to the previous page; where you can read the current page as many times as you want (no need to copy it); and where you only know you’ve reached the end of the flipbook when you turn the page and find that the next page is empty. In FlipbookIterator
, advance()
turns the page, isValid
asks if the page you are on is non-empty, and value
gives you the value on the current page (which is an error if the page is empty).
StagingIterator
is a subtype of FlipbookIterator
which adds a bit of state to each page, together with the methods consume()
and stage()
. The bit of state on each page tracks whether the value has been “consumed” yet. consume()
can only be called on a valid page which has not yet been consumed, and marks it as consumed. Note that consume()
does not actually turn the page, so that the value just consumed is kept alive for the consumer to use. stage()
brings the iterator forward to the next state in which we are on an unconsumed page: if the current page has been consumed, it flips to the next page, but if the current page has not been consumed it does nothing.
My goal with the StagingIterator
interface was to find something simple—to make it easier to write iterator code that is easily understandable and obviously correct and bug-free—and which covers most of the low-level (and bug-prone) “bookkeeping” I could find in current iterator code. The StagingIterator
interface is also needed to implement the Scala Iterator
methods, and for that reason I made it so that every FlipbookIterator
is really a StagingIterator
with a restricted interface, so that FlipbookIterator
is able to subtype (BufferedIterator
, and therefore) Iterator
.
The tiny abstract class StateMachine
can be thought of as a “naked” FlipbookIterator
, with only the core interface methods and without any runtime checking. It is intended to be used only for defining a new FlipbookIterator
or StagingIterator
, through the factory methods on their respective companion objects which take a StateMachine
.
I implemented fliterWhere
, map
, and flatMap
on FlipbookIterator
, allowing the use of for–comprehensions, and on top of that I defined all the varieties of join. Some of the join methods take OrderingView
arguments. An OrderingView
is a small abstraction on top of an Ordering
which can take one element a
, copy data (such as key-fields) from a
if necessary, then later (after a
might have been destroyed or mutated) compare a
to other elements using the copied data. The potentially producting (non-distinct assuming) join methods also take a buffer argument, which is anything that can make a copy of an iterator and then iterate over the copy multiple times.
To avoid allocating tuples in the output iterators of the join methods, I made Muple
, which is just a mutable tuple. I turned the existing JoinedRegionValue
into an alias of Muple[RegionValue, RegionValue]
.
All of this core FlipbookIterator
and StagingIterator
behavior has no dependencies on anything else in Hail, so I want to thoroughly test everything at this level, and treat it like a small external iterator library living inside the repo. As such, I think this level should be quite stable going forwards.
At the higher level, I lifted all the join methods on FlipbookIterator
to OrderedRVIterator
, which is a Iterator[RegionValue]
together with an OrderedRVDType
. I think OrderedRVIterator
should be replaced by something better soon: see future work below. I’ve replaced the old implementations of innerJoinDistinct
, leftJoinDistinct
, and orderedZipJoin
using the new infrastructure (see OrderedRDD2.scala), which I think is a good example of the kind of simplifications possible. The existing JoinSuite tests also serve as a secondary test of the new code.
In future PRs, I intend to:
- Implement all the join varieties on
OrderedRVD
using theFlipbookIterator
joins defined here. - Update other uses of
Iterator
to use theFlipbookIterator
infrastructure, if doing so improves the code (and I expect it will, or I’ve done something wrong) - Beef up the
FlipbookIterator
test suite. I could do that for this PR if requested.