I just opened https://github.com/hail-is/hail/pull/3016, which introduces new iterator abstractions I think will simplify a lot of the existing iterator code. Since this potentially affects anybody who will work with iterators, I thought it worth copying the overview I wrote for the PR here. I welcome any feedback/concerns/questions.
I am proposing a new iterator abstraction that I think should be preferred to Scala
Iterator throughout most of the codebase, especially for iterators of region values. This is a low-level change, which could affect all code involving iterators, so I welcome feedback from everybody.
The new abstractions are what I called
StagingIterator (I’m open to name suggestions). My goal was to simplify and raise the level of abstraction of most of the iterator manipulating code in the codebase—which can be subtle and bug-prone, and difficult to read—while paying as minimal as possible a performance overhead for the abstraction. This was surprisingly subtle to find the right abstractions and get their implementation right, and my hope is that all the non-obvious iterator code will now be concentrated in a small, well tested, component.
FlipbookIterator solves the confusing behavior where
hasNext potentially wipes out the current value. (All methods on
StagingIterator should obey the rule that methods defined without trailing
() do not change the state of the iterator in any way detectable through the API.) The core interface of
FlipbookIterator[A] consists of the methods
The metaphor is a flipbook, where when you turn the page, you no longer have access to the previous page; where you can read the current page as many times as you want (no need to copy it); and where you only know you’ve reached the end of the flipbook when you turn the page and find that the next page is empty. In
advance() turns the page,
isValid asks if the page you are on is non-empty, and
value gives you the value on the current page (which is an error if the page is empty).
StagingIterator is a subtype of
FlipbookIterator which adds a bit of state to each page, together with the methods
stage(). The bit of state on each page tracks whether the value has been “consumed” yet.
consume() can only be called on a valid page which has not yet been consumed, and marks it as consumed. Note that
consume() does not actually turn the page, so that the value just consumed is kept alive for the consumer to use.
stage() brings the iterator forward to the next state in which we are on an unconsumed page: if the current page has been consumed, it flips to the next page, but if the current page has not been consumed it does nothing.
My goal with the
StagingIterator interface was to find something simple—to make it easier to write iterator code that is easily understandable and obviously correct and bug-free—and which covers most of the low-level (and bug-prone) “bookkeeping” I could find in current iterator code. The
StagingIterator interface is also needed to implement the Scala
Iterator methods, and for that reason I made it so that every
FlipbookIterator is really a
StagingIterator with a restricted interface, so that
FlipbookIterator is able to subtype (
BufferedIterator, and therefore)
The tiny abstract class
StateMachine can be thought of as a “naked”
FlipbookIterator, with only the core interface methods and without any runtime checking. It is intended to be used only for defining a new
StagingIterator, through the factory methods on their respective companion objects which take a
FlipbookIterator, allowing the use of for–comprehensions, and on top of that I defined all the varieties of join. Some of the join methods take
OrderingView arguments. An
OrderingView is a small abstraction on top of an
Ordering which can take one element
a, copy data (such as key-fields) from
a if necessary, then later (after
a might have been destroyed or mutated) compare
a to other elements using the copied data. The potentially producting (non-distinct assuming) join methods also take a buffer argument, which is anything that can make a copy of an iterator and then iterate over the copy multiple times.
To avoid allocating tuples in the output iterators of the join methods, I made
Muple, which is just a mutable tuple. I turned the existing
JoinedRegionValue into an alias of
All of this core
StagingIterator behavior has no dependencies on anything else in Hail, so I want to thoroughly test everything at this level, and treat it like a small external iterator library living inside the repo. As such, I think this level should be quite stable going forwards.
At the higher level, I lifted all the join methods on
OrderedRVIterator, which is a
Iterator[RegionValue] together with an
OrderedRVDType. I think
OrderedRVIterator should be replaced by something better soon: see future work below. I’ve replaced the old implementations of
orderedZipJoin using the new infrastructure (see OrderedRDD2.scala), which I think is a good example of the kind of simplifications possible. The existing JoinSuite tests also serve as a secondary test of the new code.
In future PRs, I intend to:
- Implement all the join varieties on
FlipbookIteratorjoins defined here.
- Update other uses of
Iteratorto use the
FlipbookIteratorinfrastructure, if doing so improves the code (and I expect it will, or I’ve done something wrong)
- Beef up the
FlipbookIteratortest suite. I could do that for this PR if requested.