New iterator abstractions

I just opened https://github.com/hail-is/hail/pull/3016, which introduces new iterator abstractions I think will simplify a lot of the existing iterator code. Since this potentially affects anybody who will work with iterators, I thought it worth copying the overview I wrote for the PR here. I welcome any feedback/concerns/questions.

I am proposing a new iterator abstraction that I think should be preferred to Scala Iterator throughout most of the codebase, especially for iterators of region values. This is a low-level change, which could affect all code involving iterators, so I welcome feedback from everybody.

The new abstractions are what I called FlipbookIterator and StagingIterator (I’m open to name suggestions). My goal was to simplify and raise the level of abstraction of most of the iterator manipulating code in the codebase—which can be subtle and bug-prone, and difficult to read—while paying as minimal as possible a performance overhead for the abstraction. This was surprisingly subtle to find the right abstractions and get their implementation right, and my hope is that all the non-obvious iterator code will now be concentrated in a small, well tested, component.

FlipbookIterator solves the confusing behavior where hasNext potentially wipes out the current value. (All methods on FlipbookIterator and StagingIterator should obey the rule that methods defined without trailing () do not change the state of the iterator in any way detectable through the API.) The core interface of FlipbookIterator[A] consists of the methods

  • isValid: Boolean
  • value: A
  • advance(): Unit

The metaphor is a flipbook, where when you turn the page, you no longer have access to the previous page; where you can read the current page as many times as you want (no need to copy it); and where you only know you’ve reached the end of the flipbook when you turn the page and find that the next page is empty. In FlipbookIterator, advance() turns the page, isValid asks if the page you are on is non-empty, and value gives you the value on the current page (which is an error if the page is empty).

StagingIterator is a subtype of FlipbookIterator which adds a bit of state to each page, together with the methods consume() and stage(). The bit of state on each page tracks whether the value has been “consumed” yet. consume() can only be called on a valid page which has not yet been consumed, and marks it as consumed. Note that consume() does not actually turn the page, so that the value just consumed is kept alive for the consumer to use. stage() brings the iterator forward to the next state in which we are on an unconsumed page: if the current page has been consumed, it flips to the next page, but if the current page has not been consumed it does nothing.

My goal with the StagingIterator interface was to find something simple—to make it easier to write iterator code that is easily understandable and obviously correct and bug-free—and which covers most of the low-level (and bug-prone) “bookkeeping” I could find in current iterator code. The StagingIterator interface is also needed to implement the Scala Iterator methods, and for that reason I made it so that every FlipbookIterator is really a StagingIterator with a restricted interface, so that FlipbookIterator is able to subtype (BufferedIterator, and therefore) Iterator.

The tiny abstract class StateMachine can be thought of as a “naked” FlipbookIterator, with only the core interface methods and without any runtime checking. It is intended to be used only for defining a new FlipbookIterator or StagingIterator, through the factory methods on their respective companion objects which take a StateMachine.

I implemented fliterWhere, map, and flatMap on FlipbookIterator, allowing the use of for–comprehensions, and on top of that I defined all the varieties of join. Some of the join methods take OrderingView arguments. An OrderingView is a small abstraction on top of an Ordering which can take one element a, copy data (such as key-fields) from a if necessary, then later (after a might have been destroyed or mutated) compare a to other elements using the copied data. The potentially producting (non-distinct assuming) join methods also take a buffer argument, which is anything that can make a copy of an iterator and then iterate over the copy multiple times.

To avoid allocating tuples in the output iterators of the join methods, I made Muple, which is just a mutable tuple. I turned the existing JoinedRegionValue into an alias of Muple[RegionValue, RegionValue].

All of this core FlipbookIterator and StagingIterator behavior has no dependencies on anything else in Hail, so I want to thoroughly test everything at this level, and treat it like a small external iterator library living inside the repo. As such, I think this level should be quite stable going forwards.

At the higher level, I lifted all the join methods on FlipbookIterator to OrderedRVIterator, which is a Iterator[RegionValue] together with an OrderedRVDType. I think OrderedRVIterator should be replaced by something better soon: see future work below. I’ve replaced the old implementations of innerJoinDistinct, leftJoinDistinct, and orderedZipJoin using the new infrastructure (see OrderedRDD2.scala), which I think is a good example of the kind of simplifications possible. The existing JoinSuite tests also serve as a secondary test of the new code.

In future PRs, I intend to:

  • Implement all the join varieties on OrderedRVD using the FlipbookIterator joins defined here.
  • Update other uses of Iterator to use the FlipbookIterator infrastructure, if doing so improves the code (and I expect it will, or I’ve done something wrong)
  • Beef up the FlipbookIterator test suite. I could do that for this PR if requested.