Merging Multiple Scans


#1

Sometimes a query involves multiple scans of the same large dataset, with no dependency
relationship forcing the two scans to be separate.

If the dataset is big, it might be best to combine the two scans into a single scan which
accesses the union of all required rows and the union of all require fields. This is difficult
to express in a pure-pull Volcano execution strategy. The most general model of evaluation
would have nodes which could take multiple inputs (e.g. for a join) and/or produce multiple
outputs (e.g. a merged scan producing several outputs for unrelated downstream processing).

But this concept becomes more interesting in a future always-on Hail service where there may be
different users running different queries against the same huge dataset. In that
case the optimization of merge-multiple-unrelated-scans-of-huge-dataset-into-one-scan
can be applied across queries from different users and different Hail sessions. Which
gives more useful processing per TB of storage-read. It’s like carpooling for queries,
each car moves more people to their destination.

Then this introduces a throughput-vs-latency scheduling policy decision at the higher
levels - when I get a query that requires a scan of a huge dataset, how long do we wait
for other queries which can piggyback on the same scan ? But that’s a really good
problem to have, compared to being stuck with “each query has to do its own scan”,
when the time/money costs for each scan are substantial.

At some scale the economics and scheduling potentially become much more like air travel,
where you can charter a plane, but it’s very expensive, so more often you buy a ticket on
a regularly-scheduled flight along with a lot of other people. With all the attendant issues
of filling as many seats as possible, while tweaking pricing to spread the demand away from
peak periods.