Future - Numpy analyses on groups

Quick summary of discussion w/ Jon on 2018-05-10

There are a whole lot of different methods that users might want to run on smallish groups, with parallel independent processing of many groups. If we try to add these piecemeal, then there would be no end to it … so it would be really nice if for this kind of
parallel analysis-of-many-smallish-groups we could run anything that already exists in
Numpy - and of course get the results back into Hail for further processing.

It turns out that both CPython and Cython have a Global Interpreter Lock (GIL) and that
both the Python runtime and extension packages may make assumptions about holding the GIL. Everyone agrees to release the GIL while waiting for IO; and Cython
allows threads to explicitly release the GIL - I think that can be used roughly when you’re sure you’re doing pure compute without any object allocate/free or changes in object references. But if our intention is to run lots of different Numpy-based
extensions, then we shouldn’t rely on changing the GIL behavior.

After some discussion of possible ways to run multiple Numpy methods in parallel within a single Cython runtime instance, it seems this is probably a can of worms and
that a more promising approach is to run each method in a separate process, with
data structures (ndarray’s) visible to both Hail and Numpy being (somehow, waving
hands …) allocated in shared-memory segments.

Orthogonal to this is the question of how to get good fast linear algebra (and ML) for
big data. A couple of relevant links here:

Sandia National Lab “Kokkos”

Samsung “dMATH”

Also noting here that users can already “run” Python stack on workers from Hail via PySpark: http://discuss.hail.is/t/lasso-elastic-net-regression-on-blocks-of-variants/492/5

But from @cseed: Spark’s PythonRDD invokes python on each worker and serializes records from the RDD to Python and back. At least in earlier versions, this was notoriously slow. Wes had some demos using Arrow for serialization where he sped things up like 100x, but still was far from optimal (shared layouts and shared memory).