Here are some references on the least squares and PCA algorithms I think are a good fit for Hail. While they use randomness, they give results that are at least as accurate as any other method, but are competitive with (sometimes much faster than) classical methods on a single core, and are easily parallelizable.
Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions
- A nice overview of the basic framework of randomized PCA methods.
An implementation of a randomized algorithm for principal component
- More details on making the above method work in practice, plus more empirical evaluation.
An algorithm for the principal component analysis of large data sets
- Another presentation of the same basic framework, specialized to the distributed case.
Optimal Principal Component Analysis in Distributed and Streaming Models
- A different approach that is able to project both dimensions down to O(k) (where k is the number of desired PCs), but is more involved and less proven.
LSRN: A Parallel Iterative Solver for Strongly Over- or Under-Determined Systems
- Paper that introduced the basic method. Improved on in next papers.
Sketching as a Tool for Numerical Linear Algebra
- A book with a good overview of the use of random projections in linear algebra. Section 2.6 improves the method in the previous paper by using a more efficient family of random projections.
The Chebyshev iteration revisited
- Overview of the Chebyshev iterative method for least squares. This is a very old method which has not seen much use, since it’s preconditions are hard to satisfy, but which we can use because we have tight bounds on the singular values of the preconditioned matrix.