Randomized big linear algebra

pschultz · January 16, 2020, 7:52pm

Here are some references on the least squares and PCA algorithms I think are a good fit for Hail. While they use randomness, they give results that are at least as accurate as any other method, but are competitive with (sometimes much faster than) classical methods on a single core, and are easily parallelizable.

PCA

Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions
- A nice overview of the basic framework of randomized PCA methods.
An implementation of a randomized algorithm for principal component
- More details on making the above method work in practice, plus more empirical evaluation.
An algorithm for the principal component analysis of large data sets
- Another presentation of the same basic framework, specialized to the distributed case.
Optimal Principal Component Analysis in Distributed and Streaming Models
- A different approach that is able to project both dimensions down to O(k) (where k is the number of desired PCs), but is more involved and less proven.

Least squares

LSRN: A Parallel Iterative Solver for Strongly Over- or Under-Determined Systems
- Paper that introduced the basic method. Improved on in next papers.
Sketching as a Tool for Numerical Linear Algebra
- A book with a good overview of the use of random projections in linear algebra. Section 2.6 improves the method in the previous paper by using a more efficient family of random projections.
The Chebyshev iteration revisited
- Overview of the Chebyshev iterative method for least squares. This is a very old method which has not seen much use, since it’s preconditions are hard to satisfy, but which we can use because we have tight bounds on the singular values of the preconditioned matrix.

johnc1231 · June 25, 2020, 2:52pm

FastPCA, a paper describing randomized PCA on genotype matrices.

pschultz · July 1, 2020, 9:36pm

I’ll give a summary of the randomized PCA method here, with the main variations we might want to evaluate, along with pointers to papers that performed empirical evaluations on some of the variations.

Method summary

We’re given an m x n matrix A, and a number k of desired PCs. Assume that k << n < m.

Choose parameters l and q, both positive integers, where l > k, and (q+1)l <= (m - k). I’ve only seen k <= l <= 2k, most often l = k + 2, and 0 <= q <= 10, most often q = 1 or 2 (though it seems genotype data may require the upper end for both). We assume matrices of dimension n x ((q+1)l) or smaller are small enough to work with in memory, but that m x n is too big and must be distributed. A will be the only m x n matrix in the process.

Besides the choices of parameters, there are two primary variations on the method. I’ll refer to these as “standard” and “blanczos” (a portmanteau of “block Lanczos”; yes, this seems to be how it’s referred to in the literature).

Step 1

Form a n x l matrix G whose entries are i.i.d. standard normal variates. Note: n x l will be the smallest matrix we consider. Empirical tests have found this method to be robust to the distribution here, and to the quality of the RNG, so we might try using uniform variates in [-1, 1], or even simpler distributions, though I think it will be a while before this is a bottleneck.

Step 2

If the spectrum of A (the distribution of singular values) decays slowly, the simplest version of this method can perform poorly. This is definitely a concern in genetic data, where the top singular values capture the effect of ancestry, but there is still a lot of variation caused by the long tail of smaller singular values. This step uses a power iteration to exponentially amplify the large singular values and suppress the small ones. In practice, it seems only a handful of iterations are needed, though one paper using this method for GWAS uses 10 iterations (the second highest number I’ve seen is 3). Importantly, the number of iterations needed is not dependent on the size of A.

Form the m x l products
H0 = AG
H1 = A (A^T H0)
…
Hq = A (A^T H(q-1))

This step contains most of the work of the method, but is trivially parallelizable. Suppose A is partitioned by rows, with r blocks: A^T = (A0^T | ... | Ar^T). Let G0 = G. Then we can compute Gj = Sum_i Ai^T Ai G(j-1) and Hj = A Gj. Thus this is just q rounds of map-reduce.

This step produces a matrix H to be used in step 3. In the standard variant, we let H = Hq. But in the blanczos version, which I suspect we will ultimately need to ensure sufficient accuracy, we let H = (H0 | ... | Hq) be the concatenation of all the intermediate products, which all must be cached somehow.

There is some concern about numerical stability and under/overflow in this step. The fix is to insert a QR step between each multiplication, and use the normalized Q in the next step. Obviously this introduces a lot of extra work (though it doesn’t change the asymptotic complexity). However, from what I’ve been able to gather, in the blanczos version the stability is less of an issue (because you keep the low powers as well). None of the evaluations of the blanczos method I’ve seen include a renormalization step. Anyways, something to be on the lookout for in our testing.

Finally we collect the matrix H (either m x l or m x ((q+1)l)) to work with locally.

Step 3

Perform a QR decomposition H = QR (locally, using a library function). Only Q is needed, which has the same dimensions as H. The columns of Q are a basis for the range of H, and hence an approximation of the range of A. The core property of this method is that Q Q^T A is a good low-rank approximation of A.

Step 4

Broadcast Q, and compute and collect the l x n or ((q+1)l) x n matrix T = Q^T A.

Step 5

Compute (locally, using a library function) the SVD of T, T = V'' S' W'^T. Then, computing V' = Q V'', it follows that V' S' W'^T = Q Q^T A is an approximate SVD of A.

Finally, let V = V'(:, :k), S = S'(:k, :k), and W = W'(:, :k). Then V S W^T is the (approximate) truncated SVD of A we were after. Depending on the application, we may not need to materialize all these pieces.

References

These all include empirical evaluations that might be good for us to try to reproduce.

A randomized algorithm for principal component analysis
One of the original sources for this method. Section 4.1 describes the standard variant, and section 4.4 describes the blanczos variant (and I believe is the source of that name). Section 5 is a brief empirical test of both variants (and one other).
An Algorithm for the Pricipal Component Analysis of Large Data Sets
An empirical evaluation of the blanczos variant in the setting where A is too big to be kept in memory.
An implementation of a randomized algorithm for principal component analysis
An empirical evaluation of the standard variant, implemented for MATLAB.
Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia
Uses the blanczos variant for GWAS. One interesting thing is that instead of using the L2 matrix norm to quantify the accuracy of the SVD (as everyone else does), they use the “mean of exlained variances” to compare its closeness to a reference SVD. This is the mean length of one of our PC vectors projected into the subspace spanned by the reference PCs.

They also claim that they needed q=10 for sufficient accuacy on real genetic data, while in simulated samples q=3 worked well.

I haven’t read this very closely, but it definitely looks worth deeper study. Thanks John for finding it!

pschultz · July 1, 2020, 10:03pm

Actually, I take this back, there’s no need to store these in a distributed cache.

For the moment, we’re assuming that H is small enough to store locally. So we can just collect the Hjs and keep them locally while we do the rest of the distributed iterations.

So the partition containing Ai will compute both Ai G(j-1) and Ai^T Ai G(j-1), collecting all the former to make H(j-1), and summing up all the later to compute Gj. The driver then stores Hj for step 3, and broadcasts Gj for the next iteration.

So I think the only caching concern is making sure the repeated reads of A aren’t a bottleneck.

Topic		Replies	Views
Linear Algebra Syntax Example: PC Relate	5	817	February 8, 2019
Seeking some input on current implementation of LD Prune method	12	994	March 12, 2018
Future - Numpy analyses on groups	1	682	May 17, 2018
Blocked MT Woes	0	587	May 22, 2020
makeSimilar: Idea for Reproducible Debugging	0	462	February 26, 2021