Randomized big linear algebra

Here are some references on the least squares and PCA algorithms I think are a good fit for Hail. While they use randomness, they give results that are at least as accurate as any other method, but are competitive with (sometimes much faster than) classical methods on a single core, and are easily parallelizable.

PCA

Least squares

1 Like

FastPCA, a paper describing randomized PCA on genotype matrices.

I’ll give a summary of the randomized PCA method here, with the main variations we might want to evaluate, along with pointers to papers that performed empirical evaluations on some of the variations.

Method summary

We’re given an m x n matrix A, and a number k of desired PCs. Assume that k << n < m.

Choose parameters l and q, both positive integers, where l > k, and (q+1)l <= (m - k). I’ve only seen k <= l <= 2k, most often l = k + 2, and 0 <= q <= 10, most often q = 1 or 2 (though it seems genotype data may require the upper end for both). We assume matrices of dimension n x ((q+1)l) or smaller are small enough to work with in memory, but that m x n is too big and must be distributed. A will be the only m x n matrix in the process.

Besides the choices of parameters, there are two primary variations on the method. I’ll refer to these as “standard” and “blanczos” (a portmanteau of “block Lanczos”; yes, this seems to be how it’s referred to in the literature).

Step 1

Form a n x l matrix G whose entries are i.i.d. standard normal variates. Note: n x l will be the smallest matrix we consider. Empirical tests have found this method to be robust to the distribution here, and to the quality of the RNG, so we might try using uniform variates in [-1, 1], or even simpler distributions, though I think it will be a while before this is a bottleneck.

Step 2

If the spectrum of A (the distribution of singular values) decays slowly, the simplest version of this method can perform poorly. This is definitely a concern in genetic data, where the top singular values capture the effect of ancestry, but there is still a lot of variation caused by the long tail of smaller singular values. This step uses a power iteration to exponentially amplify the large singular values and suppress the small ones. In practice, it seems only a handful of iterations are needed, though one paper using this method for GWAS uses 10 iterations (the second highest number I’ve seen is 3). Importantly, the number of iterations needed is not dependent on the size of A.

Form the m x l products
H0 = AG
H1 = A (A^T H0)
…
Hq = A (A^T H(q-1))

This step contains most of the work of the method, but is trivially parallelizable. Suppose A is partitioned by rows, with r blocks: A^T = (A0^T | ... | Ar^T). Let G0 = G. Then we can compute Gj = Sum_i Ai^T Ai G(j-1) and Hj = A Gj. Thus this is just q rounds of map-reduce.

This step produces a matrix H to be used in step 3. In the standard variant, we let H = Hq. But in the blanczos version, which I suspect we will ultimately need to ensure sufficient accuracy, we let H = (H0 | ... | Hq) be the concatenation of all the intermediate products, which all must be cached somehow.

There is some concern about numerical stability and under/overflow in this step. The fix is to insert a QR step between each multiplication, and use the normalized Q in the next step. Obviously this introduces a lot of extra work (though it doesn’t change the asymptotic complexity). However, from what I’ve been able to gather, in the blanczos version the stability is less of an issue (because you keep the low powers as well). None of the evaluations of the blanczos method I’ve seen include a renormalization step. Anyways, something to be on the lookout for in our testing.

Finally we collect the matrix H (either m x l or m x ((q+1)l)) to work with locally.

Step 3

Perform a QR decomposition H = QR (locally, using a library function). Only Q is needed, which has the same dimensions as H. The columns of Q are a basis for the range of H, and hence an approximation of the range of A. The core property of this method is that Q Q^T A is a good low-rank approximation of A.

Step 4

Broadcast Q, and compute and collect the l x n or ((q+1)l) x n matrix T = Q^T A.

Step 5

Compute (locally, using a library function) the SVD of T, T = V'' S' W'^T. Then, computing V' = Q V'', it follows that V' S' W'^T = Q Q^T A is an approximate SVD of A.

Finally, let V = V'(:, :k), S = S'(:k, :k), and W = W'(:, :k). Then V S W^T is the (approximate) truncated SVD of A we were after. Depending on the application, we may not need to materialize all these pieces.

References

These all include empirical evaluations that might be good for us to try to reproduce.

Actually, I take this back, there’s no need to store these in a distributed cache.

For the moment, we’re assuming that H is small enough to store locally. So we can just collect the Hjs and keep them locally while we do the rest of the distributed iterations.

So the partition containing Ai will compute both Ai G(j-1) and Ai^T Ai G(j-1), collecting all the former to make H(j-1), and summing up all the later to compute Gj. The driver then stores Hj for step 3, and broadcasts Gj for the next iteration.

So I think the only caching concern is making sure the repeated reads of A aren’t a bottleneck.