# Randomized big linear algebra

Here are some references on the least squares and PCA algorithms I think are a good fit for Hail. While they use randomness, they give results that are at least as accurate as any other method, but are competitive with (sometimes much faster than) classical methods on a single core, and are easily parallelizable.

# Least squares

1 Like

FastPCA, a paper describing randomized PCA on genotype matrices.

Iâ€™ll give a summary of the randomized PCA method here, with the main variations we might want to evaluate, along with pointers to papers that performed empirical evaluations on some of the variations.

### Method summary

Weâ€™re given an `m x n` matrix `A`, and a number `k` of desired PCs. Assume that `k << n < m`.

Choose parameters `l` and `q`, both positive integers, where `l > k`, and `(q+1)l <= (m - k)`. Iâ€™ve only seen `k <= l <= 2k`, most often `l = k + 2`, and `0 <= q <= 10`, most often `q = 1 or 2` (though it seems genotype data may require the upper end for both). We assume matrices of dimension `n x ((q+1)l)` or smaller are small enough to work with in memory, but that `m x n` is too big and must be distributed. A will be the only `m x n` matrix in the process.

Besides the choices of parameters, there are two primary variations on the method. Iâ€™ll refer to these as â€śstandardâ€ť and â€śblanczosâ€ť (a portmanteau of â€śblock Lanczosâ€ť; yes, this seems to be how itâ€™s referred to in the literature).

#### Step 1

Form a `n x l` matrix `G` whose entries are i.i.d. standard normal variates. Note: `n x l` will be the smallest matrix we consider. Empirical tests have found this method to be robust to the distribution here, and to the quality of the RNG, so we might try using uniform variates in [-1, 1], or even simpler distributions, though I think it will be a while before this is a bottleneck.

#### Step 2

If the spectrum of `A` (the distribution of singular values) decays slowly, the simplest version of this method can perform poorly. This is definitely a concern in genetic data, where the top singular values capture the effect of ancestry, but there is still a lot of variation caused by the long tail of smaller singular values. This step uses a power iteration to exponentially amplify the large singular values and suppress the small ones. In practice, it seems only a handful of iterations are needed, though one paper using this method for GWAS uses 10 iterations (the second highest number Iâ€™ve seen is 3). Importantly, the number of iterations needed is not dependent on the size of `A`.

Form the `m x l` products
`H0 = AG`
`H1 = A (A^T H0)`
â€¦
`Hq = A (A^T H(q-1))`

This step contains most of the work of the method, but is trivially parallelizable. Suppose `A` is partitioned by rows, with `r` blocks: `A^T = (A0^T | ... | Ar^T)`. Let `G0 = G`. Then we can compute `Gj = Sum_i Ai^T Ai G(j-1)` and `Hj = A Gj`. Thus this is just `q` rounds of map-reduce.

This step produces a matrix `H` to be used in step 3. In the standard variant, we let `H = Hq`. But in the blanczos version, which I suspect we will ultimately need to ensure sufficient accuracy, we let `H = (H0 | ... | Hq)` be the concatenation of all the intermediate products, which all must be cached somehow.

There is some concern about numerical stability and under/overflow in this step. The fix is to insert a QR step between each multiplication, and use the normalized `Q` in the next step. Obviously this introduces a lot of extra work (though it doesnâ€™t change the asymptotic complexity). However, from what Iâ€™ve been able to gather, in the blanczos version the stability is less of an issue (because you keep the low powers as well). None of the evaluations of the blanczos method Iâ€™ve seen include a renormalization step. Anyways, something to be on the lookout for in our testing.

Finally we collect the matrix `H` (either `m x l` or `m x ((q+1)l)`) to work with locally.

#### Step 3

Perform a QR decomposition `H = QR` (locally, using a library function). Only `Q` is needed, which has the same dimensions as `H`. The columns of `Q` are a basis for the range of `H`, and hence an approximation of the range of `A`. The core property of this method is that `Q Q^T A` is a good low-rank approximation of `A`.

#### Step 4

Broadcast `Q`, and compute and collect the `l x n` or `((q+1)l) x n` matrix `T = Q^T A`.

#### Step 5

Compute (locally, using a library function) the SVD of `T`, `T = V'' S' W'^T`. Then, computing `V' = Q V''`, it follows that `V' S' W'^T = Q Q^T A` is an approximate SVD of `A`.

Finally, let `V = V'(:, :k)`, `S = S'(:k, :k)`, and `W = W'(:, :k)`. Then `V S W^T` is the (approximate) truncated SVD of `A` we were after. Depending on the application, we may not need to materialize all these pieces.

### References

These all include empirical evaluations that might be good for us to try to reproduce.

Actually, I take this back, thereâ€™s no need to store these in a distributed cache.

For the moment, weâ€™re assuming that `H` is small enough to store locally. So we can just collect the `Hj`s and keep them locally while we do the rest of the distributed iterations.

So the partition containing `Ai` will compute both `Ai G(j-1)` and `Ai^T Ai G(j-1)`, collecting all the former to make `H(j-1)`, and summing up all the later to compute `Gj`. The driver then stores `Hj` for step 3, and broadcasts `Gj` for the next iteration.

So I think the only caching concern is making sure the repeated reads of `A` arenâ€™t a bottleneck.