Linear Algebra Syntax Example: PC Relate


#1

Reposting from https://github.com/hail-is/hail/pull/5270

@cseed @jbloom @akotlar @pschultz

Continuing in the vein of Tensor Types https://github.com/hail-is/hail/pull/5190, I’m not sure how to disseminate these thoughts and receive criticism. I like the idea of checking in files so we can collaborate on them and track history.

I think implementing pc_relate in this manner should be an explicit target of the linear algebra project.

import hail as hl

def gram(x):
    return x @ x.T

def pc_relate(mt, k):
    # scores is samples-by-k matrix
    _, scores, _ = hl.hwe_normalized_pca(mt.gt, k)

    # mean impute missing data
    mt = mt.select_entries(x = mt.gt.n_alt_alleles / 2)
    mt = mt.annotate_rows(mean = hl.agg.mean(mt.x))
    mt = mt.select_entries(x = hl.or_else(mt.x, mt.mean))
    g = mt.to_ndarray(mt.x)

    # X is k-by-variants matrix
    X = hl.lstsq(scores, g.T)

    mu = scores @ X

    centered = g - mu

    variance = mu * (1.0 - mu)
    stddev = hl.sqrt(variance)

    phi = gram(centered) / gram(stddev)

    # self_kin is a 1-tensor of length samples, it is a measure of how related
    # each sample is to itself under our model. The "inbreeding coefficient" is
    # given by (2 * self_kin - 1).
    self_kin = hl.diagonal(phi)

    dominance = (hl.case()
        .when(g == 0.0, mu)
        .when(g == 0.5, 0.0)
        .default(1.0 - mu))

    # I want to scale the variance (sample-by-variant) for each sample by the
    # sample's self-kinship. Numpy-style broadcasting requires that I use the
    # double transpose, thoughts?
    normalized_dominance = dominance - (variance.T * self_kin).T

    ibd2 = gram(normalized_dominance) / gram(variance)

    ibd0_for_close_kins = 1.0 - 4.0 * phi + k2

    # calculate identity-by-state-0
    ibs0 = hl.inner_product(
               g.T, g,
               lambda l, r: hl.agg.count(hl.abs(l - r) == 1.0))

    temp = (mu * mu) @ ((1.0 - mu) * (1.0 - mu))
    ibd0_for_distant_kins = ibs0 / (temp + temp.T)

    ibd0 = hl.cond(phi < pow(2, -5/2),
                   ibd0_for_close_kins,
                   ibd0_for_distant_kins)

    ibd1 = 1.0 - ibd2 - ibd0

    return (phi, ibd0, ibd1, ibd2)

#2

Reposting @cseed’s comments.

  1. I feel like like this was what dev Discourse was intended for. Should it go there?

  2. I’m not sure what the standards for going in are. If you’re happy, I’ll approve. The thing is, if we want to maintain the discussion, having the file checked in but the discussion on Github doesn’t seem ideal.

g = mt.to_ndarray(mt.x)

Is this hl.to_ndarray? Right now we have
BlockMatrix.from_entry_expr. mt.x.to_ndarray() seems nice, too.

# X is k-by-variants matrix
  X = hl.lstsq(scores, g.T)

Is linear algebra going to be in its own sub-module? I feel top-level is going
to get crowded.

I don’t think we want to import our linear algebra stuff as np, since
collected local matrices will be actual numpy matrices and the users are going
to want to still use numpy.

# self_kin is a 1-tensor of length samples, it is a measure of how related

A 1-tensor is normally called a vector.

dominance = (hl.case()
      .when(g == 0.0, mu)
      .when(g == 0.5, 0.0)
      .default(1.0 - mu))

Floating point equality makes me uncomfortable

# I want to scale the variance (sample-by-variant) for each sample by the
# sample's self-kinship. Numpy-style broadcasting requires that I use the
# double transpose, thoughts?
normalized_dominance = dominance - (variance.T * self_kin).T

You can use reshape to extend where you want:

>>> a = np.array([1, 2])
>>> a.reshape(1, 2)
array([[1, 2]])
>>> a.reshape(2, 1)
array([[1],
       [2]])

#3

Hmm. Let’s try it. My gut says discuss is not a good medium for collaborating on and discussing a document be it code or otherwise.

For a design doc, I think the people responsible for the design should agree on it before it’s merged. Those people seem to be you, me, Daniel G, @jbloom, @akotlar, & @pschultz.


Ah, good point. We’ve focused on functions over methods. This should be hl.ndarray(mt.x). I’m not sure I want mt.x.to_ndarray(). We have avoided adding such functions.


Ah, good question. Maybe hl.tensor.lstsq? And we can standardize on import hail.tensor as hlt.


:+1:


Hmm yeah. I explicitly work in powers of two here. Let me see what it would look like if I have two g matrices: before and after mean imputation.


Ah, I had not considered this approach. I don’t like specifying the dimensions explicitly.


#4

Here’s an updated proposal.

  • hlt.contract is how I specify the dimensions over which to aggregate
  • tensors have named dimensions t.dims.row, @pschultz and I talked briefly about this and I really like the idea. The alternative seems to be hlt.contract(1, hl.agg.sum(n_alts)).
  • I imagine hlt.contract can also take an iterable of dimensions: hlt.contract(n_alts.dims, hl.agg.sum(n_alts)) sums over all the entries of n_alts

I feel like I’m using an unusual definition for contract here. I don’t know how to define @ with hlt.contract.

Instead, we need two operations with different names and signatures. The first permits dimension elimination by aggregating along said dimension. The second one is the standard meaning of tensor contraction.

def contract(dims, agg_expr):
    pass
def inner_product(dim, left, right, lambda l, r: agg_expr):
    pass
# examples
row_mean = hlt.contract(g.dims.row, hl.agg.mean(g))
gram = hlt.inner_product(g.variant, g, g, lambda l, r: hl.agg.sum(l * r))

I think we define __matmul__ like this:

def __matmul__(self, r):
    inner = self.dims[len(self.dims) -1]
    r2 = r.rename(**{r.dims[0]: inner})
    return hlt.inner_product(inner, self, r2, lambda l, r: hl.agg.sum(l * r))

import hail as hl
import hail.tensor as hlt

def gram(x):
    return x @ x.T

def pc_relate(mt, k):
    # scores.shape = [samples, k]
    _, scores, _ = hl.hwe_normalized_pca(mt.gt, k)

    # n_alts.shape = [samples, variants]
    n_alts = hlt.ndarray(mt.gt.n_alt_alleles()).T
    # row_mean.shape = [variants]
    variant_mean = hlt.contract(n_alts.dims.variants, hl.agg.sum(n_alts))
    # mean impute missing data (variant_mean is broadcasted)
    # g.shape = [samples, variants]
    g = hl.or_else((n_alts / 2), variant_mean)

    # X.shape = [k, variants]
    X = hlt.lstsq(scores, g.T)
    # mu.shape = [samples, variants]
    mu = scores @ X

    centered = g - mu

    variance = mu * (1.0 - mu)
    stddev = hl.sqrt(variance)

    # phi = [samples, samples]
    phi = gram(centered) / gram(stddev)

    # self_kin is a vector (1-tensor) of length samples, it is a measure of how
    # related each sample is to itself under our model. The "inbreeding
    # coefficient" is given by (2 * self_kin - 1).
    self_kin = hlt.diagonal(phi)

    # dominance.shape = [samples, variants]
    dominance = (hl.switch(n_alts)
                 .when(0, mu)
                 .when(1, 0.0)
                 .default(1.0 - mu))

    # I want to scale the variance (sample-by-variant) for each sample by the
    # sample's self-kinship. Numpy-style broadcasting requires that I use the
    # double transpose, thoughts?
    normalized_dominance = dominance - (variance.T * self_kin).T

    ibd2 = gram(normalized_dominance) / gram(variance)

    ibd0_for_close_kins = 1.0 - 4.0 * phi + k2

    # calculate identity-by-state-0
    ibs0 = hlt.inner_product(
               g.variants
               g, g,
               lambda l, r: hl.agg.count(hl.abs(l - r) == 1.0))

    temp = (mu * mu) @ ((1.0 - mu) * (1.0 - mu))
    ibd0_for_distant_kins = ibs0 / (temp + temp.T)

    ibd0 = hl.cond(phi < pow(2, -5/2),
                   ibd0_for_close_kins,
                   ibd0_for_distant_kins)

    ibd1 = 1.0 - ibd2 - ibd0

    return (phi, ibd0, ibd1, ibd2)

#5

tensors have named dimensions

This might be relevant: http://nlp.seas.harvard.edu/NamedTensor.

I had wanted to implement the numpy interface to start, but it looks like this is starting to diverge. That might be OK, but we should be conscious about the decision. For example, the numpy way would be to use np.where instead of our switch/when/default.


#6

Yeah, I found that “Tensor considered harmful” post and shared it with Dan when we talked about the tensor interface. I also found http://xarray.pydata.org/, which I really like. Their data model is very close to the picture that’s been forming in my mind of what TensorTables might look like. Since they are built on numpy and pandas, and are trying to fit in to that ecosystem like we are, I think they could be a good source of ideas.

Here’s a rewrite of Dan’s PCRelate in an xarray style.

import hail as hl
import hail.tensor as hlt

def dot(self, r, dims=None):
    if dims is None:
        dims = [dim for dim in self.dims if dim in r.dims]
    return (self * r2).sum(dims)

def __matmul__(self, r):
    inner = self.dims[len(self.dims) -1]
    r2 = r.rename({r.dims[0]: inner})
    return self.dot(r2)

def gram(x, inner):
    return x.dot(x, inner)

def pc_relate(mt, k):
    # scores.dims = ['samples', 'k']
    _, scores, _ = hl.hwe_normalized_pca(mt.gt, k)

    # n_alts.dims = ['variants', 'samples']
    n_alts = hlt.ndarray(mt.gt.n_alt_alleles())
    # variant_mean.dims = ['variants']
    variant_mean = n_alts.sum('samples')
    # mean impute missing data (variant_mean is broadcasted)
    # g.dims = ['variants', 'samples']
    g = (n_alts / 2).where(n_alts.notnull(), variant_mean)
    # or pandas style (n_alts / 2).combine_first(variant_mean)

    # X.dims = ['k', 'variants']
    X = hlt.lstsq(scores, g)
    # mu.dims = ['variants', samples]
    mu = scores @ X

    centered = g - mu