Memory efficient spearman correlation on sparse matrices
Utilizing a simple property of covariances to minimize memory usage
An important property of the covariance function is that it is invariant under shifts, i.e., for any two random variables and , you get the same covariance if you add constant quantitie to either $X$ or $Y$:
where and are real valued quantities. Essentially is a measure of the product of how much $\mathbf{X}$ and $\mathbf{Y}$ are deviating from their respective means so adding a constant does not change anything (because the deviation from the mean remains the same).
Two commonly used correlations are
The spearman correlation on the other hand asseses if the relationship between $\textit{X,Y}$ is monotonic (either increasing or decreasing). It is equivalent to running pearson correlation between the ranks of values in $X$ and $Y$ instead of the actual value themselves. So it essentially asks if $X$ is increasing (decreasing) would values in $Y$ would be increasing (decreasing) as well? A perfect score of 1 (-1) would result in a yes (no). Both the types of correlation are often employed in genomics to assess relationship between two variables of interest.
One particular context, where correlations are employed is in multi-omics experiments, say where we are profiling RNA and open chromatin regions (ATAC) in the same cells. For example,
a recent study used correlations to find potential gene-enhancer links (Ma et al., 2020). The idea is simple: we have a bunch of cells
in which we simultaneously profiled both the transcriptome (RNA) and the open chromatin regions (ATAC). We then ask, for each gene, which open chromatin regions are highly correlated (after
necessary adjustment for background) to predict potential gene-enhancer links. The default correlation function in R
cor(RNA, ATAC, method="pearson")
or cor(RNA, ATAC, method="spearman")
would ideally be sufficient to do this. Here, RNA
and ATAC
are vectors of equal length with entries summarizing the transcriptome signal and ATAC signal at a gene and potential enhancer, respectively.
However, both RNA and ATAC matrices are often sparse matrices, i.e. they have lots of entries that zeroes,
which are not explicitly stored to save space. The default cor()
method does not work on sparse matrices. The problem here is a simple one then: convert the RNA and ATAC sparse matrices to a usual (dense) matrix using as.matrix()
and run the correlation function. However, converting to denser matrix format will take loads of memory, especially if you are searching for link between 10,000 genes and say only about 5,000
potential enhancers in around 10,000 cells all at once, parallely.
Sparsity makes it easier
The solution to avoid this is rather easy and has been previously discussed for pearson correlation.
A detailed description is available in the documentation of qlcMatrix::corSparse()
.
But in short, the idea is to utilize the sparsity in a vector and avoid doing operations that would make a sparse matrix dense. For example, the variance calulation for a sparse vect the essential idea here is that we do not want to lose the sparsity
structure during our calculations. For example, for a sparse vector, if we are interested in calculating the variance $Var(X) = \mathbb{E}[(X-\mathbb{E}[X])^2]$, if we do the $X-\mathbb{E}[X]$ operation first,
the sparsity structure of X is now destroyed and we land up with a dense matrix. Instead, we can use the fact that the variance can equivalentyl be written as $\text{Var}(X) = \mathbb{E}[X^2] - E[X]^2$, retaining the sparity
throughout. That solves our problem of calculating pearson correlation on sparse vectors (or matrices).
The next question is then, what about sparse matrices and spearman correlation? cor(X, Y, kind="spearman")
does not work for sparse matrices and we do not want to convert them to dense form.
The solution is again simple, but took me a while to figure out. A naive idea would be to use the definition of spearman correlation - we calculate ranks of $X$ and $Y$ and then run it through cor()
with method="spearman"
as the ranks are not sparse. The problem however is again the same - the rank matrix is not sparse. But if you think about ranks in a sparse matrix, it does have some
interesting properties that we can utilize to make it sparse.
We can look at a sparse vector for an example. Consider a vector y <- c(0,0,0,42,21,10)
with 3 non zero entries. We will use $n_z$ to denote the number of non-zero
entries in a vector. But if we know the number of non-zero entries, we also know what these ranks are going to be - they are fixed.
For a vector with $n_z$ entries, the rank(ties.method="average")
method will set them all to $\frac{1}{n_z}\sum_{i=1}^{n_z} i = \frac{(n_z+1)}{2}$. We also know that the lowest non-zero entry in such a vector would have a
rank of $(n_z+1)$. For example, rank vector rank(y) = c(2,2,2,6,5,4)
- by default the ranks of tied entries are averaged. So the rank of 0s is $\frac{1+2+3}{3}= \frac{(n_z+1)}{2}$. Our rank vector
is not sparse, but we can retain its sparsity if we were to subtract $\frac{(n_z+1)}{2}$ from each of the entries. Since a shift operation will not change the (co)variance, the variance of
c(0,0,0,4,3,2)
which we called the “sparsified rank vector” is the same as original rank vector c(2,2,2,6,5,4)
. So we should aim to get our “sparsified rank” vector somehow.
The trick to arrive at “sparsified rank” vector is to use calculate ranks on the non-zero entries in our vector. We will forget about the zero entries in such a vector and only focus on the non-zero entries - they are few and it is fast to calculate ranks of just these.
In this version of the vector (where there are no zeros) the lowest non-zero entry has a rank of $1$ (assuming there are no ties, but the following arguments hold without loss of generality). To arrive at the “sparsified rank” vector,
we subtracted $\frac{(n_z+1)}{2}$ from the original rank vector, so the non-zero entry’s rank will now be $n_z + 1 - \frac{(n_z+1)}{2} = 1 + \frac{n_z}{2} - \frac{1}{2}$ which is equivalent to adding $\frac{(n_z-1)}{2}$ to the rank of the
non-zero entries! By this way, we retain the sparsity in ranks and can then just use corSparse()
to calculate pearson correlation on sparsified rank vectors, resulting in spearman correlation.
While this approach is memory efficient, it unfortunately is not always the fastest. See this notebook for some time benchmarks. I did not explicitly
perform memory benchmarks.
Update: The approach is both memory efficient and fast. See an updated post and associated notebook
Example
y <- c(0,0,0,42,21,10)
rank(y) = c(2,2,2,6,5,4)
sparsified_rank(y) <- c(0,0,0,4,3,2)
(Subtract $\frac{(n_z+1)}{2}=2$ from all entries to make the previous vector a sparse vector)
rank(y[y!=0]) = rank(c(42,21,10)) = c(3,2,1)
.
If we now add $\frac{(n_z-1)}{2} = \frac{(3-1)}{2}$ to all the entries of the last vector, we get c(4,3,2)
which are the non-zero ranks
from our <code<sparisifed_rank</code> vector which will be the input to corSparse
.
- Ma, S., Zhang, B., LaFave, L. M., Earl, A. S., Chiang, Z., Hu, Y., Ding, J., Brack, A., Kartha, V. K., Tay, T., & others. (2020). Chromatin potential identified by shared single-cell profiling of RNA and chromatin. Cell, 183(4), 1103–1116.