Everything about these parties was simple. The people were simple. The invitation was simple - you visited each house and verbally invited everyone. The food was simple. There were no starters, a couple of large bottles of ThumpsUp would suffice and was often served in steel glasses which would come out just for this occasion. There was only one simple course - the main course with pulao (rice + sautéed veggies), poori (fried flatbread), dessert (gulab jamun or kheer), chole (chickpeas), and well, a piece of cake. Paneer had not entered the menu yet, though it was sometimes on the menu if you were invited home for dinner.
The music was simple and the dance was simple. Bollywood’s top four dance numbers would play on a loop from a Philips cassette player that everyone had. It was available at a discount for everyone or I recall. And yes, there were simple “return gifts” - a Nataraj pencil and a Non-dust eraser. Simple times.
There were no bakeries in Rawatbhata for a long time, and almost all cakes were home-baked. No one in the town had ovens! There were no microwaves either. The only electrical gadget in the kitchen used to be a mixer grinder. How do you “bake” without an “oven”? The technique uses a simple idea - use sand to provide controlled heating. Too much digression. But the cakes were simple too.
While we would get invited to all the birthdays, I do remember sneaking in uninvited to a few, once in a while. I also distinctly remember August being a month of celebrations. Of course, there was Independence Day on the 15th lined up with five birthdays in August! Five! Most other months would have one or two. Naturally, August was also my favourite month - loads of good food and return gifts. People with birthdays in August had to be extra diligent. While the simple birthday menu rarely changed, the “return gifts” required deeper thought. You did not want to be called out for repeating a return gift in the same month.
Caught in the nostalgia of food, birthdays, and 90s music, I had a question going around in my head: Were the five birthdays in August a rare event? There are 365 days in a year and assuming no day is special, there is no reason August should be a non-simple birthday month. How do I find out if there is anything special about August, if there is? My first thought was to poll my odd group of college friends and ask them for their birthdays. However hard I try to avoid it, this would be a convenient sample, and I don’t think I would have learned anything mechanistic about the “why” or “how”.
What do I need to answer my question? My question can be asked more simply: when do most Indians celebrate their birthdays? If I had access to all the Aadhaar data, this question could be answered using probably a few lines of code. Aadhaar is of course closed for these use cases. A little bit of search landed me on a wonderful resource of HMIS. The description on the website is self-explanatory:
This portal will be a gateway to wealth of information regarding the health indicators of India. The information available on this portal is derived data from data uploaded by the States/ UTs. HMIS data is specifically designed to support planning, management, and decision making based on on Grading of facilities, various indicators at Block, District at State as well as National Level.
While the emphasis is on “health indicators”, HMIS has district and state-level data on how many births happen in private and public hospitals. Getting this data was quite an exercise and taught me several tricks for parsing excel/html. After struggling disproportionately with weirdly formatted files, I could extract all the birth data between 2008 and 2020 across states. With this data in hand, I could finally answer the simple question: was august the special birthday month?
I first looked at the distribution of births. September, October, and August have the highest number of births with approximately 1.9 million average births over 2008-2020. Since births peak in these months, there is an inherent “seasonality” attached to birthdays in India. In a hypothetical world, all the months would have roughly equal births, rather than having a range from 1.38 million births (April) to 1.98 million births (September).
Taking an average can sometimes hide a tonne of information that lies in a time-series data. If we look at the entire period between January 2008 and December 2020, the “seasonality” is easier to spot. From 2008 to 2020, the birth curve rises and dips over the year. The peaks happen around September/October, while April registers a deep dip.
While the data from HMIS is on births, we can infer the time of conception by simple arithmetic - subtracting 9 months. Of course, the seasonality remains intact. The peak of conceptions happens in December of the previous year or January of the same year.
Summarizing information at the country level is a good starting point. Overall, September is the month with the highest number of birthdays. But does that hold across all states? What about Rajasthan? I next broke it down at the state level and annotated the month with the highest number of birthdays.
From the figure above, August is the month with the highest number of birthdays in Rajasthan. But more importantly, the figure highlights the diversity that underlies India. While September, the month with the highest number of total births, is the month of peak births in 9 states, October is the peak month in 10 states. But there are also Meghalaya and Tripura which peak in January and December respectively. On the other extreme, July is the month of the least conceptions across multiple states. We can flip the births and look at the conception curve.
While my actual question was answered, and I discovered that there is heterogeneity within the country in how it celebrates birthdays, my related question of understanding why this happens “mechanistically” remained unanswered. I must admit beforehand that it is also a hard question to answer without access to a tonne of data.
Why are conceptions higher in a month? Why do they vary across states? Is it driven by the “wedding season” in the country?
I did not have access to the wedding registration data. So I asked a simple question: does temperature affect the rate of conception (and hence birth) in India. Surprisingly, getting temperature data for a city across a time span without paying anyone remains a non-trivial task. However after a bit of tussle and jumping language hoops, I was able to download the gridded temperature data from IMD, Pune.
To understand the relationship between temperature and rate of conceptions, I looked at the average temperature over 2008-2020 in a state and calculated its correlation with the number of estimated conceptions. For the autumn months (October/November), the correlation coefficient is the highest (-0.56). But even more importantly, this figure hinted towards the seasonality being associated with temperature. The rate of conceptions across seasons varies as Winter > Autumn > Summer > Monsoon.
I am sure you are thinking about heterogeneity, what does this relationship look like if we focus on each state individually?
When I broke down the association analysis for each state individually and arranged the states based on the strength of the correlation (between the relative percentage of conceptions every month and the mean temperature in each state), a beautiful pattern emerged. 24 out of 28 states for which I had data show a correlation coefficient of -0.5 or lower (that is, the absolute strength of correlation exceeds 0.5). For states like Manipur, Bihar, and Haryana the association between temperature and rate of conception is as high as 0.91, implying a higher temperature is associated with fewer conceptions and a one-degree drop in temperature will lead to an increase of 0.91% in conception assuming this relationship is indeed causal. States like Jammu and Kashmir and Uttarakhand which are usually colder have weak associations. While Kerala, which has a tropical climate, has a stronger association with a correlation coefficient of -0.82. Thus, the association is not as strong for colder climates, an observation that I previously made at the season level in my previous plot.
Causality is hard to prove here. We have strong associations that reproduce across states, Occam’s razor, and the golden fact that correlation does not imply causation. I wish the answer was as simple as those birthday parties.
Summer has come and passed
The innocent can never last
Wake me up when September ends
- Billie Armstrong
However, the time benchmarks look awful - though we ended up saving memory, SparseSpearmanCor()
was atleast 2 times slower than the naive approach of densifying the matrix
and calculating correlation using cor(as.matrix(X.sparse), method="spearman")
. This in practise defeats the motivation - we are saving memory at the cost of speed.
The costliest step in my original implementation of SparseSpearmanCor()
was a simple lookup operation:
which(j == column)
,
where I fetch the non-zero entries in a column for calculating the rank, and this happens for all the columns (j stores the index of columns where there are non-zero entries).
I tried other ways of making this faster, such as by using fastmatch. But the actual
speedup came from a simple thought - if we care about the non zero entries, I should just deal with them separately. So instead of doing repeated
lookups, I just separate the non-zero entries out, do the rank sparsification operations on them and put them back into the sparse matrix.
I call this implementation SparseSpearmanCor2()
and you can find the implementation in the notebook, but here are some comparisons with
the dense approach and the previous implementation SparseSpearmanCor()
.
The result is a function that calculates values 10x faster than any approach on large matrices (10000 x 5000):
SparseSpearmanCor2()
and time benchmarks are available in this notebook.
$\begin{aligned} \text{Cov}(\mathbf{X} + a, \mathbf{Y} + b) &= \text{Cov}(\mathbf{X}, \mathbf{Y}) \end{aligned}$
where $\text{Cov}(\mathbf{X}, \mathbf{Y}) = \mathbb{E}[(\mathbf{X} - \mathbb{E}[\mathbf{X}])(\mathbf{X} - \mathbb{E}[\mathbf{X}])]$ and $a,b$ are real valued quantities. Essentially $\text{Cov}(\mathbf{X}, \mathbf{Y})$ is a measure of the product of how much $\mathbf{X}$ and $\mathbf{Y}$ are deviating from their respective means so adding a constant does not change anything (because the deviation from the mean remains the same).
Two commonly used correlations are
$\begin{aligned} \text{Cor}(\mathbf{X} + a, \mathbf{Y} + b) &= \frac{\text{Cov}(\mathbf{X}, \mathbf{Y})}{\sigma_X \sigma_Y},\\ \sigma^2_X &= \mathbb{E}[(X-\mathbb{E}[X])^2),\\ \sigma^2_Y &= \mathbb{E}[(Y-\mathbb{E}[Y])^2).\\ \end{aligned}$
The spearman correlation on the other hand asseses if the relationship between $\textit{X,Y}$ is monotonic (either increasing or decreasing). It is equivalent to running pearson correlation between the ranks of values in $X$ and $Y$ instead of the actual value themselves. So it essentially asks if $X$ is increasing (decreasing) would values in $Y$ would be increasing (decreasing) as well? A perfect score of 1 (-1) would result in a yes (no). Both the types of correlation are often employed in genomics to assess relationship between two variables of interest.
One particular context, where correlations are employed is in multi-omics experiments, say where we are profiling RNA and open chromatin regions (ATAC) in the same cells. For example,
a recent study used correlations to find potential gene-enhancer links (Ma et al., 2020). The idea is simple: we have a bunch of cells
in which we simultaneously profiled both the transcriptome (RNA) and the open chromatin regions (ATAC). We then ask, for each gene, which open chromatin regions are highly correlated (after
necessary adjustment for background) to predict potential gene-enhancer links. The default correlation function in R
cor(RNA, ATAC, method="pearson")
or cor(RNA, ATAC, method="spearman")
would ideally be sufficient to do this. Here, RNA
and ATAC
are vectors of equal length with entries summarizing the transcriptome signal and ATAC signal at a gene and potential enhancer, respectively.
However, both RNA and ATAC matrices are often sparse matrices, i.e. they have lots of entries that zeroes,
which are not explicitly stored to save space. The default cor()
method does not work on sparse matrices. The problem here is a simple one then: convert the RNA and ATAC sparse matrices to a usual (dense) matrix using as.matrix()
and run the correlation function. However, converting to denser matrix format will take loads of memory, especially if you are searching for link between 10,000 genes and say only about 5,000
potential enhancers in around 10,000 cells all at once, parallely.
The solution to avoid this is rather easy and has been previously discussed for pearson correlation.
A detailed description is available in the documentation of qlcMatrix::corSparse()
.
But in short, the idea is to utilize the sparsity in a vector and avoid doing operations that would make a sparse matrix dense. For example, the variance calulation for a sparse vect the essential idea here is that we do not want to lose the sparsity
structure during our calculations. For example, for a sparse vector, if we are interested in calculating the variance $Var(X) = \mathbb{E}[(X-\mathbb{E}[X])^2]$, if we do the $X-\mathbb{E}[X]$ operation first,
the sparsity structure of X is now destroyed and we land up with a dense matrix. Instead, we can use the fact that the variance can equivalentyl be written as $\text{Var}(X) = \mathbb{E}[X^2] - E[X]^2$, retaining the sparity
throughout. That solves our problem of calculating pearson correlation on sparse vectors (or matrices).
The next question is then, what about sparse matrices and spearman correlation? cor(X, Y, kind="spearman")
does not work for sparse matrices and we do not want to convert them to dense form.
The solution is again simple, but took me a while to figure out. A naive idea would be to use the definition of spearman correlation - we calculate ranks of $X$ and $Y$ and then run it through cor()
with method="spearman"
as the ranks are not sparse. The problem however is again the same - the rank matrix is not sparse. But if you think about ranks in a sparse matrix, it does have some
interesting properties that we can utilize to make it sparse.
We can look at a sparse vector for an example. Consider a vector y <- c(0,0,0,42,21,10)
with 3 non zero entries. We will use $n_z$ to denote the number of non-zero
entries in a vector. But if we know the number of non-zero entries, we also know what these ranks are going to be - they are fixed.
For a vector with $n_z$ entries, the rank(ties.method="average")
method will set them all to $\frac{1}{n_z}\sum_{i=1}^{n_z} i = \frac{(n_z+1)}{2}$. We also know that the lowest non-zero entry in such a vector would have a
rank of $(n_z+1)$. For example, rank vector rank(y) = c(2,2,2,6,5,4)
- by default the ranks of tied entries are averaged. So the rank of 0s is $\frac{1+2+3}{3}= \frac{(n_z+1)}{2}$. Our rank vector
is not sparse, but we can retain its sparsity if we were to subtract $\frac{(n_z+1)}{2}$ from each of the entries. Since a shift operation will not change the (co)variance, the variance of
c(0,0,0,4,3,2)
which we called the “sparsified rank vector” is the same as original rank vector c(2,2,2,6,5,4)
. So we should aim to get our “sparsified rank” vector somehow.
The trick to arrive at “sparsified rank” vector is to use calculate ranks on the non-zero entries in our vector. We will forget about the zero entries in such a vector and only focus on the non-zero entries - they are few and it is fast to calculate ranks of just these.
In this version of the vector (where there are no zeros) the lowest non-zero entry has a rank of $1$ (assuming there are no ties, but the following arguments hold without loss of generality). To arrive at the “sparsified rank” vector,
we subtracted $\frac{(n_z+1)}{2}$ from the original rank vector, so the non-zero entry’s rank will now be $n_z + 1 - \frac{(n_z+1)}{2} = 1 + \frac{n_z}{2} - \frac{1}{2}$ which is equivalent to adding $\frac{(n_z-1)}{2}$ to the rank of the
non-zero entries! By this way, we retain the sparsity in ranks and can then just use corSparse()
to calculate pearson correlation on sparsified rank vectors, resulting in spearman correlation.
While this approach is memory efficient, it unfortunately is not always the fastest. See this notebook for some time benchmarks. I did not explicitly
perform memory benchmarks.
Update: The approach is both memory efficient and fast. See an updated post and associated notebook
y <- c(0,0,0,42,21,10)
rank(y) = c(2,2,2,6,5,4)
sparsified_rank(y) <- c(0,0,0,4,3,2)
(Subtract $\frac{(n_z+1)}{2}=2$ from all entries to make the previous vector a sparse vector)
rank(y[y!=0]) = rank(c(42,21,10)) = c(3,2,1)
.
If we now add $\frac{(n_z-1)}{2} = \frac{(3-1)}{2}$ to all the entries of the last vector, we get c(4,3,2)
which are the non-zero ranks
from our <code<sparisifed_rank</code> vector which will be the input to corSparse
.
scvi-tools uses generative modeling to model counts originating from a scRNA-seq experiment with different underlying models catering to other experiments. “Generative modeling” is a broad term that implies models of distributions $P(X)$ , defined over some collection of datapoints $X$ that exist in a high dimensional space. In scRNA-seq, each datapoint corresponds to a cell $c$ which has a multidimensional vector $X_{c,g} \in \mathcal{R}^{20000}$ containing read counts or UMIs of 20000 genes. A scRNA-seq datasets contains not one but a few thousand if not millions of cells. The generative model’s task is to capture the underlying representation of these cells. “Representation” here is a loose term, but more formally given a $\text{gene} \times \text{cell}$ matrix whose distribution $P_{\text{truth}}(X)$ is unknown, the generative model tries to learn a distribution $P(X)$ which is as close to $P_{\text{truth}}(X)$ as possible.
In order to obtain $P(X)$ , the model should be able to exploit the underlying structure in data. Neural networks are powerful functional approximators given their ability to capture non-linearities. Variational autoencoders utilize neural networks to build generative models that can approximate $P_{truth}(X)$ in a decently quick fashion. The reason this works is because any $d$ dimensional distribution can be approximated by starting with $d$ gaussian random variables and passing them through a complicated function (Devroye, 1986). A famous example of this is generating a 2D circle from a 2D gaussian blob.
scvi-tools also starts from a gaussian random variable and propogates it through its various layers such that the output count for a gene and a particular cells is close to its observed value. It does it over four main steps:
Generate a gaussian
Pass the gaussian through a neural network to approximate gene-cell proportions ($\rho_{g,c}$)
Generate a count $y_{c,g}$ for each gene-cell using the estimated proportion in step 2 and and the total sequencing depth along with an estimated dispersion $\phi_g$.
Calculate reconstruction error between generated count $y_{c,g}$ and observed count $x_{c,g}$
The aim is to minimize the reconstruction error in step 4 by optimizing the neural network weights and the estimated parameters $\rho_{c,g}$ and $\theta_g$.
$\begin{aligned} {\color{purple}z_c} &\sim \mathcal{N}(0,I) & \text{\color{purple}Cell embedding} \\ {\color{red}\rho_{g,c}} &\sim \text{softmax}(f_w(z_c)) & \text{\color{red}Normalized expression } \\ y_{c,g} &\sim \text{NB}({\color{blue} l_c} {\color{red}\rho_{c,g}}, \phi_g) & \text{Observed counts} \end{aligned}$
The total sequencing depth for a cell can be also be learned by the network inherently, but the latest version (0.8.0) of scVI supports using observed library size. I started using observed library sizes before it became part of the implementation. The training time is faster and in my limited testing, the downstream clustering results look slightly better with using observed library size, but it could also be due to other reasons.
The latent distribution $Z$ thus learned is a reduced dimensional latent reprsentation of the data. I will use [PBMC3k dataset] (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k) for all the analysis here. We can do a UMAP visualization and the clusters tend to match up pretty well with ground truth, though there is possiblity of improvement.
We now have a $P(Y)$ and access to all intermediate values we can do a ton of things. But the first thing would be to check if $P(Y)$ is indeed correct. One such way of performing validity checks on this model is posterior predictive checks (PPC). I learned of PPCs through Richard McElreath’s Statistical Rethinking (McElreath, 2020), which forms an integral part of all his discussions.
The idea of a PPC is very simple: simulate replicate data from the learned model and compare it to the observed. In a way you are using your data twice, to learn the model and then using the learned model to check it against the same data. A better designed check would be done on held out dataset, but it is perfectly valid to test the model against the observations used to train the model.
The simplest checks for scRNA-seq counts is the mean-variance relationships. The simulated means and variances from the learned model should match that of the observed data on both a cell and a gene level.
The simulated mean-variance relationship aligns very well with the observed relationship.
Let’s compare how the dispersion looks like:
Variation with gene detection rate:
The loss function being minizmied to infere the parameters minimizes the reconstruction loss between generated counts $X$ and observed counts $Y$.
One thing I still need to wrap around my head is how much informative the reconstruction error itself is. For example, a UMAP of this reconstruction error mimics that of the latent representation: