# Experiments with scVI

#### I - Posterior Predictive Checks

scvi-tools is a suite of tools for performing dimensionality reduction, data harmonization, and differential expression. One key advantage of using scvi-tools is that it inherently supports loading and training data in mini-batches which makes it practically infinitely scalable (Lopez et al., 2018).

scvi-tools uses generative modeling to model counts originating from a single cell RNA-seq or a CITE-seq experiment. “Generative modeling” is a broad term that implies learning models of distributions $P(X)$ , defined over some collection of datapoints $X$ that exist in a high dimensional space. In scRNA-seq, each datapoint corresponds to a cell $c$ which has a multidimensional vector $X_{c,g} \in \mathbb{R}^{20000}$ containing read counts or UMIs of 20,000 genes. A scRNA-seq datasets contains not one but a few thousand if not millions of cells. The generative model’s task is to capture the underlying latent representation of these cells. “Representation” here is a loose term, but more formally given a $\text{cell} \times \text{gene}$ matrix whose distribution $P_{\text{truth}}(X)$ is unknown, the generative model tries to learn a distribution $P(X)$ which is as close to $P_{\text{truth}}(X)$ as possible.

In order to obtain $P(X)$ , the model should be able to exploit the underlying structure in data. At the core, scvi relies on a variational autoencoder (Kingma & Welling, 2013) to learn $P(X)$ . Neural networks are powerful functional approximators given their ability to capture non-linearity in the data. Variational autoencoders utilize neural networks to build generative models that can approximate $P_{truth}(X)$ in a decently quick fashion. The reason this works is because any $d$ dimensional distribution can be approximated by starting with $d$ gaussian random variables and passing them through a complicated function (Devroye, 1986). A famous example of this is generating a 2D circle from a 2D gaussian blob.

scvi-tools also starts from a gaussian random variable and propogates it through its various layers of non-linear functions. The goal is to be able to learn a generator model $P(X)$ that has the capacity to generate counts similar to the observed ones and in this process learn a latent representation of the data. In the most simplistic setting, it learns this over four main steps:

1. Generate a gaussian

2. Pass the gaussian through a neural network to approximate gene-cell proportions ($\rho_{c,g}$)

3. Generate a count $y_{c,g}$ for each gene-cell using the estimated proportion in step 2 and and the total sequencing depth along with an estimated dispersion $\phi_g$ .

4. Calculate reconstruction error between generated count $y_{c,g}$ and observed coun $x_{c,g}$

The aim is to minimize the reconstruction error in step 4 by optimizing the neural network weights and the estimated parameters $\rho_{c,g}$ and $\theta_g$ .

\begin{aligned} {\color{purple}z_c} &\sim \mathcal{N}(0,I) & \text{\color{purple}Cell embedding} \\ {\color{red}\rho_{c,g}} &\sim \text{softmax}(f_w(z_c)) & \text{\color{red}Normalized expression } \\ y_{c,g} &\sim \text{NB}({\color{blue} l_c} {\color{red}\rho_{c,g}}, \phi_g) & \text{Observed counts} \end{aligned}

The total sequencing depth for a cell $l_c$ can be also be learned by the network inherently, but the latest version (0.8.0) of scVI supports directly using the observed library size. $\rho_{c,g}$ is one of the most useful parameter inferred by scvi that summarizes the probability of obtaining a molecule from gene $g$ in a given cell $c$ ( $\sum_g \rho_{c,g}=1$ ). I was interested in observing how the effect of library size on the relative abundance of highly and lowly expressed genes. We can use the $\rho_{c,g}$ values learned by scvi and contrast their sum across cells over highly and lowly expressed genes. It can also be used to perform differential expression.

I will use PBMC3k dataset for all the analysis here.

The above plot shows there is no correlation between the difference of latent frequencies $\rho_{g}$ among highly and lowly expressed genes cells. I used only CD8 naive cells for comparing the differences to contorl for any heterogeneity arising from different celltypes.

In the process of learning the generating distribution $P(X)$ , the model also learns a latent distribution $Z$ which is a reduced dimensional representation of $X$ and can generate $X$ when propoageted through the network. $Z$ can be used to represent the underlying structure in the data. If we perform a UMAP based visualization on the $Z$ values of PBMC3k dataset, the clusters match up pretty well with the annotated celltypes.

The latent distribution $Z$ seems to be a more or less correct one given that the UMAP clusters match up the celltypes, though there is some room for improvement. We also have access to a generating distribution $P(X)$ that can generate gene-cell counts matrix which is very powerful. But how do we know that the estimated $P(Y)$ is correct? One such way of performing validity checks on this model is posterior predictive checks (PPCs). I learned about PPCs through Richard McElreath’s fantastic Statistical Rethinking (McElreath, 2020), which forms an integral part of all his discussions.

The idea behind a PPC is very simple: simulate replicate data from the learned model and compare it to the observed. In a way you are using your data twice, to learn the model and then using the learned model to check it against the same data. A better designed check would be done on held out dataset, but it is perfectly valid to test the model against the observations used to train the model.

The simplest checks for scRNA-seq counts is the mean-variance relationships. The simulated means and variances from the learned model should match that of the observed data on cell- as well as gene-level. Generating a posterior predictive sample is as simple as model.posterior_predictive_sample(). This can essentially be considered as sampling from the estimated distribution $P(X)$ .

The simulated mean-variance relationship aligns very well with the observed relationship. The mean and variance of generated counts matches to the means of observed counts at both gene and cell level. And the mean-variance relationship is preserved as well, both at cell and gene levels.

scvi can fit other distributions to counts but I used the nb option to fit a negative binomial (NB) to each gene-cell count. Additionaly, I used the dispersion=gene option to learn one dispersion parameter per gene, though it is possible to learn a per cell per gene dispersion parameter. There are no posterior predictive checks for the dispersion parameter as such, but given the over-dispersed nature of scRNA-seq data we do expect the dispersion parameters to be non zero in a few if not all cases.

The estimated library size matches the observed library size. Most genes have a small dispersion parameter, though there are genes that show high values of dispersion. The estimated dispersion does not seem to have any dependence on the estimated mean which is a bit surprising. It is difficult to compare the mean-dispersion relationship here with the more traditional negative binomial models either in the bulk (Love et al., 2014) or scRNA-seq (Hafemeister & Satija, 2019) settings given the non-linearity involved at the encoding and decoding steps. It is possible that a reduced dispersion estimate of the negative-binomial is enough to capture the observed overdispersion since it has possibly been handled at the latent representation level.

We can also compare the generated proportions of non-zeroes with the observed ones (detection rate).

The fact that we have a generator $P(X)$ that allows us to generate counts does not necessarily mean its a universal simulator, at least not in the current setting. PBMC33k dataset also comes from the same v1 chemistry kit as PBMC3k. We can use a subset of cells, say CD8 naive from PBMC33k and use it as an input the generator learned on PBMC3k and see if the generated counts still capture the mean-variance relationship. This is PPC but on a different dataset from a different batch.

The generated counts have similar mean as the observed counts in CD naive cells in PBMC33k, but the variance is highly overestimated. It is however possible to learn a joint model for the two datasets by defining a batch variable which is not discussed here, but the scvi-model allows conditioning on such covariates.

The loss is a measure of how good an approximation $P(X)$ is to $P_{\text{truth}}(Y)$ . The loss function here is NB, where we measure the likelihood of observed counts given the learned estimates of mean and dispersion.

If we trace the loss per gene or per cell and contrast it with the mean counts, at both gene and cell level, the reconstruction a error is dependent on the mean counts. In a way, the model finds it hard to reconstruct things that have high expression. Things that are harder to reconstruct, i.e. when the loss is high implies these gene-cell combination are hard to synthesize possibly due to their underlying variability. But in this case it can also be explained by the behavior of the NB distribution itself. The NB likelihood behaves differently at low and high counts in contrast to something like a gaussian. For example, compare the log likelihood distribution for these two families of distribution at high and low mean settings in the figure below. A high mean negative binomial has lower log likelihood than a low mean negative binomial with same dispersion. On the other hand, a gaussian would assign similar log likelihoods to a high mean and a low mean gaussian near their respective means.

Moreover, there is structure in the reconstruction error as revealed by its UMAP. It seems to capture the structure to some degree as in the latent representation UMAP. I could not find a discussion on this phenomenon in the scVI paper, but it would be intersting to see if there is a direct relationship between $Z$ and the reconstrution error.

Thanks to Dr. Rahul Satija, Adam Gayoso, Romain Lopez, and Dr. Valentine Svensson for an insightful discussion. The discussion thread is here and a notebook for all the analysis in this post is here.

1. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., & Yosef, N. (2018). Deep generative modeling for single-cell transcriptomics. Nature Methods, 15(12), 1053–1058.
2. Devroye, L. (1986). Sample-based non-uniform random variate generation. Proceedings of the 18th Conference on Winter Simulation, 260–265.
3. McElreath, R. (2020). Statistical Rethinking: A Bayesian course with examples in R and Stan. CRC press.
4. Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. ArXiv Preprint ArXiv:1312.6114.
5. Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550.
6. Hafemeister, C., & Satija, R. (2019). Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biology, 20(1), 1–15.
1. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., & Yosef, N. (2018). Deep generative modeling for single-cell transcriptomics. Nature Methods, 15(12), 1053–1058.
2. Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. ArXiv Preprint ArXiv:1312.6114.
3. Devroye, L. (1986). Sample-based non-uniform random variate generation. Proceedings of the 18th Conference on Winter Simulation, 260–265.
4. McElreath, R. (2020). Statistical Rethinking: A Bayesian course with examples in R and Stan. CRC press.
5. Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550.
6. Hafemeister, C., & Satija, R. (2019). Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biology, 20(1), 1–15.