Experiments with scVI
I  Posterior Predictive Checks
scvitools is a suite of tools for performing dimensionality reduction, data harmonization, and differential expression. One key advantage of using scvitools is that it inherently supports loading and training data in minibatches which makes it practically infinitely scalable (Lopez et al., 2018).
scvitools uses generative modeling to model counts originating from a single cell RNAseq or a CITEseq experiment. “Generative modeling” is a broad term that implies learning models of distributions $P(X)$ , defined over some collection of datapoints $X$ that exist in a high dimensional space. In scRNAseq, each datapoint corresponds to a cell $c$ which has a multidimensional vector $X_{c,g} \in \mathbb{R}^{20000}$ containing read counts or UMIs of 20,000 genes. A scRNAseq datasets contains not one but a few thousand if not millions of cells. The generative model’s task is to capture the underlying latent representation of these cells. “Representation” here is a loose term, but more formally given a $\text{cell} \times \text{gene}$ matrix whose distribution $P_{\text{truth}}(X)$ is unknown, the generative model tries to learn a distribution $P(X)$ which is as close to $P_{\text{truth}}(X)$ as possible.
In order to obtain $P(X)$ , the model should be able to exploit the underlying structure in data. At the core, scvi relies on a variational autoencoder (Kingma & Welling, 2013) to learn $P(X)$ . Neural networks are powerful functional approximators given their ability to capture nonlinearity in the data. Variational autoencoders utilize neural networks to build generative models that can approximate $P_{truth}(X)$ in a decently quick fashion. The reason this works is because any $d$ dimensional distribution can be approximated by starting with $d$ gaussian random variables and passing them through a complicated function (Devroye, 1986). A famous example of this is generating a 2D circle from a 2D gaussian blob.
scvitools also starts from a gaussian random variable and propogates it through its various layers of nonlinear functions. The goal is to be able to learn a generator model $P(X)$ that has the capacity to generate counts similar to the observed ones and in this process learn a latent representation of the data. In the most simplistic setting, it learns this over four main steps:

Generate a gaussian

Pass the gaussian through a neural network to approximate genecell proportions ($\rho_{c,g}$)

Generate a count $y_{c,g}$ for each genecell using the estimated proportion in step 2 and and the total sequencing depth along with an estimated dispersion $\phi_g$ .

Calculate reconstruction error between generated count $y_{c,g}$ and observed coun $x_{c,g}$
The aim is to minimize the reconstruction error in step 4 by optimizing the neural network weights and the estimated parameters $\rho_{c,g}$ and $\theta_g$ .
$\begin{aligned} {\color{purple}z_c} &\sim \mathcal{N}(0,I) & \text{\color{purple}Cell embedding} \\ {\color{red}\rho_{c,g}} &\sim \text{softmax}(f_w(z_c)) & \text{\color{red}Normalized expression } \\ y_{c,g} &\sim \text{NB}({\color{blue} l_c} {\color{red}\rho_{c,g}}, \phi_g) & \text{Observed counts} \end{aligned}$
The total sequencing depth for a cell $l_c$ can be also be learned by the network inherently, but the latest version (0.8.0) of scVI supports directly using the observed library size. $\rho_{c,g}$ is one of the most useful parameter inferred by scvi that summarizes the probability of obtaining a molecule from gene $g$ in a given cell $c$ ( $\sum_g \rho_{c,g}=1$ ). I was interested in observing how the effect of library size on the relative abundance of highly and lowly expressed genes. We can use the $\rho_{c,g}$ values learned by scvi and contrast their sum across cells over highly and lowly expressed genes. It can also be used to perform differential expression.
I will use PBMC3k dataset for all the analysis here.
The above plot shows there is no correlation between the difference of latent frequencies $\rho_{g}$ among highly and lowly expressed genes cells. I used only CD8 naive cells for comparing the differences to contorl for any heterogeneity arising from different celltypes.
In the process of learning the generating distribution $P(X)$ , the model also learns a latent distribution $Z$ which is a reduced dimensional representation of $X$ and can generate $X$ when propoageted through the network. $Z$ can be used to represent the underlying structure in the data. If we perform a UMAP based visualization on the $Z$ values of PBMC3k dataset, the clusters match up pretty well with the annotated celltypes.
The latent distribution $Z$ seems to be a more or less correct one given that the UMAP clusters match up the celltypes, though there is some room for improvement. We also have access to a generating distribution $P(X)$ that can generate genecell counts matrix which is very powerful. But how do we know that the estimated $P(Y)$ is correct? One such way of performing validity checks on this model is posterior predictive checks (PPCs). I learned about PPCs through Richard McElreath’s fantastic Statistical Rethinking (McElreath, 2020), which forms an integral part of all his discussions.
The idea behind a PPC is very simple: simulate replicate data from the learned model and compare it to the observed. In a way you are using your data twice, to learn the model and then using the learned model to check it against the same data. A better designed check would be done on held out dataset, but it is perfectly valid to test the model against the observations used to train the model.
The simplest checks for scRNAseq counts is the meanvariance relationships. The simulated means and variances
from the learned model should match that of the observed data on cell as well as genelevel. Generating
a posterior predictive sample is as simple as model.posterior_predictive_sample()
. This can essentially be considered
as sampling from the estimated distribution $P(X)$ .
The simulated meanvariance relationship aligns very well with the observed relationship. The mean and variance of generated counts matches to the means of observed counts at both gene and cell level. And the meanvariance relationship is preserved as well, both at cell and gene levels.
scvi can fit other distributions to counts but I used the nb
option to fit a negative binomial (NB) to each genecell count.
Additionaly, I used the dispersion=gene
option to learn one dispersion parameter per gene, though it is possible to learn a per cell per gene
dispersion parameter. There are no posterior predictive checks for the dispersion parameter as such, but given the overdispersed nature of scRNAseq data
we do expect the dispersion parameters to be non zero in a few if not all cases.
The estimated library size matches the observed library size. Most genes have a small dispersion parameter, though there are genes that show high values of dispersion. The estimated dispersion does not seem to have any dependence on the estimated mean which is a bit surprising. It is difficult to compare the meandispersion relationship here with the more traditional negative binomial models either in the bulk (Love et al., 2014) or scRNAseq (Hafemeister & Satija, 2019) settings given the nonlinearity involved at the encoding and decoding steps. It is possible that a reduced dispersion estimate of the negativebinomial is enough to capture the observed overdispersion since it has possibly been handled at the latent representation level.
We can also compare the generated proportions of nonzeroes with the observed ones (detection rate).
The fact that we have a generator $P(X)$ that allows us to generate counts does not necessarily mean its a universal simulator, at least not in the current setting. PBMC33k dataset also comes from the same v1 chemistry kit as PBMC3k. We can use a subset of cells, say CD8 naive from PBMC33k and use it as an input the generator learned on PBMC3k and see if the generated counts still capture the meanvariance relationship. This is PPC but on a different dataset from a different batch.
The generated counts have similar mean as the observed counts in CD naive cells in PBMC33k, but the variance is highly overestimated. It is however possible to learn a joint model for the two datasets by defining a batch variable which is not discussed here, but the scvimodel allows conditioning on such covariates.
The loss is a measure of how good an approximation $P(X)$ is to $P_{\text{truth}}(Y)$ . The loss function here is NB, where we measure the likelihood of observed counts given the learned estimates of mean and dispersion.
If we trace the loss per gene or per cell and contrast it with the mean counts, at both gene and cell level, the reconstruction a error is dependent on the mean counts. In a way, the model finds it hard to reconstruct things that have high expression. Things that are harder to reconstruct, i.e. when the loss is high implies these genecell combination are hard to synthesize possibly due to their underlying variability. But in this case it can also be explained by the behavior of the NB distribution itself. The NB likelihood behaves differently at low and high counts in contrast to something like a gaussian. For example, compare the log likelihood distribution for these two families of distribution at high and low mean settings in the figure below. A high mean negative binomial has lower log likelihood than a low mean negative binomial with same dispersion. On the other hand, a gaussian would assign similar log likelihoods to a high mean and a low mean gaussian near their respective means.
Moreover, there is structure in the reconstruction error as revealed by its UMAP. It seems to capture the structure to some degree as in the latent representation UMAP. I could not find a discussion on this phenomenon in the scVI paper, but it would be intersting to see if there is a direct relationship between $Z$ and the reconstrution error.
Thanks to Dr. Rahul Satija, Adam Gayoso, Romain Lopez, and Dr. Valentine Svensson for an insightful discussion. The discussion thread is here and a notebook for all the analysis in this post is here.
 Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., & Yosef, N. (2018). Deep generative modeling for singlecell transcriptomics. Nature Methods, 15(12), 1053–1058.
 Devroye, L. (1986). Samplebased nonuniform random variate generation. Proceedings of the 18th Conference on Winter Simulation, 260–265.
 McElreath, R. (2020). Statistical Rethinking: A Bayesian course with examples in R and Stan. CRC press.
 Kingma, D. P., & Welling, M. (2013). Autoencoding variational bayes. ArXiv Preprint ArXiv:1312.6114.
 Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biology, 15(12), 550.
 Hafemeister, C., & Satija, R. (2019). Normalization and variance stabilization of singlecell RNAseq data using regularized negative binomial regression. Genome Biology, 20(1), 1–15.
 Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., & Yosef, N. (2018). Deep generative modeling for singlecell transcriptomics. Nature Methods, 15(12), 1053–1058.
 Kingma, D. P., & Welling, M. (2013). Autoencoding variational bayes. ArXiv Preprint ArXiv:1312.6114.
 Devroye, L. (1986). Samplebased nonuniform random variate generation. Proceedings of the 18th Conference on Winter Simulation, 260–265.
 McElreath, R. (2020). Statistical Rethinking: A Bayesian course with examples in R and Stan. CRC press.
 Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biology, 15(12), 550.
 Hafemeister, C., & Satija, R. (2019). Normalization and variance stabilization of singlecell RNAseq data using regularized negative binomial regression. Genome Biology, 20(1), 1–15.