For this session, we will work with the human liver single-cell RNA-seq dataset profiled in http://livercellatlas.org/ published as part of Guilliams et al. 2022. There is a lot of data in this paper to familiarise you with different data modalities. To quote:
The liver is the largest solid organ in the body, yet it remains incompletely characterized. Here we present a spatial proteogenomic atlas of the healthy and obese human and murine liver combining single-cell CITE-seq, single-nuclei sequencing, spatial transcriptomics, and spatial proteomics
We will focus on single-cell RNA-sequencing for this tutorial.
2 Downloading data
The authors have deposited their data on GEO and on their own website. We will work with the dataset deposited on their website to keep things uniform. But irrespective, we will walk you through how to download data and metadata from GEO/SRA.
SRA stores the raw sequencing data. While GEO stores the processed data - in our case, these include the count matrices, the Seurat objects and the metadata.
We will fetch the metadata using pysradb:
Code
# run this on your bash terminalpip install git+https://github.com/saketkc/pysradb@develop -q#install pysradbpysradb metadata GSE192740 --detailed--saveto data/metadata/gex_GSE192740_metadata.tsv # get metadata for the GEO recordspysradb metadata SRP352824 --detailed--saveto data/metadata/fastq_GSE192740_metadata.tsv # get metadata for the SRA records
2.1 Why are we fetching both the SRA and GEO metadata?
In most cases, you will directly work with the GEO metadata. However, since we are already going to process the raw fastq files, we would want to keep this handy as well
# run in bash terminalwget--content-disposition-c"https://livercellatlas.org/data_files/toDownload/rawData_human.zip"-O"data/rawData_human.zip"wget--content-disposition-c"https://livercellatlas.org/data_files/toDownload/annot_humanAll.csv"-O"data/annot_humanAll.csv"
We will work with the human data for the rest of the workflow.
Warning: Removing 84662 cells missing data for vars requested
Code
rm(object.scrna)rm(counts.metadata)
3 Exploring the object
Before we do any downstream processing, it is always a good idea to explore the object at hand.
The counts (UMIs) are stored in the “counts” layer of the “RNA” assay”. You can read about the details of how Seurat stores its information on this wiki page.
Code
counts <-LayerData(object.subset, layer ="counts", assay ="RNA")# how the first 5 gene' and first 5 cell countscounts[1:5, 1:5]
VlnPlot(object.subset, features =c("nCount_RNA", "nFeature_RNA", "percent.mt", "percent.ribo"))
Warning: Default search for "data" layer in "RNA" assay yielded no results;
utilizing "counts" layer instead.
4.1 Normalisation
Before processding with any biological interpretation with scRNA-seq data, we need to get rid of any technical factors that might lead to differences across cells. One such technical difference is the total UMIs in each cell.
In an ideal case scenario, we would have wanted the ‘sequencing depth’ of each cell, i.e. the total UMIs for each cell to be the same. We also expect that the mean of gene expression and variance should not be related.
Warning in scale_x_log10(): log-10 transformation introduced infinite values.
Warning in scale_y_log10(): log-10 transformation introduced infinite values.
A standard approach of achieving normalisation is to divide the counts per cell by the total counts and multiply by a large number (preferably the median sequencing depth) so that all the cell counts sum to the same (large) value. Additionally, a log is taken to reduce the impact of the large outliers (which also stabilises the variance):
So the mean variance has no “stabilized”. , however you will notice that the variance stabilization is not uniform - there is a bend in the curve at the beginning.
4.2 Feature selection
Many genes in this dataset are “not informative” - they are either lowly expressed overall, or expressed in all the celltypes. For downstream analysis, we want to identify “informative” genes - this increases the signal to noise ratio for downstream dimensionaloty reduction.
One way to select these genes would be to calculate a per-gene variance and select the ones with the highest variance. Let’s explore the mean-variance relationship of these genes:
Warning in scale_x_log10(): log-10 transformation introduced infinite values.
Warning in scale_y_log10(): log-10 transformation introduced infinite values.
So genes with higher means will also have higher variance because there is a strong mean-variance relationship exhibited by all genes. One way to adjust for this relationship is to explicitly model the mean-variance relationship by fitting a smooth curve and then identify genes that show high deviation from this “expected” pattern. This is essentially what is done inside the FindVariableFeatures function:
Warning in scale_x_log10(): log-10 transformation introduced infinite values.
In the plot above, the “variable genes” are highlhted in red and as clear from their mean values are not necessarily the most highly expressed (notice the genes with average expression > 10)
The total number of UMIs observed for a cell are attributable to 1) cell size (higher cell size => more RNA content => more expected UMIs) and 2) sequencing depth. Let’s look at the expression of a ubiquoulsy expressed gene (Malat1- a long noncoding RNA):
This gene (MALAT1) shows an almost linear relationship with sequencing depth of a cell. In this case the “heterogeneity across cells” in expression levels of Malat1 are primarly attributable to the fact that these cells were sequenced to different depths and is uninteresting.
If all the genes showed this linear behaviour, we could have simply “regressed” out the effect of sequencing depth by performing a linear regression of observed gene count against the total sequencing depth of the cell. But it is not a good idea because there are also marker genes such as ALB, (hepatocytes), MS4A1 (a marker for BCells) and CD2 (a marker of T Cells)
Given these are marker genes, they are only expressed in a subset of cells. A simple linear regression where we regress out the sequencing depth from each gene’s UMI will end up affecting these marker genes negatively - dampening their expression.
4.2.1 Log Normalization
Instead of regressing out sequencing depth, an alternate idea is to to use a global scale factor (say 10,000) and “scale” all cells to have this sequencing depth while also accounting for the sequencing depth of the cell. To reduce the impact of outliers, these scaled values are log transformed (after adding a pseudocount of 1). To perform this normalizing using Seurat, you can use the NormalizeData function:
There are two key properties that any “good normalization” method should achieve:
The normalized expression values of a gene should have minimal dependence on the seuquencing depth
The variance of normalized gene expression values across should cells should primarily reflect biological variation and not technical noise and should be independent of gene’s mean expression level or sequencing depth of the cell.
Let’s look into the expression pattern of the same highly expressed gene after running Log Normalization:
4.3 One step normalization and feature selection: SCTransform
A second way to achieve variance stabilization (and variable feature selection) is by using SCTransform. In SCTransform, we use a regularized negative binomial model to directly model the single-cell counts. The normalized counts are then cacluclated by reversing this general reltaionship by substituting the observed count depth as the median sequencing depth. You can read the SCTransform publication for more details (SCTransformSCTransform-v2).
We will use SCTransform based analysis for the rest of the work flow.
While we measured around 30000 genes, the “intrinsic” dimensionality of data is smaller - we do not expect all 17000 genes to be orthogonal (“independent” measurements). There will be “modules” of genes that are co-expressed. We want to reduce the dimensioanloty of this dataset to a) enhance the signal to noise ratio and b) visualize this information. A common strategy in this case is to use Principle Component Analysis (PCA) which is a linear dimensionalty reduction method that outputs a smaller (20-50 usually, but upto n-1 PCs in a n dimensional dataset) orthologonal dimensions.
Rather than including all the PCs, we want to retain the most informative PCs. The goal here is to avoid adding extra PCs that might reflect other sources of variation (technical) assuming the major source of variation in the dataset is biological.
One way to visualize the “intrinsic dimensionality” of this dataset is to to see how the variance explained by individual PCs varies in the form of an elbow plot:
Code
ElbowPlot(object, ndims =50)
The standard deviation (square root of variance) seems to plateau around PC30, so we will retain 30 PCs for downstream visualisation and clustering.
Code
DimPlot(object, group.by ="patient")
Code
DimPlot(object, group.by ="annot")
The standard deviation (sequare root of variance) seems to plateau around PC30, so we will retain 30 PCs for downstream visualization and clustering.
Code
DimPlot(object)
4.5 Non linear dimensionality reduction
While the previous PCA analysis resulted in 30 dimensions, we want to further reduce the dimension of our dataset to 2 to be able to visualize it. While there are different methods to do this, we will utilize Uniform Manifold Approximation and Projection (UMAP). The goal of UMAP and related dimensionality reduction methods is to find a set of 2-D coordinates for the multidimensional input dataset such that the difference in cell-cell similarities between the low-dimensional and original-dimension dataset is minimised.
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
Code
DimPlot(object)
The above plot shows how individual cells are located in a 2D space (defined by the UMAP coordinates). Our dataset is finally ready for a deeper dive in the biology.
4.6 Clustering
We can now find “clusters” of cells that likely represent the same celltype. For single-cell data, a popular choice of methods for clustering is graph-based community detection methods that can scale to millions of cells (as opposed to tree-based methods such as hierechial clustering):
FeaturePlot(integrated, features ="ALB", reduction ="umap.cca")
Code
VlnPlot(integrated, features ="ALB", sort = T)
7 Find differentiating markers between the obese and lean diet in Macrophages
Our next goal is to decipher what genes are differentially expressed in the obese individuals as compared to the lean individauls.
To find differentially expressed (DE) genes, we first set the “identity” of the cells to be the “mouse_model” metadata column and then identify diferentially expressed genes in the Obese individuals as compared to the lean individuals using the FindMarkers command:
Code
integrated <-PrepSCTFindMarkers(integrated)
Found 5 SCT models. Recorrecting SCT counts using minimum median counts: 2188
To visualize these genes we will make use of a “volcano plot” where on the x-axis we will plot the log fold change of each gene and the associated p-value will be plotted on the y-axis. Additinally we will also highlight top 10 DE genes in both directions (positive, i.e. upregulated in Obese and negative, i.e. downregulated in Obese).
Warning: ggrepel: 7 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
Can you notice what is going on? Why is there XIST? Why are they Ychromosome genes? Did we capture all the nuances of the experiement design?
8 Functional enrichment
The abovce analysis clearly denotes there is On obvious question to ask with the list of DE genes is what function are these carrying? We can perform a gene ontology enrichment using the gene lists to see what functional categories is each of the DE list enriched in.
---title: "Day 02 - Single-Cell RNA-seq Analysis Workshop"author: "Saket Choudhary <saketc@iitb.ac.in> for QSCB 2025"date: todayformat: html: embed-resources: true toc: false number-sections: true code-fold: show code-tools: true code-copy: true theme: cosmo css: styles.css fig-width: 8 fig-height: 6 df-print: pagededitor: visual---# PremiseFor this session, we will work with the human liver single-cell RNA-seq dataset profiled in <http://livercellatlas.org/> published as part of [Guilliams et al. 2022](https://www.cell.com/cell/fulltext/S0092-8674(21)01481-1). There is a lot of data in this paper to familiarise you with different data modalities. To quote:> The liver is the largest solid organ in the body, yet it remains incompletely characterized. Here we present a spatial proteogenomic atlas of the healthy and obese human and murine liver combining single-cell CITE-seq, single-nuclei sequencing, spatial transcriptomics, and spatial proteomicsWe will focus on single-cell RNA-sequencing for this tutorial.# Downloading dataThe authors have deposited their data on [GEO](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE192740) and on their own [website](https://livercellatlas.org/download.php). We will work with the dataset deposited on their website to keep things uniform. But irrespective, we will walk you through how to download data and metadata from GEO/SRA.SRA stores the raw sequencing data. While GEO stores the processed data - in our case, these include the count matrices, the Seurat objects and the metadata.We will fetch the metadata using pysradb:```{bash, eval=FALSE}# run this on your bash terminalpip install git+https://github.com/saketkc/pysradb@develop -q#install pysradbpysradb metadata GSE192740 --detailed--saveto data/metadata/gex_GSE192740_metadata.tsv # get metadata for the GEO recordspysradb metadata SRP352824 --detailed--saveto data/metadata/fastq_GSE192740_metadata.tsv # get metadata for the SRA records```## Why are we fetching both the SRA and GEO metadata?In most cases, you will directly work with the GEO metadata. However, since we are already going to process the raw fastq files, we would want to keep this handy as well```{bash, eval=FALSE}head gex_GSE192740_metadata.tsv study_accession study_title study_summary organism_name platform_accession platform_title experiment_type bioproject submission_date supplementary_files series_ftp sample_accession sample_title sample_ftp sample_supplementary sample_geo2r sample_type sample_source_name sex age tissue cell_type disease treatment extract_protocol label_protocol sample_geo_accession sample_status sample_submission_date sample_last_update_date sample_channel_count sample_source_name_ch1 sample_organism_ch1 sample_taxid_ch1 sample_molecule_ch1 sample_extract_protocol_ch1 sample_description sample_data_processing sample_platform_id sample_contact_name sample_contact_email sample_contact_department sample_contact_institute sample_contact_address sample_contact_city sample_contact_zip/postal_code sample_contact_country sample_instrument_model sample_library_selection sample_library_source sample_library_strategy sample_relation sample_supplementary_file_1 sample_series_id sample_data_row_count shortfilename strain platform digestion method number of added abs number of cells ena_fastq_http_1 ena_fastq_http_2 ena_fastq_ftp_1 ena_fastq_ftp_2```## Download counts matrices from geo```{bash, eval=FALSE}pysradb download -g GSE192740 --out-dir data/geo```## Download counts matrices from website```{bash, eval=FALSE}# run in bash terminalwget--content-disposition-c"https://livercellatlas.org/data_files/toDownload/rawData_human.zip"-O"data/rawData_human.zip"wget--content-disposition-c"https://livercellatlas.org/data_files/toDownload/annot_humanAll.csv"-O"data/annot_humanAll.csv"```We will work with the human data for the rest of the workflow.```{r}suppressPackageStartupMessages({library(Seurat)library(tidyverse)library(ggpubr)library(sparseMatrixStats)})theme_set(theme_pubr())counts <-Read10X(data.dir ="/NAS/qscb2025/Guilliams_2022_livercellatlas/Raw_data/rawData_human/countTable_human/", gene.column =1)counts.metadata <-read.csv("/NAS/qscb2025/Guilliams_2022_livercellatlas/Raw_data/annot_humanAll.csv")rownames(counts.metadata) <- counts.metadata$cellsample.scrna <- counts.metadata %>%filter(typeSample =="scRnaSeq")```Let's look at what the metadata has:```{r}head(sample.scrna)``````{r}table(sample.scrna$patient) %>%sort()```We will take few samples: H07, H14, H06, H16, H23```{r}barcodes.scrna <-intersect(rownames(counts.metadata), colnames(counts))object.scrna <-CreateSeuratObject(counts = counts[, barcodes.scrna], meta.data = sample.scrna[barcodes.scrna, ])rm(counts)``````{r}object.subset <-subset(object.scrna, patient %in%c("H06", "H07", "H14", "H16", "H23"))rm(object.scrna)rm(counts.metadata)```# Exploring the objectBefore we do any downstream processing, it is always a good idea to explore the object at hand.The counts (UMIs) are stored in the "counts" layer of the "RNA" assay". You can read about the details of how Seurat stores its information on this [wiki page](https://github.com/satijalab/seurat/wiki/Assay).```{r}counts <-LayerData(object.subset, layer ="counts", assay ="RNA")# how the first 5 gene' and first 5 cell countscounts[1:5, 1:5]```In total we have `{r} ncol(counts)` cells (columns) and `{r} nrow(counts)` genes (rows):```{r}dim(counts)```Among these entries, we can calculate how many proportion of entries are non-zero:```{r}100*round(sum(counts >0) / (dim(counts)[1] *dim(counts)[2]), 2)```So 96% of entries in the `counts` matrix are zeroes (or only 4% of the values in the entire matrix are non-zeros).We can ask whether these zeroes are 'technical' or something that would be observed biological.```{r}# for each gene count the number of zeroesnum_zeros <-ncol(counts) - Matrix::rowSums(counts !=0)prop_zeroes <- num_zeros /ncol(counts)df <-data.frame(gene_mean =rowMeans(counts), gene_variance =rowVars(counts), prop_zeroes = prop_zeroes)ggplot(df, aes(gene_mean, prop_zeroes)) +geom_point() +scale_x_log10()```Let's try to fit a distribution here. One common distribution that is often used is poisson. The mean of a poisson with```{r}df$expected_prop_zeros <-exp(-df$gene_mean) # this is P(X=0| lambda) for a poissonggplot(df, aes(x = gene_mean, y = prop_zeroes)) +geom_point(alpha =0.3, size =1) +geom_line(aes(y = expected_prop_zeros),color ="red",linewidth =1.5 ) +scale_x_log10() +scale_y_continuous(limits =c(0, 1)) +labs(x ="Gene mean (log10 scale)",y ="Proportion of zeros", ) +annotate("text",x =max(df$gene_mean) /100, y =0.9,label ="Red line: Poisson expectation",color ="red", hjust =0 ) +theme_bw()rm(counts)gc()```## Metadata```{r}# take a look at the object metadatahead(object.subset)```You can add a custom column to the metadata using:```{r}# add a new metadata columnobject.subset$my_column <-"test"head(object.subset)```Or delete the column from the metadata:```{r}object.subset$my_column <-NULLhead(object.subset)```# QC checksWe will calculate the percentage of mitochondria in all samples:```{r}object.subset[["percent.mt"]] <-PercentageFeatureSet(object.subset, pattern ="^MT-")object.subset[["percent.ribo"]] <-PercentageFeatureSet(object.subset, pattern ="^RP[SL][[:digit:]]")```Before we proceed, let's look at the distribution of number of UMIs and number of genes detected in each samples```{r}Idents(object.subset) <-"patient"VlnPlot(object.subset, features =c("nCount_RNA", "nFeature_RNA", "percent.mt", "percent.ribo"))```Higher mitochondrial percentage cells are under stress (or about to die) and ideally we would want to keep them out for the rest of the analysis```{r}object.subset <-subset(object.subset, percent.mt <10)``````{r}VlnPlot(object.subset, features =c("nCount_RNA", "nFeature_RNA", "percent.mt", "percent.ribo"))```## NormalisationBefore processding with any biological interpretation with scRNA-seq data, we need to get rid of any technical factors that might lead to differences across cells. One such technical difference is the total UMIs in each cell.In an ideal case scenario, we would have wanted the 'sequencing depth' of each cell, i.e. the total UMIs for each cell to be the same. We also expect that the mean of gene expression and variance should not be related.```{r}library(sparseMatrixStats)counts.subset <-LayerData(object = object.subset, layer ="counts", assay ="RNA")df <-data.frame(mean =rowMeans(counts.subset), variance =rowVars(counts.subset))ggplot(df) +geom_point(aes(mean, variance)) +scale_x_log10() +scale_y_log10() +geom_abline() +theme_bw(base_size =14)```A standard approach of achieving normalisation is to divide the counts per cell by the total counts and multiply by a large number (preferably the median sequencing depth) so that all the cell counts sum to the same (large) value. Additionally, a log is taken to reduce the impact of the large outliers (which also stabilises the variance):```{r}object.subset <-NormalizeData(object.subset, scale.factor =median(object.subset$nCount_RNA))```The normalised data is stored in the data layer:```{r}layer.data <-LayerData(object = object.subset, layer ="data", assay ="RNA")df <-data.frame(mean =rowMeans(layer.data), variance =rowVars(layer.data))ggplot(df) +geom_point(mapping =aes(mean, variance)) +geom_abline() +theme_bw(base_size =14)```So the mean variance has no "stabilized". , however you will notice that the variance stabilization is not uniform - there is a bend in the curve at the beginning.## Feature selectionMany genes in this dataset are "not informative" - they are either lowly expressed overall, or expressed in all the celltypes. For downstream analysis, we want to identify "informative" genes - this increases the signal to noise ratio for downstream dimensionaloty reduction.One way to select these genes would be to calculate a per-gene variance and select the ones with the highest variance. Let's explore the mean-variance relationship of these genes:```{r,}# get UMI counts from objectcounts <-LayerData(object = object.subset, assay ="RNA", layer ="counts")# calculate per gene meangene_mean <-rowMeans(x = counts)# gene_variance <- apply(X = counts, MARGIN = 1, FUN = var)gene_variance <-rowVars(x = counts)ggplot(data =data.frame("gene_mean"= gene_mean, "gene_variance"= gene_variance), aes(gene_mean, gene_variance)) +geom_point() +geom_abline(color ="red") +scale_x_log10() +scale_y_log10() +xlab("Gene mean") +ylab("Gene variance") +theme_bw(base_size =14)```So genes with higher means will also have higher variance because there is a strong mean-variance relationship exhibited by all genes. One way to adjust for this relationship is to explicitly model the mean-variance relationship by fitting a smooth curve and then identify genes that show high deviation from this "expected" pattern. This is essentially what is done inside the `FindVariableFeatures` function:```{r}object.subset <-FindVariableFeatures(object = object.subset, selection.method ="vst")VariableFeaturePlot(object.subset)```In the plot above, the "variable genes" are highlhted in red and as clear from their mean values are not necessarily the most highly expressed (notice the genes with average expression \> 10)The total number of UMIs observed for a cell are attributable to 1) cell size (higher cell size =\> more RNA content =\> more expected UMIs) and 2) sequencing depth. Let's look at the expression of a ubiquoulsy expressed gene (Malat1- a long noncoding RNA):```{r}total_umis <-colSums(x = counts)gene <-"MALAT1"gene_umi <-as.numeric(counts[gene, ])ggplot(data =data.frame("total_cell_umi"= total_umis, "gene_umi"= gene_umi), aes(total_cell_umi, gene_umi)) +geom_point() +theme_bw(base_size =14) +xlab("Total sequencing depth") +ylab(paste0("Gene UMI(", gene))```This gene (MALAT1) shows an almost linear relationship with sequencing depth of a cell. In this case the "heterogeneity across cells " in expression levels of Malat1 are primarly attributable to the fact that these cells were sequenced to different depths and is uninteresting.If all the genes showed this linear behaviour, we could have simply "regressed" out the effect of sequencing depth by performing a linear regression of observed gene count against the total sequencing depth of the cell. But it is not a good idea because there are also marker genes such as ALB, (hepatocytes), MS4A1 (a marker for BCells) and CD2 (a marker of T Cells)```{r}data.to.plot <-FetchData(object = object.subset, layer ="counts", vars =c("ALB", "MS4A1", "CD2", "nCount_RNA"))head(data.to.plot)``````{r}data.to.plot.genewise <- reshape2::melt(data.to.plot, id.vars ="nCount_RNA")ggplot(data.to.plot.genewise, aes(nCount_RNA, value)) +geom_point() +facet_wrap(~variable) +xlab("Sequencing depth") +ylab("Gene UMI") +theme_bw(base_size =14)```Given these are marker genes, they are only expressed in a subset of cells. A simple linear regression where we regress out the sequencing depth from each gene's UMI will end up affecting these marker genes negatively - dampening their expression.### Log NormalizationInstead of regressing out sequencing depth, an alternate idea is to to use a global scale factor (say 10,000) and "scale" all cells to have this sequencing depth while also accounting for the sequencing depth of the cell. To reduce the impact of outliers, these scaled values are log transformed (after adding a pseudocount of 1). To perform this normalizing using Seurat, you can use the `NormalizeData` function:```{r}object <- object.subsetobject <-NormalizeData(object = object, scale.factor =median(object$nCount_RNA))object <-ScaleData(object = object)```### What is "good normalization"?There are two key properties that any "good normalization" method should achieve:1. The normalized expression values of a gene should have minimal dependence on the seuquencing depth2. The variance of normalized gene expression values across should cells should primarily reflect biological variation and not technical noise and should be independent of gene's mean expression level or sequencing depth of the cell.Let's look into the expression pattern of the same highly expressed gene after running Log Normalization:```{r}gene <-"MALAT1"lognormalized.data <-LayerData(object = object, assay ="RNA", layer ="data")[gene, ]ggplot(data.frame("sequencing_depth"=colSums(x = counts), "normalized_values"= lognormalized.data), aes(sequencing_depth, normalized_values)) +geom_point() +theme_bw(base_size =14)```## One step normalization and feature selection: SCTransformA second way to achieve variance stabilization (and variable feature selection) is by using SCTransform. In SCTransform, we use a regularized negative binomial model to directly model the single-cell counts. The normalized counts are then cacluclated by reversing this general reltaionship by substituting the observed count depth as the median sequencing depth. You can read the SCTransform publication for more details ([SCTransform](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02584-9)[SCTransform-v2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1874-1)).We will use SCTransform based analysis for the rest of the work flow.```{r}object <-SCTransform(object = object, verbose =FALSE)``````{r}VariableFeatures(object) %>%head()```Have we achieved normalisation?```{r}sct.residual <-LayerData(object = object, assay ="SCT", layer ="scale.data")sct.data <-LayerData(object = object, assay ="SCT", layer ="data")sct.data.variance <-rowVars(sct.data)log.data <-LayerData(object = object, assay ="RNA", layer ="data")lognorm.data.variance <-rowVars(log.data)sct.residual.variance <-rowVars(sct.residual)df <-data.frame(gene_mean =rowMeans(counts)[names(sct.data.variance)], data_variance = sct.data.variance)df$type <-"SCT"df2 <-data.frame(gene_mean =rowMeans(counts)[names(sct.data.variance)], data_variance = lognorm.data.variance[names(sct.data.variance)])df2$type <-"LogNorm"df3 <-rbind(df, df2)ggplot(df3) +geom_point(aes(gene_mean, data_variance, color = type), size =0.1) +scale_x_log10()```They look similarish (for this dataset)## Linear dimensionality reductionWhile we measured around 30000 genes, the "intrinsic" dimensionality of data is smaller - we do not expect all 17000 genes to be orthogonal ("independent" measurements). There will be "modules" of genes that are co-expressed. We want to reduce the dimensioanloty of this dataset to a) enhance the signal to noise ratio and b) visualize this information. A common strategy in this case is to use Principle Component Analysis (PCA) which is a linear dimensionalty reduction method that outputs a smaller (20-50 usually, but upto n-1 PCs in a n dimensional dataset) orthologonal dimensions.```{r}object <-RunPCA(object = object, npcs =50, verbose =FALSE)```Rather than including all the PCs, we want to retain the most informative PCs. The goal here is to avoid adding extra PCs that might reflect other sources of variation (technical) assuming the major source of variation in the dataset is biological.One way to visualize the "intrinsic dimensionality" of this dataset is to to see how the variance explained by individual PCs varies in the form of an elbow plot:```{r}ElbowPlot(object, ndims =50)```The standard deviation (square root of variance) seems to plateau around PC30, so we will retain 30 PCs for downstream visualisation and clustering.```{r}DimPlot(object, group.by ="patient")``````{r}DimPlot(object, group.by ="annot")```The standard deviation (sequare root of variance) seems to plateau around PC30, so we will retain 30 PCs for downstream visualization and clustering.```{r}DimPlot(object)```## Non linear dimensionality reductionWhile the previous PCA analysis resulted in 30 dimensions, we want to further reduce the dimension of our dataset to 2 to be able to visualize it. While there are different methods to do this, we will utilize Uniform Manifold Approximation and Projection (UMAP). The goal of UMAP and related dimensionality reduction methods is to find a set of 2-D coordinates for the multidimensional input dataset such that the difference in cell-cell similarities between the low-dimensional and original-dimension dataset is minimised.```{r}object <-RunUMAP(object, reduction ="pca", dims =1:30, verbose =FALSE)``````{r, fig.width=8, fig.height=7}DimPlot(object)```The above plot shows how individual cells are located in a 2D space (defined by the UMAP coordinates). Our dataset is finally ready for a deeper dive in the biology.## ClusteringWe can now find "clusters" of cells that likely represent the same celltype. For single-cell data, a popular choice of methods for clustering is graph-based community detection methods that can scale to millions of cells (as opposed to tree-based methods such as hierechial clustering):```{r}# build SNN graphobject <-FindNeighbors(object, reduction ="pca", dims =1:30, verbose =FALSE)object <-FindClusters(object, resolution =0.3, verbose =FALSE)``````{r, fig.width=8, fig.height=7}DimPlot(object, label =TRUE)``````{r, fig.width=8, fig.height=7}DimPlot(object, label =TRUE, group.by ="patient")```# Performing integrationGiven the "samples do not mix that well, we will consider integrating these:```{r}DefaultAssay(object) <-"RNA"merged <-JoinLayers(object)merged[["RNA"]] <-split(merged[["RNA"]], f = merged$patient)merged <-SCTransform(merged, verbose =FALSE)merged <-RunPCA(merged, verbose =FALSE)``````{r}integrated <-IntegrateLayers(object = merged, method = HarmonyIntegration, normalization.method ="SCT",orig.reduction ="pca", new.reduction ="integrated.harmony",verbose =FALSE)integrated <-RunUMAP(integrated,reduction ="integrated.harmony", dims =1:20,reduction.name ="umap.harmony", verbose =FALSE)```We can visualise whether integration has worked or not by doing:```{r, fig.width=10, fig.height=5}DimPlot(integrated, reduction ="umap.harmony", group.by =c("patient", "annot"), shuffle = T)```We can also use CCA for integration (takes longer):```{r, fig.width=10, fig.height=5}integrated <-IntegrateLayers(object = merged, method = CCAIntegration, normalization.method ="SCT",orig.reduction ="pca", new.reduction ="integrated.cca",verbose =FALSE)integrated <-RunUMAP(integrated,reduction ="integrated.cca", dims =1:20,reduction.name ="umap.cca", verbose =FALSE)DimPlot(integrated, reduction ="umap.cca", group.by =c("patient", "annot"), shuffle = T)```Once we are happy with our integration, we can proceed with clustering:```{r}integrated <-FindNeighbors(integrated, reduction ="integrated.cca", dims =1:20, verbose =FALSE)integrated <-FindClusters(integrated,resolution =0.1,cluster.name ="cca_clusters", verbose =FALSE)``````{r, fig.width=8, fig.height=5}DimPlot(integrated, label = T, reduction ="umap.cca")```# Identifying Hepatocyte cluster```{r}FeaturePlot(integrated, features ="ALB", reduction ="umap.cca")``````{r}VlnPlot(integrated, features ="ALB", sort = T)```# Find differentiating markers between the obese and lean diet in MacrophagesOur next goal is to decipher what genes are differentially expressed in the obese individuals as compared to the lean individauls.To find differentially expressed (DE) genes, we first set the "identity" of the cells to be the "mouse_model" metadata column and then identify diferentially expressed genes in the Obese individuals as compared to the lean individuals using the `FindMarkers` command:```{r}integrated <-PrepSCTFindMarkers(integrated)macro <-subset(integrated, annot %in%c("Macrophages"))Idents(macro) <-"diet"markers.obese_vs_lean <-FindMarkers(macro, ident.1 ="Obese", ident.2 ="Lean", recorrect_umi = F)markers.obese_vs_lean$gene <-rownames(markers.obese_vs_lean)head(markers.obese_vs_lean)```To visualize these genes we will make use of a "volcano plot" where on the x-axis we will plot the log fold change of each gene and the associated p-value will be plotted on the y-axis. Additinally we will also highlight top 10 DE genes in both directions (positive, i.e. upregulated in Obese and negative, i.e. downregulated in Obese).```{r}library(ggrepel) # load this library to plot gene namesmarkers.obese_vs_lean.top10.pos <- markers.obese_vs_lean %>%filter(avg_log2FC >0) %>%filter(p_val_adj <0.05) %>%top_n(n =10, wt = avg_log2FC)markers.obese_vs_lean.top10.neg <- markers.obese_vs_lean %>%filter(avg_log2FC <0) %>%filter(p_val_adj <0.05) %>%top_n(n =10, wt =abs(avg_log2FC))markers.obese_vs_lean.top10 <-rbind(markers.obese_vs_lean.top10.pos, markers.obese_vs_lean.top10.neg)ggplot(markers.obese_vs_lean, aes(avg_log2FC, -log10(p_val))) +geom_point() +geom_point(data = markers.obese_vs_lean.top10, aes(avg_log2FC, -log10(p_val)), color ="red") +geom_text_repel(data = markers.obese_vs_lean.top10, aes(avg_log2FC, -log10(p_val), label = gene)) +xlab("average log2 Fold Change") +ylab("-log10(Pvalue)")```Can you notice what is going on? Why is there XIST? Why are they Ychromosome genes? Did we capture all the nuances of the experiement design?# Functional enrichmentThe abovce analysis clearly denotes there is On obvious question to ask with the list of DE genes is what function are these carrying? We can perform a gene ontology enrichment using the gene lists to see what functional categories is each of the DE list enriched in.```{r}suppressPackageStartupMessages({library(clusterProfiler)library(org.Hs.eg.db)})universe.genes <-rownames(integrated@assays$SCT@data)DoEnrichment <-function(genes, universe.genes) { enrich <-enrichGO(gene = genes,universe = universe.genes,OrgDb = org.Hs.eg.db,keyType ="SYMBOL",ont ="BP",pvalueCutoff =0.05,qvalueCutoff =0.05,readable =TRUE )return(enrich)}de.pos <- markers.obese_vs_lean %>%filter(avg_log2FC >0) %>%filter(p_val_adj <0.1) %>%pull(gene)de.neg <- markers.obese_vs_lean %>%filter(avg_log2FC <0) %>%filter(p_val_adj <0.1) %>%pull(gene)enrichment.pos <-DoEnrichment(genes = de.pos, universe.genes = universe.genes)enrichment.neg <-DoEnrichment(genes = de.neg, universe.genes = universe.genes)``````{r}barplot(enrichment.pos) +ggtitle("Enrichment of Obese upregulated genes")``````{r}barplot(enrichment.neg, showTerms =7) +ggtitle("Enrichment of Obese downregulated genes")```# Homework1. Integrate all the different patients that we left out in this analysis and annotate the hepatocyte cluster. 2. Using the hepatocyte cluster, identify zonation markers for the different zones of the liver. You can use the following markers as a guide:||||----|----|| Zone 1 (periportal) | ALB, CPS1, ARG1, PCK1, SDH, SULT1A1, IGF1 || Zone 2 (Midlobular) | HAMP, IGFB2, CYP8B1 || Zone 3(perivenous) | GS,GLT,CAR,AHR,CYP1A, CYP2E1, CYP3A, GSTM1, GK,IGFB1, B-CATENIN |