Before we begin, please make sure you have all the necessary packages
installed. In addition to Seurat
, tidyverse
,
ggrepel
, and glmGamPoi
, we will also install
reshape2
and enrichR
pacakges
setRepositories(ind = 1:3) # set repositories to install Bioconductor dependencies
install.packages(c("Seurat", "tidyverse", "ggrepel", "glmGamPoi", "enrichR", "reshape2"))
Alzheimer’s disease (AD) is the most common form of dementia. Demential is a broad umbrella term for all major major neurocognitive disorders that impairs a person’s ability to remember, think, or make decisions. Pathologically, the disease is believed to occur when amyloid beta (A\(\beta\)) peptides accumulates abnormally in the extracellular space. This is followed by intraneuronal tau hyperphosphorylation and aggregation which causes neuronal and synaptic dysfunction and cell death.
The brain is made of different celltypes which can broadly be classified into three categories: excitatory, inhibitory and non-neuronal cells. Non-neuronal cells comprise of glial cells, ependymal cells, endothelial cells and pericytes. Glial cells can be further subdivided into macroglia (astrocytes and oligdendrocytes) and microglia - the primary immune cells of the central nervous system.
A number of studies have shown that among all the celltypes, microglia undergo the most prominent changes in AD brains. This is evident in two forms: 1) a change in cellular composition with AD brains exhibiting higher number of microglia and 2) a robust transcitiptional activation signal in the microglia cells. The activated microglia is referred to as DAM (Disease associated microglia) and has a distinct transcriptional signature than the homoeostatic microglia (control). This signal is conserved in mouse models (5xFAD) of AD. Multiple studies have also characterized the impact of genetic variants in the TREM2 gene. Loss of TREM2 in mice AD model (5XFAD) restricts the ability of microglia to surround A\(\beta\) plaques.
In this vignette, following Zhou et al., 2020 we will investigate the changes associated with AD pathology and TREM2 deficit brains from 5XFAD mice undergoing A\(\beta\) accumulation.
The raw counts matrices were deposited by authors on GEO.
To download all the files, you can either download the .zip
or use wget
:
mkdir -p data/GSE140511 && cd data/GSE140511 && wget -c --content-disposition "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE140511&format=file"
cd data/GSE140511 && tar -xvf GSE140511_RAW.tar
Once the files have been downloaded, we use ReadMtx
to
read the raw files and create one object per sample using the
CreateSeuratObject
function. Ultimately, we merge these
objects into one object with appropriate steps for adding metadata and
use the criteria defined by original authors for retaining high quality
cells:
pysradb metadata --detailed SRP230293 --saveto data/GSE140511/SRP230293.tsv
pysradb metadata --detailed SRP229978 --saveto data/GSE140511/SRP229978.tsv
library(tidyverse)
options(future.globals.maxSize = 100000 * 1024^2) # allow more RAM usage
library(Seurat)
## 15 Month design
# GSM4160643 WT_Cor
# GSM4160644 Trem2_KO_Cor
# GSM4160645 WT_5XFAD_Cor
# GSM4160646 Trem2_KO_5XFAD_Cor
# GSM4160647 WT_Hip
# GSM4160648 Trem2_KO_Hip
# GSM4160649 WT_5XFAD_Hip
# GSM4160650 Trem2_KO_5XFAD_Hip
## 7 Month sample:
# GSM4173504 WT1
# GSM4173505 WT2
# GSM4173506 WT3
# GSM4173507 Trem2_KO1
# GSM4173508 Trem2_KO2
# GSM4173509 Trem2_KO3
# GSM4173510 WT_5XFAD1
# GSM4173511 WT_5XFAD2
# GSM4173512 WT_5XFAD3
# GSM4173513 Trem2_KO_5XFAD1
# GSM4173514 Trem2_KO_5XFAD2
# GSM4173515 Trem2_KO_5XFAD3
sample_names <- c( # 15 Month
"GSM4160643_WT_Cor", "GSM4160644_Trem2_KO_Cor",
"GSM4160645_WT_5XFAD_Cor", "GSM4160646_Trem2_KO_5XFAD_Cor",
"GSM4160647_WT_Hip", "GSM4160648_Trem2_KO_Hip",
"GSM4160649_WT_5XFAD_Hip", "GSM4160650_Trem2_KO_5XFAD_Hip",
# 7 month
"GSM4173504_WT_1", "GSM4173505_WT_2", "GSM4173506_WT_3",
"GSM4173507_Trem2_KO_1", "GSM4173508_Trem2_KO_2", "GSM4173509_Trem2_KO_3",
"GSM4173510_WT_5XFAD_1", "GSM4173511_WT_5XFAD_2", "GSM4173512_WT_5XFAD_3",
"GSM4173513_Trem2_KO_5XFAD_1", "GSM4173514_Trem2_KO_5XFAD_2", "GSM4173515_Trem2_KO_5XFAD_3"
)
sample_names.15 <- c(
"GSM4160643_WT_Cor", "GSM4160644_Trem2_KO_Cor",
"GSM4160645_WT_5XFAD_Cor", "GSM4160646_Trem2_KO_5XFAD_Cor",
"GSM4160647_WT_Hip", "GSM4160648_Trem2_KO_Hip",
"GSM4160649_WT_5XFAD_Hip", "GSM4160650_Trem2_KO_5XFAD_Hip"
)
sample_names.7 <- c(
"GSM4173504_WT_1", "GSM4173505_WT_2", "GSM4173506_WT_3",
"GSM4173507_Trem2_KO_1", "GSM4173508_Trem2_KO_2", "GSM4173509_Trem2_KO_3",
"GSM4173510_WT_5XFAD_1", "GSM4173511_WT_5XFAD_2", "GSM4173512_WT_5XFAD_3",
"GSM4173513_Trem2_KO_5XFAD_1", "GSM4173514_Trem2_KO_5XFAD_2", "GSM4173515_Trem2_KO_5XFAD_3"
)
objects.list.15 <- list()
# For 15-monthold mouse brain snRNA-seq analysis,
# genes expressed in fewer than 3 nuclei and nuclei that expressed <400 or >3,500 genes were removed for downstream analysis.
# After filtering, 38,230 nuclei were remained.
for (sample in sample_names.15) {
counts <- ReadMtx(
mtx = file.path("data/GSE140511/", paste0(sample, "_matrix.mtx.gz")),
cells = file.path("data/GSE140511/", paste0(sample, "_barcodes.tsv.gz")),
features = file.path("data/GSE140511/", paste0(sample, "_genes.tsv.gz"))
)
object <- CreateSeuratObject(counts = counts, project = sample, min.cells = 3, min.features = 1)
object[["percent.mt"]] <- PercentageFeatureSet(object = object, pattern = "^mt-")
object <- RenameCells(object = object, add.cell.id = paste0(sample, "_"))
object$sample_name <- sample
objects.list.15[[sample]] <- object
}
library(patchwork)
plots <- list()
for (sample in names(objects.list.15)) {
plot <- VlnPlot(objects.list.15[[sample]], features = c("nCount_RNA", "nFeature_RNA", "percent.mt"))
plots[[sample]] <- plot
}
wrap_plots(plots)
From the methods section of the paper
For quality control, nuclei with mitochondrial content >5% were removed.
For 15-month-old mouse brain snRNA-seq analysis, genes expressed in fewer than 3 nuclei and nuclei that expressed <400 or >3,500 genes were removed for downstream analysis. After filtering, 38,230 nuclei were remained.
for (sample in names(objects.list.15)) {
object <- objects.list.15[[sample]]
# if nothing was specified, we could use the bottom and top quantile of nCount_RNA
# ncount_boundaries <- quantile(object$nCount_RNA, probs = c(0.01, 0.99))
# object <- subset(object, nCount_RNA > ncount_boundaries[1] & nCount_RNA < ncount_boundaries[2] & percent.mt < 5)
object <- subset(object,
nFeature_RNA >= 400 & nFeature_RNA <= 3500 & percent.mt <= 5)
objects.list.15[[sample]] <- object
}
merged.object.15 <- merge(objects.list.15[[1]], objects.list.15[2:length(objects.list.15)])
merged.object.15
## An object of class Seurat
## 17953 features across 27999 samples within 1 assay
## Active assay: RNA (17953 features, 0 variable features)
## 8 layers present: counts.GSM4160643_WT_Cor, counts.GSM4160644_Trem2_KO_Cor, counts.GSM4160645_WT_5XFAD_Cor, counts.GSM4160646_Trem2_KO_5XFAD_Cor, counts.GSM4160647_WT_Hip, counts.GSM4160648_Trem2_KO_Hip, counts.GSM4160649_WT_5XFAD_Hip, counts.GSM4160650_Trem2_KO_5XFAD_Hip
After filtering, 38,230 nuclei were remained
So we retain slightly lesser number of samples (even after following) their recommendation of cutoffs.
merged.object.15$sample_name2 <- gsub(pattern = "Trem2_KO", replacement = "Trem2KO", x = merged.object.15$sample_name)
merged.object.15$genotype <- stringr::str_split_fixed(string = merged.object.15$sample_name2, pattern = "_", n = Inf)[, 2]
merged.object.15$region <- "NA"
merged.object.15$region[grepl(pattern = "_Cor", x = merged.object.15$sample_name2)] <- "Cortex"
merged.object.15$region[grepl(pattern = "_Hip", x = merged.object.15$sample_name2)] <- "Hippocampus"
merged.object.15$mouse_model <- "WT"
merged.object.15$mouse_model[grepl(pattern = "5XFAD", x = merged.object.15$sample_name2)] <- "5XFAD"
saveRDS(merged.object.15, "data/GSE140511/Stage15_all.rds")
By default different samples are stored as different layers - but sometimes we want all data in a single layer.
# merged.object.15 <- readRDS("data/GSE140511/Stage15_all.rds")
merged.object.15.joined <- merged.object.15
DefaultAssay(merged.object.15.joined) <- "RNA"
merged.object.15.joined <- JoinLayers(merged.object.15.joined)
Before we do any downstream processing, it is always a good idea to explore the object at hand.
The counts (UMIs) are stored in the “counts” layer of the “RNA” assay”.
object <- merged.object.15.joined
counts <- LayerData(object = object, layer = "counts", assay = "RNA")
# how the first 5 gene' and first 5 cell counts
counts[1:5, 1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
## GSM4160643_WT_Cor__AAACCTGAGACCCACC-1
## Xkr4 .
## Sox17 .
## Mrpl15 .
## Lypla1 .
## Tcea1 .
## GSM4160643_WT_Cor__AAACCTGAGAGTAAGG-1
## Xkr4 .
## Sox17 .
## Mrpl15 .
## Lypla1 .
## Tcea1 .
## GSM4160643_WT_Cor__AAACCTGAGCGATGAC-1
## Xkr4 .
## Sox17 .
## Mrpl15 .
## Lypla1 .
## Tcea1 1
## GSM4160643_WT_Cor__AAACCTGAGTCAAGCG-1
## Xkr4 .
## Sox17 .
## Mrpl15 .
## Lypla1 .
## Tcea1 .
## GSM4160643_WT_Cor__AAACCTGCAAAGCGGT-1
## Xkr4 .
## Sox17 .
## Mrpl15 .
## Lypla1 .
## Tcea1 .
In total we have 27999 cells (columns) and 17187 genes (rows):
dim(counts)
## [1] 17953 27999
Among these entries, we can calculate how many proportion of entries are non-zero:
100 * round(sum(counts > 0) / (dim(counts)[1] * dim(counts)[2]), 2)
## [1] 4
So 96% of entries in the counts
matrix are zeroes (or
only 4% of the values in the entire matrix are non-zeros).
# take a look at the object metadata
head(object)
orig.ident | nCount_RNA | nFeature_RNA | percent.mt | sample_name | sample_name2 | genotype | region | mouse_model | |
---|---|---|---|---|---|---|---|---|---|
GSM4160643_WT_Cor__AAACCTGAGACCCACC-1 | GSM4160643_WT_Cor | 2018 | 708 | 0.3964321 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT |
GSM4160643_WT_Cor__AAACCTGAGAGTAAGG-1 | GSM4160643_WT_Cor | 952 | 403 | 0.1050420 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT |
GSM4160643_WT_Cor__AAACCTGAGCGATGAC-1 | GSM4160643_WT_Cor | 1447 | 769 | 0.5528680 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT |
GSM4160643_WT_Cor__AAACCTGAGTCAAGCG-1 | GSM4160643_WT_Cor | 1893 | 878 | 0.4754358 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT |
GSM4160643_WT_Cor__AAACCTGCAAAGCGGT-1 | GSM4160643_WT_Cor | 2023 | 875 | 0.1482946 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT |
GSM4160643_WT_Cor__AAACCTGCAAGCGATG-1 | GSM4160643_WT_Cor | 1258 | 746 | 0.3974563 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT |
GSM4160643_WT_Cor__AAACCTGCAAGTAGTA-1 | GSM4160643_WT_Cor | 1131 | 512 | 0.2652520 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT |
GSM4160643_WT_Cor__AAACCTGCAGTATAAG-1 | GSM4160643_WT_Cor | 1790 | 868 | 0.3910615 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT |
GSM4160643_WT_Cor__AAACCTGCATCCGTGG-1 | GSM4160643_WT_Cor | 1253 | 831 | 4.5490822 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT |
GSM4160643_WT_Cor__AAACCTGCATGCGCAC-1 | GSM4160643_WT_Cor | 1459 | 563 | 1.3022618 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT |
You can add a custom column to the metadata using:
# add a new metadata column
object$my_column <- "test"
head(object)
orig.ident | nCount_RNA | nFeature_RNA | percent.mt | sample_name | sample_name2 | genotype | region | mouse_model | my_column | |
---|---|---|---|---|---|---|---|---|---|---|
GSM4160643_WT_Cor__AAACCTGAGACCCACC-1 | GSM4160643_WT_Cor | 2018 | 708 | 0.3964321 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT | test |
GSM4160643_WT_Cor__AAACCTGAGAGTAAGG-1 | GSM4160643_WT_Cor | 952 | 403 | 0.1050420 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT | test |
GSM4160643_WT_Cor__AAACCTGAGCGATGAC-1 | GSM4160643_WT_Cor | 1447 | 769 | 0.5528680 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT | test |
GSM4160643_WT_Cor__AAACCTGAGTCAAGCG-1 | GSM4160643_WT_Cor | 1893 | 878 | 0.4754358 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT | test |
GSM4160643_WT_Cor__AAACCTGCAAAGCGGT-1 | GSM4160643_WT_Cor | 2023 | 875 | 0.1482946 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT | test |
GSM4160643_WT_Cor__AAACCTGCAAGCGATG-1 | GSM4160643_WT_Cor | 1258 | 746 | 0.3974563 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT | test |
GSM4160643_WT_Cor__AAACCTGCAAGTAGTA-1 | GSM4160643_WT_Cor | 1131 | 512 | 0.2652520 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT | test |
GSM4160643_WT_Cor__AAACCTGCAGTATAAG-1 | GSM4160643_WT_Cor | 1790 | 868 | 0.3910615 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT | test |
GSM4160643_WT_Cor__AAACCTGCATCCGTGG-1 | GSM4160643_WT_Cor | 1253 | 831 | 4.5490822 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT | test |
GSM4160643_WT_Cor__AAACCTGCATGCGCAC-1 | GSM4160643_WT_Cor | 1459 | 563 | 1.3022618 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT | test |
Or delete the column from the metadata:
object$my_column <- NULL
head(object)
orig.ident | nCount_RNA | nFeature_RNA | percent.mt | sample_name | sample_name2 | genotype | region | mouse_model | |
---|---|---|---|---|---|---|---|---|---|
GSM4160643_WT_Cor__AAACCTGAGACCCACC-1 | GSM4160643_WT_Cor | 2018 | 708 | 0.3964321 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT |
GSM4160643_WT_Cor__AAACCTGAGAGTAAGG-1 | GSM4160643_WT_Cor | 952 | 403 | 0.1050420 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT |
GSM4160643_WT_Cor__AAACCTGAGCGATGAC-1 | GSM4160643_WT_Cor | 1447 | 769 | 0.5528680 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT |
GSM4160643_WT_Cor__AAACCTGAGTCAAGCG-1 | GSM4160643_WT_Cor | 1893 | 878 | 0.4754358 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT |
GSM4160643_WT_Cor__AAACCTGCAAAGCGGT-1 | GSM4160643_WT_Cor | 2023 | 875 | 0.1482946 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT |
GSM4160643_WT_Cor__AAACCTGCAAGCGATG-1 | GSM4160643_WT_Cor | 1258 | 746 | 0.3974563 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT |
GSM4160643_WT_Cor__AAACCTGCAAGTAGTA-1 | GSM4160643_WT_Cor | 1131 | 512 | 0.2652520 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT |
GSM4160643_WT_Cor__AAACCTGCAGTATAAG-1 | GSM4160643_WT_Cor | 1790 | 868 | 0.3910615 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT |
GSM4160643_WT_Cor__AAACCTGCATCCGTGG-1 | GSM4160643_WT_Cor | 1253 | 831 | 4.5490822 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT |
GSM4160643_WT_Cor__AAACCTGCATGCGCAC-1 | GSM4160643_WT_Cor | 1459 | 563 | 1.3022618 | GSM4160643_WT_Cor | GSM4160643_WT_Cor | WT | Cortex | WT |
You can plotsome QC metrics using the VlnPlot
command:
VlnPlot(object, features = c("nCount_RNA", "nFeature_RNA", "percent.mt"))
You can seee the help associated with this (or any function) function by typing ‘?VlnPlot’ in the RStudio console.
Many genes in this dataset are “not informative” - they are either lowly expressed overall, or expressed in all the celltypes. For downstream analysis, we want to identify “informative” genes - this increases the signal to noise ratio for downstream dimensionaloty reduction.
One way to select these genes would be to calculate a per-gene variance and select the ones with the highest variance. Let’s explore the mean-variance relationship of these genes:
# get UMI counts from object
counts <- LayerData(object = object, assay = "RNA", layer = "counts")
# calculate per gene mean
gene_mean <- rowMeans(x = counts)
gene_variance <- apply(X = counts, MARGIN = 1, FUN = var)
ggplot(data = data.frame("gene_mean" = gene_mean, "gene_variance" = gene_variance), aes(gene_mean, gene_variance)) +
geom_point() +
geom_abline(color = "red") +
scale_x_log10() +
scale_y_log10() +
xlab("Gene mean") +
ylab("Gene variance") +
theme_bw(base_size = 14)
So genes with higher means will also have higher variance because
there is a strong mean-variance relationship exhibited by all genes. One
way to adjust for this relationship is to explicitly model the
mean-variance relationship by fitting a smooth curve and then identify
genes that show high deviation from this “expected” pattern. This is
essentially what is done inside the FindVariableFeatures
function:
object <- FindVariableFeatures(object = object, selection.method = "vst")
VariableFeaturePlot(object)
In the plot above, the “variable genes” are highlhted in red and as clear from their mean values are not necessarily the most highly expressed (notice the genes with average expression > 10)
The total number of UMIs observed for a cell are attributable to 1) cell size (higher cell size => more RNA content => more expected UMIs) and 2) sequencing depth. Let’s look at the expression of a ubiquoulsy expressed gene (Malat1- a long noncoding RNA):
total_umis <- colSums(x = counts)
gene <- "Malat1"
gene_umi <- as.numeric(counts[gene, ])
ggplot(data = data.frame("total_cell_umi" = total_umis, "gene_umi" = gene_umi), aes(total_cell_umi, gene_umi)) +
geom_point() +
theme_bw(base_size = 14) +
xlab("Total sequencing depth") +
ylab(paste0("Gene UMI(", gene))
This gene (Malat1) shows an almost linear relationship with sequencing depth of a cell. In this case the “heterogeneity across cells” in expression levels of Malat1 are primarly attributable to the fact that these cells were sequenced to different deths and is un interesting.
If all the genes showed this linear behavior we could have simply “regressed” out the effect of sequencing depth by performing a linear regression of observed gene count agaisnt the total sequencing depth of the cell. But it is not a good idea because there are also marker genes such as Apoe (an astrocytes marker), Vip (a marker for Vip interneurons - these are a type of inhibitory neurons) and C1qa (a marker of microglia)
data.to.plot <- FetchData(object = object, layer = "counts", vars = c("Apoe", "Vip", "C1qa", "nCount_RNA"))
head(data.to.plot)
Apoe | Vip | C1qa | nCount_RNA | |
---|---|---|---|---|
GSM4160643_WT_Cor__AAACCTGAGACCCACC-1 | 0 | 0 | 0 | 2018 |
GSM4160643_WT_Cor__AAACCTGAGAGTAAGG-1 | 0 | 0 | 0 | 952 |
GSM4160643_WT_Cor__AAACCTGAGCGATGAC-1 | 0 | 0 | 0 | 1447 |
GSM4160643_WT_Cor__AAACCTGAGTCAAGCG-1 | 0 | 0 | 0 | 1893 |
GSM4160643_WT_Cor__AAACCTGCAAAGCGGT-1 | 0 | 0 | 1 | 2023 |
GSM4160643_WT_Cor__AAACCTGCAAGCGATG-1 | 0 | 0 | 0 | 1258 |
data.to.plot.genewise <- reshape2::melt(data.to.plot, id.vars = "nCount_RNA")
ggplot(data.to.plot.genewise, aes(nCount_RNA, value)) +
geom_point() +
facet_wrap(~variable) +
xlab("Sequencing depth") +
ylab("Gene UMI") +
theme_bw(base_size = 14)
Given these are marker genes, they are only expressed in a subset of cells. A simple linear regression where we regress out the sequencing depth from each gene’s UMI will end up affecting these marker genes negatively - dampening their expression.
Instead of regressing out sequencing depth, an alternate idea is to
to use a global scale factor (say 10,000) and “scale” all cells to have
this sequencing depth while also accounting for the sequencing depth of
the cell. To reduce the impact of outliers, these scaled values are log
transformed (after adding a pseudocount of 1). To perform this
normalizing using Seurat, you can use the NormalizeData
function:
object <- NormalizeData(object = object, scale.factor = 10000)
There are two key properties that any “good normalization” method should achieve:
The normalized expression values of a gene should have minimal dependence on the seuquencing depth
The variance of normalized gene expression values across should cells should primarily reflect biological variation and not technical noise and should be independent of gene’s mean expression level or sequencing depth of the cell.
Let’s look into the expression pattern of the same highly expressed gene after running Log Normalization:
gene <- "Malat1"
# normalized values are stored in the "data" layer
object <- NormalizeData(object = object, scale.factor = 10000)
lognormalized.data <- LayerData(object = object, assay = "RNA", layer = "data")[gene, ]
ggplot(data.frame("sequencing_depth" = colSums(x = counts), "normalized_values" = lognormalized.data), aes(sequencing_depth, normalized_values)) +
geom_point() +
theme_bw(base_size = 14)
Compare this relationship with an earlier figure where we were plotting the raw expression values - the correlation between y and x axis has reduced significantly. But as we will see in the next figure, this correlation is not zero.
counts <- LayerData(object = object, assay = "RNA", layer = "counts")
lognormalized.data <- LayerData(object = object, assay = "RNA", layer = "data")
# compute per gene correlation with sequencing depth of each cell
var.features <- VariableFeatures(object = object)
corr.counts <- apply(X = counts[var.features, ], MARGIN = 1, FUN = function(x) cor(x = x, y = colSums(x = counts)))
corr.lognorm <- apply(X = lognormalized.data[var.features, ], MARGIN = 1, FUN = function(x) cor(x = x, y = colSums(x = counts)))
df <- data.frame(corr = c(corr.counts, corr.lognorm), type = c(
rep("Raw counts", length(corr.counts)),
rep("Log normalized", length(corr.lognorm))
))
ggplot(data = df, aes(corr, fill = type)) +
geom_histogram() +
theme_bw(base_size = 14)
## One step normalization and feature selection: SCTransform
A second way to achieve variance stabilization (and variable feature selection) is by using SCTransform. In SCTransform, we use a regularized negative binomial model to directly model the single-cell counts. The normalized counts are then cacluclated by reversing this general reltaionship by substituting the observed count depth as the median sequencing depth. You can read the SCTransform publication for more details (SCTransform SCTransform-v2).
We will use SCTransform based analysis for the rest of the work flow.
object <- SCTransform(
object = object, verbose = FALSE
)
Compare the correlation of SCT normalized data with sequencing depth:
sct.data <- LayerData(object = object, assay = "SCT", layer = "data")
corr.sct <- apply(X = sct.data[intersect(var.features, rownames(sct.data)), ], MARGIN = 1, FUN = function(x) cor(x = x, y = colSums(x = counts)))
df <- data.frame(corr = c(corr.counts, corr.lognorm, corr.sct), type = c(
rep("counts", length(corr.counts)),
rep("Log normalized", length(corr.lognorm)), rep("SCT normalized", length(corr.sct))
))
ggplot(data = df, aes(corr, fill = type)) +
geom_histogram() +
theme_bw(base_size = 14)
While we measured around 17000 genes, the “intrinsic” dimensionality of data is smaller - we do not expect all 17000 genes to be orthogonal (“independent” measurements). There will be “modules” of genes that are co-expressed. We want to reduce the dimensioanloty of this dataset to a) enhance the signal to noise ratio and b) visualize this information. A common strategy in this case is to use Principle Component Analysis (PCA) which is a linear dimensionalty reduction method that outputs a smaller (20-50 usually, but upto n-1 PCs in a n dimensional dataset) orthologonal dimensions.
object <- RunPCA(object = object, npcs = 50, verbose = FALSE)
Rather than including all the PCs, we want to retain the most informative PCs. The goal here is to avoid adding extra PCs that might reflect other sources of variation (technical) assuming the major source of variation in the dataset is biological.
One way to visualize the “intrinsic dimensionality” of this dataset is to to see how the variance explained by individual PCs varies in the form of an elbow plot:
ElbowPlot(object, ndims = 30)
DimPlot(object)
The standard deviation (sequare root of variance) seems to plateau around PC30, so we will retain 30 PCs for downstream visualization and clustering.
While he previous PCA analysis resulted in 30 dimensions, we want to further reduce the dimension of our dataset to 2 to be able to visualize it. While there are different methods to do this, we will utilize Uniform Manifold Approximation and Projection (UMAP). The goal of UMAP and related dimensionality reduction methods is to find a set of 2-D coordinates for the multidimensional input dataset such that the difference in cell-cell similarities between the low-dimensional and original-dimension dataset is minimized.
object <- RunUMAP(object, reduction = "pca", dims = 1:30)
DimPlot(object)
DimPlot(object, group.by = "region")
The above plot shows how individual cells are located in a 2D space (defined by the UMAP coordinates). Our dataset is finally ready for a deeper dive in the biology.
We can now find “clusters” of cells that likely represent the same celltype. For single-cell data, a popular choice of methods for clustering is graph-based community detection methods that can scale to millions of cells (as opposed to tree-based methods such as hierechial clustering):
# build SNN graph
object <- FindNeighbors(object, reduction = "pca", dims = 1:30)
object <- FindClusters(object, resolution = 0.3)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 27999
## Number of edges: 1008819
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9347
## Number of communities: 17
## Elapsed time: 5 seconds
DimPlot(object, label = TRUE)
We can visualize some of the marker genes that define the Microglia (See Figure 2 in the associated paper)
FeaturePlot(object, features = c("C1qa", "Fcrls", "Tyrobp", "Csf1r", "Cx3cr1", "P2ry12"))
While we can clearly see which cells have highe expression of these
genes on the UMAP, it is not clear which “cluster” these cells belong
to. You can use VlnPlot
to plot the expression pattern of
these genes:
VlnPlot(object, features = c("C1qa", "Fcrls", "Tyrobp", "Csf1r", "Cx3cr1", "P2ry12"))
So Cluster 12 is our microglia cluster, we can separate it out to do a deeper analysis
# subset out cluster 13
all.microglia <- subset(object, idents = "12")
DimPlot(all.microglia, group.by = "genotype")
DimPlot(all.microglia, group.by = "mouse_model")
Our next goal is to decipher what genes are differentially expressed in the 5XFAD mouse as compared to the WT mouse (we expect there to be some genes that will be highly expressed in WT and lowly expressed in 5X5AD and vice-versa).
To find differentially expressed (DE) genes, we first set the
“identity” of the cells to be the “mouse_model” metadata column and then
identify diferentially expressed genes in the 5XFAD mouse as compared to
the “WT” mouse using the FindMarkers
command:
# Assign the identity to be the mouse model
# among the 'mouse_model` labeled 5XFAD we have both the original 5XFAD (labeled as 'WT' under `genotype` column) and a 'TREM2KO' reflecting the TREM2KO which we remove for this part of the analysis
microglia.notrem2ko <- subset(all.microglia, genotype %in% c("WT"))
DimPlot(microglia.notrem2ko, group.by = "mouse_model")
Idents(microglia.notrem2ko) <- "mouse_model"
markers.wt_vs_5ad <- FindMarkers(microglia.notrem2ko, ident.1 = "5XFAD", ident.2 = "WT")
markers.wt_vs_5ad$gene <- rownames(markers.wt_vs_5ad)
head(markers.wt_vs_5ad)
p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | gene | |
---|---|---|---|---|---|---|
Cst7 | 0 | 5.732861 | 0.455 | 0.011 | 0.00e+00 | Cst7 |
Thy1 | 0 | 2.095309 | 0.750 | 0.422 | 0.00e+00 | Thy1 |
Ctsd | 0 | 1.400086 | 0.955 | 0.911 | 2.00e-07 | Ctsd |
Apoe | 0 | 1.177130 | 0.932 | 0.794 | 1.42e-05 | Apoe |
Lyz2 | 0 | 1.906891 | 0.489 | 0.156 | 5.21e-05 | Lyz2 |
Cd74 | 0 | 2.814830 | 0.261 | 0.028 | 8.45e-05 | Cd74 |
To visualize these genes we will make use of a “volcano plot” where on the x-axis we will plot the log fold change of each gene and the associated p-value will be plotted on the y-axis. Additinally we will also highlight top 10 DE genes in both directions (positive, i.e. upregulated in 5XFAD and negative, i.e. downregulated in 5XFAD).
library(ggrepel) # load this library to plot gene names
markers.wt_vs_5ad.top10.pos <- markers.wt_vs_5ad %>%
filter(avg_log2FC > 0) %>%
filter(p_val_adj < 0.05) %>%
top_n(n = 10, wt = avg_log2FC)
markers.wt_vs_5ad.top10.neg <- markers.wt_vs_5ad %>%
filter(avg_log2FC < 0) %>%
filter(p_val_adj < 0.05) %>%
top_n(n = 10, wt = abs(avg_log2FC))
markers.wt_vs_5ad.top10 <- rbind(markers.wt_vs_5ad.top10.pos, markers.wt_vs_5ad.top10.neg)
ggplot(markers.wt_vs_5ad, aes(avg_log2FC, -log10(p_val))) +
geom_point() +
geom_point(data = markers.wt_vs_5ad.top10, aes(avg_log2FC, -log10(p_val)), color = "red") +
geom_text_repel(data = markers.wt_vs_5ad.top10, aes(avg_log2FC, -log10(p_val), label = gene)) +
xlab("average log2 Fold Change") +
ylab("-log10(Pvalue)") +
theme_classic(base_size = 14)
On obvious question to ask with the list of DE genes is what function are these carrying? We can perform a gene ontology enrichment using the gene lists to see what functional categories is each of the DE list enriched in.
library(enrichR) # make sure this library is installed to do GO enrichments
# defne a custom function to carry out enrichment
DoEnrichment <- function(genes) {
library(enrichR)
dbs <- c("GO_Molecular_Function_2021", "GO_Cellular_Component_2021", "GO_Biological_Process_2021")
enriched <- enrichR::enrichr(genes, dbs)
return(enriched)
}
de.pos <- markers.wt_vs_5ad %>%
filter(avg_log2FC > 0) %>%
filter(p_val_adj < 0.1) %>%
pull(gene)
de.neg <- markers.wt_vs_5ad %>%
filter(avg_log2FC < 0) %>%
filter(p_val_adj < 0.1) %>%
pull(gene)
enrichment.pos <- DoEnrichment(genes = de.pos)
## Uploading data to Enrichr... Done.
## Querying GO_Molecular_Function_2021... Done.
## Querying GO_Cellular_Component_2021... Done.
## Querying GO_Biological_Process_2021... Done.
## Parsing results... Done.
enrichment.neg <- DoEnrichment(genes = de.neg)
## Uploading data to Enrichr... Done.
## Querying GO_Molecular_Function_2021... Done.
## Querying GO_Cellular_Component_2021... Done.
## Querying GO_Biological_Process_2021... Done.
## Parsing results... Done.
plotEnrich(df = enrichment.pos$GO_Biological_Process_2021, showTerms = 7) + ggtitle("Enrichment of 5XFAD upregulated genes")
plotEnrich(df = enrichment.neg$GO_Biological_Process_2021, showTerms = 7) + ggtitle("Enrichment of 5XFAD downregulated genes")
We should compare your results with the following claim in the paper:
DAM genes, including Cst7, Csf1, Apoe, Trem2, Lpl, Lilrb4a, H2-d1 (an MHC-I gene), Cd74 (an MHC-II-related gene) and various cathepsin genes were notably upregulated in 5XFAD compared with WT mice. Homeostatic genes, such as P2ry12, Selplg, Tmem119 and Cx3cr1, were downregulated. These results were highly concordant between the 7-month- and 15-month-old datasets (Fig. 2d) and with previously published single-cell RNAseq data of sorted microglia.
List of genes upregulated in 5XFAD:
de.pos
## [1] "Cst7" "Thy1" "Ctsd" "Apoe" "Lyz2" "Cd74" "B2m" "H2-K1" "Gpnmb"
## [10] "Ctsb" "Spp1" "H2-D1" "Cpne7" "Ctsz" "Trem2"
List of genes downregulated in 5XFAD:
de.neg
## [1] "Cst3" "Selplg"
Identify what genes are differentially expressed in 5XFAD mouse as compared to the WT mouse.
The mouse_model
column stores the information about it
being a WT or an AD (5XFAD) mouse. In this study the authors were also
interested in asking what changes happen to the microglial gene program
if TREM2 is knocked out (given the prior knowledge about microglial
transcriptional response dependence on TREM2). In the above analysis we
removed cells that were not labeled as “WT” under their “genotype”
column. Here we will subset a different set of cells which are all
“5XFAD” but can have different genotype
(“WT” or
“TREM2KO”)
microglia.5xfad <- subset(all.microglia, mouse_model %in% c("5XFAD"))
# now we set the identities to the genotype
Idents(microglia.5xfad) <- "genotype"
markers.5ad_vs_trem2ko <- FindMarkers(microglia.5xfad, ident.1 = "WT", ident.2 = "Trem2KO")
markers.5ad_vs_trem2ko$gene <- rownames(markers.5ad_vs_trem2ko)
head(markers.5ad_vs_trem2ko)
p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | gene | |
---|---|---|---|---|---|---|
Trem2 | 0.0000000 | 4.479658 | 0.920 | 0.105 | 1.63e-05 | Trem2 |
Cfap20 | 0.0001355 | -3.211504 | 0.023 | 0.263 | 1.00e+00 | Cfap20 |
Ubqln2 | 0.0001776 | -4.211504 | 0.000 | 0.158 | 1.00e+00 | Ubqln2 |
Col16a1 | 0.0001776 | -4.211504 | 0.000 | 0.158 | 1.00e+00 | Col16a1 |
Cep135 | 0.0001776 | -4.211504 | 0.000 | 0.158 | 1.00e+00 | Cep135 |
Ddx25 | 0.0001776 | -4.211504 | 0.000 | 0.158 | 1.00e+00 | Ddx25 |
markers.5ad_vs_trem2ko.top10.pos <- markers.5ad_vs_trem2ko %>%
filter(avg_log2FC > 0) %>%
filter(p_val_adj < 0.05) %>%
top_n(n = 10, wt = avg_log2FC)
markers.5ad_vs_trem2ko.top10.neg <- markers.5ad_vs_trem2ko %>%
filter(avg_log2FC < 0) %>%
filter(p_val_adj < 0.05) %>%
top_n(n = 10, wt = abs(avg_log2FC))
markers.5ad_vs_trem2ko.top10 <- rbind(markers.5ad_vs_trem2ko.top10.pos, markers.5ad_vs_trem2ko.top10.neg)
ggplot(markers.5ad_vs_trem2ko, aes(avg_log2FC, -log10(p_val))) +
geom_point() +
geom_point(data = markers.5ad_vs_trem2ko.top10, aes(avg_log2FC, -log10(p_val)), color = "red") +
geom_text_repel(data = markers.5ad_vs_trem2ko.top10, aes(avg_log2FC, -log10(p_val), label = gene)) +
xlab("average log2 Fold Change") +
ylab("-log10(Pvalue)")
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin23.4.0
## Running under: macOS 15.0.1
##
## Matrix products: default
## BLAS: /opt/homebrew/Cellar/openblas/0.3.28/lib/libopenblasp-r0.3.28.dylib
## LAPACK: /opt/homebrew/Cellar/r/4.4.1/lib/R/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Asia/Kolkata
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] enrichR_3.2 ggrepel_0.9.6 patchwork_1.3.0 Seurat_5.1.0
## [5] SeuratObject_5.0.2 sp_2.1-4 lubridate_1.9.3 forcats_1.0.0
## [9] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
## [13] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 rstudioapi_0.16.0
## [3] jsonlite_1.8.9 magrittr_2.0.3
## [5] spatstat.utils_3.1-0 ggbeeswarm_0.7.2
## [7] farver_2.1.2 rmarkdown_2.28
## [9] zlibbioc_1.50.0 vctrs_0.6.5
## [11] ROCR_1.0-11 DelayedMatrixStats_1.26.0
## [13] spatstat.explore_3.3-2 S4Arrays_1.4.1
## [15] htmltools_0.5.8.1 curl_5.2.3
## [17] SparseArray_1.4.8 sass_0.4.9
## [19] sctransform_0.4.1 parallelly_1.38.0
## [21] KernSmooth_2.23-24 bslib_0.8.0
## [23] htmlwidgets_1.6.4 ica_1.0-3
## [25] plyr_1.8.9 plotly_4.10.4
## [27] zoo_1.8-12 cachem_1.1.0
## [29] igraph_2.0.3 mime_0.12
## [31] lifecycle_1.0.4 pkgconfig_2.0.3
## [33] Matrix_1.7-0 R6_2.5.1
## [35] fastmap_1.2.0 GenomeInfoDbData_1.2.12
## [37] MatrixGenerics_1.16.0 fitdistrplus_1.2-1
## [39] future_1.34.0 shiny_1.9.1
## [41] digest_0.6.37 colorspace_2.1-1
## [43] S4Vectors_0.42.1 tensor_1.5
## [45] RSpectra_0.16-2 irlba_2.3.5.1
## [47] GenomicRanges_1.56.1 WriteXLS_6.7.0
## [49] labeling_0.4.3 progressr_0.14.0
## [51] fansi_1.0.6 spatstat.sparse_3.1-0
## [53] timechange_0.3.0 httr_1.4.7
## [55] polyclip_1.10-7 abind_1.4-8
## [57] compiler_4.4.1 withr_3.0.1
## [59] fastDummies_1.7.4 highr_0.11
## [61] MASS_7.3-61 DelayedArray_0.30.1
## [63] rjson_0.2.23 tools_4.4.1
## [65] vipor_0.4.7 lmtest_0.9-40
## [67] beeswarm_0.4.0 httpuv_1.6.15
## [69] future.apply_1.11.2 goftest_1.2-3
## [71] glmGamPoi_1.16.0 glue_1.8.0
## [73] nlme_3.1-166 promises_1.3.0
## [75] grid_4.4.1 Rtsne_0.17
## [77] cluster_2.1.6 reshape2_1.4.4
## [79] generics_0.1.3 gtable_0.3.5
## [81] spatstat.data_3.1-2 tzdb_0.4.0
## [83] data.table_1.16.0 hms_1.1.3
## [85] XVector_0.44.0 utf8_1.2.4
## [87] BiocGenerics_0.50.0 spatstat.geom_3.3-3
## [89] RcppAnnoy_0.0.22 RANN_2.6.2
## [91] pillar_1.9.0 limma_3.60.4
## [93] spam_2.10-0 RcppHNSW_0.6.0
## [95] later_1.3.2 splines_4.4.1
## [97] lattice_0.22-6 survival_3.7-0
## [99] deldir_2.0-4 tidyselect_1.2.1
## [101] miniUI_0.1.1.1 pbapply_1.7-2
## [103] knitr_1.48 gridExtra_2.3
## [105] IRanges_2.38.1 SummarizedExperiment_1.34.0
## [107] scattermore_1.2 stats4_4.4.1
## [109] xfun_0.47 Biobase_2.64.0
## [111] statmod_1.5.0 matrixStats_1.4.1
## [113] UCSC.utils_1.0.0 stringi_1.8.4
## [115] lazyeval_0.2.2 yaml_2.3.10
## [117] evaluate_1.0.0 codetools_0.2-20
## [119] cli_3.6.3 uwot_0.2.2
## [121] xtable_1.8-4 reticulate_1.39.0
## [123] munsell_0.5.1 jquerylib_0.1.4
## [125] GenomeInfoDb_1.40.1 Rcpp_1.0.13
## [127] globals_0.16.3 spatstat.random_3.3-2
## [129] png_0.1-8 ggrastr_1.0.2
## [131] spatstat.univar_3.0-1 parallel_4.4.1
## [133] dotCall64_1.1-1 sparseMatrixStats_1.16.0
## [135] listenv_0.9.1 viridisLite_0.4.2
## [137] scales_1.3.0.9000 ggridges_0.5.6
## [139] crayon_1.5.3 leiden_0.4.3.1
## [141] rlang_1.1.4 cowplot_1.1.3