Installing packages

Before we begin, please make sure you have all the necessary packages installed. In addition to Seurat, tidyverse, ggrepel, and glmGamPoi, we will also install reshape2 and enrichR pacakges

setRepositories(ind = 1:3) # set repositories to install Bioconductor dependencies
install.packages(c("Seurat", "tidyverse", "ggrepel", "glmGamPoi", "enrichR", "reshape2"))

Biological background

Alzheimer’s disease (AD) is the most common form of dementia. Demential is a broad umbrella term for all major major neurocognitive disorders that impairs a person’s ability to remember, think, or make decisions. Pathologically, the disease is believed to occur when amyloid beta (A\(\beta\)) peptides accumulates abnormally in the extracellular space. This is followed by intraneuronal tau hyperphosphorylation and aggregation which causes neuronal and synaptic dysfunction and cell death.

The brain is made of different celltypes which can broadly be classified into three categories: excitatory, inhibitory and non-neuronal cells. Non-neuronal cells comprise of glial cells, ependymal cells, endothelial cells and pericytes. Glial cells can be further subdivided into macroglia (astrocytes and oligdendrocytes) and microglia - the primary immune cells of the central nervous system.

A number of studies have shown that among all the celltypes, microglia undergo the most prominent changes in AD brains. This is evident in two forms: 1) a change in cellular composition with AD brains exhibiting higher number of microglia and 2) a robust transcitiptional activation signal in the microglia cells. The activated microglia is referred to as DAM (Disease associated microglia) and has a distinct transcriptional signature than the homoeostatic microglia (control). This signal is conserved in mouse models (5xFAD) of AD. Multiple studies have also characterized the impact of genetic variants in the TREM2 gene. Loss of TREM2 in mice AD model (5XFAD) restricts the ability of microglia to surround A\(\beta\) plaques.

In this vignette, following Zhou et al., 2020 we will investigate the changes associated with AD pathology and TREM2 deficit brains from 5XFAD mice undergoing A\(\beta\) accumulation.

Downloading raw files from GEO

The raw counts matrices were deposited by authors on GEO. To download all the files, you can either download the .zip or use wget:

mkdir -p data/GSE140511 && cd data/GSE140511 &&  wget -c --content-disposition "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE140511&format=file"

cd data/GSE140511 && tar -xvf GSE140511_RAW.tar

Quality control to discard low-quality cells

Once the files have been downloaded, we use ReadMtx to read the raw files and create one object per sample using the CreateSeuratObject function. Ultimately, we merge these objects into one object with appropriate steps for adding metadata and use the criteria defined by original authors for retaining high quality cells:

pysradb metadata --detailed SRP230293 --saveto data/GSE140511/SRP230293.tsv

pysradb metadata --detailed SRP229978 --saveto data/GSE140511/SRP229978.tsv

library(tidyverse)
options(future.globals.maxSize = 100000 * 1024^2) # allow more RAM usage
library(Seurat)

## 15 Month design
# GSM4160643    WT_Cor
# GSM4160644    Trem2_KO_Cor
# GSM4160645    WT_5XFAD_Cor
# GSM4160646    Trem2_KO_5XFAD_Cor
# GSM4160647    WT_Hip
# GSM4160648    Trem2_KO_Hip
# GSM4160649    WT_5XFAD_Hip
# GSM4160650    Trem2_KO_5XFAD_Hip


## 7 Month sample:
# GSM4173504    WT1
# GSM4173505    WT2
# GSM4173506    WT3
# GSM4173507    Trem2_KO1
# GSM4173508    Trem2_KO2
# GSM4173509    Trem2_KO3
# GSM4173510    WT_5XFAD1
# GSM4173511    WT_5XFAD2
# GSM4173512    WT_5XFAD3
# GSM4173513    Trem2_KO_5XFAD1
# GSM4173514    Trem2_KO_5XFAD2
# GSM4173515    Trem2_KO_5XFAD3


sample_names <- c( # 15 Month
  "GSM4160643_WT_Cor", "GSM4160644_Trem2_KO_Cor",
  "GSM4160645_WT_5XFAD_Cor", "GSM4160646_Trem2_KO_5XFAD_Cor",
  "GSM4160647_WT_Hip", "GSM4160648_Trem2_KO_Hip",
  "GSM4160649_WT_5XFAD_Hip", "GSM4160650_Trem2_KO_5XFAD_Hip",
  # 7 month
  "GSM4173504_WT_1", "GSM4173505_WT_2", "GSM4173506_WT_3",
  "GSM4173507_Trem2_KO_1", "GSM4173508_Trem2_KO_2", "GSM4173509_Trem2_KO_3",
  "GSM4173510_WT_5XFAD_1", "GSM4173511_WT_5XFAD_2", "GSM4173512_WT_5XFAD_3",
  "GSM4173513_Trem2_KO_5XFAD_1", "GSM4173514_Trem2_KO_5XFAD_2", "GSM4173515_Trem2_KO_5XFAD_3"
)

sample_names.15 <- c(
  "GSM4160643_WT_Cor", "GSM4160644_Trem2_KO_Cor",
  "GSM4160645_WT_5XFAD_Cor", "GSM4160646_Trem2_KO_5XFAD_Cor",
  "GSM4160647_WT_Hip", "GSM4160648_Trem2_KO_Hip",
  "GSM4160649_WT_5XFAD_Hip", "GSM4160650_Trem2_KO_5XFAD_Hip"
)

sample_names.7 <- c(
  "GSM4173504_WT_1", "GSM4173505_WT_2", "GSM4173506_WT_3",
  "GSM4173507_Trem2_KO_1", "GSM4173508_Trem2_KO_2", "GSM4173509_Trem2_KO_3",
  "GSM4173510_WT_5XFAD_1", "GSM4173511_WT_5XFAD_2", "GSM4173512_WT_5XFAD_3",
  "GSM4173513_Trem2_KO_5XFAD_1", "GSM4173514_Trem2_KO_5XFAD_2", "GSM4173515_Trem2_KO_5XFAD_3"
)



objects.list.15 <- list()
# For 15-monthold mouse brain snRNA-seq analysis,
# genes expressed in fewer than 3 nuclei and nuclei that expressed <400 or >3,500 genes were removed for downstream analysis.
# After filtering, 38,230 nuclei were remained.
for (sample in sample_names.15) {
  counts <- ReadMtx(
    mtx = file.path("data/GSE140511/", paste0(sample, "_matrix.mtx.gz")),
    cells = file.path("data/GSE140511/", paste0(sample, "_barcodes.tsv.gz")),
    features = file.path("data/GSE140511/", paste0(sample, "_genes.tsv.gz"))
  )
  object <- CreateSeuratObject(counts = counts, project = sample, min.cells = 3, min.features = 1)
  object[["percent.mt"]] <- PercentageFeatureSet(object = object, pattern = "^mt-")
  object <- RenameCells(object = object, add.cell.id = paste0(sample, "_"))
  object$sample_name <- sample
  objects.list.15[[sample]] <- object
}

library(patchwork)
plots <- list()
for (sample in names(objects.list.15)) {
  plot <- VlnPlot(objects.list.15[[sample]], features = c("nCount_RNA", "nFeature_RNA", "percent.mt"))
  plots[[sample]] <- plot
}
wrap_plots(plots)

Thresholding

From the methods section of the paper

For quality control, nuclei with mitochondrial content >5% were removed.

For 15-month-old mouse brain snRNA-seq analysis, genes expressed in fewer than 3 nuclei and nuclei that expressed <400 or >3,500 genes were removed for downstream analysis. After filtering, 38,230 nuclei were remained.

for (sample in names(objects.list.15)) {
  object <- objects.list.15[[sample]]
  # if nothing was specified, we could use the bottom and top quantile of nCount_RNA
  # ncount_boundaries <- quantile(object$nCount_RNA, probs = c(0.01, 0.99))
  # object <- subset(object, nCount_RNA > ncount_boundaries[1] & nCount_RNA < ncount_boundaries[2] & percent.mt < 5)
  object <- subset(object, 
                   nFeature_RNA >= 400 & nFeature_RNA <= 3500 & percent.mt <= 5)
  objects.list.15[[sample]] <- object
}

merged.object.15 <- merge(objects.list.15[[1]], objects.list.15[2:length(objects.list.15)])
merged.object.15

## An object of class Seurat 
## 17953 features across 27999 samples within 1 assay 
## Active assay: RNA (17953 features, 0 variable features)
##  8 layers present: counts.GSM4160643_WT_Cor, counts.GSM4160644_Trem2_KO_Cor, counts.GSM4160645_WT_5XFAD_Cor, counts.GSM4160646_Trem2_KO_5XFAD_Cor, counts.GSM4160647_WT_Hip, counts.GSM4160648_Trem2_KO_Hip, counts.GSM4160649_WT_5XFAD_Hip, counts.GSM4160650_Trem2_KO_5XFAD_Hip

After filtering, 38,230 nuclei were remained

So we retain slightly lesser number of samples (even after following) their recommendation of cutoffs.

merged.object.15$sample_name2 <- gsub(pattern = "Trem2_KO", replacement = "Trem2KO", x = merged.object.15$sample_name)

merged.object.15$genotype <- stringr::str_split_fixed(string = merged.object.15$sample_name2, pattern = "_", n = Inf)[, 2]
merged.object.15$region <- "NA"
merged.object.15$region[grepl(pattern = "_Cor", x = merged.object.15$sample_name2)] <- "Cortex"
merged.object.15$region[grepl(pattern = "_Hip", x = merged.object.15$sample_name2)] <- "Hippocampus"
merged.object.15$mouse_model <- "WT"
merged.object.15$mouse_model[grepl(pattern = "5XFAD", x = merged.object.15$sample_name2)] <- "5XFAD"

saveRDS(merged.object.15, "data/GSE140511/Stage15_all.rds")

By default different samples are stored as different layers - but sometimes we want all data in a single layer.

Joining layers

# merged.object.15 <- readRDS("data/GSE140511/Stage15_all.rds")

merged.object.15.joined <- merged.object.15
DefaultAssay(merged.object.15.joined) <- "RNA"

merged.object.15.joined <- JoinLayers(merged.object.15.joined)

Exploring the object

Before we do any downstream processing, it is always a good idea to explore the object at hand.

The counts (UMIs) are stored in the “counts” layer of the “RNA” assay”.

object <- merged.object.15.joined
counts <- LayerData(object = object, layer = "counts", assay = "RNA")
# how the first 5 gene' and first 5 cell counts
counts[1:5, 1:5]

## 5 x 5 sparse Matrix of class "dgCMatrix"
##        GSM4160643_WT_Cor__AAACCTGAGACCCACC-1
## Xkr4                                       .
## Sox17                                      .
## Mrpl15                                     .
## Lypla1                                     .
## Tcea1                                      .
##        GSM4160643_WT_Cor__AAACCTGAGAGTAAGG-1
## Xkr4                                       .
## Sox17                                      .
## Mrpl15                                     .
## Lypla1                                     .
## Tcea1                                      .
##        GSM4160643_WT_Cor__AAACCTGAGCGATGAC-1
## Xkr4                                       .
## Sox17                                      .
## Mrpl15                                     .
## Lypla1                                     .
## Tcea1                                      1
##        GSM4160643_WT_Cor__AAACCTGAGTCAAGCG-1
## Xkr4                                       .
## Sox17                                      .
## Mrpl15                                     .
## Lypla1                                     .
## Tcea1                                      .
##        GSM4160643_WT_Cor__AAACCTGCAAAGCGGT-1
## Xkr4                                       .
## Sox17                                      .
## Mrpl15                                     .
## Lypla1                                     .
## Tcea1                                      .

In total we have 27999 cells (columns) and 17187 genes (rows):

dim(counts)

## [1] 17953 27999

Among these entries, we can calculate how many proportion of entries are non-zero:

100 * round(sum(counts > 0) / (dim(counts)[1] * dim(counts)[2]), 2)

## [1] 4

So 96% of entries in the counts matrix are zeroes (or only 4% of the values in the entire matrix are non-zeros).

# take a look at the object metadata
head(object)

	orig.ident	nCount_RNA	nFeature_RNA	percent.mt	sample_name	sample_name2	genotype	region	mouse_model
GSM4160643_WT_Cor__AAACCTGAGACCCACC-1	GSM4160643_WT_Cor	2018	708	0.3964321	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT
GSM4160643_WT_Cor__AAACCTGAGAGTAAGG-1	GSM4160643_WT_Cor	952	403	0.1050420	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT
GSM4160643_WT_Cor__AAACCTGAGCGATGAC-1	GSM4160643_WT_Cor	1447	769	0.5528680	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT
GSM4160643_WT_Cor__AAACCTGAGTCAAGCG-1	GSM4160643_WT_Cor	1893	878	0.4754358	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT
GSM4160643_WT_Cor__AAACCTGCAAAGCGGT-1	GSM4160643_WT_Cor	2023	875	0.1482946	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT
GSM4160643_WT_Cor__AAACCTGCAAGCGATG-1	GSM4160643_WT_Cor	1258	746	0.3974563	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT
GSM4160643_WT_Cor__AAACCTGCAAGTAGTA-1	GSM4160643_WT_Cor	1131	512	0.2652520	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT
GSM4160643_WT_Cor__AAACCTGCAGTATAAG-1	GSM4160643_WT_Cor	1790	868	0.3910615	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT
GSM4160643_WT_Cor__AAACCTGCATCCGTGG-1	GSM4160643_WT_Cor	1253	831	4.5490822	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT
GSM4160643_WT_Cor__AAACCTGCATGCGCAC-1	GSM4160643_WT_Cor	1459	563	1.3022618	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT

You can add a custom column to the metadata using:

# add a new metadata column
object$my_column <- "test"
head(object)

	orig.ident	nCount_RNA	nFeature_RNA	percent.mt	sample_name	sample_name2	genotype	region	mouse_model	my_column
GSM4160643_WT_Cor__AAACCTGAGACCCACC-1	GSM4160643_WT_Cor	2018	708	0.3964321	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT	test
GSM4160643_WT_Cor__AAACCTGAGAGTAAGG-1	GSM4160643_WT_Cor	952	403	0.1050420	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT	test
GSM4160643_WT_Cor__AAACCTGAGCGATGAC-1	GSM4160643_WT_Cor	1447	769	0.5528680	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT	test
GSM4160643_WT_Cor__AAACCTGAGTCAAGCG-1	GSM4160643_WT_Cor	1893	878	0.4754358	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT	test
GSM4160643_WT_Cor__AAACCTGCAAAGCGGT-1	GSM4160643_WT_Cor	2023	875	0.1482946	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT	test
GSM4160643_WT_Cor__AAACCTGCAAGCGATG-1	GSM4160643_WT_Cor	1258	746	0.3974563	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT	test
GSM4160643_WT_Cor__AAACCTGCAAGTAGTA-1	GSM4160643_WT_Cor	1131	512	0.2652520	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT	test
GSM4160643_WT_Cor__AAACCTGCAGTATAAG-1	GSM4160643_WT_Cor	1790	868	0.3910615	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT	test
GSM4160643_WT_Cor__AAACCTGCATCCGTGG-1	GSM4160643_WT_Cor	1253	831	4.5490822	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT	test
GSM4160643_WT_Cor__AAACCTGCATGCGCAC-1	GSM4160643_WT_Cor	1459	563	1.3022618	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT	test

Or delete the column from the metadata:

object$my_column <- NULL
head(object)

	orig.ident	nCount_RNA	nFeature_RNA	percent.mt	sample_name	sample_name2	genotype	region	mouse_model
GSM4160643_WT_Cor__AAACCTGAGACCCACC-1	GSM4160643_WT_Cor	2018	708	0.3964321	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT
GSM4160643_WT_Cor__AAACCTGAGAGTAAGG-1	GSM4160643_WT_Cor	952	403	0.1050420	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT
GSM4160643_WT_Cor__AAACCTGAGCGATGAC-1	GSM4160643_WT_Cor	1447	769	0.5528680	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT
GSM4160643_WT_Cor__AAACCTGAGTCAAGCG-1	GSM4160643_WT_Cor	1893	878	0.4754358	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT
GSM4160643_WT_Cor__AAACCTGCAAAGCGGT-1	GSM4160643_WT_Cor	2023	875	0.1482946	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT
GSM4160643_WT_Cor__AAACCTGCAAGCGATG-1	GSM4160643_WT_Cor	1258	746	0.3974563	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT
GSM4160643_WT_Cor__AAACCTGCAAGTAGTA-1	GSM4160643_WT_Cor	1131	512	0.2652520	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT
GSM4160643_WT_Cor__AAACCTGCAGTATAAG-1	GSM4160643_WT_Cor	1790	868	0.3910615	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT
GSM4160643_WT_Cor__AAACCTGCATCCGTGG-1	GSM4160643_WT_Cor	1253	831	4.5490822	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT
GSM4160643_WT_Cor__AAACCTGCATGCGCAC-1	GSM4160643_WT_Cor	1459	563	1.3022618	GSM4160643_WT_Cor	GSM4160643_WT_Cor	WT	Cortex	WT

You can plotsome QC metrics using the VlnPlot command:

VlnPlot(object, features = c("nCount_RNA", "nFeature_RNA", "percent.mt"))

You can seee the help associated with this (or any function) function by typing ‘?VlnPlot’ in the RStudio console.

Feature Selection

Many genes in this dataset are “not informative” - they are either lowly expressed overall, or expressed in all the celltypes. For downstream analysis, we want to identify “informative” genes - this increases the signal to noise ratio for downstream dimensionaloty reduction.

One way to select these genes would be to calculate a per-gene variance and select the ones with the highest variance. Let’s explore the mean-variance relationship of these genes:

# get UMI counts from object
counts <- LayerData(object = object, assay = "RNA", layer = "counts")

# calculate per gene mean
gene_mean <- rowMeans(x = counts)
gene_variance <- apply(X = counts, MARGIN = 1, FUN = var)

ggplot(data = data.frame("gene_mean" = gene_mean, "gene_variance" = gene_variance), aes(gene_mean, gene_variance)) +
  geom_point() +
  geom_abline(color = "red") +
  scale_x_log10() +
  scale_y_log10() +
  xlab("Gene mean") +
  ylab("Gene variance") +
  theme_bw(base_size = 14)

So genes with higher means will also have higher variance because there is a strong mean-variance relationship exhibited by all genes. One way to adjust for this relationship is to explicitly model the mean-variance relationship by fitting a smooth curve and then identify genes that show high deviation from this “expected” pattern. This is essentially what is done inside the FindVariableFeatures function:

object <- FindVariableFeatures(object = object, selection.method = "vst")
VariableFeaturePlot(object)

In the plot above, the “variable genes” are highlhted in red and as clear from their mean values are not necessarily the most highly expressed (notice the genes with average expression > 10)

Normalization

The total number of UMIs observed for a cell are attributable to 1) cell size (higher cell size => more RNA content => more expected UMIs) and 2) sequencing depth. Let’s look at the expression of a ubiquoulsy expressed gene (Malat1- a long noncoding RNA):

total_umis <- colSums(x = counts)
gene <- "Malat1"
gene_umi <- as.numeric(counts[gene, ])

ggplot(data = data.frame("total_cell_umi" = total_umis, "gene_umi" = gene_umi), aes(total_cell_umi, gene_umi)) +
  geom_point() +
  theme_bw(base_size = 14) +
  xlab("Total sequencing depth") +
  ylab(paste0("Gene UMI(", gene))

This gene (Malat1) shows an almost linear relationship with sequencing depth of a cell. In this case the “heterogeneity across cells” in expression levels of Malat1 are primarly attributable to the fact that these cells were sequenced to different deths and is un interesting.

If all the genes showed this linear behavior we could have simply “regressed” out the effect of sequencing depth by performing a linear regression of observed gene count agaisnt the total sequencing depth of the cell. But it is not a good idea because there are also marker genes such as Apoe (an astrocytes marker), Vip (a marker for Vip interneurons - these are a type of inhibitory neurons) and C1qa (a marker of microglia)

data.to.plot <- FetchData(object = object, layer = "counts", vars = c("Apoe", "Vip", "C1qa", "nCount_RNA"))
head(data.to.plot)

	C1qa	nCount_RNA
GSM4160643_WT_Cor__AAACCTGAGACCCACC-1	0	2018
GSM4160643_WT_Cor__AAACCTGAGAGTAAGG-1	0	952
GSM4160643_WT_Cor__AAACCTGAGCGATGAC-1	0	1447
GSM4160643_WT_Cor__AAACCTGAGTCAAGCG-1	0	1893
GSM4160643_WT_Cor__AAACCTGCAAAGCGGT-1	1	2023
GSM4160643_WT_Cor__AAACCTGCAAGCGATG-1	0	1258

data.to.plot.genewise <- reshape2::melt(data.to.plot, id.vars = "nCount_RNA")
ggplot(data.to.plot.genewise, aes(nCount_RNA, value)) +
  geom_point() +
  facet_wrap(~variable) +
  xlab("Sequencing depth") +
  ylab("Gene UMI") +
  theme_bw(base_size = 14)

Given these are marker genes, they are only expressed in a subset of cells. A simple linear regression where we regress out the sequencing depth from each gene’s UMI will end up affecting these marker genes negatively - dampening their expression.

Log Normalization

Instead of regressing out sequencing depth, an alternate idea is to to use a global scale factor (say 10,000) and “scale” all cells to have this sequencing depth while also accounting for the sequencing depth of the cell. To reduce the impact of outliers, these scaled values are log transformed (after adding a pseudocount of 1). To perform this normalizing using Seurat, you can use the NormalizeData function:

object <- NormalizeData(object = object, scale.factor = 10000)

What is “good normalization”?

There are two key properties that any “good normalization” method should achieve:

The normalized expression values of a gene should have minimal dependence on the seuquencing depth
The variance of normalized gene expression values across should cells should primarily reflect biological variation and not technical noise and should be independent of gene’s mean expression level or sequencing depth of the cell.

Let’s look into the expression pattern of the same highly expressed gene after running Log Normalization:

gene <- "Malat1"
# normalized values are stored in the "data" layer
object <- NormalizeData(object = object, scale.factor = 10000)
lognormalized.data <- LayerData(object = object, assay = "RNA", layer = "data")[gene, ]
ggplot(data.frame("sequencing_depth" = colSums(x = counts), "normalized_values" = lognormalized.data), aes(sequencing_depth, normalized_values)) +
  geom_point() +
  theme_bw(base_size = 14)

Compare this relationship with an earlier figure where we were plotting the raw expression values - the correlation between y and x axis has reduced significantly. But as we will see in the next figure, this correlation is not zero.

counts <- LayerData(object = object, assay = "RNA", layer = "counts")
lognormalized.data <- LayerData(object = object, assay = "RNA", layer = "data")

# compute per gene correlation with sequencing depth of each cell
var.features <- VariableFeatures(object = object)

corr.counts <- apply(X = counts[var.features, ], MARGIN = 1, FUN = function(x) cor(x = x, y = colSums(x = counts)))
corr.lognorm <- apply(X = lognormalized.data[var.features, ], MARGIN = 1, FUN = function(x) cor(x = x, y = colSums(x = counts)))

df <- data.frame(corr = c(corr.counts, corr.lognorm), type = c(
  rep("Raw counts", length(corr.counts)),
  rep("Log normalized", length(corr.lognorm))
))

ggplot(data = df, aes(corr, fill = type)) +
  geom_histogram() +
  theme_bw(base_size = 14)

## One step normalization and feature selection: SCTransform

A second way to achieve variance stabilization (and variable feature selection) is by using SCTransform. In SCTransform, we use a regularized negative binomial model to directly model the single-cell counts. The normalized counts are then cacluclated by reversing this general reltaionship by substituting the observed count depth as the median sequencing depth. You can read the SCTransform publication for more details (SCTransform SCTransform-v2).

We will use SCTransform based analysis for the rest of the work flow.

object <- SCTransform(
  object = object, verbose = FALSE
)

Compare the correlation of SCT normalized data with sequencing depth:

sct.data <- LayerData(object = object, assay = "SCT", layer = "data")
corr.sct <- apply(X = sct.data[intersect(var.features, rownames(sct.data)), ], MARGIN = 1, FUN = function(x) cor(x = x, y = colSums(x = counts)))

df <- data.frame(corr = c(corr.counts, corr.lognorm, corr.sct), type = c(
  rep("counts", length(corr.counts)),
  rep("Log normalized", length(corr.lognorm)), rep("SCT normalized", length(corr.sct))
))

ggplot(data = df, aes(corr, fill = type)) +
  geom_histogram() +
  theme_bw(base_size = 14)

Linear Dimensionality reduction

While we measured around 17000 genes, the “intrinsic” dimensionality of data is smaller - we do not expect all 17000 genes to be orthogonal (“independent” measurements). There will be “modules” of genes that are co-expressed. We want to reduce the dimensioanloty of this dataset to a) enhance the signal to noise ratio and b) visualize this information. A common strategy in this case is to use Principle Component Analysis (PCA) which is a linear dimensionalty reduction method that outputs a smaller (20-50 usually, but upto n-1 PCs in a n dimensional dataset) orthologonal dimensions.

object <- RunPCA(object = object, npcs = 50, verbose = FALSE)

Rather than including all the PCs, we want to retain the most informative PCs. The goal here is to avoid adding extra PCs that might reflect other sources of variation (technical) assuming the major source of variation in the dataset is biological.

One way to visualize the “intrinsic dimensionality” of this dataset is to to see how the variance explained by individual PCs varies in the form of an elbow plot:

ElbowPlot(object, ndims = 30)

DimPlot(object)

The standard deviation (sequare root of variance) seems to plateau around PC30, so we will retain 30 PCs for downstream visualization and clustering.

Non linear dimensionality reduction

While he previous PCA analysis resulted in 30 dimensions, we want to further reduce the dimension of our dataset to 2 to be able to visualize it. While there are different methods to do this, we will utilize Uniform Manifold Approximation and Projection (UMAP). The goal of UMAP and related dimensionality reduction methods is to find a set of 2-D coordinates for the multidimensional input dataset such that the difference in cell-cell similarities between the low-dimensional and original-dimension dataset is minimized.

object <- RunUMAP(object, reduction = "pca", dims = 1:30)

DimPlot(object)

DimPlot(object, group.by = "region")

The above plot shows how individual cells are located in a 2D space (defined by the UMAP coordinates). Our dataset is finally ready for a deeper dive in the biology.

Clustering

We can now find “clusters” of cells that likely represent the same celltype. For single-cell data, a popular choice of methods for clustering is graph-based community detection methods that can scale to millions of cells (as opposed to tree-based methods such as hierechial clustering):

# build SNN graph
object <- FindNeighbors(object, reduction = "pca", dims = 1:30)
object <- FindClusters(object, resolution = 0.3)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 27999
## Number of edges: 1008819
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9347
## Number of communities: 17
## Elapsed time: 5 seconds

DimPlot(object, label = TRUE)

Identifying Microglia cluster

We can visualize some of the marker genes that define the Microglia (See Figure 2 in the associated paper)

FeaturePlot(object, features = c("C1qa", "Fcrls", "Tyrobp", "Csf1r", "Cx3cr1", "P2ry12"))

While we can clearly see which cells have highe expression of these genes on the UMAP, it is not clear which “cluster” these cells belong to. You can use VlnPlot to plot the expression pattern of these genes:

VlnPlot(object, features = c("C1qa", "Fcrls", "Tyrobp", "Csf1r", "Cx3cr1", "P2ry12"))

So Cluster 12 is our microglia cluster, we can separate it out to do a deeper analysis

# subset out cluster 13
all.microglia <- subset(object, idents = "12")

DimPlot(all.microglia, group.by = "genotype")

DimPlot(all.microglia, group.by = "mouse_model")

Find differentiating markers between the 5XFAD (AD) mouse and the WT mouse

Our next goal is to decipher what genes are differentially expressed in the 5XFAD mouse as compared to the WT mouse (we expect there to be some genes that will be highly expressed in WT and lowly expressed in 5X5AD and vice-versa).

To find differentially expressed (DE) genes, we first set the “identity” of the cells to be the “mouse_model” metadata column and then identify diferentially expressed genes in the 5XFAD mouse as compared to the “WT” mouse using the FindMarkers command:

# Assign the identity to be the mouse model
# among the 'mouse_model` labeled 5XFAD we have both the original 5XFAD (labeled as 'WT' under `genotype` column) and a 'TREM2KO' reflecting the TREM2KO which we remove for this part of the analysis

microglia.notrem2ko <- subset(all.microglia, genotype %in% c("WT"))
DimPlot(microglia.notrem2ko, group.by = "mouse_model")

Idents(microglia.notrem2ko) <- "mouse_model"

markers.wt_vs_5ad <- FindMarkers(microglia.notrem2ko, ident.1 = "5XFAD", ident.2 = "WT")
markers.wt_vs_5ad$gene <- rownames(markers.wt_vs_5ad)
head(markers.wt_vs_5ad)

	avg_log2FC	pct.1	pct.2	p_val_adj	gene
Cst7	5.732861	0.455	0.011	0.00e+00	Cst7
Thy1	2.095309	0.750	0.422	0.00e+00	Thy1
Ctsd	1.400086	0.955	0.911	2.00e-07	Ctsd
Apoe	1.177130	0.932	0.794	1.42e-05	Apoe
Lyz2	1.906891	0.489	0.156	5.21e-05	Lyz2
Cd74	2.814830	0.261	0.028	8.45e-05	Cd74

To visualize these genes we will make use of a “volcano plot” where on the x-axis we will plot the log fold change of each gene and the associated p-value will be plotted on the y-axis. Additinally we will also highlight top 10 DE genes in both directions (positive, i.e. upregulated in 5XFAD and negative, i.e. downregulated in 5XFAD).

library(ggrepel) # load this library to plot gene names

markers.wt_vs_5ad.top10.pos <- markers.wt_vs_5ad %>%
  filter(avg_log2FC > 0) %>%
  filter(p_val_adj < 0.05) %>%
  top_n(n = 10, wt = avg_log2FC)
markers.wt_vs_5ad.top10.neg <- markers.wt_vs_5ad %>%
  filter(avg_log2FC < 0) %>%
  filter(p_val_adj < 0.05) %>%
  top_n(n = 10, wt = abs(avg_log2FC))
markers.wt_vs_5ad.top10 <- rbind(markers.wt_vs_5ad.top10.pos, markers.wt_vs_5ad.top10.neg)

ggplot(markers.wt_vs_5ad, aes(avg_log2FC, -log10(p_val))) +
  geom_point() +
  geom_point(data = markers.wt_vs_5ad.top10, aes(avg_log2FC, -log10(p_val)), color = "red") +
  geom_text_repel(data = markers.wt_vs_5ad.top10, aes(avg_log2FC, -log10(p_val), label = gene)) +
  xlab("average log2 Fold Change") +
  ylab("-log10(Pvalue)") +
  theme_classic(base_size = 14)

Functional enrichment

On obvious question to ask with the list of DE genes is what function are these carrying? We can perform a gene ontology enrichment using the gene lists to see what functional categories is each of the DE list enriched in.

library(enrichR) # make sure this library is installed  to do GO enrichments

# defne a custom function to carry out enrichment
DoEnrichment <- function(genes) {
  library(enrichR)
  dbs <- c("GO_Molecular_Function_2021", "GO_Cellular_Component_2021", "GO_Biological_Process_2021")
  enriched <- enrichR::enrichr(genes, dbs)
  return(enriched)
}

de.pos <- markers.wt_vs_5ad %>%
  filter(avg_log2FC > 0) %>%
  filter(p_val_adj < 0.1) %>%
  pull(gene)

de.neg <- markers.wt_vs_5ad %>%
  filter(avg_log2FC < 0) %>%
  filter(p_val_adj < 0.1) %>%
  pull(gene)


enrichment.pos <- DoEnrichment(genes = de.pos)

## Uploading data to Enrichr... Done.
##   Querying GO_Molecular_Function_2021... Done.
##   Querying GO_Cellular_Component_2021... Done.
##   Querying GO_Biological_Process_2021... Done.
## Parsing results... Done.

enrichment.neg <- DoEnrichment(genes = de.neg)

## Uploading data to Enrichr... Done.
##   Querying GO_Molecular_Function_2021... Done.
##   Querying GO_Cellular_Component_2021... Done.
##   Querying GO_Biological_Process_2021... Done.
## Parsing results... Done.

plotEnrich(df = enrichment.pos$GO_Biological_Process_2021, showTerms = 7) + ggtitle("Enrichment of 5XFAD upregulated genes")

plotEnrich(df = enrichment.neg$GO_Biological_Process_2021, showTerms = 7) + ggtitle("Enrichment of 5XFAD downregulated genes")

We should compare your results with the following claim in the paper:

DAM genes, including Cst7, Csf1, Apoe, Trem2, Lpl, Lilrb4a, H2-d1 (an MHC-I gene), Cd74 (an MHC-II-related gene) and various cathepsin genes were notably upregulated in 5XFAD compared with WT mice. Homeostatic genes, such as P2ry12, Selplg, Tmem119 and Cx3cr1, were downregulated. These results were highly concordant between the 7-month- and 15-month-old datasets (Fig. 2d) and with previously published single-cell RNAseq data of sorted microglia.

List of genes upregulated in 5XFAD:

de.pos

##  [1] "Cst7"  "Thy1"  "Ctsd"  "Apoe"  "Lyz2"  "Cd74"  "B2m"   "H2-K1" "Gpnmb"
## [10] "Ctsb"  "Spp1"  "H2-D1" "Cpne7" "Ctsz"  "Trem2"

List of genes downregulated in 5XFAD:

de.neg

## [1] "Cst3"   "Selplg"

Exercise

Identify what genes are differentially expressed in 5XFAD mouse as compared to the WT mouse.

Reveal solution

The mouse_model column stores the information about it being a WT or an AD (5XFAD) mouse. In this study the authors were also interested in asking what changes happen to the microglial gene program if TREM2 is knocked out (given the prior knowledge about microglial transcriptional response dependence on TREM2). In the above analysis we removed cells that were not labeled as “WT” under their “genotype” column. Here we will subset a different set of cells which are all “5XFAD” but can have different genotype (“WT” or “TREM2KO”)

microglia.5xfad <- subset(all.microglia, mouse_model %in% c("5XFAD"))

# now we set the identities to the genotype
Idents(microglia.5xfad) <- "genotype"
markers.5ad_vs_trem2ko <- FindMarkers(microglia.5xfad, ident.1 = "WT", ident.2 = "Trem2KO")
markers.5ad_vs_trem2ko$gene <- rownames(markers.5ad_vs_trem2ko)
head(markers.5ad_vs_trem2ko)

	p_val	avg_log2FC	pct.1	pct.2	p_val_adj	gene
Trem2	0.0000000	4.479658	0.920	0.105	1.63e-05	Trem2
Cfap20	0.0001355	-3.211504	0.023	0.263	1.00e+00	Cfap20
Ubqln2	0.0001776	-4.211504	0.000	0.158	1.00e+00	Ubqln2
Col16a1	0.0001776	-4.211504	0.000	0.158	1.00e+00	Col16a1
Cep135	0.0001776	-4.211504	0.000	0.158	1.00e+00	Cep135
Ddx25	0.0001776	-4.211504	0.000	0.158	1.00e+00	Ddx25

markers.5ad_vs_trem2ko.top10.pos <- markers.5ad_vs_trem2ko %>%
  filter(avg_log2FC > 0) %>%
  filter(p_val_adj < 0.05) %>%
  top_n(n = 10, wt = avg_log2FC)
markers.5ad_vs_trem2ko.top10.neg <- markers.5ad_vs_trem2ko %>%
  filter(avg_log2FC < 0) %>%
  filter(p_val_adj < 0.05) %>%
  top_n(n = 10, wt = abs(avg_log2FC))
markers.5ad_vs_trem2ko.top10 <- rbind(markers.5ad_vs_trem2ko.top10.pos, markers.5ad_vs_trem2ko.top10.neg)

ggplot(markers.5ad_vs_trem2ko, aes(avg_log2FC, -log10(p_val))) +
  geom_point() +
  geom_point(data = markers.5ad_vs_trem2ko.top10, aes(avg_log2FC, -log10(p_val)), color = "red") +
  geom_text_repel(data = markers.5ad_vs_trem2ko.top10, aes(avg_log2FC, -log10(p_val), label = gene)) +
  xlab("average log2 Fold Change") +
  ylab("-log10(Pvalue)")

Session Info

sessionInfo()

## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin23.4.0
## Running under: macOS 15.0.1
## 
## Matrix products: default
## BLAS:   /opt/homebrew/Cellar/openblas/0.3.28/lib/libopenblasp-r0.3.28.dylib 
## LAPACK: /opt/homebrew/Cellar/r/4.4.1/lib/R/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Asia/Kolkata
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] enrichR_3.2        ggrepel_0.9.6      patchwork_1.3.0    Seurat_5.1.0      
##  [5] SeuratObject_5.0.2 sp_2.1-4           lubridate_1.9.3    forcats_1.0.0     
##  [9] stringr_1.5.1      dplyr_1.1.4        purrr_1.0.2        readr_2.1.5       
## [13] tidyr_1.3.1        tibble_3.2.1       ggplot2_3.5.1      tidyverse_2.0.0   
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3          rstudioapi_0.16.0          
##   [3] jsonlite_1.8.9              magrittr_2.0.3             
##   [5] spatstat.utils_3.1-0        ggbeeswarm_0.7.2           
##   [7] farver_2.1.2                rmarkdown_2.28             
##   [9] zlibbioc_1.50.0             vctrs_0.6.5                
##  [11] ROCR_1.0-11                 DelayedMatrixStats_1.26.0  
##  [13] spatstat.explore_3.3-2      S4Arrays_1.4.1             
##  [15] htmltools_0.5.8.1           curl_5.2.3                 
##  [17] SparseArray_1.4.8           sass_0.4.9                 
##  [19] sctransform_0.4.1           parallelly_1.38.0          
##  [21] KernSmooth_2.23-24          bslib_0.8.0                
##  [23] htmlwidgets_1.6.4           ica_1.0-3                  
##  [25] plyr_1.8.9                  plotly_4.10.4              
##  [27] zoo_1.8-12                  cachem_1.1.0               
##  [29] igraph_2.0.3                mime_0.12                  
##  [31] lifecycle_1.0.4             pkgconfig_2.0.3            
##  [33] Matrix_1.7-0                R6_2.5.1                   
##  [35] fastmap_1.2.0               GenomeInfoDbData_1.2.12    
##  [37] MatrixGenerics_1.16.0       fitdistrplus_1.2-1         
##  [39] future_1.34.0               shiny_1.9.1                
##  [41] digest_0.6.37               colorspace_2.1-1           
##  [43] S4Vectors_0.42.1            tensor_1.5                 
##  [45] RSpectra_0.16-2             irlba_2.3.5.1              
##  [47] GenomicRanges_1.56.1        WriteXLS_6.7.0             
##  [49] labeling_0.4.3              progressr_0.14.0           
##  [51] fansi_1.0.6                 spatstat.sparse_3.1-0      
##  [53] timechange_0.3.0            httr_1.4.7                 
##  [55] polyclip_1.10-7             abind_1.4-8                
##  [57] compiler_4.4.1              withr_3.0.1                
##  [59] fastDummies_1.7.4           highr_0.11                 
##  [61] MASS_7.3-61                 DelayedArray_0.30.1        
##  [63] rjson_0.2.23                tools_4.4.1                
##  [65] vipor_0.4.7                 lmtest_0.9-40              
##  [67] beeswarm_0.4.0              httpuv_1.6.15              
##  [69] future.apply_1.11.2         goftest_1.2-3              
##  [71] glmGamPoi_1.16.0            glue_1.8.0                 
##  [73] nlme_3.1-166                promises_1.3.0             
##  [75] grid_4.4.1                  Rtsne_0.17                 
##  [77] cluster_2.1.6               reshape2_1.4.4             
##  [79] generics_0.1.3              gtable_0.3.5               
##  [81] spatstat.data_3.1-2         tzdb_0.4.0                 
##  [83] data.table_1.16.0           hms_1.1.3                  
##  [85] XVector_0.44.0              utf8_1.2.4                 
##  [87] BiocGenerics_0.50.0         spatstat.geom_3.3-3        
##  [89] RcppAnnoy_0.0.22            RANN_2.6.2                 
##  [91] pillar_1.9.0                limma_3.60.4               
##  [93] spam_2.10-0                 RcppHNSW_0.6.0             
##  [95] later_1.3.2                 splines_4.4.1              
##  [97] lattice_0.22-6              survival_3.7-0             
##  [99] deldir_2.0-4                tidyselect_1.2.1           
## [101] miniUI_0.1.1.1              pbapply_1.7-2              
## [103] knitr_1.48                  gridExtra_2.3              
## [105] IRanges_2.38.1              SummarizedExperiment_1.34.0
## [107] scattermore_1.2             stats4_4.4.1               
## [109] xfun_0.47                   Biobase_2.64.0             
## [111] statmod_1.5.0               matrixStats_1.4.1          
## [113] UCSC.utils_1.0.0            stringi_1.8.4              
## [115] lazyeval_0.2.2              yaml_2.3.10                
## [117] evaluate_1.0.0              codetools_0.2-20           
## [119] cli_3.6.3                   uwot_0.2.2                 
## [121] xtable_1.8-4                reticulate_1.39.0          
## [123] munsell_0.5.1               jquerylib_0.1.4            
## [125] GenomeInfoDb_1.40.1         Rcpp_1.0.13                
## [127] globals_0.16.3              spatstat.random_3.3-2      
## [129] png_0.1-8                   ggrastr_1.0.2              
## [131] spatstat.univar_3.0-1       parallel_4.4.1             
## [133] dotCall64_1.1-1             sparseMatrixStats_1.16.0   
## [135] listenv_0.9.1               viridisLite_0.4.2          
## [137] scales_1.3.0.9000           ggridges_0.5.6             
## [139] crayon_1.5.3                leiden_0.4.3.1             
## [141] rlang_1.1.4                 cowplot_1.1.3

Analysis of mouse brain data

Compiled: October 15, 2024