Day 02 - Single-Cell RNA-seq Analysis Workshop

Author

Saket Choudhary for QSCB 2025

Published

December 18, 2025

1 Premise

For this session, we will work with the human liver single-cell RNA-seq dataset profiled in http://livercellatlas.org/ published as part of Guilliams et al. 2022. There is a lot of data in this paper to familiarise you with different data modalities. To quote:

The liver is the largest solid organ in the body, yet it remains incompletely characterized. Here we present a spatial proteogenomic atlas of the healthy and obese human and murine liver combining single-cell CITE-seq, single-nuclei sequencing, spatial transcriptomics, and spatial proteomics

We will focus on single-cell RNA-sequencing for this tutorial.

2 Downloading data

The authors have deposited their data on GEO and on their own website. We will work with the dataset deposited on their website to keep things uniform. But irrespective, we will walk you through how to download data and metadata from GEO/SRA.

SRA stores the raw sequencing data. While GEO stores the processed data - in our case, these include the count matrices, the Seurat objects and the metadata.

We will fetch the metadata using pysradb:

Code
# run this on your bash terminal
pip install git+https://github.com/saketkc/pysradb@develop -q #install pysradb
pysradb metadata GSE192740 --detailed --saveto data/metadata/gex_GSE192740_metadata.tsv # get metadata for the GEO records
pysradb metadata SRP352824 --detailed --saveto data/metadata/fastq_GSE192740_metadata.tsv # get metadata for the SRA records

2.1 Why are we fetching both the SRA and GEO metadata?

In most cases, you will directly work with the GEO metadata. However, since we are already going to process the raw fastq files, we would want to keep this handy as well

Code
 head gex_GSE192740_metadata.tsv  
study_accession study_title     study_summary   organism_name   platform_accession      platform_title  experiment_type bioproject      submission_date     supplementary_files     series_ftp      sample_accession        sample_title    sample_ftp      sample_supplementary    sample_geo2r        sample_type     sample_source_name      sex     age     tissue  cell_type       disease treatment       extract_protocol   label_protocol   sample_geo_accession    sample_status   sample_submission_date  sample_last_update_date sample_channel_count    sample_source_name_ch1      sample_organism_ch1     sample_taxid_ch1        sample_molecule_ch1     sample_extract_protocol_ch1     sample_description sample_data_processing   sample_platform_id      sample_contact_name     sample_contact_email    sample_contact_department       sample_contact_institute    sample_contact_address  sample_contact_city     sample_contact_zip/postal_code  sample_contact_country  sample_instrument_model     sample_library_selection        sample_library_source   sample_library_strategy sample_relation sample_supplementary_file_1     sample_series_id    sample_data_row_count   shortfilename   strain  platform        digestion method        number of added abs     number of cells     ena_fastq_http_1        ena_fastq_http_2        ena_fastq_ftp_1 ena_fastq_ftp_2

2.2 Download counts matrices from geo

Code
pysradb download -g GSE192740 --out-dir data/geo

2.3 Download counts matrices from website

Code
# run in bash terminal
wget --content-disposition -c "https://livercellatlas.org/data_files/toDownload/rawData_human.zip" -O "data/rawData_human.zip"
wget --content-disposition -c "https://livercellatlas.org/data_files/toDownload/annot_humanAll.csv" -O "data/annot_humanAll.csv"

We will work with the human data for the rest of the workflow.

Code
suppressPackageStartupMessages({
  library(Seurat)
  library(tidyverse)
  library(ggpubr)
  library(sparseMatrixStats)
})
theme_set(theme_pubr())

counts <- Read10X(data.dir = "/NAS/qscb2025/Guilliams_2022_livercellatlas/Raw_data/rawData_human/countTable_human/", gene.column = 1)

counts.metadata <- read.csv("/NAS/qscb2025/Guilliams_2022_livercellatlas/Raw_data/annot_humanAll.csv")
rownames(counts.metadata) <- counts.metadata$cell
sample.scrna <- counts.metadata %>% filter(typeSample == "scRnaSeq")

Let’s look at what the metadata has:

Code
head(sample.scrna)
Code
table(sample.scrna$patient) %>% sort()

  H04   H10   H38   H14   H22   H07   H21   H16   H02   H25   H23   H06 
 1602  1664  3981  4669  6077  6295  8524  8870  9214  9350 10891 11799 

We will take few samples: H07, H14, H06, H16, H23

Code
barcodes.scrna <- intersect(rownames(counts.metadata), colnames(counts))
object.scrna <- CreateSeuratObject(counts = counts[, barcodes.scrna], meta.data = sample.scrna[barcodes.scrna, ])
Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')
Code
rm(counts)
Code
object.subset <- subset(object.scrna, patient %in% c("H06", "H07", "H14", "H16", "H23"))
Warning: Removing 84662 cells missing data for vars requested
Code
rm(object.scrna)
rm(counts.metadata)

3 Exploring the object

Before we do any downstream processing, it is always a good idea to explore the object at hand.

The counts (UMIs) are stored in the “counts” layer of the “RNA” assay”. You can read about the details of how Seurat stores its information on this wiki page.

Code
counts <- LayerData(object.subset, layer = "counts", assay = "RNA")


# how the first 5 gene' and first 5 cell counts
counts[1:5, 1:5]
5 x 5 sparse Matrix of class "dgCMatrix"
             AAACCTGAGCCACCTG-11 AAACCTGAGCTGAAAT-11 AAACCTGAGGACATTA-11
MIR1302-10                     .                   .                   .
FAM138A                        .                   .                   .
OR4F5                          .                   .                   .
RP11-34P13.7                   .                   .                   .
RP11-34P13.8                   .                   .                   .
             AAACCTGAGTGAACGC-11 AAACCTGCACACCGCA-11
MIR1302-10                     .                   .
FAM138A                        .                   .
OR4F5                          .                   .
RP11-34P13.7                   .                   .
RP11-34P13.8                   .                   .

In total we have 42524 cells (columns) and 32738 genes (rows):

Code
dim(counts)
[1] 32738 42524

Among these entries, we can calculate how many proportion of entries are non-zero:

Code
100 * round(sum(counts > 0) / (dim(counts)[1] * dim(counts)[2]), 2)
[1] 4

So 96% of entries in the counts matrix are zeroes (or only 4% of the values in the entire matrix are non-zeros).

We can ask whether these zeroes are ‘technical’ or something that would be observed biological.

Code
# for each gene count the number of  zeroes

num_zeros <- ncol(counts) - Matrix::rowSums(counts != 0)
prop_zeroes <- num_zeros / ncol(counts)
df <- data.frame(gene_mean = rowMeans(counts), gene_variance = rowVars(counts), prop_zeroes = prop_zeroes)
ggplot(df, aes(gene_mean, prop_zeroes)) +
  geom_point() +
  scale_x_log10()
Warning in scale_x_log10(): log-10 transformation introduced infinite values.

Let’s try to fit a distribution here. One common distribution that is often used is poisson. The mean of a poisson with

Code
df$expected_prop_zeros <- exp(-df$gene_mean) # this is P(X=0| lambda) for a poisson
ggplot(df, aes(x = gene_mean, y = prop_zeroes)) +
  geom_point(alpha = 0.3, size = 1) +
  geom_line(aes(y = expected_prop_zeros),
    color = "red",
    linewidth = 1.5
  ) +
  scale_x_log10() +
  scale_y_continuous(limits = c(0, 1)) +
  labs(
    x = "Gene mean (log10 scale)",
    y = "Proportion of zeros",
  ) +
  annotate("text",
    x = max(df$gene_mean) / 100, y = 0.9,
    label = "Red line: Poisson expectation",
    color = "red", hjust = 0
  ) +
  theme_bw()
Warning in scale_x_log10(): log-10 transformation introduced infinite values.
log-10 transformation introduced infinite values.

Code
rm(counts)
gc()
           used  (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells  4458213 238.1    7493985   400.3    7493985   400.3
Vcells 94371522 720.0 1473065000 11238.6 1786966387 13633.5

3.1 Metadata

Code
# take a look at the object metadata
head(object.subset)

You can add a custom column to the metadata using:

Code
# add a new metadata column
object.subset$my_column <- "test"
head(object.subset)

Or delete the column from the metadata:

Code
object.subset$my_column <- NULL
head(object.subset)

4 QC checks

We will calculate the percentage of mitochondria in all samples:

Code
object.subset[["percent.mt"]] <- PercentageFeatureSet(object.subset, pattern = "^MT-")
object.subset[["percent.ribo"]] <- PercentageFeatureSet(object.subset, pattern = "^RP[SL][[:digit:]]")

Before we proceed, let’s look at the distribution of number of UMIs and number of genes detected in each samples

Code
Idents(object.subset) <- "patient"
VlnPlot(object.subset, features = c("nCount_RNA", "nFeature_RNA", "percent.mt", "percent.ribo"))
Warning: Default search for "data" layer in "RNA" assay yielded no results;
utilizing "counts" layer instead.

Higher mitochondrial percentage cells are under stress (or about to die) and ideally we would want to keep them out for the rest of the analysis

Code
object.subset <- subset(object.subset, percent.mt < 10)
Code
VlnPlot(object.subset, features = c("nCount_RNA", "nFeature_RNA", "percent.mt", "percent.ribo"))
Warning: Default search for "data" layer in "RNA" assay yielded no results;
utilizing "counts" layer instead.

4.1 Normalisation

Before processding with any biological interpretation with scRNA-seq data, we need to get rid of any technical factors that might lead to differences across cells. One such technical difference is the total UMIs in each cell.

In an ideal case scenario, we would have wanted the ‘sequencing depth’ of each cell, i.e. the total UMIs for each cell to be the same. We also expect that the mean of gene expression and variance should not be related.

Code
library(sparseMatrixStats)
counts.subset <- LayerData(object = object.subset, layer = "counts", assay = "RNA")
df <- data.frame(mean = rowMeans(counts.subset), variance = rowVars(counts.subset))
ggplot(df) +
  geom_point(aes(mean, variance)) +
  scale_x_log10() +
  scale_y_log10() +
  geom_abline() +
  theme_bw(base_size = 14)
Warning in scale_x_log10(): log-10 transformation introduced infinite values.
Warning in scale_y_log10(): log-10 transformation introduced infinite values.

A standard approach of achieving normalisation is to divide the counts per cell by the total counts and multiply by a large number (preferably the median sequencing depth) so that all the cell counts sum to the same (large) value. Additionally, a log is taken to reduce the impact of the large outliers (which also stabilises the variance):

Code
object.subset <- NormalizeData(object.subset, scale.factor = median(object.subset$nCount_RNA))
Normalizing layer: counts

The normalised data is stored in the data layer:

Code
layer.data <- LayerData(object = object.subset, layer = "data", assay = "RNA")
df <- data.frame(mean = rowMeans(layer.data), variance = rowVars(layer.data))
ggplot(df) +
  geom_point(mapping = aes(mean, variance)) +
  geom_abline() +
  theme_bw(base_size = 14)

So the mean variance has no “stabilized”. , however you will notice that the variance stabilization is not uniform - there is a bend in the curve at the beginning.

4.2 Feature selection

Many genes in this dataset are “not informative” - they are either lowly expressed overall, or expressed in all the celltypes. For downstream analysis, we want to identify “informative” genes - this increases the signal to noise ratio for downstream dimensionaloty reduction.

One way to select these genes would be to calculate a per-gene variance and select the ones with the highest variance. Let’s explore the mean-variance relationship of these genes:

Code
# get UMI counts from object
counts <- LayerData(object = object.subset, assay = "RNA", layer = "counts")

# calculate per gene mean
gene_mean <- rowMeans(x = counts)
# gene_variance <- apply(X = counts, MARGIN = 1, FUN = var)
gene_variance <- rowVars(x = counts)


ggplot(data = data.frame("gene_mean" = gene_mean, "gene_variance" = gene_variance), aes(gene_mean, gene_variance)) +
  geom_point() +
  geom_abline(color = "red") +
  scale_x_log10() +
  scale_y_log10() +
  xlab("Gene mean") +
  ylab("Gene variance") +
  theme_bw(base_size = 14)
Warning in scale_x_log10(): log-10 transformation introduced infinite values.
Warning in scale_y_log10(): log-10 transformation introduced infinite values.

So genes with higher means will also have higher variance because there is a strong mean-variance relationship exhibited by all genes. One way to adjust for this relationship is to explicitly model the mean-variance relationship by fitting a smooth curve and then identify genes that show high deviation from this “expected” pattern. This is essentially what is done inside the FindVariableFeatures function:

Code
object.subset <- FindVariableFeatures(object = object.subset, selection.method = "vst")
Finding variable features for layer counts
Code
VariableFeaturePlot(object.subset)
Warning in scale_x_log10(): log-10 transformation introduced infinite values.

In the plot above, the “variable genes” are highlhted in red and as clear from their mean values are not necessarily the most highly expressed (notice the genes with average expression > 10)

The total number of UMIs observed for a cell are attributable to 1) cell size (higher cell size => more RNA content => more expected UMIs) and 2) sequencing depth. Let’s look at the expression of a ubiquoulsy expressed gene (Malat1- a long noncoding RNA):

Code
total_umis <- colSums(x = counts)
gene <- "MALAT1"
gene_umi <- as.numeric(counts[gene, ])

ggplot(data = data.frame("total_cell_umi" = total_umis, "gene_umi" = gene_umi), aes(total_cell_umi, gene_umi)) +
  geom_point() +
  theme_bw(base_size = 14) +
  xlab("Total sequencing depth") +
  ylab(paste0("Gene UMI(", gene))

This gene (MALAT1) shows an almost linear relationship with sequencing depth of a cell. In this case the “heterogeneity across cells” in expression levels of Malat1 are primarly attributable to the fact that these cells were sequenced to different depths and is uninteresting.

If all the genes showed this linear behaviour, we could have simply “regressed” out the effect of sequencing depth by performing a linear regression of observed gene count against the total sequencing depth of the cell. But it is not a good idea because there are also marker genes such as ALB, (hepatocytes), MS4A1 (a marker for BCells) and CD2 (a marker of T Cells)

Code
data.to.plot <- FetchData(object = object.subset, layer = "counts", vars = c("ALB", "MS4A1", "CD2", "nCount_RNA"))
head(data.to.plot)
Code
data.to.plot.genewise <- reshape2::melt(data.to.plot, id.vars = "nCount_RNA")
ggplot(data.to.plot.genewise, aes(nCount_RNA, value)) +
  geom_point() +
  facet_wrap(~variable) +
  xlab("Sequencing depth") +
  ylab("Gene UMI") +
  theme_bw(base_size = 14)

Given these are marker genes, they are only expressed in a subset of cells. A simple linear regression where we regress out the sequencing depth from each gene’s UMI will end up affecting these marker genes negatively - dampening their expression.

4.2.1 Log Normalization

Instead of regressing out sequencing depth, an alternate idea is to to use a global scale factor (say 10,000) and “scale” all cells to have this sequencing depth while also accounting for the sequencing depth of the cell. To reduce the impact of outliers, these scaled values are log transformed (after adding a pseudocount of 1). To perform this normalizing using Seurat, you can use the NormalizeData function:

Code
object <- object.subset
object <- NormalizeData(object = object, scale.factor = median(object$nCount_RNA))
Normalizing layer: counts
Code
object <- ScaleData(object = object)
Centering and scaling data matrix

4.2.2 What is “good normalization”?

There are two key properties that any “good normalization” method should achieve:

  1. The normalized expression values of a gene should have minimal dependence on the seuquencing depth

  2. The variance of normalized gene expression values across should cells should primarily reflect biological variation and not technical noise and should be independent of gene’s mean expression level or sequencing depth of the cell.

Let’s look into the expression pattern of the same highly expressed gene after running Log Normalization:

Code
gene <- "MALAT1"
lognormalized.data <- LayerData(object = object, assay = "RNA", layer = "data")[gene, ]
ggplot(data.frame("sequencing_depth" = colSums(x = counts), "normalized_values" = lognormalized.data), aes(sequencing_depth, normalized_values)) +
  geom_point() +
  theme_bw(base_size = 14)

4.3 One step normalization and feature selection: SCTransform

A second way to achieve variance stabilization (and variable feature selection) is by using SCTransform. In SCTransform, we use a regularized negative binomial model to directly model the single-cell counts. The normalized counts are then cacluclated by reversing this general reltaionship by substituting the observed count depth as the median sequencing depth. You can read the SCTransform publication for more details (SCTransform SCTransform-v2).

We will use SCTransform based analysis for the rest of the work flow.

Code
object <- SCTransform(object = object, verbose = FALSE)
Code
VariableFeatures(object) %>% head()
[1] "FCN3"   "IGJ"    "C1QB"   "S100A8" "S100A9" "C1QA"  

Have we achieved normalisation?

Code
sct.residual <- LayerData(object = object, assay = "SCT", layer = "scale.data")
sct.data <- LayerData(object = object, assay = "SCT", layer = "data")
sct.data.variance <- rowVars(sct.data)
log.data <- LayerData(object = object, assay = "RNA", layer = "data")
lognorm.data.variance <- rowVars(log.data)

sct.residual.variance <- rowVars(sct.residual)

df <- data.frame(gene_mean = rowMeans(counts)[names(sct.data.variance)], data_variance = sct.data.variance)
df$type <- "SCT"
df2 <- data.frame(gene_mean = rowMeans(counts)[names(sct.data.variance)], data_variance = lognorm.data.variance[names(sct.data.variance)])
df2$type <- "LogNorm"
df3 <- rbind(df, df2)

ggplot(df3) +
  geom_point(aes(gene_mean, data_variance, color = type), size = 0.1) +
  scale_x_log10()

They look similarish (for this dataset)

4.4 Linear dimensionality reduction

While we measured around 30000 genes, the “intrinsic” dimensionality of data is smaller - we do not expect all 17000 genes to be orthogonal (“independent” measurements). There will be “modules” of genes that are co-expressed. We want to reduce the dimensioanloty of this dataset to a) enhance the signal to noise ratio and b) visualize this information. A common strategy in this case is to use Principle Component Analysis (PCA) which is a linear dimensionalty reduction method that outputs a smaller (20-50 usually, but upto n-1 PCs in a n dimensional dataset) orthologonal dimensions.

Code
object <- RunPCA(object = object, npcs = 50, verbose = FALSE)

Rather than including all the PCs, we want to retain the most informative PCs. The goal here is to avoid adding extra PCs that might reflect other sources of variation (technical) assuming the major source of variation in the dataset is biological.

One way to visualize the “intrinsic dimensionality” of this dataset is to to see how the variance explained by individual PCs varies in the form of an elbow plot:

Code
ElbowPlot(object, ndims = 50)

The standard deviation (square root of variance) seems to plateau around PC30, so we will retain 30 PCs for downstream visualisation and clustering.

Code
DimPlot(object, group.by = "patient")

Code
DimPlot(object, group.by = "annot")

The standard deviation (sequare root of variance) seems to plateau around PC30, so we will retain 30 PCs for downstream visualization and clustering.

Code
DimPlot(object)

4.5 Non linear dimensionality reduction

While the previous PCA analysis resulted in 30 dimensions, we want to further reduce the dimension of our dataset to 2 to be able to visualize it. While there are different methods to do this, we will utilize Uniform Manifold Approximation and Projection (UMAP). The goal of UMAP and related dimensionality reduction methods is to find a set of 2-D coordinates for the multidimensional input dataset such that the difference in cell-cell similarities between the low-dimensional and original-dimension dataset is minimised.

Code
object <- RunUMAP(object, reduction = "pca", dims = 1:30, verbose = FALSE)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
Code
DimPlot(object)

The above plot shows how individual cells are located in a 2D space (defined by the UMAP coordinates). Our dataset is finally ready for a deeper dive in the biology.

4.6 Clustering

We can now find “clusters” of cells that likely represent the same celltype. For single-cell data, a popular choice of methods for clustering is graph-based community detection methods that can scale to millions of cells (as opposed to tree-based methods such as hierechial clustering):

Code
# build SNN graph
object <- FindNeighbors(object, reduction = "pca", dims = 1:30, verbose = FALSE)
object <- FindClusters(object, resolution = 0.3, verbose = FALSE)
Code
DimPlot(object, label = TRUE)

Code
DimPlot(object, label = TRUE, group.by = "patient")

5 Performing integration

Given the “samples do not mix that well, we will consider integrating these:

Code
DefaultAssay(object) <- "RNA"
merged <- JoinLayers(object)
merged[["RNA"]] <- split(merged[["RNA"]], f = merged$patient)
Splitting 'counts', 'data' layers. Not splitting 'scale.data'. If you would like to split other layers, set in `layers` argument.
Code
merged <- SCTransform(merged, verbose = FALSE)
Warning: Different cells and/or features from existing assay SCT
Code
merged <- RunPCA(merged, verbose = FALSE)
Code
integrated <- IntegrateLayers(
  object = merged, method = HarmonyIntegration, normalization.method = "SCT",
  orig.reduction = "pca", new.reduction = "integrated.harmony",
  verbose = FALSE
)
The `features` argument is ignored by `HarmonyIntegration`.
This message is displayed once per session.
Code
integrated <- RunUMAP(integrated,
  reduction = "integrated.harmony", dims = 1:20,
  reduction.name = "umap.harmony", verbose = FALSE
)

We can visualise whether integration has worked or not by doing:

Code
DimPlot(integrated, reduction = "umap.harmony", group.by = c("patient", "annot"), shuffle = T)

We can also use CCA for integration (takes longer):

Code
integrated <- IntegrateLayers(
  object = merged, method = CCAIntegration, normalization.method = "SCT",
  orig.reduction = "pca", new.reduction = "integrated.cca",
  verbose = FALSE
)
integrated <- RunUMAP(integrated,
  reduction = "integrated.cca", dims = 1:20,
  reduction.name = "umap.cca", verbose = FALSE
)

DimPlot(integrated, reduction = "umap.cca", group.by = c("patient", "annot"), shuffle = T)

Once we are happy with our integration, we can proceed with clustering:

Code
integrated <- FindNeighbors(integrated, reduction = "integrated.cca", dims = 1:20, verbose = FALSE)
integrated <- FindClusters(integrated,
  resolution = 0.1,
  cluster.name = "cca_clusters", verbose = FALSE
)
Code
DimPlot(integrated, label = T, reduction = "umap.cca")

6 Identifying Hepatocyte cluster

Code
FeaturePlot(integrated, features = "ALB", reduction = "umap.cca")

Code
VlnPlot(integrated, features = "ALB", sort = T)

7 Find differentiating markers between the obese and lean diet in Macrophages

Our next goal is to decipher what genes are differentially expressed in the obese individuals as compared to the lean individauls.

To find differentially expressed (DE) genes, we first set the “identity” of the cells to be the “mouse_model” metadata column and then identify diferentially expressed genes in the Obese individuals as compared to the lean individuals using the FindMarkers command:

Code
integrated <- PrepSCTFindMarkers(integrated)
Found 5 SCT models. Recorrecting SCT counts using minimum median counts: 2188
Code
macro <- subset(integrated, annot %in% c("Macrophages"))


Idents(macro) <- "diet"

markers.obese_vs_lean <- FindMarkers(macro, ident.1 = "Obese", ident.2 = "Lean", recorrect_umi = F)
markers.obese_vs_lean$gene <- rownames(markers.obese_vs_lean)
head(markers.obese_vs_lean)

To visualize these genes we will make use of a “volcano plot” where on the x-axis we will plot the log fold change of each gene and the associated p-value will be plotted on the y-axis. Additinally we will also highlight top 10 DE genes in both directions (positive, i.e. upregulated in Obese and negative, i.e. downregulated in Obese).

Code
library(ggrepel) # load this library to plot gene names

markers.obese_vs_lean.top10.pos <- markers.obese_vs_lean %>%
  filter(avg_log2FC > 0) %>%
  filter(p_val_adj < 0.05) %>%
  top_n(n = 10, wt = avg_log2FC)
markers.obese_vs_lean.top10.neg <- markers.obese_vs_lean %>%
  filter(avg_log2FC < 0) %>%
  filter(p_val_adj < 0.05) %>%
  top_n(n = 10, wt = abs(avg_log2FC))
markers.obese_vs_lean.top10 <- rbind(markers.obese_vs_lean.top10.pos, markers.obese_vs_lean.top10.neg)

ggplot(markers.obese_vs_lean, aes(avg_log2FC, -log10(p_val))) +
  geom_point() +
  geom_point(data = markers.obese_vs_lean.top10, aes(avg_log2FC, -log10(p_val)), color = "red") +
  geom_text_repel(data = markers.obese_vs_lean.top10, aes(avg_log2FC, -log10(p_val), label = gene)) +
  xlab("average log2 Fold Change") +
  ylab("-log10(Pvalue)")
Warning: ggrepel: 7 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Can you notice what is going on? Why is there XIST? Why are they Ychromosome genes? Did we capture all the nuances of the experiement design?

8 Functional enrichment

The abovce analysis clearly denotes there is On obvious question to ask with the list of DE genes is what function are these carrying? We can perform a gene ontology enrichment using the gene lists to see what functional categories is each of the DE list enriched in.

Code
suppressPackageStartupMessages({
  library(clusterProfiler)
  library(org.Hs.eg.db)
})


universe.genes <- rownames(integrated@assays$SCT@data)
DoEnrichment <- function(genes, universe.genes) {
  enrich <- enrichGO(
    gene = genes,
    universe = universe.genes,
    OrgDb = org.Hs.eg.db,
    keyType = "SYMBOL",
    ont = "BP",
    pvalueCutoff = 0.05,
    qvalueCutoff = 0.05,
    readable = TRUE
  )
  return(enrich)
}
de.pos <- markers.obese_vs_lean %>%
  filter(avg_log2FC > 0) %>%
  filter(p_val_adj < 0.1) %>%
  pull(gene)

de.neg <- markers.obese_vs_lean %>%
  filter(avg_log2FC < 0) %>%
  filter(p_val_adj < 0.1) %>%
  pull(gene)


enrichment.pos <- DoEnrichment(genes = de.pos, universe.genes = universe.genes)
enrichment.neg <- DoEnrichment(genes = de.neg, universe.genes = universe.genes)
Code
barplot(enrichment.pos) + ggtitle("Enrichment of Obese upregulated genes")
Warning in (function (model, data, ...) : Arguments in `...` must be used.
✖ Problematic argument:
• by = "Count"
ℹ Did you misspell an argument name?

Code
barplot(enrichment.neg, showTerms = 7) + ggtitle("Enrichment of Obese downregulated genes")
Warning in (function (model, data, ...) : Arguments in `...` must be used.
✖ Problematic argument:
• by = "Count"
ℹ Did you misspell an argument name?

9 Homework

  1. Integrate all the different patients that we left out in this analysis and annotate the hepatocyte cluster.

  2. Using the hepatocyte cluster, identify zonation markers for the different zones of the liver. You can use the following markers as a guide:

Zone 1 (periportal) ALB, CPS1, ARG1, PCK1, SDH, SULT1A1, IGF1
Zone 2 (Midlobular) HAMP, IGFB2, CYP8B1
Zone 3(perivenous) GS,GLT,CAR,AHR,CYP1A, CYP2E1, CYP3A, GSTM1, GK,IGFB1, B-CATENIN