Last updated: 2021-12-17

Checks: 7 0

Knit directory: sct2_revision/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20210706) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 8afc486. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/raw_data/
    Ignored:    data/rds_filtered/
    Ignored:    data/rds_raw/
    Ignored:    data/sampled_counts/
    Ignored:    output/snakemake_output/

Untracked files:
    Untracked:  code/02_run_seurat_noclip.R
    Untracked:  code/07AA_deseq2_muscat_simulate.R
    Untracked:  code/07A_muscat_simulate.R
    Untracked:  code/07A_simulate_muscat.R
    Untracked:  code/07BB_deseq2_muscat_process.R
    Untracked:  code/07B_muscat_process.R
    Untracked:  code/07B_process_muscat.R
    Untracked:  code/08_run_presto.R
    Untracked:  code/17A_HEK_SS3_dropseq.Rmd
    Untracked:  code/17A_HEK_SS3_dropseq_files/
    Untracked:  code/17C_HEK_Quartzeseq2_dropseq.Rmd
    Untracked:  code/17C_HEK_Quartzeseq2_dropseq_files/
    Untracked:  code/17_HEK_SS3_ChromiumV3.Rmd
    Untracked:  code/17_HEK_SS3_ChromiumV3.nb.html
    Untracked:  code/17_HEK_SS3_ChromiumV3_files/
    Untracked:  code/AA_process_muscat.R
    Untracked:  code/BB_process_muscat.R
    Untracked:  code/DD_simulate_muscat.R
    Untracked:  code/EE_simulate_muscat.R
    Untracked:  code/XX_process_muscat.R
    Untracked:  code/XX_simulate_muscat.R
    Untracked:  code/YY_simulate_muscat.R
    Untracked:  code/ZZ_simulate_muscat.R
    Untracked:  code/kang_muscat.R
    Untracked:  code/prep_sce.R
    Untracked:  code/prep_sce_ss3_dropseq.R
    Untracked:  data/azimuth_predictions/
    Untracked:  junk/
    Untracked:  mamba_update_changes.txt
    Untracked:  output/11C_VST/
    Untracked:  output/AAmuscat_simulated/
    Untracked:  output/BBmuscat_simulated/
    Untracked:  output/CCmuscat_simulated/
    Untracked:  output/CD4_NK_downsampling_DE.rds
    Untracked:  output/DDmuscat_simulated/
    Untracked:  output/EEmuscat_simulated/
    Untracked:  output/KANGmuscat_simulated/
    Untracked:  output/NK_downsampling/
    Untracked:  output/XXmuscat_simulated/
    Untracked:  output/YYmuscat_simulated/
    Untracked:  output/ZZmuscat_simulated/
    Untracked:  output/figures/
    Untracked:  output/kang_prepsce.rds
    Untracked:  output/muscat_simulated/
    Untracked:  output/muscat_simulation/
    Untracked:  output/seu_sct2_sim.rds
    Untracked:  output/simulation_HEK_QuartzSeq2_Dropseq_downsampling/
    Untracked:  output/simulation_HEK_SS3_ChromiumV3_downsampling/
    Untracked:  output/simulation_HEK_SS3_Dropseq_downsampling/
    Untracked:  output/simulation_HEK_downsampling/
    Untracked:  output/simulation_NK_downsampling/
    Untracked:  output/ss3_dropseq_prepsim.rds
    Untracked:  output/tables/
    Untracked:  output/vargenes/
    Untracked:  snakemake/.snakemake/
    Untracked:  snakemake/Snakefile_noclip.smk
    Untracked:  snakemake/Snakefile_presto.smk
    Untracked:  snakemake/cluster.yaml
    Untracked:  snakemake/install_glm.R
    Untracked:  snakemake/jobscript.sh
    Untracked:  snakemake/jobscript_ncells.sh
    Untracked:  snakemake/local_run_downsampling.sh
    Untracked:  snakemake/local_run_glm.sh
    Untracked:  snakemake/local_run_ncells.sh
    Untracked:  snakemake/local_run_noclip.sh
    Untracked:  snakemake/local_run_presto.sh
    Untracked:  snakemake/local_run_time.sh
    Untracked:  snakemake/run_glm.sh
    Untracked:  snakemake/run_ncells.sh
    Untracked:  snakemake/sct2_revision_env.yml
    Untracked:  temp_figures/

Unstaged changes:
    Deleted:    analysis/04_PBMC68k.Rmd
    Modified:   code/02_run_seurat.R
    Modified:   code/03_run_vst2_downsample.R
    Modified:   code/04_run_vst_ncells.R
    Modified:   code/06_run_sct.R
    Modified:   data/datasets.csv
    Modified:   snakemake/Snakefile_downsampling.smk
    Modified:   snakemake/Snakefile_glm_seurat.smk
    Modified:   snakemake/Snakefile_metacell.smk

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/12_SuppFigure-DataStats.Rmd) and HTML (docs/12_SuppFigure-DataStats.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 8afc486 Saket Choudhary 2021-12-17 workflowr::wflow_publish("analysis/*")
html d736ec8 Saket Choudhary 2021-07-07 Build site.
Rmd 400797a Saket Choudhary 2021-07-06 workflowr::wflow_git_commit(all = TRUE)
html 400797a Saket Choudhary 2021-07-06 workflowr::wflow_git_commit(all = TRUE)

suppressPackageStartupMessages({
  library(dplyr)
  library(ggplot2)
  library(kableExtra)
  library(ggpubr)
  library(ggridges)
  library(here)
  library(patchwork)
  library(RColorBrewer)
  library(readr)
  library(reshape2)
  library(scattermore)
  library(Seurat)
  library(sparseMatrixStats)
  library(xtable)
})
`%notin%` <- Negate(`%in%`)
theme_set(theme_pubr(base_size = 9))
knitr::opts_chunk$set(warning = FALSE, message = FALSE)


clean_keys <- function(key) {
  gsub(
    pattern = "|\\)", replacement = "",
    x = gsub(pattern = " |\\(", replacement = "_", x = key)
  )
}
CellSummary <- function(cm) {
  total_umi_per_cell <- colSums(cm)
  expressed_features_per_cell <- colSums(x = cm > 0)
  n_features <- dim(cm)[1]
  nonexpressed_features_per_cell <- n_features - expressed_features_per_cell
  median_umi_per_cell <- median(total_umi_per_cell)
  avg_umi_per_cell <- total_umi_per_cell / n_features
  avg_umi_per_cell_expressedgenes <- total_umi_per_cell / expressed_features_per_cell
  cell_amean <- colMeans2(cm)
  cell_variance <- colVars(cm)
  cell_attr <- data.frame(
    total_umi = total_umi_per_cell, n_expressed_genes = expressed_features_per_cell, n_nonexpressed_cells = nonexpressed_features_per_cell, prop_expressed_genes = expressed_features_per_cell / n_features,
    prop_nonexpressed_genes = nonexpressed_features_per_cell / n_features,
    avg_umi = avg_umi_per_cell, avg_umi_expressedgenes = avg_umi_per_cell_expressedgenes, cell_amean = cell_amean,
    cell_variance = cell_variance
  )

  return(cell_attr)
}

GeneSummary <- function(cm) {

  # remove genes and cells with zero counts

  cm <- cm[rowSums(cm) > 0, colSums(cm) > 0]

  total_umi_per_gene <- rowSums(cm)
  expressed_cells_per_gene <- rowSums(cm > 0)
  n_cells <- dim(cm)[2]
  nonexpressed_cells_per_gene <- n_cells - expressed_cells_per_gene

  median_umi_per_gene <- median(total_umi_per_gene)

  avg_umi_per_gene <- total_umi_per_gene / n_cells
  avg_umi_per_gene_expressedcells <- total_umi_per_gene / expressed_cells_per_gene

  gene_amean <- rowMeans(cm)
  gene_var <- rowVars(cm)
  gene_gmean <- sctransform:::row_gmean(cm)

  gene_attr <- data.frame(
    total_umi = total_umi_per_gene, n_expressed_cells = expressed_cells_per_gene, n_nonexpressed_cells = nonexpressed_cells_per_gene, prop_expressed_cells = expressed_cells_per_gene / n_cells,
    prop_nonexpressed_cells = nonexpressed_cells_per_gene / n_cells,
    avg_umi = avg_umi_per_gene, avg_umi_expressedcells = avg_umi_per_gene_expressedcells, gene_amean = gene_amean, gene_gmean = gene_gmean, gene_variance = gene_var
  )

  return(gene_attr)
}

GetGeneCellSummary <- function(dataset_name, mode = "gene") {
  cm <- GetAssayData(
    readRDS(here::here("data", "rds_filtered", paste0(clean_keys(dataset_name), ".rds"))),
    assay = "RNA", slot = "counts"
  )
  if (mode == "gene") {
    gc_attr <- GeneSummary(cm)
  } else {
    gc_attr <- CellSummary(cm)
  }

  cm <- NULL
  gc()
  return(gc_attr)
}
datasets <- readr::read_csv(here::here("data", "datasets.csv"), col_types = readr::cols())
dataset_keys <- datasets$key
counts <- sapply(dataset_keys,
  FUN = function(x) {
    GetAssayData(
      readRDS(here::here("data", "rds_filtered", paste0(clean_keys(x), ".rds"))),
      assay = "RNA", slot = "counts"
    )
  },
  simplify = FALSE, USE.NAMES = TRUE
)

cell_attrs <- sapply(dataset_keys,
  FUN = function(x) {
    message(x)
    GetGeneCellSummary(x, "cell")
  },
  simplify = FALSE, USE.NAMES = TRUE
)

cell_attrs_df <- bind_rows(cell_attrs, .id = "key")
cell_attrs_df <- left_join(cell_attrs_df, datasets)

gene_attrs <- sapply(dataset_keys,
  FUN = function(x) {
    GetGeneCellSummary(x, "gene")
  },
  simplify = FALSE, USE.NAMES = TRUE
)
gene_attrs_df <- bind_rows(gene_attrs, .id = "key")
gene_attrs_df <- left_join(gene_attrs_df, datasets)

UMI statistics

gene_attrs_df$datatype <- factor(gene_attrs_df$datatype, levels = c("technical-control", "cell line", "heterogeneous"))
cell_attrs_df$datatype <- factor(cell_attrs_df$datatype, levels = c("technical-control", "cell line", "heterogeneous"))

gene_attrs_df_summary <- gene_attrs_df %>%
  group_by(sample_name) %>%
  summarize(median.zero_prop = round(median(prop_nonexpressed_cells), 4), median.detection_rate = round(median(prop_expressed_cells), 4))

gene_attrs_df_summary <- left_join(gene_attrs_df_summary, datasets)

pgeneavg <- ggplot(gene_attrs_df, aes(
  x = avg_umi,
  y = reorder(sample_name, avg_umi, FUN = median),
  avg_umi, fill = datatype
)) +
  scale_x_continuous(trans = "log10", breaks = c(0.0001, 0.01, 1, 100, 10000), labels = c("0.0001", 0.01, 1, 100, 10000)) +
  stat_density_ridges(quantile_lines = TRUE, quantiles = 2) +
  scale_fill_manual(values = brewer.pal(3, "Set2"), name = "") +
  labs(title = "") +
  ylab("") +
  theme(
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.background = element_blank()
  ) +
  guides(col = guide_legend(ncol = 3)) +
  xlab("Mean UMI per gene")




cell_attrs_df_summary <- cell_attrs_df %>%
  group_by(sample_name, datatype) %>%
  summarize(median_umi = median(total_umi), median_detection_rate = round(median(prop_expressed_genes), 3))
# cell_attrs_df_summary
pcelltot <- ggplot(cell_attrs_df, aes(
  x = total_umi,
  y = reorder(sample_name, total_umi, FUN = median),
  total_umi, fill = datatype
)) +
  scale_x_continuous(trans = "log10", breaks = c(0.0001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000), labels = c("0.0001", 0.01, 0.1, 1, 10, 100, 1000, 10000, "100000")) +
  stat_density_ridges(quantile_lines = TRUE, quantiles = 2) +
  scale_fill_manual(values = brewer.pal(3, "Set2"), name = "") +
  labs(title = "") +
  ylab("") +
  theme(
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.background = element_blank()
  ) +
  guides(col = guide_legend(ncol = 3)) +
  xlab("Total UMI per cell")

wrap_plots(pcelltot,pgeneavg, ncol = 2) + plot_annotation(tag_levels = "A") + plot_layout(guides = "collect", tag_level = "new") & theme(legend.position = "bottom") & theme(plot.tag = element_text(face = "bold"))

Version Author Date
400797a Saket Choudhary 2021-07-06
dir.create(here::here("output", "figures"), showWarnings = F)

ggsave(here::here("output", "figures", "data_stats.pdf"), width = 12, height = 12, dpi = "print")
cell_attrs_df_summary <- cell_attrs_df_summary %>% arrange(median_umi)
kbl(cell_attrs_df_summary, booktabs = T) %>%
  kable_styling(latex_options = "striped")
sample_name datatype median_umi median_detection_rate
PBMC-r1 (inDrops) heterogeneous 375.0 0.009
Fetal (sci-RNA-seq3) heterogeneous 499.0 0.008
PBMC-r2 (Seq-Well) heterogeneous 521.0 0.011
PBMC-r1 (Seq-Well) heterogeneous 846.0 0.017
PBMC-r1 (Drop-seq) heterogeneous 1199.0 0.022
PBMC-r2 (inDrops) heterogeneous 1247.0 0.020
PBMC68k (ChromiumV1) heterogeneous 1292.0 0.026
PBMC-r2 (Drop-seq) heterogeneous 1850.5 0.030
Cortex-r2 (sci-RNA-seq) heterogeneous 1899.0 0.044
HEK-m (Drop-seq) cell line 1907.5 0.051
PBMC-r1 (ChromiumV2A) heterogeneous 2032.0 0.027
Cortex-r1 (DroNc-seq) heterogeneous 2092.0 0.053
3T3-r1 (inDrops) cell line 2213.0 0.061
Bone Marrow (CITE-seq) heterogeneous 2294.0 0.045
TechCtrl1 (ChromiumV1) technical-control 2308.5 0.031
TechCtrl2 (ChromiumV1) technical-control 2566.0 0.031
PBMC-r2 (ChromiumV2) heterogeneous 2626.0 0.036
HEK-m (inDrops) cell line 2943.0 0.073
HEK-r1 (inDrops) cell line 3019.0 0.057
PBMC-r1 (ChromiumV2B) heterogeneous 3050.0 0.037
3T3-r1 (Drop-seq) cell line 3072.0 0.082
Cortex-r2 (DroNc-seq) heterogeneous 3094.0 0.071
HEK-m (mcSCRB-seq) cell line 3266.5 0.063
3T3-r2 (Drop-seq) cell line 3345.0 0.091
Cortex-r1 (sci-RNA-seq) heterogeneous 3524.0 0.060
Cortex-r2 (ChromiumV2) heterogeneous 3527.0 0.073
HEK-r2 (inDrops) cell line 3904.0 0.073
3T3-r2 (inDrops) cell line 4666.5 0.107
HEK-m (ChromiumV2_sn) cell line 4967.0 0.128
HEK-r2 (Drop-seq) cell line 4968.5 0.085
PBMC-r1 (ChromiumV3) heterogeneous 5066.0 0.054
HEK-m (ddSeq) cell line 5304.5 0.108
HEK-r1 (Drop-seq) cell line 5328.0 0.092
PBMC-r2 (CEL-seq2) heterogeneous 5917.0 0.087
3T3-r1 (sci-RNA-seq) cell line 6609.5 0.136
PBMC-r1 (CEL-seq2) heterogeneous 6848.0 0.096
PBMC (ChromiumV3) heterogeneous 6992.0 0.107
Cortex-r1 (ChromiumV2) heterogeneous 6993.5 0.122
3T3-r2 (sci-RNA-seq) cell line 8256.0 0.160
PBMC (Smart-seq3) heterogeneous 8288.0 0.058
3T3-r1 (ChromiumV2) cell line 9548.0 0.140
HEK-r2 (sci-RNA-seq) cell line 11045.0 0.160
HEK-m (MARS-seq) cell line 11207.5 0.180
HEK-r1 (sci-RNA-seq) cell line 11490.0 0.158
3T3-r2 (ChromiumV2) cell line 13776.5 0.185
3T3 (ChromiumV3) cell line 15577.0 0.180
HEK-r2 (ChromiumV2) cell line 22986.5 0.171
HEK-r1 (ChromiumV2) cell line 23388.0 0.169
TechCtrl (inDrops) technical-control 32905.0 0.391
3T3-r1 (CEL-seq2) cell line 34291.0 0.321
HEK (ChromiumV3) cell line 40547.0 0.246
HEK-r2 (CEL-seq2) cell line 43670.0 0.287
HEK-r1 (CEL-seq2) cell line 52973.0 0.308
3T3-r2 (CEL-seq2) cell line 53036.0 0.367
HEK-m (CEL-seq2) cell line 60592.5 0.479
HEK-m (ChromiumV2) cell line 73333.5 0.434
HEK (Smart-seq3) cell line 106994.0 0.381
HEK-m (Quartz-Seq2) cell line 167199.0 0.548
Fibroblasts (Smart-seq3) heterogeneous 196626.0 0.380
dir.create(here::here("output", "tables"), showWarnings = F)
print(xtable(cell_attrs_df_summary, type = "latex", digits=3), include.rownames = FALSE, file = here::here("output", "tables", "datasets_umi_stats.tex"))

sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] xtable_1.8-4            sparseMatrixStats_1.4.2 MatrixGenerics_1.4.3   
 [4] matrixStats_0.61.0      SeuratObject_4.0.4      Seurat_4.0.5           
 [7] scattermore_0.7         reshape2_1.4.4          readr_2.1.1            
[10] RColorBrewer_1.1-2      patchwork_1.1.1         here_1.0.1             
[13] ggridges_0.5.3          ggpubr_0.4.0            kableExtra_1.3.4       
[16] ggplot2_3.3.5           dplyr_1.0.7             workflowr_1.6.2        

loaded via a namespace (and not attached):
  [1] backports_1.4.1        systemfonts_1.0.3      plyr_1.8.6            
  [4] igraph_1.2.9           lazyeval_0.2.2         splines_4.1.2         
  [7] listenv_0.8.0          digest_0.6.29          htmltools_0.5.2       
 [10] fansi_0.5.0            magrittr_2.0.1         tensor_1.5            
 [13] cluster_2.1.2          ROCR_1.0-11            tzdb_0.2.0            
 [16] globals_0.14.0         vroom_1.5.7            svglite_2.0.0         
 [19] spatstat.sparse_2.0-0  colorspace_2.0-2       rvest_1.0.2           
 [22] ggrepel_0.9.1          xfun_0.28              crayon_1.4.2          
 [25] jsonlite_1.7.2         spatstat.data_2.1-0    survival_3.2-13       
 [28] zoo_1.8-9              glue_1.5.1             polyclip_1.10-0       
 [31] gtable_0.3.0           webshot_0.5.2          leiden_0.3.9          
 [34] car_3.0-12             future.apply_1.8.1     abind_1.4-5           
 [37] scales_1.1.1           DBI_1.1.1              rstatix_0.7.0         
 [40] miniUI_0.1.1.1         Rcpp_1.0.7             viridisLite_0.4.0     
 [43] reticulate_1.22        spatstat.core_2.3-2    bit_4.0.4             
 [46] htmlwidgets_1.5.4      httr_1.4.2             ellipsis_0.3.2        
 [49] ica_1.0-2              farver_2.1.0           pkgconfig_2.0.3       
 [52] sass_0.4.0             uwot_0.1.11            deldir_1.0-6          
 [55] utf8_1.2.2             tidyselect_1.1.1       rlang_0.4.12          
 [58] later_1.3.0            munsell_0.5.0          tools_4.1.2           
 [61] generics_0.1.1         broom_0.7.10           evaluate_0.14         
 [64] stringr_1.4.0          fastmap_1.1.0          yaml_2.2.1            
 [67] goftest_1.2-3          knitr_1.36             bit64_4.0.5           
 [70] fs_1.5.2               fitdistrplus_1.1-6     purrr_0.3.4           
 [73] RANN_2.6.1             pbapply_1.5-0          future_1.23.0         
 [76] nlme_3.1-152           whisker_0.4            mime_0.12             
 [79] xml2_1.3.3             compiler_4.1.2         rstudioapi_0.13       
 [82] plotly_4.10.0          png_0.1-7              ggsignif_0.6.3        
 [85] spatstat.utils_2.3-0   tibble_3.1.6           bslib_0.3.1           
 [88] stringi_1.7.6          highr_0.9              lattice_0.20-45       
 [91] Matrix_1.4-0           vctrs_0.3.8            pillar_1.6.4          
 [94] lifecycle_1.0.1        spatstat.geom_2.3-1    lmtest_0.9-39         
 [97] jquerylib_0.1.4        RcppAnnoy_0.0.19       data.table_1.14.2     
[100] cowplot_1.1.1          irlba_2.3.5            httpuv_1.6.3          
[103] R6_2.5.1               promises_1.2.0.1       KernSmooth_2.23-20    
[106] gridExtra_2.3          parallelly_1.29.0      codetools_0.2-18      
[109] MASS_7.3-54            assertthat_0.2.1       rprojroot_2.0.2       
[112] withr_2.4.3            sctransform_0.3.2.9008 mgcv_1.8-38           
[115] parallel_4.1.2         hms_1.1.1              grid_4.1.2            
[118] rpart_4.1-15           tidyr_1.1.4            rmarkdown_2.11        
[121] carData_3.0-4          Rtsne_0.15             git2r_0.29.0          
[124] shiny_1.7.1