Case Studies#

Case Study 1#

Consider a scenario where somone is interested in searching for single-cell RNA-seq datasets. In particular, the interest is in studying retina:

$ pysradb search --query "single-cell rna-seq retina"

 study_accession experiment_accession    experiment_title    sample_taxon_id sample_scientific_name  experiment_library_strategy experiment_library_source   experiment_library_selection    sample_accession    sample_alias    experiment_instrument_model pool_member_spots   run_1_size  run_1_accession run_1_total_spots   run_1_total_bases
 SRP299803   SRX9756769  GSM4995565: scATAC_Retina_WT; Mus musculus; ATAC-seq    10090   Mus musculus    ATAC-seq    GENOMIC other   SRS7946094  GSM4995565  Illumina NovaSeq 6000   55435867    2637580797  SRR13329759 55435867    6874047508
 SRP299803   SRX9756768  GSM4995564: scRNA_Retina_VSX2SEKO_Rep2; Mus musculus; RNA-Seq   10090   Mus musculus    RNA-Seq TRANSCRIPTOMIC  cDNA    SRS7946093  GSM4995564  Illumina NovaSeq 6000   96123725    4107807391  SRR13329758 96123725    12688331700
 SRP299803   SRX9756767  GSM4995563: scRNA_Retina_VSX2SEKO_Rep1; Mus musculus; RNA-Seq   10090   Mus musculus    RNA-Seq TRANSCRIPTOMIC  cDNA    SRS7946092  GSM4995563  Illumina NovaSeq 6000   94345783    4056010488  SRR13329757 94345783    12453643356
 SRP299803   SRX9756766  GSM4995562: scRNA_Retina_WT_Rep2; Mus musculus; RNA-Seq 10090   Mus musculus    RNA-Seq TRANSCRIPTOMIC  cDNA    SRS7946091  GSM4995562  Illumina NovaSeq 6000   99487074    4240172698  SRR13329756 99487074    13132293768
 SRP299803   SRX9756765  GSM4995561: scRNA_Retina_WT_Rep1; Mus musculus; RNA-Seq 10090   Mus musculus    RNA-Seq TRANSCRIPTOMIC  cDNA    SRS7946090  GSM4995561  Illumina NovaSeq 6000   88048461    3817540828  SRR13329755 88048461    11622396852
 SRP257758   SRX9537754  GSM4916438: Pou4f2-tdTomato/+ E17.5 scRNA-seq; Mus musculus; RNA-Seq    10090   Mus musculus    RNA-Seq TRANSCRIPTOMIC  cDNA    SRS7743995  GSM4916438  Illumina HiSeq 2500 364683840   8246658699  SRR13091939 364683840   32456861760
 SRP257758   SRX9537753  GSM4916437: Atoh7-zsGreen/lacZ E17.5 scRNA-seq; Mus musculus; RNA-Seq   10090   Mus musculus    RNA-Seq TRANSCRIPTOMIC  cDNA    SRS7743994  GSM4916437  Illumina HiSeq 2500 530456067   11895864680 SRR13091938 530456067   47210589963
 SRP257758   SRX9537752  GSM4916436: Atoh7-zsGreen/+ E17.5 scRNA-seq; Mus musculus; RNA-Seq  10090   Mus musculus    RNA-Seq TRANSCRIPTOMIC  cDNA    SRS7743993  GSM4916436  Illumina HiSeq 2500 389849416   8671923722  SRR13091937 389849416   34696598024
 SRP257758   SRX9537751  GSM4916435: Atoh7-zsGreen/lacZ E14.5 scRNA-seq; Mus musculus; RNA-Seq   10090   Mus musculus    RNA-Seq TRANSCRIPTOMIC  cDNA    SRS7743992  GSM4916435  Illumina HiSeq 2500 328878355   7875737709  SRR13091936 328878355   29270173595
 SRP257758   SRX9537750  GSM4916434: Atoh7-zsGreen/+ E14.5 scRNA-seq; Mus musculus; RNA-Seq  10090   Mus musculus    RNA-Seq TRANSCRIPTOMIC  cDNA    SRS7743991  GSM4916434  Illumina HiSeq 2500 522040155   12760941656 SRR13091935 522040155   46461573795
 ERP118072   ERX3614517  NextSeq 500 sequencing; 3' mRNA-seq of protrusions and cell bodies of BJ, PC-3M, RPE-1, U-87 and WM-266.4 cells 9606    Homo sapiens    OTHER   TRANSCRIPTOMIC  Oligo-dT    ERS3920269  SAMEA6120013    NextSeq 500 5818488 43355751    ERR3619129  1457318 109897743
 ERP118072   ERX3614516  NextSeq 500 sequencing; 3' mRNA-seq of protrusions and cell bodies of BJ, PC-3M, RPE-1, U-87 and WM-266.4 cells 9606    Homo sapiens    OTHER   TRANSCRIPTOMIC  Oligo-dT    ERS3920268  SAMEA6120012    NextSeq 500 5422441 40645479    ERR3619125  1359663 102468758
 SRP288715   SRX9369597  RPE1_SS119_p10  9606    Homo sapiens    OTHER   GENOMIC other   SRS7591452  RPE1_SS119_p10.bam  Illumina HiSeq 2000 5062938 88426773    SRR12904705 5062938 202517520
 SRP288715   SRX9369596  RPE1_SS119_p0   9606    Homo sapiens    OTHER   GENOMIC other   SRS7591451  RPE1_SS119_p0.bam   Illumina HiSeq 2000 978835  19219630    SRR12904706 978835  39153400
 SRP288715   SRX9369595  RPE1_SS111_p10  9606    Homo sapiens    OTHER   GENOMIC other   SRS7591450  RPE1_SS111_p10.bam  Illumina HiSeq 2000 6205827 108129733   SRR12904707 6205827 248233080
 SRP288715   SRX9369594  RPE1_SS111_p0   9606    Homo sapiens    OTHER   GENOMIC other   SRS7591449  RPE1_SS111_p0.bam   Illumina HiSeq 2000 928703  18488436    SRR12904708 928703  37148120
 SRP288715   SRX9369593  RPE1_SS51_p10   9606    Homo sapiens    OTHER   GENOMIC other   SRS7591448  RPE1_SS51_p10.bam   Illumina HiSeq 2000 6088168 106065537   SRR12904709 6088168 243526720
 SRP288715   SRX9369592  RPE1_SS51_p0    9606    Homo sapiens    OTHER   GENOMIC other   SRS7591447  RPE1_SS51_p0.bam    Illumina HiSeq 2000 1624227 30610200    SRR12904710 1624227 64969080
 SRP288715   SRX9369591  RPE1_SS48_p10   9606    Homo sapiens    OTHER   GENOMIC other   SRS7591446  RPE1_SS48_p10.bam   Illumina HiSeq 2000 8117881 139408135   SRR12904711 8117881 324715240
 SRP288715   SRX9369590  RPE1_SS48_p0    9606    Homo sapiens    OTHER   GENOMIC other   SRS7591445  RPE1_SS48_p0.bam    Illumina HiSeq 2000 776140  15821200    SRR12904712 776140  31045600

By default search returns first 20 hits. SRP299803 seems like a project of interest. However the information outputted by the search command is pretty limited. We want to look up more detailed information about this project:

$ pysradb metadata SRP299803 | head
 study_accession experiment_accession    experiment_title    experiment_desc organism_taxid  organism_name   library_name    library_strategy    library_source  library_selection   library_layout  sample_accession    sample_title    instrument  instrument_model    instrument_model_desc   total_spots total_size  run_accession   run_total_spots run_total_bases
 SRP299803   SRX9756769  GSM4995565: scATAC_Retina_WT; Mus musculus; ATAC-seq    GSM4995565: scATAC_Retina_WT; Mus musculus; ATAC-seq    10090   Mus musculus        ATAC-seq    GENOMIC other   PAIRED  SRS7946094      Illumina NovaSeq 6000   Illumina NovaSeq 6000   ILLUMINA    55435867    2637580797  SRR13329759 55435867    6874047508
 SRP299803   SRX9756768  GSM4995564: scRNA_Retina_VSX2SEKO_Rep2; Mus musculus; RNA-Seq   GSM4995564: scRNA_Retina_VSX2SEKO_Rep2; Mus musculus; RNA-Seq   10090   Mus musculus        RNA-Seq TRANSCRIPTOMIC  cDNA    PAIRED  SRS7946093      Illumina NovaSeq 6000   Illumina NovaSeq 6000   ILLUMINA    96123725    4107807391  SRR13329758 96123725    12688331700
 SRP299803   SRX9756767  GSM4995563: scRNA_Retina_VSX2SEKO_Rep1; Mus musculus; RNA-Seq   GSM4995563: scRNA_Retina_VSX2SEKO_Rep1; Mus musculus; RNA-Seq   10090   Mus musculus        RNA-Seq TRANSCRIPTOMIC  cDNA    PAIRED  SRS7946092      Illumina NovaSeq 6000   Illumina NovaSeq 6000   ILLUMINA    94345783    4056010488  SRR13329757 94345783    12453643356
 SRP299803   SRX9756766  GSM4995562: scRNA_Retina_WT_Rep2; Mus musculus; RNA-Seq GSM4995562: scRNA_Retina_WT_Rep2; Mus musculus; RNA-Seq 10090   Mus musculus        RNA-Seq TRANSCRIPTOMIC  cDNA    PAIRED  SRS7946091      Illumina NovaSeq 6000   Illumina NovaSeq 6000   ILLUMINA    99487074    4240172698  SRR13329756 99487074    13132293768
 SRP299803   SRX9756765  GSM4995561: scRNA_Retina_WT_Rep1; Mus musculus; RNA-Seq GSM4995561: scRNA_Retina_WT_Rep1; Mus musculus; RNA-Seq 10090   Mus musculus        RNA-Seq TRANSCRIPTOMIC  cDNA    PAIRED  SRS7946090      Illumina NovaSeq 6000   Illumina NovaSeq 6000   ILLUMINA    88048461    3817540828  SRR13329755 88048461    11622396852

It is also possible to get more detailed information using the --detailed flag:

$ pysradb metadata SRP075720 --detailed

 run_accession   study_accession experiment_accession    experiment_title    experiment_desc organism_taxid  organism_name   library_name    library_strategy    library_source  library_selection   library_layout  sample_accession    sample_title    instrument  instrument_model    instrument_model_desc   total_spots total_size  run_total_spots run_total_bases run_alias   sra_url experiment_alias    source_name strain background   genotype    tissue/cell type    molecule subtype    ena_fastq_http  ena_fastq_http_1    ena_fastq_http_2    ena_fastq_ftp   ena_fastq_ftp_1 ena_fastq_ftp_2
 SRR13329759 SRP299803   SRX9756769  GSM4995565: scATAC_Retina_WT; Mus musculus; ATAC-seq    GSM4995565: scATAC_Retina_WT; Mus musculus; ATAC-seq    10090   Mus musculus        ATAC-seq    GENOMIC other   PAIRED  SRS7946094      Illumina NovaSeq 6000   Illumina NovaSeq 6000   ILLUMINA    55435867    2637580797  55435867    6874047508  GSM4995565_r1   https://sra-download.ncbi.nlm.nih.gov/traces/sra77/SRR/013017/SRR13329759   GSM4995565  wild type_retina    C57BL/6 wild type   retina          http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR133/059/SRR13329759/SRR13329759_1.fastq.gz   http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR133/059/SRR13329759/SRR13329759_2.fastq.gz       era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR133/059/SRR13329759/SRR13329759_1.fastq.gz    era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR133/059/SRR13329759/SRR13329759_2.fastq.gz
 SRR13329758 SRP299803   SRX9756768  GSM4995564: scRNA_Retina_VSX2SEKO_Rep2; Mus musculus; RNA-Seq   GSM4995564: scRNA_Retina_VSX2SEKO_Rep2; Mus musculus; RNA-Seq   10090   Mus musculus        RNA-Seq TRANSCRIPTOMIC  cDNA    PAIRED  SRS7946093      Illumina NovaSeq 6000   Illumina NovaSeq 6000   ILLUMINA    96123725    4107807391  96123725    12688331700 GSM4995564_r1   https://sra-download.ncbi.nlm.nih.gov/traces/sra70/SRR/013017/SRR13329758   GSM4995564  Vsx2SE Δ/Δ_retina   C57BL/6 Vsx2SE {delta}/{delta}  retina  3' RNA      http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR133/058/SRR13329758/SRR13329758_1.fastq.gz   http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR133/058/SRR13329758/SRR13329758_2.fastq.gz       era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR133/058/SRR13329758/SRR13329758_1.fastq.gz    era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR133/058/SRR13329758/SRR13329758_2.fastq.gz
 SRR13329757 SRP299803   SRX9756767  GSM4995563: scRNA_Retina_VSX2SEKO_Rep1; Mus musculus; RNA-Seq   GSM4995563: scRNA_Retina_VSX2SEKO_Rep1; Mus musculus; RNA-Seq   10090   Mus musculus        RNA-Seq TRANSCRIPTOMIC  cDNA    PAIRED  SRS7946092      Illumina NovaSeq 6000   Illumina NovaSeq 6000   ILLUMINA    94345783    4056010488  94345783    12453643356 GSM4995563_r1   https://sra-download.ncbi.nlm.nih.gov/traces/sra79/SRR/013017/SRR13329757   GSM4995563  Vsx2SE Δ/Δ_retina   C57BL/6 Vsx2SE {delta}/{delta}  retina  3' RNA      http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR133/057/SRR13329757/SRR13329757_1.fastq.gz   http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR133/057/SRR13329757/SRR13329757_2.fastq.gz       era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR133/057/SRR13329757/SRR13329757_1.fastq.gz    era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR133/057/SRR13329757/SRR13329757_2.fastq.gz
 SRR13329756 SRP299803   SRX9756766  GSM4995562: scRNA_Retina_WT_Rep2; Mus musculus; RNA-Seq GSM4995562: scRNA_Retina_WT_Rep2; Mus musculus; RNA-Seq 10090   Mus musculus        RNA-Seq TRANSCRIPTOMIC  cDNA    PAIRED  SRS7946091      Illumina NovaSeq 6000   Illumina NovaSeq 6000   ILLUMINA    99487074    4240172698  99487074    13132293768 GSM4995562_r1   https://sra-download.ncbi.nlm.nih.gov/traces/sra77/SRR/013017/SRR13329756   GSM4995562  wild type_retina    C57BL/6 wild type   retina  3' RNA      http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR133/056/SRR13329756/SRR13329756_1.fastq.gz   http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR133/056/SRR13329756/SRR13329756_2.fastq.gz       era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR133/056/SRR13329756/SRR13329756_1.fastq.gz    era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR133/056/SRR13329756/SRR13329756_2.fastq.gz
 SRR13329755 SRP299803   SRX9756765  GSM4995561: scRNA_Retina_WT_Rep1; Mus musculus; RNA-Seq GSM4995561: scRNA_Retina_WT_Rep1; Mus musculus; RNA-Seq 10090   Mus musculus        RNA-Seq TRANSCRIPTOMIC  cDNA    PAIRED  SRS7946090      Illumina NovaSeq 6000   Illumina NovaSeq 6000   ILLUMINA    88048461    3817540828  88048461    11622396852 GSM4995561_r1   https://sra-download.ncbi.nlm.nih.gov/traces/sra72/SRR/013017/SRR13329755   GSM4995561  wild type_retina    C57BL/6 wild type   retina  3' RNA      http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR133/055/SRR13329755/SRR13329755_1.fastq.gz   http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR133/055/SRR13329755/SRR13329755_2.fastq.gz       era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR133/055/SRR13329755/SRR13329755_1.fastq.gz    era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR133/055/SRR13329755/SRR13329755_2.fastq.gz

Having made sure this dataset is indeed of interest, we want to save some work and see if the processed dataset has been made available on GEO by the authors:

$ pysradb srp-to-gse SRP299803

study_accession  study_alias
SRP299803        GSE164044

So indeed a GEO project exists for this SRA dataset.

Notice, that the GEO information was also visible in the metadata --detailed operation. Assume we were in posession of the GSM id of one of the experiments to start off with, say GSE4995565. Starting from this GSM id, we want to get the following information:

  • SRP id of the project

  • GSE id of the project

  • SRX id of the experiment

  • SRR id(s) corresponding to the experiment

Get SRP id:

$ pysradb gsm-to-srp GSM4995565

experiment_alias study_accession
GSM4995565       SRP299803

Get GSE id:

$ pysradb gsm-to-gse GSM4995565

experiment_alias study_alias
GSM4995565       GSE164044

Get SRX id:

$ pysradb gsm-to-srx GSM4995565

experiment_alias experiment_accession
GSM4995565       SRX9756769

Getting SRR id(s):

$ pysradb gsm-to-srr GSM4995565

experiment_alias run_accession
GSM4995565       SRR13329759

Case Study 2#

Our first case study included metadata search. Next, we explore downloading datasets.

We have a SRP id to start off with: SRP000941. We want to quickly checkout its contents:

$ pysradb metadata SRP000941 --detailed| head

study_accession experiment_accession    experiment_title    experiment_desc organism_taxid  organism_name   library_name    library_strategy    library_source  library_selection   library_layout  sample_accession    sample_title    instrument  instrument_model    instrument_model_desc   total_spots total_size  run_accession   run_total_spots run_total_bases
SRP000941   SRX056722   Reference Epigenome: ChIP-Seq Analysis of H3K27ac in hESC H1 Cells  Reference Epigenome: ChIP-Seq Analysis of H3K27ac in hESC H1 Cells  9606    Homo sapiens    SAK270  ChIP-Seq    GENOMIC ChIP    SINGLE  SRS184466       Illumina HiSeq 2000 Illumina HiSeq 2000 ILLUMINA    26900401    531654480   SRR179707   26900401    807012030
SRP000941   SRX027889   Reference Epigenome: ChIP-Seq Analysis of H2AK5ac in hESC Cells Reference Epigenome: ChIP-Seq Analysis of H2AK5ac in hESC Cells 9606    Homo sapiens    SAK201  ChIP-Seq    GENOMIC ChIP    SINGLE  SRS116481       Illumina Genome Analyzer II Illumina Genome Analyzer II ILLUMINA    37528590    779578968   SRR067978   37528590    1351029240
SRP000941   SRX027888   Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606    Homo sapiens    LLH1U   ChIP-Seq    GENOMIC RANDOM  SINGLE  SRS116483       Illumina Genome Analyzer II Illumina Genome Analyzer II ILLUMINA    13603127    3232309537  SRR067977   13603127    489712572
SRP000941   SRX027887   Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606    Homo sapiens    DM219   ChIP-Seq    GENOMIC RANDOM  SINGLE  SRS116562       Illumina Genome Analyzer II Illumina Genome Analyzer II ILLUMINA    22430523    506327844   SRR067976   22430523    807498828

This project is a collection of multiple assays.

$ pysradb metadata SRP000941 --detailed  | tr -s '  ' | cut -f5 -d ' ' | sort | uniq -c

999 Bisulfite-Seq
768 ChIP-Seq
  1 library_strategy
121 OTHER
353 RNA-Seq
 28 WGS

We want to however only download RNA-seq samples:

$ pysradb metadata SRP000941 --detailed | grep 'study\|RNA-Seq' | pysradb download

This will download all RNA-seq samples coming from this project using aspera-client, if available. Alternatively, it can also use wget.

Downloading an entire project is easy:

$ pysradb download -p SRP000941

Downloads are organized by SRP/SRX/SRR mimicking the hiererachy of SRA projects.