Case Studies#

Case Study 1#

Consider a scenario where somone is interested in searching for single-cell RNA-seq datasets. In particular, the interest is in studying retina:

$ pysradb search --query "single-cell rna-seq retina"

 study_accession     experiment_accession    experiment_title        sample_taxon_id sample_scientific_name  experiment_library_strategy     experiment_library_source       experiment_library_selection    sample_accession        sample_alias    experiment_instrument_model     pool_member_spots       run_1_size      run_1_accession run_1_total_spots       run_1_total_bases
 SRP299803   SRX9756769      GSM4995565: scATAC_Retina_WT; Mus musculus; ATAC-seq    10090   Mus musculus    ATAC-seq        GENOMIC other   SRS7946094      GSM4995565      Illumina NovaSeq 6000   55435867        2637580797      SRR13329759     55435867        6874047508
 SRP299803   SRX9756768      GSM4995564: scRNA_Retina_VSX2SEKO_Rep2; Mus musculus; RNA-Seq   10090   Mus musculus    RNA-Seq TRANSCRIPTOMIC  cDNA    SRS7946093      GSM4995564      Illumina NovaSeq 6000   96123725        4107807391      SRR13329758     96123725        12688331700
 SRP299803   SRX9756767      GSM4995563: scRNA_Retina_VSX2SEKO_Rep1; Mus musculus; RNA-Seq   10090   Mus musculus    RNA-Seq TRANSCRIPTOMIC  cDNA    SRS7946092      GSM4995563      Illumina NovaSeq 6000   94345783        4056010488      SRR13329757     94345783        12453643356
 SRP299803   SRX9756766      GSM4995562: scRNA_Retina_WT_Rep2; Mus musculus; RNA-Seq 10090   Mus musculus    RNA-Seq TRANSCRIPTOMIC  cDNA    SRS7946091      GSM4995562      Illumina NovaSeq 6000   99487074        4240172698      SRR13329756     99487074        13132293768
 SRP299803   SRX9756765      GSM4995561: scRNA_Retina_WT_Rep1; Mus musculus; RNA-Seq 10090   Mus musculus    RNA-Seq TRANSCRIPTOMIC  cDNA    SRS7946090      GSM4995561      Illumina NovaSeq 6000   88048461        3817540828      SRR13329755     88048461        11622396852
 SRP257758   SRX9537754      GSM4916438: Pou4f2-tdTomato/+ E17.5 scRNA-seq; Mus musculus; RNA-Seq    10090   Mus musculus    RNA-Seq TRANSCRIPTOMIC  cDNA    SRS7743995      GSM4916438      Illumina HiSeq 2500     364683840       8246658699      SRR13091939     364683840       32456861760
 SRP257758   SRX9537753      GSM4916437: Atoh7-zsGreen/lacZ E17.5 scRNA-seq; Mus musculus; RNA-Seq   10090   Mus musculus    RNA-Seq TRANSCRIPTOMIC  cDNA    SRS7743994      GSM4916437      Illumina HiSeq 2500     530456067       11895864680     SRR13091938     530456067       47210589963
 SRP257758   SRX9537752      GSM4916436: Atoh7-zsGreen/+ E17.5 scRNA-seq; Mus musculus; RNA-Seq      10090   Mus musculus    RNA-Seq TRANSCRIPTOMIC  cDNA    SRS7743993      GSM4916436      Illumina HiSeq 2500     389849416       8671923722      SRR13091937     389849416       34696598024
 SRP257758   SRX9537751      GSM4916435: Atoh7-zsGreen/lacZ E14.5 scRNA-seq; Mus musculus; RNA-Seq   10090   Mus musculus    RNA-Seq TRANSCRIPTOMIC  cDNA    SRS7743992      GSM4916435      Illumina HiSeq 2500     328878355       7875737709      SRR13091936     328878355       29270173595
 SRP257758   SRX9537750      GSM4916434: Atoh7-zsGreen/+ E14.5 scRNA-seq; Mus musculus; RNA-Seq      10090   Mus musculus    RNA-Seq TRANSCRIPTOMIC  cDNA    SRS7743991      GSM4916434      Illumina HiSeq 2500     522040155       12760941656     SRR13091935     522040155       46461573795
 ERP118072   ERX3614517      NextSeq 500 sequencing; 3' mRNA-seq of protrusions and cell bodies of BJ, PC-3M, RPE-1, U-87 and WM-266.4 cells 9606    Homo sapiens    OTHER   TRANSCRIPTOMIC  Oligo-dT        ERS3920269      SAMEA6120013    NextSeq 500     5818488 43355751        ERR3619129      1457318 109897743
 ERP118072   ERX3614516      NextSeq 500 sequencing; 3' mRNA-seq of protrusions and cell bodies of BJ, PC-3M, RPE-1, U-87 and WM-266.4 cells 9606    Homo sapiens    OTHER   TRANSCRIPTOMIC  Oligo-dT        ERS3920268      SAMEA6120012    NextSeq 500     5422441 40645479        ERR3619125      1359663 102468758
 SRP288715   SRX9369597      RPE1_SS119_p10  9606    Homo sapiens    OTHER   GENOMIC other   SRS7591452      RPE1_SS119_p10.bam      Illumina HiSeq 2000     5062938 88426773        SRR12904705     5062938 202517520
 SRP288715   SRX9369596      RPE1_SS119_p0   9606    Homo sapiens    OTHER   GENOMIC other   SRS7591451      RPE1_SS119_p0.bam       Illumina HiSeq 2000     978835  19219630        SRR12904706     978835  39153400
 SRP288715   SRX9369595      RPE1_SS111_p10  9606    Homo sapiens    OTHER   GENOMIC other   SRS7591450      RPE1_SS111_p10.bam      Illumina HiSeq 2000     6205827 108129733       SRR12904707     6205827 248233080
 SRP288715   SRX9369594      RPE1_SS111_p0   9606    Homo sapiens    OTHER   GENOMIC other   SRS7591449      RPE1_SS111_p0.bam       Illumina HiSeq 2000     928703  18488436        SRR12904708     928703  37148120
 SRP288715   SRX9369593      RPE1_SS51_p10   9606    Homo sapiens    OTHER   GENOMIC other   SRS7591448      RPE1_SS51_p10.bam       Illumina HiSeq 2000     6088168 106065537       SRR12904709     6088168 243526720
 SRP288715   SRX9369592      RPE1_SS51_p0    9606    Homo sapiens    OTHER   GENOMIC other   SRS7591447      RPE1_SS51_p0.bam        Illumina HiSeq 2000     1624227 30610200        SRR12904710     1624227 64969080
 SRP288715   SRX9369591      RPE1_SS48_p10   9606    Homo sapiens    OTHER   GENOMIC other   SRS7591446      RPE1_SS48_p10.bam       Illumina HiSeq 2000     8117881 139408135       SRR12904711     8117881 324715240
 SRP288715   SRX9369590      RPE1_SS48_p0    9606    Homo sapiens    OTHER   GENOMIC other   SRS7591445      RPE1_SS48_p0.bam        Illumina HiSeq 2000     776140  15821200        SRR12904712     776140  31045600

By default search returns first 20 hits. SRP299803 seems like a project of interest. However the information outputted by the search command is pretty limited. We want to look up more detailed information about this project:

$ pysradb metadata SRP299803 | head
 study_accession     experiment_accession    experiment_title        experiment_desc organism_taxid  organism_name   library_name    library_strategy        library_source  library_selection       library_layout  sample_accession        sample_title    instrument      instrument_model        instrument_model_desc   total_spots     total_size      run_accession   run_total_spots run_total_bases
 SRP299803   SRX9756769      GSM4995565: scATAC_Retina_WT; Mus musculus; ATAC-seq    GSM4995565: scATAC_Retina_WT; Mus musculus; ATAC-seq    10090   Mus musculus            ATAC-seq        GENOMIC other   PAIRED  SRS7946094              Illumina NovaSeq 6000   Illumina NovaSeq 6000   ILLUMINA        55435867        2637580797      SRR13329759     55435867        6874047508
 SRP299803   SRX9756768      GSM4995564: scRNA_Retina_VSX2SEKO_Rep2; Mus musculus; RNA-Seq   GSM4995564: scRNA_Retina_VSX2SEKO_Rep2; Mus musculus; RNA-Seq   10090   Mus musculus            RNA-Seq TRANSCRIPTOMIC  cDNA    PAIRED  SRS7946093              Illumina NovaSeq 6000   Illumina NovaSeq 6000   ILLUMINA        96123725        4107807391      SRR13329758     96123725        12688331700
 SRP299803   SRX9756767      GSM4995563: scRNA_Retina_VSX2SEKO_Rep1; Mus musculus; RNA-Seq   GSM4995563: scRNA_Retina_VSX2SEKO_Rep1; Mus musculus; RNA-Seq   10090   Mus musculus            RNA-Seq TRANSCRIPTOMIC  cDNA    PAIRED  SRS7946092              Illumina NovaSeq 6000   Illumina NovaSeq 6000   ILLUMINA        94345783        4056010488      SRR13329757     94345783        12453643356
 SRP299803   SRX9756766      GSM4995562: scRNA_Retina_WT_Rep2; Mus musculus; RNA-Seq GSM4995562: scRNA_Retina_WT_Rep2; Mus musculus; RNA-Seq 10090   Mus musculus            RNA-Seq TRANSCRIPTOMIC  cDNA    PAIRED  SRS7946091              Illumina NovaSeq 6000   Illumina NovaSeq 6000   ILLUMINA        99487074        4240172698      SRR13329756     99487074        13132293768
 SRP299803   SRX9756765      GSM4995561: scRNA_Retina_WT_Rep1; Mus musculus; RNA-Seq GSM4995561: scRNA_Retina_WT_Rep1; Mus musculus; RNA-Seq 10090   Mus musculus            RNA-Seq TRANSCRIPTOMIC  cDNA    PAIRED  SRS7946090              Illumina NovaSeq 6000   Illumina NovaSeq 6000   ILLUMINA        88048461        3817540828      SRR13329755     88048461        11622396852

It is also possible to get more detailed information using the --detailed flag:

$ pysradb metadata SRP075720 --detailed

 run_accession       study_accession experiment_accession    experiment_title        experiment_desc organism_taxid  organism_name   library_name    library_strategy        library_source  library_selection       library_layout  sample_accession        sample_title    instrument      instrument_model        instrument_model_desc   total_spots     total_size      run_total_spots run_total_bases run_alias       sra_url experiment_alias        source_name     strain background       genotype        tissue/cell type        molecule subtype        ena_fastq_http  ena_fastq_http_1        ena_fastq_http_2        ena_fastq_ftp   ena_fastq_ftp_1 ena_fastq_ftp_2
 SRR13329759 SRP299803       SRX9756769      GSM4995565: scATAC_Retina_WT; Mus musculus; ATAC-seq    GSM4995565: scATAC_Retina_WT; Mus musculus; ATAC-seq    10090   Mus musculus            ATAC-seq        GENOMIC other   PAIRED  SRS7946094              Illumina NovaSeq 6000   Illumina NovaSeq 6000   ILLUMINA        55435867        2637580797      55435867        6874047508      GSM4995565_r1   https://sra-download.ncbi.nlm.nih.gov/traces/sra77/SRR/013017/SRR13329759       GSM4995565      wild type_retina        C57BL/6 wild type       retina                  http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR133/059/SRR13329759/SRR13329759_1.fastq.gz       http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR133/059/SRR13329759/SRR13329759_2.fastq.gz               era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR133/059/SRR13329759/SRR13329759_1.fastq.gz    era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR133/059/SRR13329759/SRR13329759_2.fastq.gz
 SRR13329758 SRP299803       SRX9756768      GSM4995564: scRNA_Retina_VSX2SEKO_Rep2; Mus musculus; RNA-Seq   GSM4995564: scRNA_Retina_VSX2SEKO_Rep2; Mus musculus; RNA-Seq   10090   Mus musculus            RNA-Seq TRANSCRIPTOMIC  cDNA    PAIRED  SRS7946093              Illumina NovaSeq 6000   Illumina NovaSeq 6000   ILLUMINA        96123725        4107807391      96123725        12688331700     GSM4995564_r1   https://sra-download.ncbi.nlm.nih.gov/traces/sra70/SRR/013017/SRR13329758       GSM4995564      Vsx2SE Δ/Δ_retina       C57BL/6 Vsx2SE {delta}/{delta}  retina  3' RNA          http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR133/058/SRR13329758/SRR13329758_1.fastq.gz       http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR133/058/SRR13329758/SRR13329758_2.fastq.gz               era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR133/058/SRR13329758/SRR13329758_1.fastq.gz    era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR133/058/SRR13329758/SRR13329758_2.fastq.gz
 SRR13329757 SRP299803       SRX9756767      GSM4995563: scRNA_Retina_VSX2SEKO_Rep1; Mus musculus; RNA-Seq   GSM4995563: scRNA_Retina_VSX2SEKO_Rep1; Mus musculus; RNA-Seq   10090   Mus musculus            RNA-Seq TRANSCRIPTOMIC  cDNA    PAIRED  SRS7946092              Illumina NovaSeq 6000   Illumina NovaSeq 6000   ILLUMINA        94345783        4056010488      94345783        12453643356     GSM4995563_r1   https://sra-download.ncbi.nlm.nih.gov/traces/sra79/SRR/013017/SRR13329757       GSM4995563      Vsx2SE Δ/Δ_retina       C57BL/6 Vsx2SE {delta}/{delta}  retina  3' RNA          http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR133/057/SRR13329757/SRR13329757_1.fastq.gz       http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR133/057/SRR13329757/SRR13329757_2.fastq.gz               era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR133/057/SRR13329757/SRR13329757_1.fastq.gz    era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR133/057/SRR13329757/SRR13329757_2.fastq.gz
 SRR13329756 SRP299803       SRX9756766      GSM4995562: scRNA_Retina_WT_Rep2; Mus musculus; RNA-Seq GSM4995562: scRNA_Retina_WT_Rep2; Mus musculus; RNA-Seq 10090   Mus musculus            RNA-Seq TRANSCRIPTOMIC  cDNA    PAIRED  SRS7946091              Illumina NovaSeq 6000   Illumina NovaSeq 6000   ILLUMINA        99487074        4240172698      99487074        13132293768     GSM4995562_r1   https://sra-download.ncbi.nlm.nih.gov/traces/sra77/SRR/013017/SRR13329756       GSM4995562      wild type_retina        C57BL/6 wild type       retina  3' RNA          http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR133/056/SRR13329756/SRR13329756_1.fastq.gz       http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR133/056/SRR13329756/SRR13329756_2.fastq.gz               era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR133/056/SRR13329756/SRR13329756_1.fastq.gz    era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR133/056/SRR13329756/SRR13329756_2.fastq.gz
 SRR13329755 SRP299803       SRX9756765      GSM4995561: scRNA_Retina_WT_Rep1; Mus musculus; RNA-Seq GSM4995561: scRNA_Retina_WT_Rep1; Mus musculus; RNA-Seq 10090   Mus musculus            RNA-Seq TRANSCRIPTOMIC  cDNA    PAIRED  SRS7946090              Illumina NovaSeq 6000   Illumina NovaSeq 6000   ILLUMINA        88048461        3817540828      88048461        11622396852     GSM4995561_r1   https://sra-download.ncbi.nlm.nih.gov/traces/sra72/SRR/013017/SRR13329755       GSM4995561      wild type_retina        C57BL/6 wild type       retina  3' RNA          http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR133/055/SRR13329755/SRR13329755_1.fastq.gz       http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR133/055/SRR13329755/SRR13329755_2.fastq.gz               era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR133/055/SRR13329755/SRR13329755_1.fastq.gz    era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR133/055/SRR13329755/SRR13329755_2.fastq.gz

Having made sure this dataset is indeed of interest, we want to save some work and see if the processed dataset has been made available on GEO by the authors:

$ pysradb srp-to-gse SRP299803

study_accession  study_alias
SRP299803        GSE164044

So indeed a GEO project exists for this SRA dataset.

Notice, that the GEO information was also visible in the metadata --detailed operation. Assume we were in posession of the GSM id of one of the experiments to start off with, say GSE4995565. Starting from this GSM id, we want to get the following information:

  • SRP id of the project

  • GSE id of the project

  • SRX id of the experiment

  • SRR id(s) corresponding to the experiment

Get SRP id:

$ pysradb gsm-to-srp GSM4995565

experiment_alias study_accession
GSM4995565       SRP299803

Get GSE id:

$ pysradb gsm-to-gse GSM4995565

experiment_alias study_alias
GSM4995565       GSE164044

Get SRX id:

$ pysradb gsm-to-srx GSM4995565

experiment_alias experiment_accession
GSM4995565       SRX9756769

Getting SRR id(s):

$ pysradb gsm-to-srr GSM4995565

experiment_alias run_accession
GSM4995565       SRR13329759

Case Study 2#

Our first case study included metadata search. Next, we explore downloading datasets.

We have a SRP id to start off with: SRP000941. We want to quickly checkout its contents:

$ pysradb metadata SRP000941 --detailed| head

study_accession     experiment_accession    experiment_title        experiment_desc organism_taxid  organism_name   library_name    library_strategy        library_source  library_selection       library_layout  sample_accession        sample_title    instrument      instrument_model        instrument_model_desc   total_spots     total_size      run_accession   run_total_spots run_total_bases
SRP000941   SRX056722       Reference Epigenome: ChIP-Seq Analysis of H3K27ac in hESC H1 Cells      Reference Epigenome: ChIP-Seq Analysis of H3K27ac in hESC H1 Cells      9606    Homo sapiens    SAK270  ChIP-Seq        GENOMIC ChIP    SINGLE  SRS184466               Illumina HiSeq 2000     Illumina HiSeq 2000     ILLUMINA        26900401        531654480       SRR179707       26900401        807012030
SRP000941   SRX027889       Reference Epigenome: ChIP-Seq Analysis of H2AK5ac in hESC Cells Reference Epigenome: ChIP-Seq Analysis of H2AK5ac in hESC Cells 9606    Homo sapiens    SAK201  ChIP-Seq        GENOMIC ChIP    SINGLE  SRS116481               Illumina Genome Analyzer II     Illumina Genome Analyzer II     ILLUMINA        37528590        779578968       SRR067978       37528590        1351029240
SRP000941   SRX027888       Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606    Homo sapiens    LLH1U   ChIP-Seq        GENOMIC RANDOM  SINGLE  SRS116483               Illumina Genome Analyzer II     Illumina Genome Analyzer II     ILLUMINA        13603127        3232309537      SRR067977       13603127        489712572
SRP000941   SRX027887       Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606    Homo sapiens    DM219   ChIP-Seq        GENOMIC RANDOM  SINGLE  SRS116562               Illumina Genome Analyzer II     Illumina Genome Analyzer II     ILLUMINA        22430523        506327844       SRR067976       22430523        807498828

This project is a collection of multiple assays.

$ pysradb metadata SRP000941 --detailed  | tr -s '  ' | cut -f5 -d ' ' | sort | uniq -c

999 Bisulfite-Seq
768 ChIP-Seq
  1 library_strategy
121 OTHER
353 RNA-Seq
 28 WGS

We want to however only download RNA-seq samples:

$ pysradb metadata SRP000941 --detailed | grep 'study\|RNA-Seq' | pysradb download

This will download all RNA-seq samples coming from this project using aspera-client, if available. Alternatively, it can also use wget.

Downloading an entire project is easy:

$ pysradb download -p SRP000941

Downloads are organized by SRP/SRX/SRR mimicking the hiererachy of SRA projects.