Open In Colab

Converting SRA to FASTQ using Conda

This notebook demonstrates how to convert SRA files to FASTQ format using conda and parallel-fastq-dump.

[1]:
# Install pysradb if not already installed
try:
    import pysradb

    print(f"pysradb {pysradb.__version__} is already installed")
except ImportError:
    print("Installing pysradb from GitHub...")
    import sys

    !{sys.executable} -m pip install -q git+https://github.com/saketkc/pysradb
    print("pysradb installed successfully!")
pysradb 3.0.0.dev0 is already installed
/home/runner/work/pysradb/pysradb/pysradb/download.py:15: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm

Install Conda

[2]:
print("Install Miniforge or Mambaforge locally, then configure channels:")
print("conda config --add channels conda-forge")
print("conda config --add channels bioconda")
Install Miniforge or Mambaforge locally, then configure channels:
conda config --add channels conda-forge
conda config --add channels bioconda

Install parallel-fastq-dump

[3]:
print("conda install -y parallel-fastq-dump")
conda install -y parallel-fastq-dump

Install latest pysradb

[4]:
print("python -m pip install pysradb")
python -m pip install pysradb

Get metadata

[5]:
!pysradb metadata --detailed        SRP063852
Displaying 54 columns in chunks of 5

Columns 1-5:
                                                       experime               
 run_acce  study_ac                                    nt_acces  experiment_t 
 ssion     cession   study_title                       sion      itle         
 SRR24337  SRP06385  A spectral analysis approach to   SRX12544  GSM1887643:
 94        2         detects actively translated open  13        ribosome
                     reading frames in high                      profiling;
                     resolution ribosome profiling               Homo
                     data                                        sapiens;
                                                                 miRNA-Seq

Columns 6-10:
                                  organism_  organism_  library_na  library_s 
 experiment_desc                  taxid      name       me          trategy   
 GSM1887643: ribosome profiling;  9606       Homo       -           miRNA-Seq
 Homo sapiens; miRNA-Seq                     sapiens

Columns 11-15:
 library_sourc                     library_layo  sample_accessi               
 e              library_selection  ut            on              sample_title 
 TRANSCRIPTOMI  size               SINGLE        SRS1072728      -
 C              fractionation

Columns 16-20:
                                            instrument_mode  instrument_model 
 biosample    bioproject    instrument      l                _desc            
 SAMN0409381  PRJNA296059   Illumina HiSeq  Illumina HiSeq   ILLUMINA
 8                          2000            2000

Columns 21-25:
 total_spots    total_size    run_total_spots  run_total_bases  run_alias     
 31967082       626381849     31967082         916773615        GSM1887643_r1

Columns 26-30:
 public_filen                                                     public_vers 
 ame           public_size  public_date   public_md5              ion         
 SRR2433794.s  278092969    2020-06-21    3caeb3f88cce8b4dce5d0e  1
 ralite                     10:31:41      51caa1d9f2

Columns 31-35:
 public_sem  public_sup  public_srat                              aws_free_eg 
 antic_name  ertype      oolkit       aws_url                     ress        
 SRA Lite    Primary     1            s3://sra-pub-zq-7/SRR24337  s3.us-east-
             ETL                      94/SRR2433794.sralite.1     1

Columns 36-40:
                                                                      ncbi_ac 
 aws_acce                                                   ncbi_fre  cess_ty 
 ss_type   public_url               ncbi_url                e_egress  pe      
 aws       https://sra-downloadb.b  https://sra-downloadb.  worldwid  anonymo
 identity  e-md.ncbi.nlm.nih.gov/s  be-md.ncbi.nlm.nih.gov  e         us
           os5/sra-pub-zq-11/SRR00  /sos5/sra-pub-zq-11/SR
           2/433/SRR2433794/SRR243  R002/433/SRR2433794/SR
           3794.sralite.1           R2433794.sralite.1

Columns 41-45:
                               gcp_free_e  gcp_access  experiment  source_nam 
 gcp_url                       gress       _type       _alias      e          
 gs://sra-pub-zq-101/SRR24337  gs.us-east  gcp         GSM1887643  HEK293
 94/SRR2433794.zq.1            1           identity

Columns 46-50:
 cell                              ena_fast  ena_fast                         
 line      ena_fastq_http          q_http_1  q_http_2  ena_fastq_ftp          
 HEK293    http://ftp.sra.ebi.ac.  -         -         era-fasp@fasp.sra.ebi.
           uk/vol1/fastq/SRR243/0                      ac.uk:vol1/fastq/SRR24
           04/SRR2433794/SRR24337                      3/004/SRR2433794/SRR24
           94.fastq.gz                                 33794.fastq.gz

Columns 51-54:
                  ena_fastq_ftp_  study_geo_accessio  experiment_geo_accessio 
 ena_fastq_ftp_1  2               n                   n                       
 -                -               GSE73136            GSM1887643

Download data

[6]:
print("pysradb download -y -p SRP063852")
pysradb download -y -p SRP063852

Run parallel-fastq-dump

[7]:
print("ls -ltrh pysradb_downloads")
ls -ltrh pysradb_downloads
[8]:
print("ls -ltrh pysradb_downloads/SRP063852/SRX1254413")
ls -ltrh pysradb_downloads/SRP063852/SRX1254413

SRA to fastq

[9]:
print("mkdir -p sratofastq tmpdir")
print("parallel-fastq-dump --threads 4 --outdir sratofastq/ --split-files --tmpdir tmpdir --gzip -s pysradb_downloads/SRP063852/SRX1254413/SRR2433794.sra")
mkdir -p sratofastq tmpdir
parallel-fastq-dump --threads 4 --outdir sratofastq/ --split-files --tmpdir tmpdir --gzip -s pysradb_downloads/SRP063852/SRX1254413/SRR2433794.sra
[10]:
!ls -ltrh sratofastq
ls: cannot access 'sratofastq': No such file or directory