Converting SRA to FASTQ using Conda¶
This notebook demonstrates how to convert SRA files to FASTQ format using conda and parallel-fastq-dump.
[1]:
# Install pysradb if not already installed
try:
import pysradb
print(f"pysradb {pysradb.__version__} is already installed")
except ImportError:
print("Installing pysradb from GitHub...")
import sys
!{sys.executable} -m pip install -q git+https://github.com/saketkc/pysradb
print("pysradb installed successfully!")
pysradb 3.0.0.dev0 is already installed
/home/runner/work/pysradb/pysradb/pysradb/download.py:15: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
from tqdm.autonotebook import tqdm
Install Conda¶
[2]:
print("Install Miniforge or Mambaforge locally, then configure channels:")
print("conda config --add channels conda-forge")
print("conda config --add channels bioconda")
Install Miniforge or Mambaforge locally, then configure channels:
conda config --add channels conda-forge
conda config --add channels bioconda
Install parallel-fastq-dump¶
[3]:
print("conda install -y parallel-fastq-dump")
conda install -y parallel-fastq-dump
Install latest pysradb¶
[4]:
print("python -m pip install pysradb")
python -m pip install pysradb
Get metadata¶
[5]:
!pysradb metadata --detailed SRP063852
Displaying 54 columns in chunks of 5
Columns 1-5:
experime
run_acce study_ac nt_acces experiment_t
ssion cession study_title sion itle
SRR24337 SRP06385 A spectral analysis approach to SRX12544 GSM1887643:
94 2 detects actively translated open 13 ribosome
reading frames in high profiling;
resolution ribosome profiling Homo
data sapiens;
miRNA-Seq
Columns 6-10:
organism_ organism_ library_na library_s
experiment_desc taxid name me trategy
GSM1887643: ribosome profiling; 9606 Homo - miRNA-Seq
Homo sapiens; miRNA-Seq sapiens
Columns 11-15:
library_sourc library_layo sample_accessi
e library_selection ut on sample_title
TRANSCRIPTOMI size SINGLE SRS1072728 -
C fractionation
Columns 16-20:
instrument_mode instrument_model
biosample bioproject instrument l _desc
SAMN0409381 PRJNA296059 Illumina HiSeq Illumina HiSeq ILLUMINA
8 2000 2000
Columns 21-25:
total_spots total_size run_total_spots run_total_bases run_alias
31967082 626381849 31967082 916773615 GSM1887643_r1
Columns 26-30:
public_filen public_vers
ame public_size public_date public_md5 ion
SRR2433794.s 278092969 2020-06-21 3caeb3f88cce8b4dce5d0e 1
ralite 10:31:41 51caa1d9f2
Columns 31-35:
public_sem public_sup public_srat aws_free_eg
antic_name ertype oolkit aws_url ress
SRA Lite Primary 1 s3://sra-pub-zq-7/SRR24337 s3.us-east-
ETL 94/SRR2433794.sralite.1 1
Columns 36-40:
ncbi_ac
aws_acce ncbi_fre cess_ty
ss_type public_url ncbi_url e_egress pe
aws https://sra-downloadb.b https://sra-downloadb. worldwid anonymo
identity e-md.ncbi.nlm.nih.gov/s be-md.ncbi.nlm.nih.gov e us
os5/sra-pub-zq-11/SRR00 /sos5/sra-pub-zq-11/SR
2/433/SRR2433794/SRR243 R002/433/SRR2433794/SR
3794.sralite.1 R2433794.sralite.1
Columns 41-45:
gcp_free_e gcp_access experiment source_nam
gcp_url gress _type _alias e
gs://sra-pub-zq-101/SRR24337 gs.us-east gcp GSM1887643 HEK293
94/SRR2433794.zq.1 1 identity
Columns 46-50:
cell ena_fast ena_fast
line ena_fastq_http q_http_1 q_http_2 ena_fastq_ftp
HEK293 http://ftp.sra.ebi.ac. - - era-fasp@fasp.sra.ebi.
uk/vol1/fastq/SRR243/0 ac.uk:vol1/fastq/SRR24
04/SRR2433794/SRR24337 3/004/SRR2433794/SRR24
94.fastq.gz 33794.fastq.gz
Columns 51-54:
ena_fastq_ftp_ study_geo_accessio experiment_geo_accessio
ena_fastq_ftp_1 2 n n
- - GSE73136 GSM1887643
Download data¶
[6]:
print("pysradb download -y -p SRP063852")
pysradb download -y -p SRP063852
Run parallel-fastq-dump¶
[7]:
print("ls -ltrh pysradb_downloads")
ls -ltrh pysradb_downloads
[8]:
print("ls -ltrh pysradb_downloads/SRP063852/SRX1254413")
ls -ltrh pysradb_downloads/SRP063852/SRX1254413
SRA to fastq¶
[9]:
print("mkdir -p sratofastq tmpdir")
print("parallel-fastq-dump --threads 4 --outdir sratofastq/ --split-files --tmpdir tmpdir --gzip -s pysradb_downloads/SRP063852/SRX1254413/SRR2433794.sra")
mkdir -p sratofastq tmpdir
parallel-fastq-dump --threads 4 --outdir sratofastq/ --split-files --tmpdir tmpdir --gzip -s pysradb_downloads/SRP063852/SRX1254413/SRR2433794.sra
[10]:
!ls -ltrh sratofastq
ls: cannot access 'sratofastq': No such file or directory