Python API#

Use Case 1: Fetch the metadata table (SRA-runtable)#

The simplest use case of pysradb is when you know the SRA project ID (SRP) and would simply want to fetch the metadata associated with it. This is generally reflected in the SraRunTable.txt that you get from NCBI’s website. See an example of a SraRunTable.

from pysradb import SRAWeb
db = SRAWeb()
df = db.sra_metadata('SRP098789')
df.head()
===============  ====================  ======================================================================  =============  ========  =================  ==============  ================  ==============  ============  ==========  ========  ============  ===============
study_accession  experiment_accession                             experiment_title                             run_accession  taxon_id  library_selection  library_layout  library_strategy  library_source  library_name    bases      spots    adapter_spec  avg_read_length
===============  ====================  ======================================================================  =============  ========  =================  ==============  ================  ==============  ============  ==========  ========  ============  ===============
SRP098789        SRX2536403            GSM2475997: 1.5 µM PF-067446846, 10 min, rep 1; Homo sapiens; OTHER  SRR5227288         9606  other              SINGLE -        OTHER             TRANSCRIPTOMIC                2104142750  42082855                             50
SRP098789        SRX2536404            GSM2475998: 1.5 µM PF-067446846, 10 min, rep 2; Homo sapiens; OTHER  SRR5227289         9606  other              SINGLE -        OTHER             TRANSCRIPTOMIC                2082873050  41657461                             50
SRP098789        SRX2536405            GSM2475999: 1.5 µM PF-067446846, 10 min, rep 3; Homo sapiens; OTHER  SRR5227290         9606  other              SINGLE -        OTHER             TRANSCRIPTOMIC                2023148650  40462973                             50
SRP098789        SRX2536406            GSM2476000: 0.3 µM PF-067446846, 10 min, rep 1; Homo sapiens; OTHER  SRR5227291         9606  other              SINGLE -        OTHER             TRANSCRIPTOMIC                2057165950  41143319                             50
SRP098789        SRX2536407            GSM2476001: 0.3 µM PF-067446846, 10 min, rep 2; Homo sapiens; OTHER  SRR5227292         9606  other              SINGLE -        OTHER             TRANSCRIPTOMIC                3027621850  60552437                             50
===============  ====================  ======================================================================  =============  ========  =================  ==============  ================  ==============  ============  ==========  ========  ============  ===============

The metadata is returned as a pandas dataframe and hence allows you to perform all regular select/query operations available through pandas.

Use Case 2: Downloading an entire project arranged experiment wise#

Once you have fetched the metadata and made sure, this is the project you were looking for, you would want to download everything at once. NCBI follows this hiererachy: SRP => SRX => SRR. Each SRP (project) has multiple SRX (experiments) and each SRX in turn has multiple SRR (runs) inside it. We want to mimick this hiereachy in our downloads. The reason to do that is simple: in most cases you care about SRX the most, and would want to “merge” your SRRs in one way or the other. Having this hierearchy ensures your downstream code can handle such cases easily, without worrying about which runs (SRR) need to be merged.

We strongly recommend installing aspera-client which uses UDP and is designed to be faster.

from pysradb import SRAWeb
db = SRAWeb()
df = db.sra_metadata('SRP017942')
db.download(df)

Use Case 3: Downloading a subset of experiments#

Often, you need to process only a smaller set of samples from a project (SRP). Consider this project which has data spanning four assays.

df = db.sra_metadata('SRP000941')
print(df.library_strategy.unique())
['ChIP-Seq' 'Bisulfite-Seq' 'RNA-Seq' 'WGS' 'OTHER']

But, you might be only interested in analyzing the RNA-seq samples and would just want to download that subset. This is simple using pysradb since the metadata can be subset just as you would subset a dataframe in pandas.

df_rna = df[df.library_strategy == 'RNA-Seq']
db.download(df=df_rna, out_dir='/pysradb_downloads')()

Use Case 4: Getting cell-type/treatment information from sample_attributes#

Cell type/tissue informations is usually hidden in the sample_attributes column, which can be expanded:

from pysradb.filter_attrs import expand_sample_attribute_columns
df = db.sra_metadata('SRP017942')
expand_sample_attribute_columns(df).head()

study_accession

experiment_accession

experiment_title

experiment_attribute

sample_attribute

run_accession

taxon_id

library_selection

library_layout

library_strategy

library_source

library_name

bases

spots

adapter_spec

avg_read_length

assay_type

cell_line

source_name

transfected_with

treatment

SRP017942

SRX217028

GSM1063575: 293T_GFP; Homo sapiens; RNA-Seq

GEO Accession: GSM1063575

source_name: 293T cells || cell line: 293T cells || transfected with: 3XFLAG-GFP || assay type: Riboseq

SRR648667

9606

other

SINGLE -

RNA-Seq

TRANSCRIPTOMIC

1806641316

50184481

36

riboseq

293t cells

293t cells

3xflag-gfp

NaN

SRP017942

SRX217029

GSM1063576: 293T_GFP_2hrs_severe_Heat_Shock; Homo sapiens; RNA-Seq

GEO Accession: GSM1063576

source_name: 293T cells || cell line: 293T cells || transfected with: 3XFLAG-GFP || treatment: severe heat shock (44C 2 hours) || assay type: Riboseq

SRR648668

9606

other

SINGLE -

RNA-Seq

TRANSCRIPTOMIC

3436984836

95471801

36

riboseq

293t cells

293t cells

3xflag-gfp

severe heat shock (44c 2 hours)

SRP017942

SRX217030

GSM1063577: 293T_Hspa1a; Homo sapiens; RNA-Seq

GEO Accession: GSM1063577

source_name: 293T cells || cell line: 293T cells || transfected with: 3XFLAG-Hspa1a || assay type: Riboseq

SRR648669

9606

other

SINGLE -

RNA-Seq

TRANSCRIPTOMIC

3330909216

92525256

36

riboseq

293t cells

293t cells

3xflag-hspa1a

NaN

SRP017942

SRX217031

GSM1063578: 293T_Hspa1a_2hrs_severe_Heat_Shock; Homo sapiens; RNA-Seq

GEO Accession: GSM1063578

source_name: 293T cells || cell line: 293T cells || transfected with: 3XFLAG-Hspa1a || treatment: severe heat shock (44C 2 hours) || assay type: Riboseq

SRR648670

9606

other

SINGLE -

RNA-Seq

TRANSCRIPTOMIC

3622123512

100614542

36

riboseq

293t cells

293t cells

3xflag-hspa1a

severe heat shock (44c 2 hours)

SRP017942

SRX217956

GSM794854: 3T3-Control-Riboseq; Mus musculus; RNA-Seq

GEO Accession: GSM794854

source_name: 3T3 cells || treatment: control || cell line: 3T3 cells || assay type: Riboseq

SRR649752

10090

cDNA

SINGLE -

RNA-Seq

TRANSCRIPTOMIC

594945396

16526261

36

riboseq

3t3 cells

3t3 cells

NaN

control

Use Case 5: Searching for datasets#

Another common operation that we do on SRA is seach, plain text search.

If you want to look up for all projects where ribosome profiling appears somewhere in the description:

df = db.search_sra(search_str='"ribosome profiling"')
df.head()

study_accession

experiment_accession

experiment_title

run_accession

taxon_id

library_selection

library_layout

library_strategy

library_source

library_name

bases

spots

DRP003075

DRX019536

Illumina Genome Analyzer IIx sequencing of SAMD00018584

DRR021383

83333

other

SINGLE -

OTHER

TRANSCRIPTOMIC

GAII05_3

978776480

12234706

DRP003075

DRX019537

Illumina Genome Analyzer IIx sequencing of SAMD00018585

DRR021384

83333

other

SINGLE -

OTHER

TRANSCRIPTOMIC

GAII05_4

894201680

11177521

DRP003075

DRX019538

Illumina Genome Analyzer IIx sequencing of SAMD00018586

DRR021385

83333

other

SINGLE -

OTHER

TRANSCRIPTOMIC

GAII05_5

931536720

11644209

DRP003075

DRX019540

Illumina Genome Analyzer IIx sequencing of SAMD00018588

DRR021387

83333

other

SINGLE -

OTHER

TRANSCRIPTOMIC

GAII07_4

2759398700

27593987

DRP003075

DRX019541

Illumina Genome Analyzer IIx sequencing of SAMD00018589

DRR021388

83333

other

SINGLE -

OTHER

TRANSCRIPTOMIC

GAII07_5

2386196500

23861965

Again, the results are available as a pandas dataframe and hence you can perform all subset operations post your query. Your query doesn’t need to be exact.