Python API#
Use Case 1: Fetch the metadata table (SRA-runtable)#
The simplest use case of pysradb is when you know the SRA project ID (SRP) and would simply want to fetch the metadata associated with it. This is generally reflected in the SraRunTable.txt that you get from NCBI’s website. See an example of a SraRunTable.
from pysradb import SRAWeb
db = SRAWeb()
df = db.sra_metadata('SRP098789')
df.head()
=============== ==================== ====================================================================== ============= ======== ================= ============== ================ ============== ============ ========== ======== ============ ===============
study_accession experiment_accession experiment_title run_accession taxon_id library_selection library_layout library_strategy library_source library_name bases spots adapter_spec avg_read_length
=============== ==================== ====================================================================== ============= ======== ================= ============== ================ ============== ============ ========== ======== ============ ===============
SRP098789 SRX2536403 GSM2475997: 1.5 µM PF-067446846, 10 min, rep 1; Homo sapiens; OTHER SRR5227288 9606 other SINGLE - OTHER TRANSCRIPTOMIC 2104142750 42082855 50
SRP098789 SRX2536404 GSM2475998: 1.5 µM PF-067446846, 10 min, rep 2; Homo sapiens; OTHER SRR5227289 9606 other SINGLE - OTHER TRANSCRIPTOMIC 2082873050 41657461 50
SRP098789 SRX2536405 GSM2475999: 1.5 µM PF-067446846, 10 min, rep 3; Homo sapiens; OTHER SRR5227290 9606 other SINGLE - OTHER TRANSCRIPTOMIC 2023148650 40462973 50
SRP098789 SRX2536406 GSM2476000: 0.3 µM PF-067446846, 10 min, rep 1; Homo sapiens; OTHER SRR5227291 9606 other SINGLE - OTHER TRANSCRIPTOMIC 2057165950 41143319 50
SRP098789 SRX2536407 GSM2476001: 0.3 µM PF-067446846, 10 min, rep 2; Homo sapiens; OTHER SRR5227292 9606 other SINGLE - OTHER TRANSCRIPTOMIC 3027621850 60552437 50
=============== ==================== ====================================================================== ============= ======== ================= ============== ================ ============== ============ ========== ======== ============ ===============
The metadata is returned as a pandas dataframe and hence allows you to perform all regular select/query operations available through pandas.
Use Case 2: Downloading an entire project arranged experiment wise#
Once you have fetched the metadata and made sure, this is the project you were looking for, you would want to download everything at once. NCBI follows this hiererachy: SRP => SRX => SRR. Each SRP (project) has multiple SRX (experiments) and each SRX in turn has multiple SRR (runs) inside it. We want to mimick this hiereachy in our downloads. The reason to do that is simple: in most cases you care about SRX the most, and would want to “merge” your SRRs in one way or the other. Having this hierearchy ensures your downstream code can handle such cases easily, without worrying about which runs (SRR) need to be merged.
We strongly recommend installing aspera-client which uses UDP and is designed to be faster.
from pysradb import SRAWeb
db = SRAWeb()
df = db.sra_metadata('SRP017942')
db.download(df)
Use Case 3: Downloading a subset of experiments#
Often, you need to process only a smaller set of samples from a project (SRP). Consider this project which has data spanning four assays.
df = db.sra_metadata('SRP000941')
print(df.library_strategy.unique())
['ChIP-Seq' 'Bisulfite-Seq' 'RNA-Seq' 'WGS' 'OTHER']
But, you might be only interested in analyzing the RNA-seq samples and would just want to download that subset. This is simple using pysradb since the metadata can be subset just as you would subset a dataframe in pandas.
df_rna = df[df.library_strategy == 'RNA-Seq']
db.download(df=df_rna, out_dir='/pysradb_downloads')()
Use Case 4: Getting cell-type/treatment information from sample_attributes#
Cell type/tissue informations is usually hidden in the sample_attributes column, which can be expanded:
from pysradb.filter_attrs import expand_sample_attribute_columns
df = db.sra_metadata('SRP017942')
expand_sample_attribute_columns(df).head()
study_accession |
experiment_accession |
experiment_title |
experiment_attribute |
sample_attribute |
run_accession |
taxon_id |
library_selection |
library_layout |
library_strategy |
library_source |
library_name |
bases |
spots |
adapter_spec |
avg_read_length |
assay_type |
cell_line |
source_name |
transfected_with |
treatment |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SRP017942 |
SRX217028 |
GSM1063575: 293T_GFP; Homo sapiens; RNA-Seq |
GEO Accession: GSM1063575 |
source_name: 293T cells || cell line: 293T cells || transfected with: 3XFLAG-GFP || assay type: Riboseq |
SRR648667 |
9606 |
other |
SINGLE - |
RNA-Seq |
TRANSCRIPTOMIC |
1806641316 |
50184481 |
36 |
riboseq |
293t cells |
293t cells |
3xflag-gfp |
NaN |
||
SRP017942 |
SRX217029 |
GSM1063576: 293T_GFP_2hrs_severe_Heat_Shock; Homo sapiens; RNA-Seq |
GEO Accession: GSM1063576 |
source_name: 293T cells || cell line: 293T cells || transfected with: 3XFLAG-GFP || treatment: severe heat shock (44C 2 hours) || assay type: Riboseq |
SRR648668 |
9606 |
other |
SINGLE - |
RNA-Seq |
TRANSCRIPTOMIC |
3436984836 |
95471801 |
36 |
riboseq |
293t cells |
293t cells |
3xflag-gfp |
severe heat shock (44c 2 hours) |
||
SRP017942 |
SRX217030 |
GSM1063577: 293T_Hspa1a; Homo sapiens; RNA-Seq |
GEO Accession: GSM1063577 |
source_name: 293T cells || cell line: 293T cells || transfected with: 3XFLAG-Hspa1a || assay type: Riboseq |
SRR648669 |
9606 |
other |
SINGLE - |
RNA-Seq |
TRANSCRIPTOMIC |
3330909216 |
92525256 |
36 |
riboseq |
293t cells |
293t cells |
3xflag-hspa1a |
NaN |
||
SRP017942 |
SRX217031 |
GSM1063578: 293T_Hspa1a_2hrs_severe_Heat_Shock; Homo sapiens; RNA-Seq |
GEO Accession: GSM1063578 |
source_name: 293T cells || cell line: 293T cells || transfected with: 3XFLAG-Hspa1a || treatment: severe heat shock (44C 2 hours) || assay type: Riboseq |
SRR648670 |
9606 |
other |
SINGLE - |
RNA-Seq |
TRANSCRIPTOMIC |
3622123512 |
100614542 |
36 |
riboseq |
293t cells |
293t cells |
3xflag-hspa1a |
severe heat shock (44c 2 hours) |
||
SRP017942 |
SRX217956 |
GSM794854: 3T3-Control-Riboseq; Mus musculus; RNA-Seq |
GEO Accession: GSM794854 |
source_name: 3T3 cells || treatment: control || cell line: 3T3 cells || assay type: Riboseq |
SRR649752 |
10090 |
cDNA |
SINGLE - |
RNA-Seq |
TRANSCRIPTOMIC |
594945396 |
16526261 |
36 |
riboseq |
3t3 cells |
3t3 cells |
NaN |
control |
Use Case 5: Searching for datasets#
Another common operation that we do on SRA is seach, plain text search.
If you want to look up for all projects where ribosome profiling appears somewhere in the description:
df = db.search_sra(search_str='"ribosome profiling"')
df.head()
study_accession |
experiment_accession |
experiment_title |
run_accession |
taxon_id |
library_selection |
library_layout |
library_strategy |
library_source |
library_name |
bases |
spots |
---|---|---|---|---|---|---|---|---|---|---|---|
DRP003075 |
DRX019536 |
Illumina Genome Analyzer IIx sequencing of SAMD00018584 |
DRR021383 |
83333 |
other |
SINGLE - |
OTHER |
TRANSCRIPTOMIC |
GAII05_3 |
978776480 |
12234706 |
DRP003075 |
DRX019537 |
Illumina Genome Analyzer IIx sequencing of SAMD00018585 |
DRR021384 |
83333 |
other |
SINGLE - |
OTHER |
TRANSCRIPTOMIC |
GAII05_4 |
894201680 |
11177521 |
DRP003075 |
DRX019538 |
Illumina Genome Analyzer IIx sequencing of SAMD00018586 |
DRR021385 |
83333 |
other |
SINGLE - |
OTHER |
TRANSCRIPTOMIC |
GAII05_5 |
931536720 |
11644209 |
DRP003075 |
DRX019540 |
Illumina Genome Analyzer IIx sequencing of SAMD00018588 |
DRR021387 |
83333 |
other |
SINGLE - |
OTHER |
TRANSCRIPTOMIC |
GAII07_4 |
2759398700 |
27593987 |
DRP003075 |
DRX019541 |
Illumina Genome Analyzer IIx sequencing of SAMD00018589 |
DRR021388 |
83333 |
other |
SINGLE - |
OTHER |
TRANSCRIPTOMIC |
GAII07_5 |
2386196500 |
23861965 |
Again, the results are available as a pandas dataframe and hence you can perform all subset operations post your query. Your query doesn’t need to be exact.