Downloading Subsets of a Project¶
This notebook shows how to filter and download specific samples from a larger SRA project.
[ ]:
# Install pysradb if not already installed
try:
import pysradb
print(f"pysradb {pysradb.__version__} is already installed")
except ImportError:
print("Installing pysradb from GitHub...")
import sys
!{sys.executable} -m pip install -q git+https://github.com/saketkc/pysradb
print("pysradb installed successfully!")
[1]:
pip install git+https://github.com/saketkc/pysradb.git
Collecting git+https://github.com/saketkc/pysradb.git
Cloning https://github.com/saketkc/pysradb.git to /tmp/pip-req-build-3dlg9hp3
Running command git clone -q https://github.com/saketkc/pysradb.git /tmp/pip-req-build-3dlg9hp3
Requirement already satisfied: pandas==0.25.3 in /usr/local/lib/python3.6/dist-packages (from pysradb==0.10.3.dev0) (0.25.3)
Collecting tqdm==4.41.1
Using cached https://files.pythonhosted.org/packages/72/c9/7fc20feac72e79032a7c8138fd0d395dc6d8812b5b9edf53c3afd0b31017/tqdm-4.41.1-py2.py3-none-any.whl
Collecting requests==2.22.0
Using cached https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl
Collecting xmltodict==0.12.0
Using cached https://files.pythonhosted.org/packages/28/fd/30d5c1d3ac29ce229f6bdc40bbc20b28f716e8b363140c26eff19122d8a5/xmltodict-0.12.0-py2.py3-none-any.whl
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas==0.25.3->pysradb==0.10.3.dev0) (2.6.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas==0.25.3->pysradb==0.10.3.dev0) (2018.9)
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.6/dist-packages (from pandas==0.25.3->pysradb==0.10.3.dev0) (1.17.5)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests==2.22.0->pysradb==0.10.3.dev0) (2.8)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests==2.22.0->pysradb==0.10.3.dev0) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests==2.22.0->pysradb==0.10.3.dev0) (2019.11.28)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests==2.22.0->pysradb==0.10.3.dev0) (3.0.4)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas==0.25.3->pysradb==0.10.3.dev0) (1.12.0)
Building wheels for collected packages: pysradb
Building wheel for pysradb (setup.py) ... done
Created wheel for pysradb: filename=pysradb-0.10.3.dev0-cp36-none-any.whl size=147411 sha256=6ccd6874b7cde11cb10eae96cb14e86f9cdfe5f1b02b16a3c7eb20879afd6a62
Stored in directory: /tmp/pip-ephem-wheel-cache-z9xalsuu/wheels/d5/24/42/81dccabc3a4aac9757e23b7175ad7270090a4b3c203cd4fc8f
Successfully built pysradb
ERROR: google-colab 1.0.0 has requirement requests~=2.21.0, but you'll have requests 2.22.0 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.
Installing collected packages: tqdm, requests, xmltodict, pysradb
Found existing installation: tqdm 4.28.1
Uninstalling tqdm-4.28.1:
Successfully uninstalled tqdm-4.28.1
Found existing installation: requests 2.21.0
Uninstalling requests-2.21.0:
Successfully uninstalled requests-2.21.0
Successfully installed pysradb-0.10.3.dev0 requests-2.22.0 tqdm-4.41.1 xmltodict-0.12.0
Data type cannot be displayed: application/vnd.colab-display-data+json
[2]:
!pysradb --version
pysradb 0.10.3-dev0
[ ]:
from pysradb.sraweb import SRAweb
db = SRAweb()
Example of a record missing “SAMPLE_ATTRIBUES”¶
It also has an “auxillary” contig file: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR5146869
[4]:
df = db.sra_metadata("SRP096127", detailed=True)
df
[4]:
| study_accession | experiment_accession | experiment_title | experiment_desc | organism_taxid | organism_name | library_strategy | library_source | library_selection | sample_accession | sample_title | instrument | total_spots | total_size | run_accession | run_total_spots | run_total_bases | run_alias | sra_url_alt | sra_url | experiment_alias | source_name | cell type | group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | SRP096127 | SRX2467007 | GSM2448483: normal.ct-970; Homo sapiens; Bisul... | GSM2448483: normal.ct-970; Homo sapiens; Bisul... | 9606 | Homo sapiens | Bisulfite-Seq | GENOMIC | RANDOM | SRS1899466 | N/A | Illumina HiSeq 2500 | 559547 | 50734487 | SRR5149059 | 559547 | 83216675 | GSM2448483_r1 | https://sra-download.ncbi.nlm.nih.gov/traces/s... | https://sra-download.st-va.ncbi.nlm.nih.gov/so... | GSM2448483 | blood serum | blood serum | healthy control |
| 1 | SRP096127 | SRX2467006 | GSM2448482: normal.ct-969; Homo sapiens; Bisul... | GSM2448482: normal.ct-969; Homo sapiens; Bisul... | 9606 | Homo sapiens | Bisulfite-Seq | GENOMIC | RANDOM | SRS1899465 | N/A | Illumina HiSeq 2500 | 441577 | 40899268 | SRR5149058 | 441577 | 65549383 | GSM2448482_r1 | https://sra-download.ncbi.nlm.nih.gov/traces/s... | https://sra-download.st-va.ncbi.nlm.nih.gov/so... | GSM2448482 | blood serum | blood serum | healthy control |
| 2 | SRP096127 | SRX2467005 | GSM2448481: normal.ct-968; Homo sapiens; Bisul... | GSM2448481: normal.ct-968; Homo sapiens; Bisul... | 9606 | Homo sapiens | Bisulfite-Seq | GENOMIC | RANDOM | SRS1899464 | N/A | Illumina HiSeq 2500 | 563378 | 50951134 | SRR5149057 | 563378 | 83839813 | GSM2448481_r1 | https://sra-download.ncbi.nlm.nih.gov/traces/s... | https://sra-download.st-va.ncbi.nlm.nih.gov/so... | GSM2448481 | blood serum | blood serum | healthy control |
| 3 | SRP096127 | SRX2467004 | GSM2448480: normal.ct-967; Homo sapiens; Bisul... | GSM2448480: normal.ct-967; Homo sapiens; Bisul... | 9606 | Homo sapiens | Bisulfite-Seq | GENOMIC | RANDOM | SRS1899463 | N/A | Illumina HiSeq 2500 | 422878 | 39223860 | SRR5149056 | 422878 | 62753430 | GSM2448480_r1 | https://sra-download.ncbi.nlm.nih.gov/traces/s... | https://sra-download.st-va.ncbi.nlm.nih.gov/so... | GSM2448480 | blood serum | blood serum | healthy control |
| 4 | SRP096127 | SRX2467003 | GSM2448479: normal.ct-966; Homo sapiens; Bisul... | GSM2448479: normal.ct-966; Homo sapiens; Bisul... | 9606 | Homo sapiens | Bisulfite-Seq | GENOMIC | RANDOM | SRS1899461 | N/A | Illumina HiSeq 2500 | 517254 | 46881651 | SRR5149055 | 517254 | 77004865 | GSM2448479_r1 | https://sra-download.ncbi.nlm.nih.gov/traces/s... | https://sra-download.st-va.ncbi.nlm.nih.gov/so... | GSM2448479 | blood serum | blood serum | healthy control |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2186 | SRP096127 | SRX2464821 | GSM2446284: HCC.ct-5; Homo sapiens; Bisulfite-Seq | GSM2446284: HCC.ct-5; Homo sapiens; Bisulfite-Seq | 9606 | Homo sapiens | Bisulfite-Seq | GENOMIC | RANDOM | SRS1897280 | N/A | Illumina HiSeq 2500 | 1033204 | 83576370 | SRR5146873 | 1033204 | 196635123 | GSM2446284_r1 | https://sra-download.ncbi.nlm.nih.gov/traces/s... | https://sra-download.st-va.ncbi.nlm.nih.gov/so... | GSM2446284 | blood serum | blood serum | hepatocellular carcinoma patient |
| 2187 | SRP096127 | SRX2464820 | GSM2446283: HCC.ct-4; Homo sapiens; Bisulfite-Seq | GSM2446283: HCC.ct-4; Homo sapiens; Bisulfite-Seq | 9606 | Homo sapiens | Bisulfite-Seq | GENOMIC | RANDOM | SRS1897279 | N/A | Illumina HiSeq 2500 | 840853 | 68410342 | SRR5146872 | 840853 | 159822416 | GSM2446283_r1 | https://sra-download.ncbi.nlm.nih.gov/traces/s... | https://sra-download.st-va.ncbi.nlm.nih.gov/so... | GSM2446283 | blood serum | blood serum | hepatocellular carcinoma patient |
| 2188 | SRP096127 | SRX2464819 | GSM2446282: HCC.ct-3; Homo sapiens; Bisulfite-Seq | GSM2446282: HCC.ct-3; Homo sapiens; Bisulfite-Seq | 9606 | Homo sapiens | Bisulfite-Seq | GENOMIC | RANDOM | SRS1897278 | N/A | Illumina HiSeq 2500 | 885724 | 71407675 | SRR5146871 | 885724 | 166270272 | GSM2446282_r1 | https://sra-download.ncbi.nlm.nih.gov/traces/s... | https://sra-download.st-va.ncbi.nlm.nih.gov/so... | GSM2446282 | blood serum | blood serum | hepatocellular carcinoma patient |
| 2189 | SRP096127 | SRX2464818 | GSM2446281: HCC.ct-2; Homo sapiens; Bisulfite-Seq | GSM2446281: HCC.ct-2; Homo sapiens; Bisulfite-Seq | 9606 | Homo sapiens | Bisulfite-Seq | GENOMIC | RANDOM | SRS1897277 | N/A | Illumina HiSeq 2500 | 775684 | 62094237 | SRR5146870 | 775684 | 145671062 | GSM2446281_r1 | https://sra-download.ncbi.nlm.nih.gov/traces/s... | https://sra-download.st-va.ncbi.nlm.nih.gov/so... | GSM2446281 | blood serum | blood serum | hepatocellular carcinoma patient |
| 2190 | SRP096127 | SRX2464817 | GSM2446280: HCC.ct-1; Homo sapiens; Bisulfite-Seq | GSM2446280: HCC.ct-1; Homo sapiens; Bisulfite-Seq | 9606 | Homo sapiens | Bisulfite-Seq | GENOMIC | RANDOM | SRS1897276 | N/A | Illumina HiSeq 2500 | 1124031 | 89769302 | SRR5146869 | 1124031 | 212986785 | GSM2446280_r1 | https://sra-download.ncbi.nlm.nih.gov/traces/s... | https://sra-download.st-va.ncbi.nlm.nih.gov/so... | GSM2446280 | blood serum | blood serum | hepatocellular carcinoma patient |
2191 rows × 24 columns
[5]:
df_contig_subset = df.loc[df["sra_url_alt"].str.contains("contig")]
df_contig_subset
[5]:
| study_accession | experiment_accession | experiment_title | experiment_desc | organism_taxid | organism_name | library_strategy | library_source | library_selection | sample_accession | sample_title | instrument | total_spots | total_size | run_accession | run_total_spots | run_total_bases | run_alias | sra_url_alt | sra_url | experiment_alias | source_name | cell type | group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | SRP096127 | SRX2467007 | GSM2448483: normal.ct-970; Homo sapiens; Bisul... | GSM2448483: normal.ct-970; Homo sapiens; Bisul... | 9606 | Homo sapiens | Bisulfite-Seq | GENOMIC | RANDOM | SRS1899466 | N/A | Illumina HiSeq 2500 | 559547 | 50734487 | SRR5149059 | 559547 | 83216675 | GSM2448483_r1 | https://sra-download.ncbi.nlm.nih.gov/traces/s... | https://sra-download.st-va.ncbi.nlm.nih.gov/so... | GSM2448483 | blood serum | blood serum | healthy control |
| 1 | SRP096127 | SRX2467006 | GSM2448482: normal.ct-969; Homo sapiens; Bisul... | GSM2448482: normal.ct-969; Homo sapiens; Bisul... | 9606 | Homo sapiens | Bisulfite-Seq | GENOMIC | RANDOM | SRS1899465 | N/A | Illumina HiSeq 2500 | 441577 | 40899268 | SRR5149058 | 441577 | 65549383 | GSM2448482_r1 | https://sra-download.ncbi.nlm.nih.gov/traces/s... | https://sra-download.st-va.ncbi.nlm.nih.gov/so... | GSM2448482 | blood serum | blood serum | healthy control |
| 2 | SRP096127 | SRX2467005 | GSM2448481: normal.ct-968; Homo sapiens; Bisul... | GSM2448481: normal.ct-968; Homo sapiens; Bisul... | 9606 | Homo sapiens | Bisulfite-Seq | GENOMIC | RANDOM | SRS1899464 | N/A | Illumina HiSeq 2500 | 563378 | 50951134 | SRR5149057 | 563378 | 83839813 | GSM2448481_r1 | https://sra-download.ncbi.nlm.nih.gov/traces/s... | https://sra-download.st-va.ncbi.nlm.nih.gov/so... | GSM2448481 | blood serum | blood serum | healthy control |
| 3 | SRP096127 | SRX2467004 | GSM2448480: normal.ct-967; Homo sapiens; Bisul... | GSM2448480: normal.ct-967; Homo sapiens; Bisul... | 9606 | Homo sapiens | Bisulfite-Seq | GENOMIC | RANDOM | SRS1899463 | N/A | Illumina HiSeq 2500 | 422878 | 39223860 | SRR5149056 | 422878 | 62753430 | GSM2448480_r1 | https://sra-download.ncbi.nlm.nih.gov/traces/s... | https://sra-download.st-va.ncbi.nlm.nih.gov/so... | GSM2448480 | blood serum | blood serum | healthy control |
| 4 | SRP096127 | SRX2467003 | GSM2448479: normal.ct-966; Homo sapiens; Bisul... | GSM2448479: normal.ct-966; Homo sapiens; Bisul... | 9606 | Homo sapiens | Bisulfite-Seq | GENOMIC | RANDOM | SRS1899461 | N/A | Illumina HiSeq 2500 | 517254 | 46881651 | SRR5149055 | 517254 | 77004865 | GSM2448479_r1 | https://sra-download.ncbi.nlm.nih.gov/traces/s... | https://sra-download.st-va.ncbi.nlm.nih.gov/so... | GSM2448479 | blood serum | blood serum | healthy control |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2186 | SRP096127 | SRX2464821 | GSM2446284: HCC.ct-5; Homo sapiens; Bisulfite-Seq | GSM2446284: HCC.ct-5; Homo sapiens; Bisulfite-Seq | 9606 | Homo sapiens | Bisulfite-Seq | GENOMIC | RANDOM | SRS1897280 | N/A | Illumina HiSeq 2500 | 1033204 | 83576370 | SRR5146873 | 1033204 | 196635123 | GSM2446284_r1 | https://sra-download.ncbi.nlm.nih.gov/traces/s... | https://sra-download.st-va.ncbi.nlm.nih.gov/so... | GSM2446284 | blood serum | blood serum | hepatocellular carcinoma patient |
| 2187 | SRP096127 | SRX2464820 | GSM2446283: HCC.ct-4; Homo sapiens; Bisulfite-Seq | GSM2446283: HCC.ct-4; Homo sapiens; Bisulfite-Seq | 9606 | Homo sapiens | Bisulfite-Seq | GENOMIC | RANDOM | SRS1897279 | N/A | Illumina HiSeq 2500 | 840853 | 68410342 | SRR5146872 | 840853 | 159822416 | GSM2446283_r1 | https://sra-download.ncbi.nlm.nih.gov/traces/s... | https://sra-download.st-va.ncbi.nlm.nih.gov/so... | GSM2446283 | blood serum | blood serum | hepatocellular carcinoma patient |
| 2188 | SRP096127 | SRX2464819 | GSM2446282: HCC.ct-3; Homo sapiens; Bisulfite-Seq | GSM2446282: HCC.ct-3; Homo sapiens; Bisulfite-Seq | 9606 | Homo sapiens | Bisulfite-Seq | GENOMIC | RANDOM | SRS1897278 | N/A | Illumina HiSeq 2500 | 885724 | 71407675 | SRR5146871 | 885724 | 166270272 | GSM2446282_r1 | https://sra-download.ncbi.nlm.nih.gov/traces/s... | https://sra-download.st-va.ncbi.nlm.nih.gov/so... | GSM2446282 | blood serum | blood serum | hepatocellular carcinoma patient |
| 2189 | SRP096127 | SRX2464818 | GSM2446281: HCC.ct-2; Homo sapiens; Bisulfite-Seq | GSM2446281: HCC.ct-2; Homo sapiens; Bisulfite-Seq | 9606 | Homo sapiens | Bisulfite-Seq | GENOMIC | RANDOM | SRS1897277 | N/A | Illumina HiSeq 2500 | 775684 | 62094237 | SRR5146870 | 775684 | 145671062 | GSM2446281_r1 | https://sra-download.ncbi.nlm.nih.gov/traces/s... | https://sra-download.st-va.ncbi.nlm.nih.gov/so... | GSM2446281 | blood serum | blood serum | hepatocellular carcinoma patient |
| 2190 | SRP096127 | SRX2464817 | GSM2446280: HCC.ct-1; Homo sapiens; Bisulfite-Seq | GSM2446280: HCC.ct-1; Homo sapiens; Bisulfite-Seq | 9606 | Homo sapiens | Bisulfite-Seq | GENOMIC | RANDOM | SRS1897276 | N/A | Illumina HiSeq 2500 | 1124031 | 89769302 | SRR5146869 | 1124031 | 212986785 | GSM2446280_r1 | https://sra-download.ncbi.nlm.nih.gov/traces/s... | https://sra-download.st-va.ncbi.nlm.nih.gov/so... | GSM2446280 | blood serum | blood serum | hepatocellular carcinoma patient |
1654 rows × 24 columns
[ ]:
db.download(df=df_contig_subset, url_col="sra_url_alt")
Example with a fastq file (submitted through ENA)¶
https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR1520686
[ ]:
df = db.sra_metadata("ERP015299", detailed=True)
df
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-11-1792c85e0c4f> in <module>()
----> 1 df = db.sra_metadata('ERP015299', detailed=True)
2 df
/usr/local/lib/python3.6/dist-packages/pysradb/sraweb.py in sra_metadata(self, srp, sample_attribute, detailed, expand_sample_attributes, output_read_lengths, **kwargs)
396 # detailed_record["run_total_spots"] = run_set["@total_spots"]
397 for sample_attribute in sample_attributes:
--> 398 dict_values = list(sample_attribute.values())
399 detailed_record[dict_values[0]] = dict_values[1]
400 detailed_records.append(detailed_record)
AttributeError: 'str' object has no attribute 'values'