Open In Colab

Downloading Subsets of a Project

This notebook shows how to filter and download specific samples from a larger SRA project.

[ ]:
# Install pysradb if not already installed
try:
    import pysradb

    print(f"pysradb {pysradb.__version__} is already installed")
except ImportError:
    print("Installing pysradb from GitHub...")
    import sys

    !{sys.executable} -m pip install -q git+https://github.com/saketkc/pysradb
    print("pysradb installed successfully!")
[1]:
pip install git+https://github.com/saketkc/pysradb.git
Collecting git+https://github.com/saketkc/pysradb.git
  Cloning https://github.com/saketkc/pysradb.git to /tmp/pip-req-build-3dlg9hp3
  Running command git clone -q https://github.com/saketkc/pysradb.git /tmp/pip-req-build-3dlg9hp3
Requirement already satisfied: pandas==0.25.3 in /usr/local/lib/python3.6/dist-packages (from pysradb==0.10.3.dev0) (0.25.3)
Collecting tqdm==4.41.1
  Using cached https://files.pythonhosted.org/packages/72/c9/7fc20feac72e79032a7c8138fd0d395dc6d8812b5b9edf53c3afd0b31017/tqdm-4.41.1-py2.py3-none-any.whl
Collecting requests==2.22.0
  Using cached https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl
Collecting xmltodict==0.12.0
  Using cached https://files.pythonhosted.org/packages/28/fd/30d5c1d3ac29ce229f6bdc40bbc20b28f716e8b363140c26eff19122d8a5/xmltodict-0.12.0-py2.py3-none-any.whl
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas==0.25.3->pysradb==0.10.3.dev0) (2.6.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas==0.25.3->pysradb==0.10.3.dev0) (2018.9)
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.6/dist-packages (from pandas==0.25.3->pysradb==0.10.3.dev0) (1.17.5)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests==2.22.0->pysradb==0.10.3.dev0) (2.8)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests==2.22.0->pysradb==0.10.3.dev0) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests==2.22.0->pysradb==0.10.3.dev0) (2019.11.28)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests==2.22.0->pysradb==0.10.3.dev0) (3.0.4)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas==0.25.3->pysradb==0.10.3.dev0) (1.12.0)
Building wheels for collected packages: pysradb
  Building wheel for pysradb (setup.py) ... done
  Created wheel for pysradb: filename=pysradb-0.10.3.dev0-cp36-none-any.whl size=147411 sha256=6ccd6874b7cde11cb10eae96cb14e86f9cdfe5f1b02b16a3c7eb20879afd6a62
  Stored in directory: /tmp/pip-ephem-wheel-cache-z9xalsuu/wheels/d5/24/42/81dccabc3a4aac9757e23b7175ad7270090a4b3c203cd4fc8f
Successfully built pysradb
ERROR: google-colab 1.0.0 has requirement requests~=2.21.0, but you'll have requests 2.22.0 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.
Installing collected packages: tqdm, requests, xmltodict, pysradb
  Found existing installation: tqdm 4.28.1
    Uninstalling tqdm-4.28.1:
      Successfully uninstalled tqdm-4.28.1
  Found existing installation: requests 2.21.0
    Uninstalling requests-2.21.0:
      Successfully uninstalled requests-2.21.0
Successfully installed pysradb-0.10.3.dev0 requests-2.22.0 tqdm-4.41.1 xmltodict-0.12.0

Data type cannot be displayed: application/vnd.colab-display-data+json

[2]:
!pysradb --version
pysradb 0.10.3-dev0
[ ]:
from pysradb.sraweb import SRAweb

db = SRAweb()

Example of a record missing “SAMPLE_ATTRIBUES”

It also has an “auxillary” contig file: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR5146869

[4]:
df = db.sra_metadata("SRP096127", detailed=True)
df
[4]:
study_accession experiment_accession experiment_title experiment_desc organism_taxid organism_name library_strategy library_source library_selection sample_accession sample_title instrument total_spots total_size run_accession run_total_spots run_total_bases run_alias sra_url_alt sra_url experiment_alias source_name cell type group
0 SRP096127 SRX2467007 GSM2448483: normal.ct-970; Homo sapiens; Bisul... GSM2448483: normal.ct-970; Homo sapiens; Bisul... 9606 Homo sapiens Bisulfite-Seq GENOMIC RANDOM SRS1899466 N/A Illumina HiSeq 2500 559547 50734487 SRR5149059 559547 83216675 GSM2448483_r1 https://sra-download.ncbi.nlm.nih.gov/traces/s... https://sra-download.st-va.ncbi.nlm.nih.gov/so... GSM2448483 blood serum blood serum healthy control
1 SRP096127 SRX2467006 GSM2448482: normal.ct-969; Homo sapiens; Bisul... GSM2448482: normal.ct-969; Homo sapiens; Bisul... 9606 Homo sapiens Bisulfite-Seq GENOMIC RANDOM SRS1899465 N/A Illumina HiSeq 2500 441577 40899268 SRR5149058 441577 65549383 GSM2448482_r1 https://sra-download.ncbi.nlm.nih.gov/traces/s... https://sra-download.st-va.ncbi.nlm.nih.gov/so... GSM2448482 blood serum blood serum healthy control
2 SRP096127 SRX2467005 GSM2448481: normal.ct-968; Homo sapiens; Bisul... GSM2448481: normal.ct-968; Homo sapiens; Bisul... 9606 Homo sapiens Bisulfite-Seq GENOMIC RANDOM SRS1899464 N/A Illumina HiSeq 2500 563378 50951134 SRR5149057 563378 83839813 GSM2448481_r1 https://sra-download.ncbi.nlm.nih.gov/traces/s... https://sra-download.st-va.ncbi.nlm.nih.gov/so... GSM2448481 blood serum blood serum healthy control
3 SRP096127 SRX2467004 GSM2448480: normal.ct-967; Homo sapiens; Bisul... GSM2448480: normal.ct-967; Homo sapiens; Bisul... 9606 Homo sapiens Bisulfite-Seq GENOMIC RANDOM SRS1899463 N/A Illumina HiSeq 2500 422878 39223860 SRR5149056 422878 62753430 GSM2448480_r1 https://sra-download.ncbi.nlm.nih.gov/traces/s... https://sra-download.st-va.ncbi.nlm.nih.gov/so... GSM2448480 blood serum blood serum healthy control
4 SRP096127 SRX2467003 GSM2448479: normal.ct-966; Homo sapiens; Bisul... GSM2448479: normal.ct-966; Homo sapiens; Bisul... 9606 Homo sapiens Bisulfite-Seq GENOMIC RANDOM SRS1899461 N/A Illumina HiSeq 2500 517254 46881651 SRR5149055 517254 77004865 GSM2448479_r1 https://sra-download.ncbi.nlm.nih.gov/traces/s... https://sra-download.st-va.ncbi.nlm.nih.gov/so... GSM2448479 blood serum blood serum healthy control
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2186 SRP096127 SRX2464821 GSM2446284: HCC.ct-5; Homo sapiens; Bisulfite-Seq GSM2446284: HCC.ct-5; Homo sapiens; Bisulfite-Seq 9606 Homo sapiens Bisulfite-Seq GENOMIC RANDOM SRS1897280 N/A Illumina HiSeq 2500 1033204 83576370 SRR5146873 1033204 196635123 GSM2446284_r1 https://sra-download.ncbi.nlm.nih.gov/traces/s... https://sra-download.st-va.ncbi.nlm.nih.gov/so... GSM2446284 blood serum blood serum hepatocellular carcinoma patient
2187 SRP096127 SRX2464820 GSM2446283: HCC.ct-4; Homo sapiens; Bisulfite-Seq GSM2446283: HCC.ct-4; Homo sapiens; Bisulfite-Seq 9606 Homo sapiens Bisulfite-Seq GENOMIC RANDOM SRS1897279 N/A Illumina HiSeq 2500 840853 68410342 SRR5146872 840853 159822416 GSM2446283_r1 https://sra-download.ncbi.nlm.nih.gov/traces/s... https://sra-download.st-va.ncbi.nlm.nih.gov/so... GSM2446283 blood serum blood serum hepatocellular carcinoma patient
2188 SRP096127 SRX2464819 GSM2446282: HCC.ct-3; Homo sapiens; Bisulfite-Seq GSM2446282: HCC.ct-3; Homo sapiens; Bisulfite-Seq 9606 Homo sapiens Bisulfite-Seq GENOMIC RANDOM SRS1897278 N/A Illumina HiSeq 2500 885724 71407675 SRR5146871 885724 166270272 GSM2446282_r1 https://sra-download.ncbi.nlm.nih.gov/traces/s... https://sra-download.st-va.ncbi.nlm.nih.gov/so... GSM2446282 blood serum blood serum hepatocellular carcinoma patient
2189 SRP096127 SRX2464818 GSM2446281: HCC.ct-2; Homo sapiens; Bisulfite-Seq GSM2446281: HCC.ct-2; Homo sapiens; Bisulfite-Seq 9606 Homo sapiens Bisulfite-Seq GENOMIC RANDOM SRS1897277 N/A Illumina HiSeq 2500 775684 62094237 SRR5146870 775684 145671062 GSM2446281_r1 https://sra-download.ncbi.nlm.nih.gov/traces/s... https://sra-download.st-va.ncbi.nlm.nih.gov/so... GSM2446281 blood serum blood serum hepatocellular carcinoma patient
2190 SRP096127 SRX2464817 GSM2446280: HCC.ct-1; Homo sapiens; Bisulfite-Seq GSM2446280: HCC.ct-1; Homo sapiens; Bisulfite-Seq 9606 Homo sapiens Bisulfite-Seq GENOMIC RANDOM SRS1897276 N/A Illumina HiSeq 2500 1124031 89769302 SRR5146869 1124031 212986785 GSM2446280_r1 https://sra-download.ncbi.nlm.nih.gov/traces/s... https://sra-download.st-va.ncbi.nlm.nih.gov/so... GSM2446280 blood serum blood serum hepatocellular carcinoma patient

2191 rows × 24 columns

[5]:
df_contig_subset = df.loc[df["sra_url_alt"].str.contains("contig")]
df_contig_subset
[5]:
study_accession experiment_accession experiment_title experiment_desc organism_taxid organism_name library_strategy library_source library_selection sample_accession sample_title instrument total_spots total_size run_accession run_total_spots run_total_bases run_alias sra_url_alt sra_url experiment_alias source_name cell type group
0 SRP096127 SRX2467007 GSM2448483: normal.ct-970; Homo sapiens; Bisul... GSM2448483: normal.ct-970; Homo sapiens; Bisul... 9606 Homo sapiens Bisulfite-Seq GENOMIC RANDOM SRS1899466 N/A Illumina HiSeq 2500 559547 50734487 SRR5149059 559547 83216675 GSM2448483_r1 https://sra-download.ncbi.nlm.nih.gov/traces/s... https://sra-download.st-va.ncbi.nlm.nih.gov/so... GSM2448483 blood serum blood serum healthy control
1 SRP096127 SRX2467006 GSM2448482: normal.ct-969; Homo sapiens; Bisul... GSM2448482: normal.ct-969; Homo sapiens; Bisul... 9606 Homo sapiens Bisulfite-Seq GENOMIC RANDOM SRS1899465 N/A Illumina HiSeq 2500 441577 40899268 SRR5149058 441577 65549383 GSM2448482_r1 https://sra-download.ncbi.nlm.nih.gov/traces/s... https://sra-download.st-va.ncbi.nlm.nih.gov/so... GSM2448482 blood serum blood serum healthy control
2 SRP096127 SRX2467005 GSM2448481: normal.ct-968; Homo sapiens; Bisul... GSM2448481: normal.ct-968; Homo sapiens; Bisul... 9606 Homo sapiens Bisulfite-Seq GENOMIC RANDOM SRS1899464 N/A Illumina HiSeq 2500 563378 50951134 SRR5149057 563378 83839813 GSM2448481_r1 https://sra-download.ncbi.nlm.nih.gov/traces/s... https://sra-download.st-va.ncbi.nlm.nih.gov/so... GSM2448481 blood serum blood serum healthy control
3 SRP096127 SRX2467004 GSM2448480: normal.ct-967; Homo sapiens; Bisul... GSM2448480: normal.ct-967; Homo sapiens; Bisul... 9606 Homo sapiens Bisulfite-Seq GENOMIC RANDOM SRS1899463 N/A Illumina HiSeq 2500 422878 39223860 SRR5149056 422878 62753430 GSM2448480_r1 https://sra-download.ncbi.nlm.nih.gov/traces/s... https://sra-download.st-va.ncbi.nlm.nih.gov/so... GSM2448480 blood serum blood serum healthy control
4 SRP096127 SRX2467003 GSM2448479: normal.ct-966; Homo sapiens; Bisul... GSM2448479: normal.ct-966; Homo sapiens; Bisul... 9606 Homo sapiens Bisulfite-Seq GENOMIC RANDOM SRS1899461 N/A Illumina HiSeq 2500 517254 46881651 SRR5149055 517254 77004865 GSM2448479_r1 https://sra-download.ncbi.nlm.nih.gov/traces/s... https://sra-download.st-va.ncbi.nlm.nih.gov/so... GSM2448479 blood serum blood serum healthy control
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2186 SRP096127 SRX2464821 GSM2446284: HCC.ct-5; Homo sapiens; Bisulfite-Seq GSM2446284: HCC.ct-5; Homo sapiens; Bisulfite-Seq 9606 Homo sapiens Bisulfite-Seq GENOMIC RANDOM SRS1897280 N/A Illumina HiSeq 2500 1033204 83576370 SRR5146873 1033204 196635123 GSM2446284_r1 https://sra-download.ncbi.nlm.nih.gov/traces/s... https://sra-download.st-va.ncbi.nlm.nih.gov/so... GSM2446284 blood serum blood serum hepatocellular carcinoma patient
2187 SRP096127 SRX2464820 GSM2446283: HCC.ct-4; Homo sapiens; Bisulfite-Seq GSM2446283: HCC.ct-4; Homo sapiens; Bisulfite-Seq 9606 Homo sapiens Bisulfite-Seq GENOMIC RANDOM SRS1897279 N/A Illumina HiSeq 2500 840853 68410342 SRR5146872 840853 159822416 GSM2446283_r1 https://sra-download.ncbi.nlm.nih.gov/traces/s... https://sra-download.st-va.ncbi.nlm.nih.gov/so... GSM2446283 blood serum blood serum hepatocellular carcinoma patient
2188 SRP096127 SRX2464819 GSM2446282: HCC.ct-3; Homo sapiens; Bisulfite-Seq GSM2446282: HCC.ct-3; Homo sapiens; Bisulfite-Seq 9606 Homo sapiens Bisulfite-Seq GENOMIC RANDOM SRS1897278 N/A Illumina HiSeq 2500 885724 71407675 SRR5146871 885724 166270272 GSM2446282_r1 https://sra-download.ncbi.nlm.nih.gov/traces/s... https://sra-download.st-va.ncbi.nlm.nih.gov/so... GSM2446282 blood serum blood serum hepatocellular carcinoma patient
2189 SRP096127 SRX2464818 GSM2446281: HCC.ct-2; Homo sapiens; Bisulfite-Seq GSM2446281: HCC.ct-2; Homo sapiens; Bisulfite-Seq 9606 Homo sapiens Bisulfite-Seq GENOMIC RANDOM SRS1897277 N/A Illumina HiSeq 2500 775684 62094237 SRR5146870 775684 145671062 GSM2446281_r1 https://sra-download.ncbi.nlm.nih.gov/traces/s... https://sra-download.st-va.ncbi.nlm.nih.gov/so... GSM2446281 blood serum blood serum hepatocellular carcinoma patient
2190 SRP096127 SRX2464817 GSM2446280: HCC.ct-1; Homo sapiens; Bisulfite-Seq GSM2446280: HCC.ct-1; Homo sapiens; Bisulfite-Seq 9606 Homo sapiens Bisulfite-Seq GENOMIC RANDOM SRS1897276 N/A Illumina HiSeq 2500 1124031 89769302 SRR5146869 1124031 212986785 GSM2446280_r1 https://sra-download.ncbi.nlm.nih.gov/traces/s... https://sra-download.st-va.ncbi.nlm.nih.gov/so... GSM2446280 blood serum blood serum hepatocellular carcinoma patient

1654 rows × 24 columns

[ ]:
db.download(df=df_contig_subset, url_col="sra_url_alt")

Example with a fastq file (submitted through ENA)

https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR1520686

[ ]:
df = db.sra_metadata("ERP015299", detailed=True)
df
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-1792c85e0c4f> in <module>()
----> 1 df = db.sra_metadata('ERP015299', detailed=True)
      2 df

/usr/local/lib/python3.6/dist-packages/pysradb/sraweb.py in sra_metadata(self, srp, sample_attribute, detailed, expand_sample_attributes, output_read_lengths, **kwargs)
    396                 # detailed_record["run_total_spots"] = run_set["@total_spots"]
    397                 for sample_attribute in sample_attributes:
--> 398                     dict_values = list(sample_attribute.values())
    399                     detailed_record[dict_values[0]] = dict_values[1]
    400                     detailed_records.append(detailed_record)

AttributeError: 'str' object has no attribute 'values'