Open In Colab

Selecting Subsets of a Project

This notebook shows the workflow for selecting specific records from a larger SRA or ENA project before downloading data. The live metadata and download commands are shown explicitly, but they are not executed during documentation builds because they can contact external services for a long time and may download large files.

[1]:
from pysradb.sraweb import SRAweb

db = SRAweb()
db
/home/runner/work/pysradb/pysradb/pysradb/download.py:15: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm
[1]:
<pysradb.sraweb.SRAweb at 0x7f2dc83b3ad0>

1. Fetch Detailed Metadata

Start with detailed project metadata. For example, SRP096127 contains the run SRR5146869, which has an auxiliary contig file listed in NCBI SRA.

df = db.sra_metadata("SRP096127", detailed=True)
df.head()

NCBI record: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR5146869

2. Select Rows by URL Content

Once the metadata table is available, identify URL columns and filter to rows whose download URLs mention contig.

download_url_cols = [col for col in df.columns if "url" in col.lower()]
contig_mask = df[download_url_cols].apply(
    lambda row: row.astype(str).str.contains("contig", case=False, na=False).any(),
    axis=1,
)
df_contig_subset = df.loc[contig_mask]
df_contig_subset.head()

3. Preview Before Downloading

Inspect the accessions and URL column before starting a transfer.

preview_cols = ["run_accession", "study_accession", "experiment_accession"]
preview_cols = [col for col in preview_cols if col in df_contig_subset.columns]
preview_cols += download_url_cols[:1]
df_contig_subset[preview_cols].head()

4. Download the Selected Rows

Pass only the selected rows to download. Use the URL column that contains the files you want.

db.download(
    df=df_contig_subset,
    url_col=download_url_cols[0],
    out_dir="pysradb_downloads",
    skip_confirmation=False,
)

5. ENA FASTQ Example

The same pattern works for ENA-backed FASTQ links. For example, ERP015299 includes the run ERR1520686.

fastq_df = db.sra_metadata("ERP015299", detailed=True)
fastq_url_cols = [col for col in fastq_df.columns if "fastq" in col.lower() and "url" in col.lower()]
fastq_df[["run_accession", "study_accession", *fastq_url_cols[:1]]].head()

NCBI record: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR1520686