Extract Identifiers from PMC/DOI¶
This notebook shows how to extract SRA identifiers from PubMed Central articles and DOI references.
[ ]:
# Install pysradb if not already installed
try:
import pysradb
print(f"pysradb {pysradb.__version__} is already installed")
except ImportError:
print("Installing pysradb from GitHub...")
import sys
!{sys.executable} -m pip install -q git+https://github.com/saketkc/pysradb
print("pysradb installed successfully!")
Extracting Database Identifiers from Literature (PMC/DOI)¶
This notebook demonstrates how to extract database identifiers (GSE, PRJNA, SRP, SRR, SRX, SRS) from:
PubMed Central (PMC) articles
PubMed IDs (PMIDs)
Digital Object Identifiers (DOIs)
Setup¶
First, let’s import pysradb and create a connection:
[1]:
from pysradb.sraweb import SRAweb
import pandas as pd
db = SRAweb()
/Users/saket/github/pysradb/pysradb/utils.py:14: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
from tqdm.autonotebook import tqdm
Example 1: Extract Identifiers from PMID¶
Let’s extract all database identifiers from a PubMed article using its PMID:
[2]:
# Extract all identifiers from PMID 39528918
pmid = "39528918"
df = db.pmid_to_identifiers(pmid)
print(f"Identifiers found for PMID {pmid}:")
df
Identifiers found for PMID 39528918:
[2]:
| pmid | pmc_id | gse_ids | prjna_ids | srp_ids | srr_ids | srx_ids | srs_ids | |
|---|---|---|---|---|---|---|---|---|
| 0 | 39528918 | PMC11552371 | GSE253406 | None | SRP484103 | None | None | None |
Example 2: Get Only GSE IDs from PMID¶
Sometimes you only need specific identifier types:
[3]:
# Get only GSE identifiers
gse_df = db.pmid_to_gse("39528918")
print("GSE identifiers:")
gse_df
GSE identifiers:
[3]:
| pmid | pmc_id | gse_ids | |
|---|---|---|---|
| 0 | 39528918 | PMC11552371 | GSE253406 |
Example 3: Get Only SRP IDs from PMID¶
Important: Even if the paper only mentions GSE IDs, pysradb will automatically convert them to SRP!
[4]:
# Get only SRP identifiers
# This works even if the paper only mentions GSE253406!
srp_df = db.pmid_to_srp("39528918")
print("SRP identifiers (auto-converted from GSE if needed):")
srp_df
SRP identifiers (auto-converted from GSE if needed):
[4]:
| pmid | pmc_id | srp_ids | |
|---|---|---|---|
| 0 | 39528918 | PMC11552371 | SRP484103 |
Example 4: Process Multiple PMIDs¶
You can batch process multiple PMIDs at once:
[5]:
# Process multiple PMIDs
pmids = ["39528918", "27373336"]
df_multiple = db.pmid_to_identifiers(pmids)
print(f"Identifiers from {len(pmids)} PMIDs:")
df_multiple
Identifiers from 2 PMIDs:
[5]:
| pmid | pmc_id | gse_ids | prjna_ids | srp_ids | srr_ids | srx_ids | srs_ids | |
|---|---|---|---|---|---|---|---|---|
| 0 | 39528918 | PMC11552371 | GSE253406 | None | SRP484103 | None | None | None |
| 1 | 27373336 | PMC4958508 | None | None | None | None | None | None |
Example 5: Extract from PMC ID¶
If you have a PMC ID directly, you can use it:
[6]:
# Extract from PMC ID
pmc_df = db.pmc_to_identifiers("PMC10802650")
print("Identifiers from PMC article:")
pmc_df
Identifiers from PMC article:
[6]:
| pmc_id | gse_ids | prjna_ids | srp_ids | srr_ids | srx_ids | srs_ids | |
|---|---|---|---|---|---|---|---|
| 0 | PMC10802650 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
Example 6: Extract from DOI¶
You can also extract identifiers directly from a DOI:
[7]:
# Extract from DOI
doi = "10.12688/f1000research.18676.1"
doi_df = db.doi_to_identifiers(doi)
print(f"Identifiers from DOI {doi}:")
doi_df
Identifiers from DOI 10.12688/f1000research.18676.1:
[7]:
| doi | pmid | pmc_id | gse_ids | prjna_ids | srp_ids | srr_ids | srx_ids | srs_ids | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 10.12688/f1000research.18676.1 | 31114675 | PMC6505635 | GSE100007,GSE24355,GSE25842,GSE41637 | None | SRP000941,SRP003870,SRP005378,SRP010679,SRP109126 | SRR403882,SRR403883,SRR403884,SRR403885,SRR403... | SRX118285,SRX118286,SRX118287,SRX118288,SRX118... | SRS2282390,SRS2282391,SRS2282392,SRS2282393,SR... |
Example 7: Get GSE from DOI¶
[8]:
# Get only GSE from DOI
doi_gse_df = db.doi_to_gse("10.12688/f1000research.18676.1")
print("GSE identifiers from DOI:")
doi_gse_df
GSE identifiers from DOI:
[8]:
| doi | pmid | pmc_id | gse_ids | |
|---|---|---|---|---|
| 0 | 10.12688/f1000research.18676.1 | 31114675 | PMC6505635 | GSE100007,GSE24355,GSE25842,GSE41637 |
Example 8: Complete Workflow - DOI to Data Download¶
Here’s a complete workflow from DOI to downloading the actual data:
[9]:
# Step 1: Extract SRP from DOI
doi = "10.12688/f1000research.18676.1"
print(f"Step 1: Extracting identifiers from DOI: {doi}")
doi_results = db.doi_to_srp(doi)
print(doi_results)
print()
Step 1: Extracting identifiers from DOI: 10.12688/f1000research.18676.1
doi pmid pmc_id \
0 10.12688/f1000research.18676.1 31114675 PMC6505635
srp_ids
0 SRP000941,SRP003870,SRP005378,SRP010679,SRP109126
[10]:
# Step 2: Extract SRP ID from results
if not doi_results.empty and not pd.isna(doi_results.iloc[0]["srp_ids"]):
srp_id = doi_results.iloc[0]["srp_ids"].split(",")[0] # Take first SRP if multiple
print(f"Step 2: Found SRP ID: {srp_id}")
print()
# Step 3: Get metadata for this SRP
print(f"Step 3: Getting metadata for {srp_id}")
metadata = db.sra_metadata(srp_id)
print(f"Found {len(metadata)} samples")
print(metadata.head())
# Optional: Download the data
# db.download(df=metadata, out_dir='./downloads')
else:
print("No SRP IDs found for this DOI")
Step 2: Found SRP ID: SRP000941
Step 3: Getting metadata for SRP000941
Found 2207 samples
study_accession study_title \
16 SRP000941 UCSD Human Reference Epigenome Mapping Project
1055 SRP000941 UCSD Human Reference Epigenome Mapping Project
1053 SRP000941 UCSD Human Reference Epigenome Mapping Project
1054 SRP000941 UCSD Human Reference Epigenome Mapping Project
968 SRP000941 UCSD Human Reference Epigenome Mapping Project
experiment_accession experiment_title \
16 SRX027874 Reference Epigenome: ChIP-Seq Analysis of H3K9...
1055 SRX006235 Reference Epigenome: ChIP-Seq Analysis of H3K3...
1053 SRX006237 Reference Epigenome: ChIP-Seq Analysis of H3K4...
1054 SRX006236 Reference Epigenome: ChIP-Seq Analysis of H3K4...
968 SRX006240 Whole genome shotgun bisulfite sequencing of t...
experiment_desc organism_taxid \
16 Reference Epigenome: ChIP-Seq Analysis of H3K9... 9606
1055 Reference Epigenome: ChIP-Seq Analysis of H3K3... 9606
1053 Reference Epigenome: ChIP-Seq Analysis of H3K4... 9606
1054 Reference Epigenome: ChIP-Seq Analysis of H3K4... 9606
968 Whole genome shotgun bisulfite sequencing of t... 9606
organism_name library_name library_strategy library_source ... \
16 Homo sapiens LL218 ChIP-Seq GENOMIC ...
1055 Homo sapiens LL220 ChIP-Seq GENOMIC ...
1053 Homo sapiens LL227 ChIP-Seq GENOMIC ...
1054 Homo sapiens LL228 ChIP-Seq GENOMIC ...
968 Homo sapiens methylC-seq_h1_r2b Bisulfite-Seq GENOMIC ...
biosample bioproject instrument \
16 SAMN00004698 <NA> Illumina Genome Analyzer II
1055 SAMN00004459 <NA> Illumina Genome Analyzer II
1053 SAMN00004459 <NA> Illumina Genome Analyzer II
1054 SAMN00004459 <NA> Illumina Genome Analyzer II
968 SAMN00004462 <NA> Illumina Genome Analyzer II
instrument_model instrument_model_desc total_spots \
16 Illumina Genome Analyzer II ILLUMINA 59041190
1055 Illumina Genome Analyzer II ILLUMINA 12466602
1053 Illumina Genome Analyzer II ILLUMINA 7763232
1054 Illumina Genome Analyzer II ILLUMINA 7110370
968 Illumina Genome Analyzer II ILLUMINA 544393574
total_size run_accession run_total_spots run_total_bases
16 1555248627 SRR018453 11642838 419142168
1055 223017616 SRR018454 12466602 448797672
1053 204130918 SRR018455 7763232 279476352
1054 185053545 SRR018456 7110370 255973320
968 32242220269 SRR018975 16729745 1455487815
[5 rows x 24 columns]
Example 9: Batch Processing Literature¶
Process multiple papers from a literature search:
[11]:
# Simulate PMIDs from a literature search
pmids_from_search = ["39528918", "27373336", "30873266"]
# Extract identifiers from all papers
results = db.pmid_to_identifiers(pmids_from_search)
# Filter to papers with SRA data
papers_with_sra = results[~pd.isna(results["srp_ids"])]
print(f"Processed {len(pmids_from_search)} papers")
print(f"Found {len(papers_with_sra)} papers with SRA data")
print()
print("Papers with SRA data:")
papers_with_sra
Processed 3 papers
Found 1 papers with SRA data
Papers with SRA data:
[11]:
| pmid | pmc_id | gse_ids | prjna_ids | srp_ids | srr_ids | srx_ids | srs_ids | |
|---|---|---|---|---|---|---|---|---|
| 0 | 39528918 | PMC11552371 | GSE253406 | None | SRP484103 | None | None | None |
Example 10: Extract Identifiers from Custom Text¶
You can also extract identifiers from any text:
[12]:
# Custom text with database identifiers
text = """
This study analyzed RNA-seq data from GSE81903 and SRP075720.
The BioProject accession is PRJNA319707 with representative samples
SRR3587529, SRX1800089, and SRS1467635.
"""
identifiers = db.extract_identifiers_from_text(text)
print("Identifiers found in text:")
for id_type, ids in identifiers.items():
if ids:
print(f" {id_type.upper()}: {', '.join(ids)}")
Identifiers found in text:
GSE: GSE81903
PRJNA: PRJNA319707
SRP: SRP075720
SRR: SRR3587529
SRX: SRX1800089
SRS: SRS1467635
Example 11: ID Conversion Utilities¶
Convert between different identifier types:
[14]:
# Convert PMID to PMC
print("PMID to PMC conversion:")
pmid_pmc = db.pmid_to_pmc("27373336")
print(pmid_pmc)
print()
# Convert DOI to PMID
print("DOI to PMID conversion:")
doi_pmid = db.doi_to_pmid("10.12688/f1000research.18676.1")
print(doi_pmid)
PMID to PMC conversion:
{'27373336': 'PMC4958508'}
DOI to PMID conversion:
{'10.12688/f1000research.18676.1': '31114675'}
[ ]: