Extract Identifiers from PMC/DOI¶

This notebook shows how to extract SRA identifiers from PubMed Central articles and DOI references.

[1]:

# Install pysradb if not already installed
try:
    import pysradb

    print(f"pysradb {pysradb.__version__} is already installed")
except ImportError:
    print("Installing pysradb from GitHub...")
    import sys

    !{sys.executable} -m pip install -q git+https://github.com/saketkc/pysradb
    print("pysradb installed successfully!")

pysradb 3.0.0.dev0 is already installed

/home/runner/work/pysradb/pysradb/pysradb/download.py:15: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm

Extracting Database Identifiers from Literature (PMC/DOI)¶

This notebook demonstrates how to extract database identifiers (GSE, PRJNA, SRP, SRR, SRX, SRS) from:

PubMed Central (PMC) articles
PubMed IDs (PMIDs)
Digital Object Identifiers (DOIs)

Setup¶

First, let’s import pysradb and create a connection:

[2]:

from pysradb.sraweb import SRAweb
import pandas as pd

db = SRAweb()

Example 1: Extract Identifiers from PMID¶

Let’s extract all database identifiers from a PubMed article using its PMID:

[3]:

# Extract all identifiers from PMID 39528918
pmid = "39528918"
df = db.pmid_to_identifiers(pmid)

print(f"Identifiers found for PMID {pmid}:")
df

Identifiers found for PMID 39528918:

[3]:

	pmid	pmc_id	gse_ids	prjna_ids	srp_ids	srr_ids	srx_ids	srs_ids
0	39528918	PMC11552371	GSE253406	PRJNA1065472	SRP484103	None	None	None

Example 2: Get Only GSE IDs from PMID¶

Sometimes you only need specific identifier types:

[4]:

# Get only GSE identifiers
gse_df = db.pmid_to_gse("39528918")
print("GSE identifiers:")
gse_df

GSE identifiers:

[4]:

	pmid	pmc_id	gse_ids
0	39528918	PMC11552371	GSE253406

Example 3: Get Only SRP IDs from PMID¶

Important: Even if the paper only mentions GSE IDs, pysradb will automatically convert them to SRP!

[5]:

# Get only SRP identifiers
# This works even if the paper only mentions GSE253406!
srp_df = db.pmid_to_srp("39528918")
print("SRP identifiers (auto-converted from GSE if needed):")
srp_df

SRP identifiers (auto-converted from GSE if needed):

[5]:

	pmid	pmc_id	srp_ids
0	39528918	PMC11552371	SRP484103

Example 4: Process Multiple PMIDs¶

You can batch process multiple PMIDs at once:

[6]:

# Process multiple PMIDs
pmids = ["39528918", "27373336"]
df_multiple = db.pmid_to_identifiers(pmids)

print(f"Identifiers from {len(pmids)} PMIDs:")
df_multiple

Identifiers from 2 PMIDs:

[6]:

	pmid	pmc_id	gse_ids	prjna_ids	srp_ids	srr_ids	srx_ids	srs_ids
0	39528918	PMC11552371	GSE253406	PRJNA1065472	SRP484103	None	None	None
1	27373336	PMC4958508	NaN	NaN	NaN	None	None	None

Example 5: Extract from PMC ID¶

If you have a PMC ID directly, you can use it:

[7]:

# Extract from PMC ID
pmc_df = db.pmc_to_identifiers("PMC10802650")
print("Identifiers from PMC article:")
pmc_df

Identifiers from PMC article:

[7]:

	pmc_id	gse_ids	prjna_ids	srp_ids	srr_ids	srx_ids	srs_ids
0	PMC10802650	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>

Example 6: Extract from DOI¶

You can also extract identifiers directly from a DOI:

[8]:

# Extract from DOI
doi = "10.12688/f1000research.18676.1"
doi_df = db.doi_to_identifiers(doi)

print(f"Identifiers from DOI {doi}:")
doi_df

Identifiers from DOI 10.12688/f1000research.18676.1:

[8]:

	doi	pmid	pmc_id	gse_ids	prjna_ids	srp_ids	srr_ids	srx_ids	srs_ids
0	10.12688/f1000research.18676.1	31114675	PMC6505635	GSE100007,GSE24355,GSE25842,GSE41637	PRJNA132809,PRJNA135759,PRJNA152347,PRJNA390315	SRP000941,SRP003870,SRP005378,SRP010679,SRP109126	SRR403882,SRR403883,SRR403884,SRR403885,SRR403...	SRX118285,SRX118286,SRX118287,SRX118288,SRX118...	SRS2282390,SRS2282391,SRS2282392,SRS2282393,SR...

Example 7: Get GSE from DOI¶

[9]:

# Get only GSE from DOI
doi_gse_df = db.doi_to_gse("10.12688/f1000research.18676.1")
print("GSE identifiers from DOI:")
doi_gse_df

GSE identifiers from DOI:

[9]:

	doi	pmid	pmc_id	gse_ids
0	10.12688/f1000research.18676.1	31114675	PMC6505635	GSE100007,GSE24355,GSE25842,GSE41637

Example 8: Complete Workflow - DOI to Data Download¶

Here’s a complete workflow from DOI to downloading the actual data:

[10]:

# Step 1: Extract SRP from DOI
doi = "10.12688/f1000research.18676.1"
print(f"Step 1: Extracting identifiers from DOI: {doi}")
doi_results = db.doi_to_srp(doi)
print(doi_results)
print()

Step 1: Extracting identifiers from DOI: 10.12688/f1000research.18676.1
                              doi      pmid      pmc_id  \
0  10.12688/f1000research.18676.1  31114675  PMC6505635

                                             srp_ids
0  SRP000941,SRP003870,SRP005378,SRP010679,SRP109126

[11]:

# Step 2: Extract SRP ID from results
if not doi_results.empty and not pd.isna(doi_results.iloc[0]["srp_ids"]):
    srp_id = doi_results.iloc[0]["srp_ids"].split(",")[0]  # Take first SRP if multiple
    print(f"Step 2: Found SRP ID: {srp_id}")
    print()

    # Step 3: Get metadata for this SRP
    print(f"Step 3: Getting metadata for {srp_id}")
    metadata = db.sra_metadata(srp_id)
    print(f"Found {len(metadata)} samples")
    print(metadata.head())

    # Optional: Download the data
    # db.download(df=metadata, out_dir='./downloads')
else:
    print("No SRP IDs found for this DOI")

Step 2: Found SRP ID: SRP000941

Step 3: Getting metadata for SRP000941
Found 2207 samples
     study_accession                                     study_title  \
16         SRP000941  UCSD Human Reference Epigenome Mapping Project
1055       SRP000941  UCSD Human Reference Epigenome Mapping Project
1053       SRP000941  UCSD Human Reference Epigenome Mapping Project
1054       SRP000941  UCSD Human Reference Epigenome Mapping Project
968        SRP000941  UCSD Human Reference Epigenome Mapping Project

     experiment_accession                                   experiment_title  \
16              SRX027874  Reference Epigenome: ChIP-Seq Analysis of H3K9...
1055            SRX006235  Reference Epigenome: ChIP-Seq Analysis of H3K3...
1053            SRX006237  Reference Epigenome: ChIP-Seq Analysis of H3K4...
1054            SRX006236  Reference Epigenome: ChIP-Seq Analysis of H3K4...
968             SRX006240  Whole genome shotgun bisulfite sequencing of t...

                                        experiment_desc organism_taxid  \
16    Reference Epigenome: ChIP-Seq Analysis of H3K9...           9606
1055  Reference Epigenome: ChIP-Seq Analysis of H3K3...           9606
1053  Reference Epigenome: ChIP-Seq Analysis of H3K4...           9606
1054  Reference Epigenome: ChIP-Seq Analysis of H3K4...           9606
968   Whole genome shotgun bisulfite sequencing of t...           9606

     organism_name        library_name library_strategy library_source  ...  \
16    Homo sapiens               LL218         ChIP-Seq        GENOMIC  ...
1055  Homo sapiens               LL220         ChIP-Seq        GENOMIC  ...
1053  Homo sapiens               LL227         ChIP-Seq        GENOMIC  ...
1054  Homo sapiens               LL228         ChIP-Seq        GENOMIC  ...
968   Homo sapiens  methylC-seq_h1_r2b    Bisulfite-Seq        GENOMIC  ...

         biosample bioproject                   instrument  \
16    SAMN00004698       <NA>  Illumina Genome Analyzer II
1055  SAMN00004459       <NA>  Illumina Genome Analyzer II
1053  SAMN00004459       <NA>  Illumina Genome Analyzer II
1054  SAMN00004459       <NA>  Illumina Genome Analyzer II
968   SAMN00004462       <NA>  Illumina Genome Analyzer II

                 instrument_model instrument_model_desc total_spots  \
16    Illumina Genome Analyzer II              ILLUMINA    59041190
1055  Illumina Genome Analyzer II              ILLUMINA    12466602
1053  Illumina Genome Analyzer II              ILLUMINA     7763232
1054  Illumina Genome Analyzer II              ILLUMINA     7110370
968   Illumina Genome Analyzer II              ILLUMINA   544393574

       total_size run_accession run_total_spots run_total_bases
16     1555248627     SRR018453        11642838       419142168
1055    223017616     SRR018454        12466602       448797672
1053    204130918     SRR018455         7763232       279476352
1054    185053545     SRR018456         7110370       255973320
968   32242220269     SRR018975        16729745      1455487815

[5 rows x 24 columns]

Example 9: Batch Processing Literature¶

Process multiple papers from a literature search:

[12]:

# Simulate PMIDs from a literature search
pmids_from_search = ["39528918", "27373336", "30873266"]

# Extract identifiers from all papers
results = db.pmid_to_identifiers(pmids_from_search)

# Filter to papers with SRA data
papers_with_sra = results[~pd.isna(results["srp_ids"])]

print(f"Processed {len(pmids_from_search)} papers")
print(f"Found {len(papers_with_sra)} papers with SRA data")
print()
print("Papers with SRA data:")
papers_with_sra

Processed 3 papers
Found 1 papers with SRA data

Papers with SRA data:

[12]:

	pmid	pmc_id	gse_ids	prjna_ids	srp_ids	srr_ids	srx_ids	srs_ids
0	39528918	PMC11552371	GSE253406	PRJNA1065472	SRP484103	None	None	None

Example 10: Extract Identifiers from Custom Text¶

You can also extract identifiers from any text:

[13]:

# Custom text with database identifiers
text = """
This study analyzed RNA-seq data from GSE81903 and SRP075720.
The BioProject accession is PRJNA319707 with representative samples
SRR3587529, SRX1800089, and SRS1467635.
"""

identifiers = db.extract_identifiers_from_text(text)

print("Identifiers found in text:")
for id_type, ids in identifiers.items():
    if ids:
        print(f"  {id_type.upper()}: {', '.join(ids)}")

Identifiers found in text:
  GSE: GSE81903
  PRJNA: PRJNA319707
  SRP: SRP075720
  SRR: SRR3587529
  SRX: SRX1800089
  SRS: SRS1467635

Example 11: ID Conversion Utilities¶

Convert between different identifier types:

[14]:

# Convert PMID to PMC
print("PMID to PMC conversion:")
pmid_pmc = db.pmid_to_pmc("27373336")
print(pmid_pmc)
print()

# Convert DOI to PMID
print("DOI to PMID conversion:")
doi_pmid = db.doi_to_pmid("10.12688/f1000research.18676.1")
print(doi_pmid)

PMID to PMC conversion:
{'27373336': 'PMC4958508'}

DOI to PMID conversion:
{'10.12688/f1000research.18676.1': '31114675'}

[ ]: