Open In Colab

Extract Identifiers from PMC/DOI

This notebook shows how to extract SRA identifiers from PubMed Central articles and DOI references.

[ ]:
# Install pysradb if not already installed
try:
    import pysradb

    print(f"pysradb {pysradb.__version__} is already installed")
except ImportError:
    print("Installing pysradb from GitHub...")
    import sys

    !{sys.executable} -m pip install -q git+https://github.com/saketkc/pysradb
    print("pysradb installed successfully!")

Extracting Database Identifiers from Literature (PMC/DOI)

This notebook demonstrates how to extract database identifiers (GSE, PRJNA, SRP, SRR, SRX, SRS) from:

  • PubMed Central (PMC) articles

  • PubMed IDs (PMIDs)

  • Digital Object Identifiers (DOIs)

Setup

First, let’s import pysradb and create a connection:

[1]:
from pysradb.sraweb import SRAweb
import pandas as pd

db = SRAweb()
/Users/saket/github/pysradb/pysradb/utils.py:14: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm

Example 1: Extract Identifiers from PMID

Let’s extract all database identifiers from a PubMed article using its PMID:

[2]:
# Extract all identifiers from PMID 39528918
pmid = "39528918"
df = db.pmid_to_identifiers(pmid)

print(f"Identifiers found for PMID {pmid}:")
df
Identifiers found for PMID 39528918:
[2]:
pmid pmc_id gse_ids prjna_ids srp_ids srr_ids srx_ids srs_ids
0 39528918 PMC11552371 GSE253406 None SRP484103 None None None

Example 2: Get Only GSE IDs from PMID

Sometimes you only need specific identifier types:

[3]:
# Get only GSE identifiers
gse_df = db.pmid_to_gse("39528918")
print("GSE identifiers:")
gse_df
GSE identifiers:
[3]:
pmid pmc_id gse_ids
0 39528918 PMC11552371 GSE253406

Example 3: Get Only SRP IDs from PMID

Important: Even if the paper only mentions GSE IDs, pysradb will automatically convert them to SRP!

[4]:
# Get only SRP identifiers
# This works even if the paper only mentions GSE253406!
srp_df = db.pmid_to_srp("39528918")
print("SRP identifiers (auto-converted from GSE if needed):")
srp_df
SRP identifiers (auto-converted from GSE if needed):
[4]:
pmid pmc_id srp_ids
0 39528918 PMC11552371 SRP484103

Example 4: Process Multiple PMIDs

You can batch process multiple PMIDs at once:

[5]:
# Process multiple PMIDs
pmids = ["39528918", "27373336"]
df_multiple = db.pmid_to_identifiers(pmids)

print(f"Identifiers from {len(pmids)} PMIDs:")
df_multiple
Identifiers from 2 PMIDs:
[5]:
pmid pmc_id gse_ids prjna_ids srp_ids srr_ids srx_ids srs_ids
0 39528918 PMC11552371 GSE253406 None SRP484103 None None None
1 27373336 PMC4958508 None None None None None None

Example 5: Extract from PMC ID

If you have a PMC ID directly, you can use it:

[6]:
# Extract from PMC ID
pmc_df = db.pmc_to_identifiers("PMC10802650")
print("Identifiers from PMC article:")
pmc_df
Identifiers from PMC article:
[6]:
pmc_id gse_ids prjna_ids srp_ids srr_ids srx_ids srs_ids
0 PMC10802650 <NA> <NA> <NA> <NA> <NA> <NA>

Example 6: Extract from DOI

You can also extract identifiers directly from a DOI:

[7]:
# Extract from DOI
doi = "10.12688/f1000research.18676.1"
doi_df = db.doi_to_identifiers(doi)

print(f"Identifiers from DOI {doi}:")
doi_df
Identifiers from DOI 10.12688/f1000research.18676.1:
[7]:
doi pmid pmc_id gse_ids prjna_ids srp_ids srr_ids srx_ids srs_ids
0 10.12688/f1000research.18676.1 31114675 PMC6505635 GSE100007,GSE24355,GSE25842,GSE41637 None SRP000941,SRP003870,SRP005378,SRP010679,SRP109126 SRR403882,SRR403883,SRR403884,SRR403885,SRR403... SRX118285,SRX118286,SRX118287,SRX118288,SRX118... SRS2282390,SRS2282391,SRS2282392,SRS2282393,SR...

Example 7: Get GSE from DOI

[8]:
# Get only GSE from DOI
doi_gse_df = db.doi_to_gse("10.12688/f1000research.18676.1")
print("GSE identifiers from DOI:")
doi_gse_df
GSE identifiers from DOI:
[8]:
doi pmid pmc_id gse_ids
0 10.12688/f1000research.18676.1 31114675 PMC6505635 GSE100007,GSE24355,GSE25842,GSE41637

Example 8: Complete Workflow - DOI to Data Download

Here’s a complete workflow from DOI to downloading the actual data:

[9]:
# Step 1: Extract SRP from DOI
doi = "10.12688/f1000research.18676.1"
print(f"Step 1: Extracting identifiers from DOI: {doi}")
doi_results = db.doi_to_srp(doi)
print(doi_results)
print()
Step 1: Extracting identifiers from DOI: 10.12688/f1000research.18676.1
                              doi      pmid      pmc_id  \
0  10.12688/f1000research.18676.1  31114675  PMC6505635

                                             srp_ids
0  SRP000941,SRP003870,SRP005378,SRP010679,SRP109126

[10]:
# Step 2: Extract SRP ID from results
if not doi_results.empty and not pd.isna(doi_results.iloc[0]["srp_ids"]):
    srp_id = doi_results.iloc[0]["srp_ids"].split(",")[0]  # Take first SRP if multiple
    print(f"Step 2: Found SRP ID: {srp_id}")
    print()

    # Step 3: Get metadata for this SRP
    print(f"Step 3: Getting metadata for {srp_id}")
    metadata = db.sra_metadata(srp_id)
    print(f"Found {len(metadata)} samples")
    print(metadata.head())

    # Optional: Download the data
    # db.download(df=metadata, out_dir='./downloads')
else:
    print("No SRP IDs found for this DOI")
Step 2: Found SRP ID: SRP000941

Step 3: Getting metadata for SRP000941
Found 2207 samples
     study_accession                                     study_title  \
16         SRP000941  UCSD Human Reference Epigenome Mapping Project
1055       SRP000941  UCSD Human Reference Epigenome Mapping Project
1053       SRP000941  UCSD Human Reference Epigenome Mapping Project
1054       SRP000941  UCSD Human Reference Epigenome Mapping Project
968        SRP000941  UCSD Human Reference Epigenome Mapping Project

     experiment_accession                                   experiment_title  \
16              SRX027874  Reference Epigenome: ChIP-Seq Analysis of H3K9...
1055            SRX006235  Reference Epigenome: ChIP-Seq Analysis of H3K3...
1053            SRX006237  Reference Epigenome: ChIP-Seq Analysis of H3K4...
1054            SRX006236  Reference Epigenome: ChIP-Seq Analysis of H3K4...
968             SRX006240  Whole genome shotgun bisulfite sequencing of t...

                                        experiment_desc organism_taxid  \
16    Reference Epigenome: ChIP-Seq Analysis of H3K9...           9606
1055  Reference Epigenome: ChIP-Seq Analysis of H3K3...           9606
1053  Reference Epigenome: ChIP-Seq Analysis of H3K4...           9606
1054  Reference Epigenome: ChIP-Seq Analysis of H3K4...           9606
968   Whole genome shotgun bisulfite sequencing of t...           9606

     organism_name        library_name library_strategy library_source  ...  \
16    Homo sapiens               LL218         ChIP-Seq        GENOMIC  ...
1055  Homo sapiens               LL220         ChIP-Seq        GENOMIC  ...
1053  Homo sapiens               LL227         ChIP-Seq        GENOMIC  ...
1054  Homo sapiens               LL228         ChIP-Seq        GENOMIC  ...
968   Homo sapiens  methylC-seq_h1_r2b    Bisulfite-Seq        GENOMIC  ...

         biosample bioproject                   instrument  \
16    SAMN00004698       <NA>  Illumina Genome Analyzer II
1055  SAMN00004459       <NA>  Illumina Genome Analyzer II
1053  SAMN00004459       <NA>  Illumina Genome Analyzer II
1054  SAMN00004459       <NA>  Illumina Genome Analyzer II
968   SAMN00004462       <NA>  Illumina Genome Analyzer II

                 instrument_model instrument_model_desc total_spots  \
16    Illumina Genome Analyzer II              ILLUMINA    59041190
1055  Illumina Genome Analyzer II              ILLUMINA    12466602
1053  Illumina Genome Analyzer II              ILLUMINA     7763232
1054  Illumina Genome Analyzer II              ILLUMINA     7110370
968   Illumina Genome Analyzer II              ILLUMINA   544393574

       total_size run_accession run_total_spots run_total_bases
16     1555248627     SRR018453        11642838       419142168
1055    223017616     SRR018454        12466602       448797672
1053    204130918     SRR018455         7763232       279476352
1054    185053545     SRR018456         7110370       255973320
968   32242220269     SRR018975        16729745      1455487815

[5 rows x 24 columns]

Example 9: Batch Processing Literature

Process multiple papers from a literature search:

[11]:
# Simulate PMIDs from a literature search
pmids_from_search = ["39528918", "27373336", "30873266"]

# Extract identifiers from all papers
results = db.pmid_to_identifiers(pmids_from_search)

# Filter to papers with SRA data
papers_with_sra = results[~pd.isna(results["srp_ids"])]

print(f"Processed {len(pmids_from_search)} papers")
print(f"Found {len(papers_with_sra)} papers with SRA data")
print()
print("Papers with SRA data:")
papers_with_sra
Processed 3 papers
Found 1 papers with SRA data

Papers with SRA data:
[11]:
pmid pmc_id gse_ids prjna_ids srp_ids srr_ids srx_ids srs_ids
0 39528918 PMC11552371 GSE253406 None SRP484103 None None None

Example 10: Extract Identifiers from Custom Text

You can also extract identifiers from any text:

[12]:
# Custom text with database identifiers
text = """
This study analyzed RNA-seq data from GSE81903 and SRP075720.
The BioProject accession is PRJNA319707 with representative samples
SRR3587529, SRX1800089, and SRS1467635.
"""

identifiers = db.extract_identifiers_from_text(text)

print("Identifiers found in text:")
for id_type, ids in identifiers.items():
    if ids:
        print(f"  {id_type.upper()}: {', '.join(ids)}")
Identifiers found in text:
  GSE: GSE81903
  PRJNA: PRJNA319707
  SRP: SRP075720
  SRR: SRR3587529
  SRX: SRX1800089
  SRS: SRS1467635

Example 11: ID Conversion Utilities

Convert between different identifier types:

[14]:
# Convert PMID to PMC
print("PMID to PMC conversion:")
pmid_pmc = db.pmid_to_pmc("27373336")
print(pmid_pmc)
print()

# Convert DOI to PMID
print("DOI to PMID conversion:")
doi_pmid = db.doi_to_pmid("10.12688/f1000research.18676.1")
print(doi_pmid)
PMID to PMC conversion:
{'27373336': 'PMC4958508'}

DOI to PMID conversion:
{'10.12688/f1000research.18676.1': '31114675'}
[ ]: