Open In Colab

Metadata Enrichment with Optional LLM Backends

pysradb can enrich detailed metadata by combining a sentence-transformer embedding model with an LLM backend. The workflow infers standardized columns for age, sex, ethnicity, phenotype, cell_type, tissue, strain, and disease when enough source metadata is available.

This feature is optional. It requires pysradb[enrichment], an embedding model download on first use, and a configured Ollama, LM Studio, or vLLM-compatible endpoint.

[1]:
# Install pysradb if not already installed
try:
    import pysradb

    print(f"pysradb {pysradb.__version__} is already installed")
except ImportError:
    print("Installing pysradb from GitHub...")
    import sys

    !{sys.executable} -m pip install -q git+https://github.com/saketkc/pysradb
    print("pysradb installed successfully!")
pysradb 3.0.0.dev0 is already installed
/home/runner/work/pysradb/pysradb/pysradb/download.py:15: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm
[2]:
from pysradb import SRAweb

Quick Start

One line enrichment!

Prerequisites: install the enrichment extra and prepare a local backend, for example Ollama with ollama pull granite4:3b.

The easiest way to enrich metadata is using the enrich=True parameter:

[3]:
# Optional local setup for LLM enrichment:
#   1. Install pysradb with: python -m pip install "pysradb[enrichment]"
#   2. Install Ollama from https://ollama.com/download
#   3. Run: ollama pull granite4:3b
# The documentation build does not install or start system services.
[4]:
print("Ollama setup is optional and is intentionally not run in this notebook build.")
Ollama setup is optional and is intentionally not run in this notebook build.
[5]:
print("Start Ollama locally with: ollama serve")
Start Ollama locally with: ollama serve
[6]:
print("Pull a model locally with: ollama pull granite4:3b")
Pull a model locally with: ollama pull granite4:3b
[7]:
db = SRAweb()

df = db.metadata("GSE155673")
display(df.head())

print('To enrich this table with Ollama, run:')
print('db.metadata("GSE155673", enrich=True, enrich_backend="ollama/granite4:3b")')
study_accession study_title study_summary organism_name platform_accession platform_title experiment_type bioproject submission_date supplementary_files series_ftp sample_accession sample_title
0 GSE155673 Systems biological assessment of immunity to s... CITE-seq of 7 patients with COVID-19 and 5 hea... Homo sapiens GPL20301 NaN Expression profiling by high throughput sequen... PRJNA655740 2020/08/07 CSV, MTX, TSV ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE155nn... GSM4712885 cov_01_RNA
1 GSE155673 Systems biological assessment of immunity to s... CITE-seq of 7 patients with COVID-19 and 5 hea... Homo sapiens GPL20301 NaN Expression profiling by high throughput sequen... PRJNA655740 2020/08/07 CSV, MTX, TSV ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE155nn... GSM4712886 cov_01_antibody
2 GSE155673 Systems biological assessment of immunity to s... CITE-seq of 7 patients with COVID-19 and 5 hea... Homo sapiens GPL20301 NaN Expression profiling by high throughput sequen... PRJNA655740 2020/08/07 CSV, MTX, TSV ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE155nn... GSM4712887 cov_02_RNA
3 GSE155673 Systems biological assessment of immunity to s... CITE-seq of 7 patients with COVID-19 and 5 hea... Homo sapiens GPL20301 NaN Expression profiling by high throughput sequen... PRJNA655740 2020/08/07 CSV, MTX, TSV ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE155nn... GSM4712888 cov_02_antibody
4 GSE155673 Systems biological assessment of immunity to s... CITE-seq of 7 patients with COVID-19 and 5 hea... Homo sapiens GPL20301 NaN Expression profiling by high throughput sequen... PRJNA655740 2020/08/07 CSV, MTX, TSV ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE155nn... GSM4712889 cov_03_RNA
To enrich this table with Ollama, run:
db.metadata("GSE155673", enrich=True, enrich_backend="ollama/granite4:3b")

Backend Configuration

Use enrich_backend to select the LLM endpoint and embedding_model to select the sentence-transformer model used to identify relevant metadata fields.

LLM Backend Example

[8]:
enrichment_example = '''
from pysradb import SRAweb

db = SRAweb()
df = db.metadata(
    "GSE155673",
    enrich=True,
    enrich_backend="ollama/granite4:3b",
)
'''
print(enrichment_example)

from pysradb import SRAweb

db = SRAweb()
df = db.metadata(
    "GSE155673",
    enrich=True,
    enrich_backend="ollama/granite4:3b",
)

[9]:
expected_cols = [
    "age",
    "sex",
    "ethnicity",
    "phenotype",
    "cell_type",
    "tissue",
    "strain",
    "disease",
]
print("Enrichment adds standardized columns such as:")
print(", ".join(expected_cols))
Enrichment adds standardized columns such as:
age, sex, ethnicity, phenotype, cell_type, tissue, strain, disease

Embedding Model Example

The embedding model is still required because pysradb uses it to find relevant metadata fields before invoking the LLM backend.

[10]:
embedding_example = '''
df = db.metadata(
    "GSE155673",
    enrich=True,
    enrich_backend="ollama/granite4:3b",
    embedding_model="abhinand/MedEmbed-large-v0.1",
)
'''
print(embedding_example)

df = db.metadata(
    "GSE155673",
    enrich=True,
    enrich_backend="ollama/granite4:3b",
    embedding_model="abhinand/MedEmbed-large-v0.1",
)