Metadata Enrichment with Optional LLM Backends¶

pysradb can enrich detailed metadata by combining a sentence-transformer embedding model with an LLM backend. The workflow infers standardized columns for age, sex, ethnicity, phenotype, cell_type, tissue, strain, and disease when enough source metadata is available.

This feature is optional. It requires pysradb[enrichment], an embedding model download on first use, and a configured Ollama, LM Studio, or vLLM-compatible endpoint.

[1]:

# Install pysradb if not already installed
try:
    import pysradb

    print(f"pysradb {pysradb.__version__} is already installed")
except ImportError:
    print("Installing pysradb from GitHub...")
    import sys

    !{sys.executable} -m pip install -q git+https://github.com/saketkc/pysradb
    print("pysradb installed successfully!")

pysradb 3.0.0.dev0 is already installed

/home/runner/work/pysradb/pysradb/pysradb/download.py:15: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm

[2]:

from pysradb import SRAweb

Quick Start¶

One line enrichment!

Prerequisites: install the enrichment extra and prepare a local backend, for example Ollama with ollama pull granite4:3b.

The easiest way to enrich metadata is using the enrich=True parameter:

[3]:

# Optional local setup for LLM enrichment:
#   1. Install pysradb with: python -m pip install "pysradb[enrichment]"
#   2. Install Ollama from https://ollama.com/download
#   3. Run: ollama pull granite4:3b
# The documentation build does not install or start system services.

[4]:

print("Ollama setup is optional and is intentionally not run in this notebook build.")

Ollama setup is optional and is intentionally not run in this notebook build.

[5]:

print("Start Ollama locally with: ollama serve")

Start Ollama locally with: ollama serve

[6]:

print("Pull a model locally with: ollama pull granite4:3b")

Pull a model locally with: ollama pull granite4:3b

[7]:

db = SRAweb()

df = db.metadata("GSE155673")
display(df.head())

print("To enrich this table with Ollama, run:")
print('db.metadata("GSE155673", enrich=True, enrich_backend="ollama/granite4:3b")')

	study_accession	study_title	study_summary	organism_name	platform_accession	platform_title	experiment_type	bioproject	submission_date	supplementary_files	series_ftp	sample_accession	sample_title
0	GSE155673	Systems biological assessment of immunity to s...	CITE-seq of 7 patients with COVID-19 and 5 hea...	Homo sapiens	GPL20301	NaN	Expression profiling by high throughput sequen...	PRJNA655740	2020/08/07	CSV, MTX, TSV	ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE155nn...	GSM4712885	cov_01_RNA
1	GSE155673	Systems biological assessment of immunity to s...	CITE-seq of 7 patients with COVID-19 and 5 hea...	Homo sapiens	GPL20301	NaN	Expression profiling by high throughput sequen...	PRJNA655740	2020/08/07	CSV, MTX, TSV	ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE155nn...	GSM4712886	cov_01_antibody
2	GSE155673	Systems biological assessment of immunity to s...	CITE-seq of 7 patients with COVID-19 and 5 hea...	Homo sapiens	GPL20301	NaN	Expression profiling by high throughput sequen...	PRJNA655740	2020/08/07	CSV, MTX, TSV	ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE155nn...	GSM4712887	cov_02_RNA
3	GSE155673	Systems biological assessment of immunity to s...	CITE-seq of 7 patients with COVID-19 and 5 hea...	Homo sapiens	GPL20301	NaN	Expression profiling by high throughput sequen...	PRJNA655740	2020/08/07	CSV, MTX, TSV	ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE155nn...	GSM4712888	cov_02_antibody
4	GSE155673	Systems biological assessment of immunity to s...	CITE-seq of 7 patients with COVID-19 and 5 hea...	Homo sapiens	GPL20301	NaN	Expression profiling by high throughput sequen...	PRJNA655740	2020/08/07	CSV, MTX, TSV	ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE155nn...	GSM4712889	cov_03_RNA

To enrich this table with Ollama, run:
db.metadata("GSE155673", enrich=True, enrich_backend="ollama/granite4:3b")

Backend Configuration¶

Use enrich_backend to select the LLM endpoint and embedding_model to select the sentence-transformer model used to identify relevant metadata fields.

LLM Backend Example¶

[8]:

enrichment_example = """
from pysradb import SRAweb

db = SRAweb()
df = db.metadata(
    "GSE155673",
    enrich=True,
    enrich_backend="ollama/granite4:3b",
)
"""
print(enrichment_example)


from pysradb import SRAweb

db = SRAweb()
df = db.metadata(
    "GSE155673",
    enrich=True,
    enrich_backend="ollama/granite4:3b",
)

[9]:

expected_cols = [
    "age",
    "sex",
    "ethnicity",
    "phenotype",
    "cell_type",
    "tissue",
    "strain",
    "disease",
]
print("Enrichment adds standardized columns such as:")
print(", ".join(expected_cols))

Enrichment adds standardized columns such as:
age, sex, ethnicity, phenotype, cell_type, tissue, strain, disease

Embedding Model Example¶

The embedding model is still required because pysradb uses it to find relevant metadata fields before invoking the LLM backend.

[10]:

embedding_example = """
df = db.metadata(
    "GSE155673",
    enrich=True,
    enrich_backend="ollama/granite4:3b",
    embedding_model="abhinand/MedEmbed-large-v0.1",
)
"""
print(embedding_example)


df = db.metadata(
    "GSE155673",
    enrich=True,
    enrich_backend="ollama/granite4:3b",
    embedding_model="abhinand/MedEmbed-large-v0.1",
)