Metadata Enrichment with Optional LLM Backends¶
pysradb can enrich detailed metadata by combining a sentence-transformer embedding model with an LLM backend. The workflow infers standardized columns for age, sex, ethnicity, phenotype, cell_type, tissue, strain, and disease when enough source metadata is available.
This feature is optional. It requires pysradb[enrichment], an embedding model download on first use, and a configured Ollama, LM Studio, or vLLM-compatible endpoint.
[1]:
# Install pysradb if not already installed
try:
import pysradb
print(f"pysradb {pysradb.__version__} is already installed")
except ImportError:
print("Installing pysradb from GitHub...")
import sys
!{sys.executable} -m pip install -q git+https://github.com/saketkc/pysradb
print("pysradb installed successfully!")
pysradb 3.0.0.dev0 is already installed
/home/runner/work/pysradb/pysradb/pysradb/download.py:15: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
from tqdm.autonotebook import tqdm
[2]:
from pysradb import SRAweb
Quick Start¶
One line enrichment!
Prerequisites: install the enrichment extra and prepare a local backend, for example Ollama with ollama pull granite4:3b.
The easiest way to enrich metadata is using the enrich=True parameter:
[3]:
# Optional local setup for LLM enrichment:
# 1. Install pysradb with: python -m pip install "pysradb[enrichment]"
# 2. Install Ollama from https://ollama.com/download
# 3. Run: ollama pull granite4:3b
# The documentation build does not install or start system services.
[4]:
print("Ollama setup is optional and is intentionally not run in this notebook build.")
Ollama setup is optional and is intentionally not run in this notebook build.
[5]:
print("Start Ollama locally with: ollama serve")
Start Ollama locally with: ollama serve
[6]:
print("Pull a model locally with: ollama pull granite4:3b")
Pull a model locally with: ollama pull granite4:3b
[7]:
db = SRAweb()
df = db.metadata("GSE155673")
display(df.head())
print('To enrich this table with Ollama, run:')
print('db.metadata("GSE155673", enrich=True, enrich_backend="ollama/granite4:3b")')
| study_accession | study_title | study_summary | organism_name | platform_accession | platform_title | experiment_type | bioproject | submission_date | supplementary_files | series_ftp | sample_accession | sample_title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | GSE155673 | Systems biological assessment of immunity to s... | CITE-seq of 7 patients with COVID-19 and 5 hea... | Homo sapiens | GPL20301 | NaN | Expression profiling by high throughput sequen... | PRJNA655740 | 2020/08/07 | CSV, MTX, TSV | ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE155nn... | GSM4712885 | cov_01_RNA |
| 1 | GSE155673 | Systems biological assessment of immunity to s... | CITE-seq of 7 patients with COVID-19 and 5 hea... | Homo sapiens | GPL20301 | NaN | Expression profiling by high throughput sequen... | PRJNA655740 | 2020/08/07 | CSV, MTX, TSV | ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE155nn... | GSM4712886 | cov_01_antibody |
| 2 | GSE155673 | Systems biological assessment of immunity to s... | CITE-seq of 7 patients with COVID-19 and 5 hea... | Homo sapiens | GPL20301 | NaN | Expression profiling by high throughput sequen... | PRJNA655740 | 2020/08/07 | CSV, MTX, TSV | ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE155nn... | GSM4712887 | cov_02_RNA |
| 3 | GSE155673 | Systems biological assessment of immunity to s... | CITE-seq of 7 patients with COVID-19 and 5 hea... | Homo sapiens | GPL20301 | NaN | Expression profiling by high throughput sequen... | PRJNA655740 | 2020/08/07 | CSV, MTX, TSV | ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE155nn... | GSM4712888 | cov_02_antibody |
| 4 | GSE155673 | Systems biological assessment of immunity to s... | CITE-seq of 7 patients with COVID-19 and 5 hea... | Homo sapiens | GPL20301 | NaN | Expression profiling by high throughput sequen... | PRJNA655740 | 2020/08/07 | CSV, MTX, TSV | ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE155nn... | GSM4712889 | cov_03_RNA |
To enrich this table with Ollama, run:
db.metadata("GSE155673", enrich=True, enrich_backend="ollama/granite4:3b")
Backend Configuration¶
Use enrich_backend to select the LLM endpoint and embedding_model to select the sentence-transformer model used to identify relevant metadata fields.
LLM Backend Example¶
[8]:
enrichment_example = '''
from pysradb import SRAweb
db = SRAweb()
df = db.metadata(
"GSE155673",
enrich=True,
enrich_backend="ollama/granite4:3b",
)
'''
print(enrichment_example)
from pysradb import SRAweb
db = SRAweb()
df = db.metadata(
"GSE155673",
enrich=True,
enrich_backend="ollama/granite4:3b",
)
[9]:
expected_cols = [
"age",
"sex",
"ethnicity",
"phenotype",
"cell_type",
"tissue",
"strain",
"disease",
]
print("Enrichment adds standardized columns such as:")
print(", ".join(expected_cols))
Enrichment adds standardized columns such as:
age, sex, ethnicity, phenotype, cell_type, tissue, strain, disease
Embedding Model Example¶
The embedding model is still required because pysradb uses it to find relevant metadata fields before invoking the LLM backend.
[10]:
embedding_example = '''
df = db.metadata(
"GSE155673",
enrich=True,
enrich_backend="ollama/granite4:3b",
embedding_model="abhinand/MedEmbed-large-v0.1",
)
'''
print(embedding_example)
df = db.metadata(
"GSE155673",
enrich=True,
enrich_backend="ollama/granite4:3b",
embedding_model="abhinand/MedEmbed-large-v0.1",
)