pysradb package¶
Submodules¶
pysradb.cli module¶
Command line interface for pysradb
- class pysradb.cli.ArgParser(prog=None, usage=None, description=None, epilog=None, parents=[], formatter_class=<class 'argparse.HelpFormatter'>, prefix_chars='-', fromfile_prefix_chars=None, argument_default=None, conflict_handler='error', add_help=True, allow_abbrev=True, exit_on_error=True)[source]¶
Bases:
ArgumentParser
- class pysradb.cli.CustomFormatterArgP(prog, indent_increment=2, max_help_position=24, width=None)[source]¶
Bases:
ArgumentDefaultsHelpFormatter,RawDescriptionHelpFormatter
- pysradb.cli.download(out_dir, srx, srp, geo, skip_confirmation, col='public_url', use_ascp=False, threads=1)[source]¶
- pysradb.cli.metadata(srp_id, assay, desc, detailed, expand, saveto, enrich=False, enrich_backend=None)[source]¶
pysradb.download module¶
Utility function to download data
- pysradb.download.download_file(url, file_path, md5_hash=None, timeout=10, block_size=1048576, show_progress=False)[source]¶
Resumable download. Expect the server to support byte ranges.
- Parameters:
- url: string
URL
- file_path: string
Local file path to store the downloaded file
- md5_hash: string
Expected MD5 string of downloaded file
- timeout: int
Seconds to wait before terminating request
- block_size: int
Chunkx of bytes to read (default: 1024 * 1024 = 1MB)
- show_progress: bool
Show progress bar
- pysradb.download.get_file_size(row, url_col)[source]¶
Get size of file to be downloaded.
- Parameters:
- row: pd.DataFrame row
- url_col: str
url_column
- Returns:
- content_length: int
pysradb.exceptions module¶
This file contains custom Exceptions for pysradb
pysradb.filter_attrs module¶
- pysradb.filter_attrs.expand_sample_attribute_columns(metadata_df)[source]¶
Expand sample attribute columns to individual columns.
Since the sample_attribute column content can be different for differnt rows even if coming from the same project (SRP), we explicitly iterate through the rows to first determine what additional columns need to be created.
- Parameters:
- metadata_df: DataFrame
Dataframe as obtained from sra_metadata or equivalent
- Returns:
- expanded_df: DataFrame
Dataframe with additionals columns pertaining to sample_attribute appended
- pysradb.filter_attrs.guess_cell_type(sample_attribute)[source]¶
Guess possible cell line from sample_attribute data.
- Parameters:
- sample_attribute: string
sample_attribute string as in the metadata column
- Returns:
- cell_type: string
Possible cell type of sample. Returns None if no match found.
pysradb.geoweb module¶
Utilities to interact with GEO online
- class pysradb.geoweb.GEOweb[source]¶
Bases:
object- download(links, root_url, gse, verbose=False, out_dir=None)[source]¶
Download GEO files.
- Parameters:
- links: list
List of all links valid downloadable present for a GEO ID
- root_url: string
url for root directory for a GEO ID
- gse: string
GEO ID
- verbose: bool
Print file list
- out_dir: string
Directory location for download
- pysradb.geoweb.download_geo_matrix(accession, output_dir='.')[source]¶
Download a GEO Matrix file for a given GEO accession ID.
- Args:
accession (str): GEO accession ID (e.g., ‘GSE234190’). output_dir (str): Directory to save the downloaded file (default: current directory).
- Returns:
str: Path to the downloaded file.
- Raises:
Exception: If the download fails.
- pysradb.geoweb.parse_geo_matrix_to_tsv(input_file, output_file)[source]¶
Parse a GEO Matrix file to a TSV file, extracting the expression data.
- Args:
input_file (str): Path to the input GEO Matrix file (gzipped). output_file (str): Path to save the output TSV file.
- Returns:
pandas.DataFrame: The parsed expression data.
pysradb.metadata_enrichment module¶
Metadata enrichment for SRA/GEO datasets using LLMs and embeddings.
- class pysradb.metadata_enrichment.EmbeddingMetadataExtractor(model_name: str = 'FremyCompany/BioLORD-2023', backend: str = 'sentence-transformers', reference_categories: Dict[str, List[str]] = None, **kwargs)[source]¶
Bases:
MetadataExtractorExtract metadata using embedding-based similarity matching.
- class pysradb.metadata_enrichment.LLMMetadataExtractor(backend: str = 'ollama/phi3', model: str | None = None, api_key: str | None = None, base_url: str | None = None, temperature: float = 0.0, max_retries: int = 3, **kwargs)[source]¶
Bases:
MetadataExtractorExtract metadata using Large Language Models via Instructor.
- class pysradb.metadata_enrichment.MetadataExtractor[source]¶
Bases:
ABCBase class for metadata extraction from experiment descriptions.
- enrich_dataframe(df: DataFrame, text_column: str | None = None, fields: List[str] | None = None, prefix: str = 'guessed_', show_progress: bool = True) DataFrame[source]¶
Enrich a DataFrame with extracted metadata.
- Args:
df: Input DataFrame text_column: Column containing text to analyze. If None, combines sample text columns. fields: List of metadata fields to extract prefix: Prefix for new columns show_progress: Show progress bar (default: True)
- Returns:
DataFrame with additional metadata columns
- pysradb.metadata_enrichment.apply_dataframe_enrichment(df: DataFrame, method: str = 'embedding', backend: str | None = None, model: str | None = None, text_column: str | None = None, show_progress: bool = True, prefix: str = 'guessed_') DataFrame[source]¶
Utility function to apply metadata enrichment to a DataFrame.
This is a convenience function that handles: - Column auto-detection - Extractor initialization - Error handling - Progress display
- Args:
df: Input DataFrame method: Enrichment method (‘llm’ or ‘embedding’) backend: Backend for the method model: Model name text_column: Column to use (auto-detected if None) show_progress: Show progress bar prefix: Prefix for new columns
- Returns:
Enriched DataFrame
- Example:
>>> from pysradb.metadata_enrichment import apply_dataframe_enrichment >>> enriched_df = apply_dataframe_enrichment( ... df, ... method="embedding", ... text_column="experiment_title" ... )
- pysradb.metadata_enrichment.create_metadata_extractor(method: str = 'llm', backend: str | None = None, model: str | None = None, **kwargs) MetadataExtractor[source]¶
Factory function to create metadata extractor.
- Args:
method: Extraction method (
llmorembedding) backend: Backend for the method model: Model name kwargs: Additional parameters (as keyword arguments)- Returns:
MetadataExtractor instance
- Examples:
>>> # LLM-based with Instructor (default provider) >>> extractor = create_metadata_extractor(method="llm") >>> >>> # Embedding-based (default: BioLORD-2023 for biomedical text) >>> extractor = create_metadata_extractor(method="embedding")
pysradb.search module¶
This file contains the search classes for the search feature.
- class pysradb.search.EnaSearch(verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, suppress_validation=False)[source]¶
Bases:
QuerySearchSubclass of QuerySearch that implements search via querying ENA API
Methods
search()
sends the user query via requests to ENA API and stores search result as an instance attribute in the form of a pandas dataframe
show_result_statistics()
Shows summary information about search results.
visualise_results()
Generate graphs that visualise the search results.
get_plot_objects():
Get the plot objects for plots generated.
_format_query_string()
formats the input user query into a string
_format_request()
formats the request payload
_format_result(content)
formats the search query output and converts it into a pandas dataframe
See also
QuerySearchSuperclass of EnaSearch
- class pysradb.search.GeoSearch(verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, geo_query=None, geo_dataset_type=None, geo_entry_type=None, suppress_validation=False)[source]¶
Bases:
SraSearchSubclass of SraSearch that can query both GEO DataSets and SRA API.
Methods
search()
sends the user query via requests to SRA, GEO DataSets, or both depending on the search query. If query is sent to both APIs, the intersection of the two sets of query results are returned.
show_result_statistics()
Shows summary information about search results.
visualise_results()
Generate graphs that visualise the search results.
get_plot_objects():
Get the plot objects for plots generated.
_format_geo_query_string()
formats the GEO DataSets portion of the input user query into a string.
_format_geo_request()
formats the GEO DataSets request payload
_format_result(content)
formats the search query output and converts it into a pandas dataframe
See also
GeoSearch.infoGeoSearch usage details
SraSearchSuperclass of GeoSearch
QuerySearchSuperclass of SraSearch
- class pysradb.search.QuerySearch(verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, suppress_validation=False)[source]¶
Bases:
objectThis is the base class for the search feature.
This class takes as input the user’s search query, which has been tokenized by the ArgParser. The query will be sent to either SRA or ENA depending on the user’s input, and the results will be returned as a pandas dataframe.
- Parameters:
- verbosityinteger
The level of details of the search result.
- return_maxint
The maximum number of entries to be returned.
- querystr
The main query string.
- accessionstr
A relevant study / experiment / sample / run accession number.
- organismstr
Scientific name of the sample organism
- layoutstr
Library layout. Possible inputs: single, paired
- mbasesint
Size of the sample of interest rounded to the nearest megabase.
- publication_datestr
The publication date of the run in the format dd-mm-yyyy. If a date range is desired, input should be in the format of dd-mm-yyyy:dd-mm-yyyy
- platformstr
Sequencing platform used for the run. Some possible inputs include: illumina, ion torrent, oxford nanopore
- selectionstr
Library selection. Some possible inputs: cdna, chip, dnase, pcr
- sourcestr
Library source. Some possible inputs: genomic, metagenomic, transcriptomic
- strategystr
Library Preparation strategy. Some possible inputs: wgs, amplicon, rna seq
- titlestr
Title of the experiment associated with the run
- suppress_validation: bool
Defaults to False. If this is set to True, the user input format checks will be skipped. Setting this to True may cause the program to behave in unexpected ways, but allows the user to search queries that does not pass the format check.
- Attributes:
- self.df: Pandas DataFrame
The search result belonging to this search instance
Methods
get_df()
Returns the dataframe storing this search result.
search()
Executes the search.
show_result_statistics()
Shows summary information about search results.
visualise_results()
Generate graphs that visualise the search results.
get_plot_objects():
Get the plot objects for plots generated.
- visualise_results(graph_types=('all',), show=False, saveto='./search_plots/')[source]¶
Generate graphs that visualise the search results.
This method will only work if the optional dependency, matplotlib, is installed in the system.
- Parameters:
- graph_typestuple
tuple containing strings representing types of graphs to generate. Possible strings: all, daterange, organism, source, selection, platform, basecount
- savetostr
directory name where the generated graphs are saved.
- showbool
Whether plotted graphs are immediately shown.
- class pysradb.search.SraSearch(verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, suppress_validation=False)[source]¶
Bases:
QuerySearchSubclass of QuerySearch that implements search by querying NCBI Entrez API
Methods
search()
sends the user query via requests to NCBI Entrez API and returns search results as a pandas dataframe.
show_result_statistics()
Shows summary information about search results.
visualise_results()
Generate graphs that visualise the search results.
get_plot_objects():
Get the plot objects for plots generated.
get_uids():
Get NCBI uids retrieved during this search query.
_format_query_string()
formats the input user query into a string
_format_request()
formats the request payload
_format_result(content)
formats the search query output.
See also
QuerySearchSuperclass of SraSearch
pysradb.sraweb module¶
Utilities to interact with SRA online
- class pysradb.sraweb.SRAweb(api_key=None)[source]¶
Bases:
object- bioproject_to_srp(bioproject)[source]¶
Convert PRJNA BioProject ID to SRP accession
- Parameters:
- bioproject: str
BioProject ID (e.g., ‘PRJNA810439’)
- Returns:
- srp_accessions: list
List of SRP accessions found
- doi_to_gse(dois)[source]¶
Get GSE identifiers from DOI(s)
- Parameters:
- dois: list or str
DOI(s)
- Returns:
- results_df: pandas.DataFrame
DataFrame with DOIs and GSE identifiers
- doi_to_identifiers(dois)[source]¶
Extract database identifiers from articles via DOI
- Parameters:
- dois: list or str
DOI(s)
- Returns:
- results_df: pandas.DataFrame
DataFrame with DOIs, PMIDs, PMC IDs, and extracted identifiers
- doi_to_pmid(dois)[source]¶
Convert DOI(s) to PMID(s)
- Parameters:
- dois: list or str
DOI(s)
- Returns:
- doi_pmid_mapping: dict
Mapping of DOI to PMID
- doi_to_srp(dois)[source]¶
Get SRP identifiers from DOI(s)
- Parameters:
- dois: list or str
DOI(s)
- Returns:
- results_df: pandas.DataFrame
DataFrame with DOIs and SRP identifiers
- extract_external_sources(metadata_df)[source]¶
Extract external source identifiers from SRA metadata
- Parameters:
- metadata_df: pandas.DataFrame
DataFrame containing SRA metadata
- Returns:
- external_sources: list
List of external source identifiers found
- extract_identifiers_from_text(text)[source]¶
Extract GSE, PRJNA, SRP, and other identifiers from text
- Parameters:
- text: str
Text to search for identifiers
- Returns:
- identifiers: dict
Dictionary with lists of found identifiers by type
- fetch_bioproject_pmids(bioprojects)[source]¶
Fetch PMIDs for given BioProject accessions
- Parameters:
- bioprojects: list or str
BioProject accession(s)
- Returns:
- bioproject_pmids: dict
Mapping of BioProject to list of PMIDs
- fetch_ena_fastq(srp)[source]¶
Fetch FASTQ records from ENA (EXPERIMENTAL)
- Parameters:
- srp: string
Srudy accession
- Returns:
- srr_url: list
List of SRR fastq urls
- fetch_gsm_soft(gsm_ids)[source]¶
Fetch detailed GSM metadata in SOFT format.
- Args:
gsm_ids: List of GSM accessions
- Returns:
Dictionary mapping GSM accession to parsed SOFT metadata
- fetch_pmc_fulltext(pmc_id)[source]¶
Fetch full text from PMC article
- Parameters:
- pmc_id: str
PMC ID (can be with or without ‘PMC’ prefix)
- Returns:
- fulltext: str
Full text of the article, or None if unavailable
- static format_xml(string)[source]¶
Create a fake root to make ‘string’ a valid xml
- Parameters:
- string: str
- Returns:
- xml: str
- geo_metadata(gse, sample_attribute=False, detailed=False, expand_sample_attributes=False, include_pmids=False, enrich=False, enrich_backend='ollama/phi3', **kwargs)[source]¶
- gse_to_pmid(gse_accessions)[source]¶
Get PMIDs for GSE accessions by searching PubMed Central
- Parameters:
- gse_accessions: list or str
GSE accession(s)
- Returns:
- gse_pmid_df: pandas.DataFrame
DataFrame with GSE accessions and associated PMIDs
- metadata(accession, **kwargs)[source]¶
Unified method to fetch metadata for SRA or GEO accessions.
Automatically detects accession type and calls appropriate method.
- Args:
accession:
SRP/GSEaccession(s) - can be string or list kwargs: Additional parameters passed tosra_metadata()orgeo_metadata()(e.g.,
detailed,enrich,enrich_backend,sample_attribute, etc.)- Returns:
DataFrame with metadata (enriched if enrich=True)
- Examples:
>>> client = SRAweb() >>> df = client.metadata("GSE286254", detailed=True, enrich=True) >>> df = client.metadata("SRP253951", detailed=True, enrich=True) >>> df = client.metadata(["GSE286254", "GSE147507"], enrich=True)
- pmc_to_identifiers(pmc_ids, convert_missing=True)[source]¶
Extract database identifiers from PMC articles
- Parameters:
- pmc_ids: list or str
PMC ID(s) (can be with or without ‘PMC’ prefix)
- convert_missing: bool
If True, automatically convert GSE↔SRP when one is found but not the other Default: True
- Returns:
- results_df: pandas.DataFrame
DataFrame with PMC IDs and extracted identifiers
- pmid_to_gse(pmids)[source]¶
Get GSE identifiers from PMID(s)
- Parameters:
- pmids: list or str
PMID(s)
- Returns:
- results_df: pandas.DataFrame
DataFrame with PMIDs and GSE identifiers
- pmid_to_identifiers(pmids)[source]¶
Extract database identifiers from PubMed articles via PMC
- Parameters:
- pmids: list or str
PMID(s)
- Returns:
- results_df: pandas.DataFrame
DataFrame with PMIDs, PMC IDs, and extracted identifiers
- pmid_to_pmc(pmids)[source]¶
Convert PMID(s) to PMC ID(s)
- Parameters:
- pmids: list or str
PMID(s)
- Returns:
- pmid_pmc_mapping: dict
Mapping of PMID to PMC ID
- pmid_to_srp(pmids)[source]¶
Get SRP identifiers from PMID(s)
- Parameters:
- pmids: list or str
PMID(s)
- Returns:
- results_df: pandas.DataFrame
DataFrame with PMIDs and SRP identifiers
- search_pmc_for_external_sources(external_sources)[source]¶
Search PubMed Central for PMIDs using external source identifiers
- Parameters:
- external_sources: list
List of external source identifiers
- Returns:
- pmids: list
List of PMIDs found
- sra_metadata(srp, sample_attribute=False, detailed=False, expand_sample_attributes=False, output_read_lengths=False, include_pmids=False, enrich=False, enrich_backend='ollama/phi3', **kwargs)[source]¶
- sra_to_pmid(sra_accessions)[source]¶
Get PMIDs for SRA accessions (backward compatibility wrapper)
- Parameters:
- sra_accessions: list or str
SRA accession(s) - can be SRP, SRR, SRX, or SRS
- Returns:
- sra_pmid_df: pandas.DataFrame
DataFrame with SRA accessions and associated PMIDs
pysradb.taxid2name module¶
pysradb.utils module¶
- class pysradb.utils.TqdmUpTo(*_, **__)[source]¶
Bases:
tqdmAlternative Class-based version of the above. Provides update_to(n) which uses tqdm.update(delta_n). Inspired by [twine#242](https://github.com/pypa/twine/pull/242), [here](https://github.com/pypa/twine/commit/42e55e06).
Credits: https://github.com/tqdm/tqdm/blob/69326b718905816bb827e0e66c5508c9c04bc06c/examples/tqdm_wget.py
- pysradb.utils.confirm(preceeding_text)[source]¶
Confirm user input.
- Parameters:
- preceeding_text: str
Text to print
- Returns:
- response: bool
- pysradb.utils.copyfileobj(fsrc, fdst, bufsize=16384, filesize=None, desc='')[source]¶
Copy file object with a progress bar.
- Parameters:
- fsrc: filehandle
Input file handle
- fdst: filehandle
Output file handle
- bufsize: int
Length of output buffer
- filesize: int
Input file file size
- desc: string
Description for tqdm status
- pysradb.utils.get_gzip_uncompressed_size(filepath)[source]¶
Get uncompressed size of a .gz file
- Parameters:
- filepath: string
Path to input file
- Returns:
- filesize: int
Uncompressed file size
- pysradb.utils.mkdir_p(path)[source]¶
Python version mkdir -p
- Parameters:
- pathstring
Path to directory to create
- pysradb.utils.order_dataframe(df, columns)[source]¶
Order a dataframe
Order a dataframe by moving the columns in the front
- Parameters:
- df: Dataframe
Dataframe
- columns: list
List of columns that need to be put in front
- pysradb.utils.path_leaf(path)[source]¶
Get path’s tail from a filepath.
- Parameters:
- path: string
Filepath
- Returns:
- tail: string
Filename
- pysradb.utils.requests_3_retries()[source]¶
Generates a requests session object that allows 3 retries.
- Returns:
- session: requests.Session
requests session object that allows 3 retries for server-side errors.
Module contents¶
Top-level package for pysradb.