pysradb package

Submodules

pysradb.cli module

Command line interface for pysradb

class pysradb.cli.ArgParser(prog=None, usage=None, description=None, epilog=None, parents=[], formatter_class=<class 'argparse.HelpFormatter'>, prefix_chars='-', fromfile_prefix_chars=None, argument_default=None, conflict_handler='error', add_help=True, allow_abbrev=True, exit_on_error=True)[source]

Bases: ArgumentParser

error(message: string)[source]

Prints a usage message incorporating the message to stderr and exits.

If you override this in a subclass, it should not return – it should either exit or raise an exception.

class pysradb.cli.CustomFormatterArgP(prog, indent_increment=2, max_help_position=24, width=None)[source]

Bases: ArgumentDefaultsHelpFormatter, RawDescriptionHelpFormatter

pysradb.cli.doi_to_gse(doi_ids, saveto)[source]
pysradb.cli.doi_to_identifiers(doi_ids, saveto)[source]
pysradb.cli.doi_to_srp(doi_ids, saveto)[source]
pysradb.cli.download(out_dir, srx, srp, geo, skip_confirmation, col='public_url', use_ascp=False, threads=1)[source]
pysradb.cli.geo_matrix(accession, to_tsv, output_dir)[source]
pysradb.cli.get_geo_search_info()[source]
pysradb.cli.gse_to_gsm(gse_ids, saveto, detailed, desc, expand)[source]
pysradb.cli.gse_to_pmid(gse_ids, saveto)[source]
pysradb.cli.gse_to_srp(gse_ids, saveto, detailed, desc, expand)[source]
pysradb.cli.gsm_to_gse(gsm_ids, saveto, detailed, desc, expand)[source]
pysradb.cli.gsm_to_srp(gsm_ids, saveto, detailed, desc, expand)[source]
pysradb.cli.gsm_to_srr(gsm_ids, saveto, detailed, desc, expand)[source]
pysradb.cli.gsm_to_srs(gsm_ids, saveto, detailed, desc, expand)[source]
pysradb.cli.gsm_to_srx(gsm_ids, saveto, detailed, desc, expand)[source]
pysradb.cli.metadata(srp_id, assay, desc, detailed, expand, saveto, enrich=False, enrich_backend=None)[source]
pysradb.cli.parse_args(args=None)[source]

Argument parser

pysradb.cli.pmc_to_identifiers(pmc_ids, saveto)[source]
pysradb.cli.pmid_to_gse(pmid_ids, saveto)[source]
pysradb.cli.pmid_to_identifiers(pmid_ids, saveto)[source]
pysradb.cli.pmid_to_srp(pmid_ids, saveto)[source]
pysradb.cli.pretty_print_df(df, include_header=True)[source]

Pretty print dataframe using rich formatting

pysradb.cli.search(saveto, db, verbosity, return_max, fields)[source]
pysradb.cli.sra_to_pmid(sra_ids, saveto)[source]

Backward compatibility wrapper for sra_to_pmid

pysradb.cli.srp_to_gse(srp_id, saveto, detailed, desc, expand)[source]
pysradb.cli.srp_to_pmid(srp_ids, saveto)[source]
pysradb.cli.srp_to_srr(srp_id, saveto, detailed, desc, expand)[source]
pysradb.cli.srp_to_srs(srp_id, saveto, detailed, desc, expand)[source]
pysradb.cli.srp_to_srx(srp_id, saveto, detailed, desc, expand)[source]
pysradb.cli.srr_to_gsm(srr_ids, saveto, detailed, desc, expand)[source]
pysradb.cli.srr_to_srp(srr_ids, saveto, detailed, desc, expand)[source]
pysradb.cli.srr_to_srs(srr_ids, saveto, detailed, desc, expand)[source]
pysradb.cli.srr_to_srx(srr_ids, saveto, detailed, desc, expand)[source]
pysradb.cli.srs_to_gsm(srs_ids, saveto, detailed, desc, expand)[source]
pysradb.cli.srs_to_srx(srs_ids, saveto, detailed, desc, expand)[source]
pysradb.cli.srx_to_srp(srx_ids, saveto, detailed, desc, expand)[source]
pysradb.cli.srx_to_srr(srx_ids, saveto, detailed, desc, expand)[source]
pysradb.cli.srx_to_srs(srx_ids, saveto, detailed, desc, expand)[source]

pysradb.download module

Utility function to download data

pysradb.download.download_file(url, file_path, md5_hash=None, timeout=10, block_size=1048576, show_progress=False)[source]

Resumable download. Expect the server to support byte ranges.

Parameters:
url: string

URL

file_path: string

Local file path to store the downloaded file

md5_hash: string

Expected MD5 string of downloaded file

timeout: int

Seconds to wait before terminating request

block_size: int

Chunkx of bytes to read (default: 1024 * 1024 = 1MB)

show_progress: bool

Show progress bar

pysradb.download.get_file_size(row, url_col)[source]

Get size of file to be downloaded.

Parameters:
row: pd.DataFrame row
url_col: str

url_column

Returns:
content_length: int
pysradb.download.md5_validate_file(file_path, md5_hash)[source]

Check file containt against an MD5.

Parameters:
file_path: string

Path to file

md5_hash: string

Expected md5 hash

Returns:
valid: bool

True if expected and observed md5 match

pysradb.download.millify(n)[source]

Convert integer to human readable format.

Parameters:
nint
Returns:
millidxstr

Formatted integer

pysradb.exceptions module

This file contains custom Exceptions for pysradb

exception pysradb.exceptions.IncorrectFieldException[source]

Bases: Exception

Exception raised when the user enters incorrect inputs for a flag.

exception pysradb.exceptions.MissingQueryException[source]

Bases: Exception

Exception raised when the user did not supply any query fields.

Attributes:
message: string

Error message for this Exception

pysradb.filter_attrs module

pysradb.filter_attrs.expand_sample_attribute_columns(metadata_df)[source]

Expand sample attribute columns to individual columns.

Since the sample_attribute column content can be different for differnt rows even if coming from the same project (SRP), we explicitly iterate through the rows to first determine what additional columns need to be created.

Parameters:
metadata_df: DataFrame

Dataframe as obtained from sra_metadata or equivalent

Returns:
expanded_df: DataFrame

Dataframe with additionals columns pertaining to sample_attribute appended

pysradb.filter_attrs.guess_cell_type(sample_attribute)[source]

Guess possible cell line from sample_attribute data.

Parameters:
sample_attribute: string

sample_attribute string as in the metadata column

Returns:
cell_type: string

Possible cell type of sample. Returns None if no match found.

pysradb.filter_attrs.guess_strain_type(sample_attribute)[source]

Guess strain type from sample_attribute data.

Parameters:
sample_attribute: string

sample_attribute string as in the metadata column

Returns:
strain_type: string

Possible cell type of sample. Returns None if no match found.

pysradb.filter_attrs.guess_tissue_type(sample_attribute)[source]

Guess tissue type from sample_attribute data.

Parameters:
sample_attribute: string

sample_attribute string as in the metadata column

Returns:
tissue_type: string

Possible cell type of sample. Returns None if no match found.

pysradb.geoweb module

Utilities to interact with GEO online

class pysradb.geoweb.GEOweb[source]

Bases: object

download(links, root_url, gse, verbose=False, out_dir=None)[source]

Download GEO files.

Parameters:
links: list

List of all links valid downloadable present for a GEO ID

root_url: string

url for root directory for a GEO ID

gse: string

GEO ID

verbose: bool

Print file list

out_dir: string

Directory location for download

Obtain all links from the GEO FTP page.

Parameters:
gse: string

GSE ID

Returns:
links: list

List of all valid downloadable links present for a GEO ID

pysradb.geoweb.download_geo_matrix(accession, output_dir='.')[source]

Download a GEO Matrix file for a given GEO accession ID.

Args:

accession (str): GEO accession ID (e.g., ‘GSE234190’). output_dir (str): Directory to save the downloaded file (default: current directory).

Returns:

str: Path to the downloaded file.

Raises:

Exception: If the download fails.

pysradb.geoweb.parse_geo_matrix_to_tsv(input_file, output_file)[source]

Parse a GEO Matrix file to a TSV file, extracting the expression data.

Args:

input_file (str): Path to the input GEO Matrix file (gzipped). output_file (str): Path to save the output TSV file.

Returns:

pandas.DataFrame: The parsed expression data.

pysradb.metadata_enrichment module

Metadata enrichment for SRA/GEO datasets using LLMs and embeddings.

class pysradb.metadata_enrichment.EmbeddingMetadataExtractor(model_name: str = 'FremyCompany/BioLORD-2023', backend: str = 'sentence-transformers', reference_categories: Dict[str, List[str]] = None, **kwargs)[source]

Bases: MetadataExtractor

Extract metadata using embedding-based similarity matching.

extract_metadata(text: str, fields: List[str] | None = None) Dict[str, Any][source]

Extract metadata using embedding similarity.

Args:

text: Input text fields: List of fields to extract

Returns:

Dictionary with extracted metadata

class pysradb.metadata_enrichment.LLMMetadataExtractor(backend: str = 'ollama/phi3', model: str | None = None, api_key: str | None = None, base_url: str | None = None, temperature: float = 0.0, max_retries: int = 3, **kwargs)[source]

Bases: MetadataExtractor

Extract metadata using Large Language Models via Instructor.

extract_metadata(text: str, fields: List[str] | None = None) Dict[str, Any][source]

Extract metadata from text using LLM.

Args:

text: Input text fields: List of fields to extract

Returns:

Dictionary with extracted metadata

class pysradb.metadata_enrichment.MetadataExtractor[source]

Bases: ABC

Base class for metadata extraction from experiment descriptions.

enrich_dataframe(df: DataFrame, text_column: str | None = None, fields: List[str] | None = None, prefix: str = 'guessed_', show_progress: bool = True) DataFrame[source]

Enrich a DataFrame with extracted metadata.

Args:

df: Input DataFrame text_column: Column containing text to analyze. If None, combines sample text columns. fields: List of metadata fields to extract prefix: Prefix for new columns show_progress: Show progress bar (default: True)

Returns:

DataFrame with additional metadata columns

extract_batch(texts: List[str], fields: List[str] | None = None) List[Dict[str, Any]][source]

Extract metadata from multiple texts.

Args:

texts: List of input texts fields: List of metadata fields to extract

Returns:

List of dictionaries with extracted metadata

abstract extract_metadata(text: str, fields: List[str] | None = None) Dict[str, Any][source]

Extract metadata from text.

Args:

text: Input text (experiment description, title, etc.) fields: List of metadata fields to extract. If None, extract all.

Returns:

Dictionary with extracted metadata

pysradb.metadata_enrichment.apply_dataframe_enrichment(df: DataFrame, method: str = 'embedding', backend: str | None = None, model: str | None = None, text_column: str | None = None, show_progress: bool = True, prefix: str = 'guessed_') DataFrame[source]

Utility function to apply metadata enrichment to a DataFrame.

This is a convenience function that handles: - Column auto-detection - Extractor initialization - Error handling - Progress display

Args:

df: Input DataFrame method: Enrichment method (‘llm’ or ‘embedding’) backend: Backend for the method model: Model name text_column: Column to use (auto-detected if None) show_progress: Show progress bar prefix: Prefix for new columns

Returns:

Enriched DataFrame

Example:
>>> from pysradb.metadata_enrichment import apply_dataframe_enrichment
>>> enriched_df = apply_dataframe_enrichment(
...     df,
...     method="embedding",
...     text_column="experiment_title"
... )
pysradb.metadata_enrichment.create_metadata_extractor(method: str = 'llm', backend: str | None = None, model: str | None = None, **kwargs) MetadataExtractor[source]

Factory function to create metadata extractor.

Args:

method: Extraction method (llm or embedding) backend: Backend for the method model: Model name kwargs: Additional parameters (as keyword arguments)

Returns:

MetadataExtractor instance

Examples:
>>> # LLM-based with Instructor (default provider)
>>> extractor = create_metadata_extractor(method="llm")
>>>
>>> # Embedding-based (default: BioLORD-2023 for biomedical text)
>>> extractor = create_metadata_extractor(method="embedding")
pysradb.metadata_enrichment.load_ontology_reference() Dict[str, List[str]][source]

Returns comprehensive reference categories from UBERON, MONDO, and CL ontologies.

Returns:

Dictionary with ontology terms (organs, tissues, anatomical_systems, cell_types, diseases)

pysradb.search module

This file contains the search classes for the search feature.

class pysradb.search.EnaSearch(verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, suppress_validation=False)[source]

Bases: QuerySearch

Subclass of QuerySearch that implements search via querying ENA API

Methods

search()

sends the user query via requests to ENA API and stores search result as an instance attribute in the form of a pandas dataframe

show_result_statistics()

Shows summary information about search results.

visualise_results()

Generate graphs that visualise the search results.

get_plot_objects():

Get the plot objects for plots generated.

_format_query_string()

formats the input user query into a string

_format_request()

formats the request payload

_format_result(content)

formats the search query output and converts it into a pandas dataframe

See also

QuerySearch

Superclass of EnaSearch

search()[source]
class pysradb.search.GeoSearch(verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, geo_query=None, geo_dataset_type=None, geo_entry_type=None, suppress_validation=False)[source]

Bases: SraSearch

Subclass of SraSearch that can query both GEO DataSets and SRA API.

Methods

search()

sends the user query via requests to SRA, GEO DataSets, or both depending on the search query. If query is sent to both APIs, the intersection of the two sets of query results are returned.

show_result_statistics()

Shows summary information about search results.

visualise_results()

Generate graphs that visualise the search results.

get_plot_objects():

Get the plot objects for plots generated.

_format_geo_query_string()

formats the GEO DataSets portion of the input user query into a string.

_format_geo_request()

formats the GEO DataSets request payload

_format_result(content)

formats the search query output and converts it into a pandas dataframe

See also

GeoSearch.info

GeoSearch usage details

SraSearch

Superclass of GeoSearch

QuerySearch

Superclass of SraSearch

classmethod info()[source]

Information on how to use GeoSearch.

Displays information on how to query GEO DataSets / SRA via GeoSearch, including accepted inputs for geo_query, geo_dataset_type and geo_entry_type.

Returns:
info: str

Information on how to use GeoSearch.

search()[source]

Sends the user query via requests to SRA, GEO DataSets, or both

class pysradb.search.QuerySearch(verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, suppress_validation=False)[source]

Bases: object

This is the base class for the search feature.

This class takes as input the user’s search query, which has been tokenized by the ArgParser. The query will be sent to either SRA or ENA depending on the user’s input, and the results will be returned as a pandas dataframe.

Parameters:
verbosityinteger

The level of details of the search result.

return_maxint

The maximum number of entries to be returned.

querystr

The main query string.

accessionstr

A relevant study / experiment / sample / run accession number.

organismstr

Scientific name of the sample organism

layoutstr

Library layout. Possible inputs: single, paired

mbasesint

Size of the sample of interest rounded to the nearest megabase.

publication_datestr

The publication date of the run in the format dd-mm-yyyy. If a date range is desired, input should be in the format of dd-mm-yyyy:dd-mm-yyyy

platformstr

Sequencing platform used for the run. Some possible inputs include: illumina, ion torrent, oxford nanopore

selectionstr

Library selection. Some possible inputs: cdna, chip, dnase, pcr

sourcestr

Library source. Some possible inputs: genomic, metagenomic, transcriptomic

strategystr

Library Preparation strategy. Some possible inputs: wgs, amplicon, rna seq

titlestr

Title of the experiment associated with the run

suppress_validation: bool

Defaults to False. If this is set to True, the user input format checks will be skipped. Setting this to True may cause the program to behave in unexpected ways, but allows the user to search queries that does not pass the format check.

Attributes:
self.df: Pandas DataFrame

The search result belonging to this search instance

Methods

get_df()

Returns the dataframe storing this search result.

search()

Executes the search.

show_result_statistics()

Shows summary information about search results.

visualise_results()

Generate graphs that visualise the search results.

get_plot_objects():

Get the plot objects for plots generated.

get_df()[source]

Getter for the search result dataframe.

get_plot_objects()[source]

Get the plot objects for plots generated.

search()[source]
show_result_statistics()[source]

Shows search result statistics.

visualise_results(graph_types=('all',), show=False, saveto='./search_plots/')[source]

Generate graphs that visualise the search results.

This method will only work if the optional dependency, matplotlib, is installed in the system.

Parameters:
graph_typestuple

tuple containing strings representing types of graphs to generate. Possible strings: all, daterange, organism, source, selection, platform, basecount

savetostr

directory name where the generated graphs are saved.

showbool

Whether plotted graphs are immediately shown.

class pysradb.search.SraSearch(verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, suppress_validation=False)[source]

Bases: QuerySearch

Subclass of QuerySearch that implements search by querying NCBI Entrez API

Methods

search()

sends the user query via requests to NCBI Entrez API and returns search results as a pandas dataframe.

show_result_statistics()

Shows summary information about search results.

visualise_results()

Generate graphs that visualise the search results.

get_plot_objects():

Get the plot objects for plots generated.

get_uids():

Get NCBI uids retrieved during this search query.

_format_query_string()

formats the input user query into a string

_format_request()

formats the request payload

_format_result(content)

formats the search query output.

See also

QuerySearch

Superclass of SraSearch

get_uids()[source]

Get NCBI uids retrieved during this search query.

Note: There is a chance that some uids retrieved do not appear in the search result output (Refer to #88)

search()[source]

pysradb.sraweb module

Utilities to interact with SRA online

class pysradb.sraweb.SRAweb(api_key=None)[source]

Bases: object

bioproject_to_srp(bioproject)[source]

Convert PRJNA BioProject ID to SRP accession

Parameters:
bioproject: str

BioProject ID (e.g., ‘PRJNA810439’)

Returns:
srp_accessions: list

List of SRP accessions found

create_esummary_params(esearchresult, db='sra')[source]
doi_to_gse(dois)[source]

Get GSE identifiers from DOI(s)

Parameters:
dois: list or str

DOI(s)

Returns:
results_df: pandas.DataFrame

DataFrame with DOIs and GSE identifiers

doi_to_identifiers(dois)[source]

Extract database identifiers from articles via DOI

Parameters:
dois: list or str

DOI(s)

Returns:
results_df: pandas.DataFrame

DataFrame with DOIs, PMIDs, PMC IDs, and extracted identifiers

doi_to_pmid(dois)[source]

Convert DOI(s) to PMID(s)

Parameters:
dois: list or str

DOI(s)

Returns:
doi_pmid_mapping: dict

Mapping of DOI to PMID

doi_to_srp(dois)[source]

Get SRP identifiers from DOI(s)

Parameters:
dois: list or str

DOI(s)

Returns:
results_df: pandas.DataFrame

DataFrame with DOIs and SRP identifiers

extract_external_sources(metadata_df)[source]

Extract external source identifiers from SRA metadata

Parameters:
metadata_df: pandas.DataFrame

DataFrame containing SRA metadata

Returns:
external_sources: list

List of external source identifiers found

extract_identifiers_from_text(text)[source]

Extract GSE, PRJNA, SRP, and other identifiers from text

Parameters:
text: str

Text to search for identifiers

Returns:
identifiers: dict

Dictionary with lists of found identifiers by type

fetch_bioproject_pmids(bioprojects)[source]

Fetch PMIDs for given BioProject accessions

Parameters:
bioprojects: list or str

BioProject accession(s)

Returns:
bioproject_pmids: dict

Mapping of BioProject to list of PMIDs

fetch_ena_fastq(srp)[source]

Fetch FASTQ records from ENA (EXPERIMENTAL)

Parameters:
srp: string

Srudy accession

Returns:
srr_url: list

List of SRR fastq urls

fetch_gds_results(gse, **kwargs)[source]
fetch_gsm_soft(gsm_ids)[source]

Fetch detailed GSM metadata in SOFT format.

Args:

gsm_ids: List of GSM accessions

Returns:

Dictionary mapping GSM accession to parsed SOFT metadata

fetch_pmc_fulltext(pmc_id)[source]

Fetch full text from PMC article

Parameters:
pmc_id: str

PMC ID (can be with or without ‘PMC’ prefix)

Returns:
fulltext: str

Full text of the article, or None if unavailable

static format_xml(string)[source]

Create a fake root to make ‘string’ a valid xml

Parameters:
string: str
Returns:
xml: str
geo_metadata(gse, sample_attribute=False, detailed=False, expand_sample_attributes=False, include_pmids=False, enrich=False, enrich_backend='ollama/phi3', **kwargs)[source]
get_efetch_response(db, term, usehistory='y')[source]
get_esummary_response(db, term, usehistory='y')[source]
gse_to_gsm(gse, **kwargs)[source]
gse_to_pmid(gse_accessions)[source]

Get PMIDs for GSE accessions by searching PubMed Central

Parameters:
gse_accessions: list or str

GSE accession(s)

Returns:
gse_pmid_df: pandas.DataFrame

DataFrame with GSE accessions and associated PMIDs

gse_to_srp(gse, **kwargs)[source]
gsm_to_gse(gsm, **kwargs)[source]
gsm_to_srp(gsm, **kwargs)[source]
gsm_to_srr(gsm, **kwargs)[source]
gsm_to_srs(gsm, **kwargs)[source]

Get SRS for a GSM

gsm_to_srx(gsm, **kwargs)[source]

Get SRX for a GSM

metadata(accession, **kwargs)[source]

Unified method to fetch metadata for SRA or GEO accessions.

Automatically detects accession type and calls appropriate method.

Args:

accession: SRP/GSE accession(s) - can be string or list kwargs: Additional parameters passed to sra_metadata() or geo_metadata()

(e.g., detailed, enrich, enrich_backend, sample_attribute, etc.)

Returns:

DataFrame with metadata (enriched if enrich=True)

Examples:
>>> client = SRAweb()
>>> df = client.metadata("GSE286254", detailed=True, enrich=True)
>>> df = client.metadata("SRP253951", detailed=True, enrich=True)
>>> df = client.metadata(["GSE286254", "GSE147507"], enrich=True)
pmc_to_identifiers(pmc_ids, convert_missing=True)[source]

Extract database identifiers from PMC articles

Parameters:
pmc_ids: list or str

PMC ID(s) (can be with or without ‘PMC’ prefix)

convert_missing: bool

If True, automatically convert GSE↔SRP when one is found but not the other Default: True

Returns:
results_df: pandas.DataFrame

DataFrame with PMC IDs and extracted identifiers

pmid_to_gse(pmids)[source]

Get GSE identifiers from PMID(s)

Parameters:
pmids: list or str

PMID(s)

Returns:
results_df: pandas.DataFrame

DataFrame with PMIDs and GSE identifiers

pmid_to_identifiers(pmids)[source]

Extract database identifiers from PubMed articles via PMC

Parameters:
pmids: list or str

PMID(s)

Returns:
results_df: pandas.DataFrame

DataFrame with PMIDs, PMC IDs, and extracted identifiers

pmid_to_pmc(pmids)[source]

Convert PMID(s) to PMC ID(s)

Parameters:
pmids: list or str

PMID(s)

Returns:
pmid_pmc_mapping: dict

Mapping of PMID to PMC ID

pmid_to_srp(pmids)[source]

Get SRP identifiers from PMID(s)

Parameters:
pmids: list or str

PMID(s)

Returns:
results_df: pandas.DataFrame

DataFrame with PMIDs and SRP identifiers

search(*args, **kwargs)[source]
search_pmc_for_external_sources(external_sources)[source]

Search PubMed Central for PMIDs using external source identifiers

Parameters:
external_sources: list

List of external source identifiers

Returns:
pmids: list

List of PMIDs found

sra_metadata(srp, sample_attribute=False, detailed=False, expand_sample_attributes=False, output_read_lengths=False, include_pmids=False, enrich=False, enrich_backend='ollama/phi3', **kwargs)[source]
sra_to_pmid(sra_accessions)[source]

Get PMIDs for SRA accessions (backward compatibility wrapper)

Parameters:
sra_accessions: list or str

SRA accession(s) - can be SRP, SRR, SRX, or SRS

Returns:
sra_pmid_df: pandas.DataFrame

DataFrame with SRA accessions and associated PMIDs

srp_to_gse(srp, **kwargs)[source]

Get GSE for a SRP

srp_to_pmid(srp_accessions)[source]

Get PMIDs associated with SRP accessions

Parameters:
srp_accessions: list or str

SRP accession(s)

Returns:
srp_pmid_df: pandas.DataFrame

DataFrame with SRP accessions and associated PMIDs

srp_to_srr(srp, **kwargs)[source]

Get SRR for a SRP

srp_to_srs(srp, **kwargs)[source]

Get SRS for a SRP

srp_to_srx(srp, **kwargs)[source]

Get SRX for a SRP

srr_to_gsm(srr, **kwargs)[source]

Get GSM for a SRR

srr_to_pmid(srr)[source]

Get PMIDs for Run Accessions (SRR)

srr_to_srp(srr, **kwargs)[source]

Get SRP for a SRR

srr_to_srs(srr, **kwargs)[source]

Get SRS for a SRR

srr_to_srx(srr, **kwargs)[source]

Get SRX for a SRR

srs_to_gsm(srs, **kwargs)[source]

Get GSM for a SRS

srs_to_pmid(srs)[source]

Get PMIDs for Sample Accessions (SRS)

srs_to_srx(srs, **kwargs)[source]

Get SRX for a SRS

srx_to_gsm(srx, **kwargs)[source]
srx_to_pmid(srx)[source]

Get PMIDs for Experiment Accessions (SRX)

srx_to_srp(srx, **kwargs)[source]

Get SRP for a SRX

srx_to_srr(srx, **kwargs)[source]

Get SRR for a SRX

srx_to_srs(srx, **kwargs)[source]

Get SRS for a SRX

static xml_to_json(xml)[source]

Convert xml to json.

Parameters:
xml: str

Input XML

Returns:
xml_dict: dict

Parsed xml as dict

pysradb.sraweb.get_retmax(n_records, retmax=500)[source]

Get retstart and retmax till n_records are exhausted

pysradb.sraweb.xmlescape(data)[source]

pysradb.taxid2name module

pysradb.utils module

class pysradb.utils.TqdmUpTo(*_, **__)[source]

Bases: tqdm

Alternative Class-based version of the above. Provides update_to(n) which uses tqdm.update(delta_n). Inspired by [twine#242](https://github.com/pypa/twine/pull/242), [here](https://github.com/pypa/twine/commit/42e55e06).

Credits: https://github.com/tqdm/tqdm/blob/69326b718905816bb827e0e66c5508c9c04bc06c/examples/tqdm_wget.py

update_to(b=1, bsize=1, tsize=None)[source]
bint, optional

Number of blocks transferred so far [default: 1].

bsizeint, optional

Size of each block (in tqdm units) [default: 1].

tsizeint, optional

Total size (in tqdm units). If [default: None] remains unchanged.

pysradb.utils.confirm(preceeding_text)[source]

Confirm user input.

Parameters:
preceeding_text: str

Text to print

Returns:
response: bool
pysradb.utils.copyfileobj(fsrc, fdst, bufsize=16384, filesize=None, desc='')[source]

Copy file object with a progress bar.

Parameters:
fsrc: filehandle

Input file handle

fdst: filehandle

Output file handle

bufsize: int

Length of output buffer

filesize: int

Input file file size

desc: string

Description for tqdm status

pysradb.utils.get_gzip_uncompressed_size(filepath)[source]

Get uncompressed size of a .gz file

Parameters:
filepath: string

Path to input file

Returns:
filesize: int

Uncompressed file size

pysradb.utils.mkdir_p(path)[source]

Python version mkdir -p

Parameters:
pathstring

Path to directory to create

pysradb.utils.order_dataframe(df, columns)[source]

Order a dataframe

Order a dataframe by moving the columns in the front

Parameters:
df: Dataframe

Dataframe

columns: list

List of columns that need to be put in front

pysradb.utils.path_leaf(path)[source]

Get path’s tail from a filepath.

Parameters:
path: string

Filepath

Returns:
tail: string

Filename

pysradb.utils.requests_3_retries()[source]

Generates a requests session object that allows 3 retries.

Returns:
session: requests.Session

requests session object that allows 3 retries for server-side errors.

pysradb.utils.run_command(command, verbose=False)[source]

Run a shell command

pysradb.utils.scientific_name_to_taxid(name)[source]

Converts a scientific name to its corresponding taxonomy ID.

Parameters:
name: str

Scientific name of interest.

Returns:
taxid: str

Taxonomy Id of the Scientific name.

Raises:
IncorrectFieldException

If the scientific name cannot be found.

pysradb.utils.unique(sequence)[source]

Get unique elements from a list maintaining the order.

Parameters:
input_list: list
Returns:
unique_list: list

List with unique elements maintaining the order

Module contents

Top-level package for pysradb.