Query and Search¶
This notebook demonstrates advanced search capabilities to find SRA studies based on specific criteria.
[ ]:
# Install pysradb if not already installed
try:
import pysradb
print(f"pysradb {pysradb.__version__} is already installed")
except ImportError:
print("Installing pysradb from GitHub...")
import sys
!{sys.executable} -m pip install -q git+https://github.com/saketkc/pysradb
print("pysradb installed successfully!")
pysradb search¶
The pysradb search module supports querying the Sequence Read Archive (SRA) and the European Nucleotide Archive (ENA) databases for sequencing data. The module also includes several built-in flags that can be used to fine-tune a search query.¶
[1]:
%%html
<style>
th {font-size: 16px;}
td {font-size: 14px;}
td:first-child {font-size: 15px; font-weight: 500;}
</style>
Terminal flags for the pysradb search module:¶
Flags |
Explanation |
|---|---|
-h, –help |
Displays the help message |
–saveto |
Saves the result in the file specified by the user.Supported file types: txt, tsv, csv |
–db |
Selects the database (SRA, ENA, or both SRA and Geo DataSets) to query. Default database is SRA. Accepted inputs: sra, ena, geo |
-v, –verbosity |
This determines how much details are retrieved and shown in the search result: 0: run_accession only 1: run_accession and experiment_description only 2: (default) study_accession, experiment_accession, experiment_title, description, tax_id, scientific_name, library_strategy, library_source, library_selection, sample_accession, sample_title, instrument_model, run_accession, read_count, base_count 3: Everything in verbosity level 2, followed by all other retrievable information from the database |
-m, –max |
Maximum number of returned entries. Default number is 20.Note: If the maximum number set is large, querying the SRA and GEO DataSets databases will take significantly longer due to API limits |
-q, –query |
The main query string. Note: if this flag is not used, at least one of the following flags must be supplied: |
–accession |
A relevant study / experiment / sample / run accession number |
–organism |
Scientific name of the sample organism |
–layout |
Library layout. Accepted inputs: single, paired |
–mbases |
Size of the sample rounded to the nearest megabase |
–publication-date |
The publication date of the run in the format dd-mm-yyyy. If a date range is desired, enter the start date, followed by end date, separated by a colon ‘:’ in the format dd-mm-yyyy:dd-mm-yyyy Example: 01-01-2010:31-12-2010 |
–platform |
Sequencing platform used for the run. Possible inputs: illumina, ion torrent, oxford nanopore |
–selection |
Library selection. Possible inputs: cdna, chip, dnase, pcr, polya |
–source |
Library source. Possible inputs: genomic, metagenomic, transcriptomic |
–strategy |
Library Preparation strategy. Possible inputs: wgs, amplicon, rna seq |
–title |
Title of the experiment associated with the run |
–geo-query |
The main query string to be sent to Geo DataSets |
–geo-dataset-type |
Dataset type. Possible inputs: expression profiling by array, expression profiling by high throughput sequencing, non coding rna profiling by high throughput sequencing |
–geo-entry-type |
Entry type. Accepted inputs: gds, gpl, gse, gsm |
Using pysradb search in python:¶
pysradb search organises each search query as a instance of either the SraSearch, EnaSearch or the GeoSearch classes. These classes takes in the following parameters in their constructor:¶
SraSearch (verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, suppress_validation=False,)
EnaSearch (verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, suppress_validation=False,)
GeoSearch (verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, geo_query=None, geo_dataset_type=None, geo_entry_type=None, suppress_validation=False,)
Parameters |
Explanations |
|---|---|
verbosity |
This determines how much details are retrieved and shown in the search result (default=2). Same as -v / –verbosity on terminal |
return_max |
Maximum number of returned entries (default=20). Same as -m / –max on terminal |
suppress_validation |
Defaults to False. If this is set to True, the user input format checks will be skipped. Setting this to True may cause the program to behave in unexpected ways, but allows the user to search queries that does not pass the format check. |
Other parameters match the command line flags of the same name.
To query the SRA database for ribosome profiling, expecting an output of verbosity level 2, and returning at most 5 entries, we can do the following:¶
[2]:
from pysradb.search import SraSearch
instance = SraSearch(2, 5, query="ribosome profiling")
instance.search()
df = instance.get_df()
print(df)
/data/github/pysradb/pysradb/utils.py:14: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from tqdm.autonotebook import tqdm
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 5.58it/s]
study_accession experiment_accession \
0 SRP630668 SRX30794203
1 SRP630668 SRX30794202
2 SRP630668 SRX30794201
3 SRP630668 SRX30794200
4 ERP179251 ERX14883219
experiment_title sample_taxon_id \
0 shRNA treated replicate 2, Ribo-seq 9606
1 shRNA treated replicate 1, Ribo-seq 9606
2 negative control replicate 2, Ribo-seq 9606
3 negative control replicate 1, Ribo-seq 9606
4 Ribosome profiling and RNA-seq of Salmonella i... 10090
sample_scientific_name experiment_library_strategy \
0 Homo sapiens Ribo-seq
1 Homo sapiens Ribo-seq
2 Homo sapiens Ribo-seq
3 Homo sapiens Ribo-seq
4 Mus musculus OTHER
experiment_library_source experiment_library_selection sample_accession \
0 TRANSCRIPTOMIC size fractionation SRS26807700
1 TRANSCRIPTOMIC size fractionation SRS26807699
2 TRANSCRIPTOMIC size fractionation SRS26807698
3 TRANSCRIPTOMIC size fractionation SRS26807697
4 OTHER other ERS26530900
sample_alias experiment_instrument_model pool_member_spots run_1_size \
0 sh2_ribo Illumina NovaSeq X Plus 57288768 1354504305
1 sh1_ribo Illumina NovaSeq X Plus 61293382 1431873887
2 nc2_ribo Illumina NovaSeq X Plus 85858060 2060965292
3 nc1_ribo Illumina NovaSeq X Plus 61189912 1474261116
4 SAMEA119980013 NextSeq 2000 14962757 <NA>
run_1_accession run_1_total_spots run_1_total_bases
0 SRR35744894 57288768 4296657600
1 SRR35744895 61293382 4597003650
2 SRR35744896 85858060 6439354500
3 SRR35744897 61189912 4589243400
4 ERR15479323 <NA> <NA>
Quickstart¶
To query ENA instead, replace SraSearch class with the EnaSearch class:¶
[3]:
from pysradb.search import EnaSearch
instance = EnaSearch(2, 5, "ribosome profiling")
instance.search()
df = instance.get_df()
print(df)
Empty DataFrame
Columns: []
Index: []
To query GEO DataSets instead and retrieve the metadata of linked entries in SRA:¶
[4]:
from pysradb.search import GeoSearch
instance = GeoSearch(2, 5, geo_query="ribosome profiling")
instance.search()
df = instance.get_df()
print(df)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 5.53it/s]
study_accession experiment_accession \
0 SRP619527 SRX30454639
1 SRP619527 SRX30454638
2 SRP619527 SRX30454637
3 SRP619527 SRX30454636
4 SRP618293 SRX30400919
experiment_title sample_taxon_id \
0 GSM9233466: Eef1g-cKO leptotene and zygotene_r... 10090
1 GSM9233465: Eef1g-cKO leptotene and zygotene_r... 10090
2 GSM9233464: Control leptotene and zygotene_rep... 10090
3 GSM9233463: Control leptotene and zygotene_rep... 10090
4 GSM9224243: Eef1g cKO leptotene/zygotene sperm... 10090
sample_scientific_name experiment_library_strategy \
0 Mus musculus OTHER
1 Mus musculus OTHER
2 Mus musculus OTHER
3 Mus musculus OTHER
4 Mus musculus RNA-Seq
experiment_library_source experiment_library_selection sample_accession \
0 TRANSCRIPTOMIC other SRS26489458
1 TRANSCRIPTOMIC other SRS26489457
2 TRANSCRIPTOMIC other SRS26489456
3 TRANSCRIPTOMIC other SRS26489455
4 TRANSCRIPTOMIC cDNA SRS26437528
sample_alias experiment_instrument_model pool_member_spots run_1_size \
0 GSM9233466 Illumina NovaSeq 6000 69632843 8660602662
1 GSM9233465 Illumina NovaSeq 6000 98921214 12243831179
2 GSM9233464 Illumina NovaSeq 6000 38336833 4708346735
3 GSM9233463 Illumina NovaSeq 6000 88212319 11000694382
4 GSM9224243 Illumina NovaSeq X 15236947 1826954111
run_1_accession run_1_total_spots run_1_total_bases
0 SRR35360226 69632843 20889852900
1 SRR35360227 98921214 29676364200
2 SRR35360228 38336833 11501049900
3 SRR35360229 88212319 26463695700
4 SRR35298842 15236947 4571084100
7. Querying GEO DataSets with publication_date filter and displaying publication dates in results:¶
[5]:
from pysradb.search import GeoSearch
# Search for RNA-Seq datasets published in September 2024
# Using verbosity=3 to get all available fields including publication_date
instance = GeoSearch(
verbosity=3,
return_max=5,
geo_query="RNA-Seq",
publication_date="01-09-2024:30-09-2024",
)
instance.search()
df = instance.get_df()
# Display select columns including publication_date
if not df.empty and "publication_date" in df.columns:
cols_to_show = [
"study_accession",
"experiment_accession",
"sample_scientific_name",
"experiment_library_strategy",
"publication_date",
]
available_cols = [c for c in cols_to_show if c in df.columns]
print(df[available_cols])
else:
print(df)
No results found for the following search query:
SRA: {'query': 'sra gds[Filter]', 'accession': None, 'organism': None, 'layout': None, 'mbases': None, 'publication_date': '01-09-2024:30-09-2024', 'platform': None, 'selection': None, 'source': None, 'strategy': None, 'title': None}
GEO DataSets: {'query': 'RNA-Seq AND gds sra[Filter]', 'dataset_type': None, 'entry_type': None, 'publication_date': '01-09-2024:30-09-2024', 'organism': None}
Empty DataFrame
Columns: []
Index: []
Error Handling¶
When suppress_validation is not set to True, query fields with incorrect entries will raise IncorrectFieldException, which provides the complete list of acceptable inputs for fields such as “selection”, etc:¶
[6]:
# 1. Invalid query entered for "selection"
SraSearch(selection="Mudkip")
---------------------------------------------------------------------------
IncorrectFieldException Traceback (most recent call last)
Cell In[6], line 2
1 # 1. Invalid query entered for "selection"
----> 2 SraSearch(selection="Mudkip")
File /data/github/pysradb/pysradb/search.py:735, in SraSearch.__init__(self, verbosity, return_max, query, accession, organism, layout, mbases, publication_date, platform, selection, source, strategy, title, suppress_validation)
718 def __init__(
719 self,
720 verbosity=2,
(...)
733 suppress_validation=False,
734 ):
--> 735 super().__init__(
736 verbosity,
737 return_max,
738 query,
739 accession,
740 organism,
741 layout,
742 mbases,
743 publication_date,
744 platform,
745 selection,
746 source,
747 strategy,
748 title,
749 suppress_validation,
750 )
751 self.entries = {}
752 self.number_entries = 0
File /data/github/pysradb/pysradb/search.py:152, in QuerySearch.__init__(self, verbosity, return_max, query, accession, organism, layout, mbases, publication_date, platform, selection, source, strategy, title, suppress_validation)
150 raise MissingQueryException()
151 if not suppress_validation:
--> 152 self._validate_fields()
153 self.stats = {
154 "study": "-",
155 "experiment": "-",
(...)
167 "count_stdev": "-",
168 }
169 self.plot_objects = {}
File /data/github/pysradb/pysradb/search.py:449, in QuerySearch._validate_fields(self)
447 self.fields["strategy"] = output[0]
448 if message:
--> 449 raise IncorrectFieldException(message)
IncorrectFieldException: Incorrect selection: Mudkip
--selection must be one of the following:
5-methylcytidine antibody, CAGE, ChIP, ChIP-Seq, DNase, HMPR, Hybrid Selection,
Inverse rRNA, Inverse rRNA selection, MBD2 protein methyl-CpG binding domain,
MDA, MF, MNase, MSLL, Oligo-dT, PCR, PolyA, RACE, RANDOM, RANDOM PCR, RT-PCR,
Reduced Representation, Restriction Digest, cDNA, cDNA_oligo_dT, cDNA_randomPriming
other, padlock probes capture method, repeat fractionation, size fractionation,
unspecified
[7]:
# 2. Ambiguous query entered for "source":
EnaSearch(source="metagenomic viral rna ")
---------------------------------------------------------------------------
IncorrectFieldException Traceback (most recent call last)
Cell In[7], line 2
1 # 2. Ambiguous query entered for "source":
----> 2 EnaSearch(source="metagenomic viral rna ")
File /data/github/pysradb/pysradb/search.py:152, in QuerySearch.__init__(self, verbosity, return_max, query, accession, organism, layout, mbases, publication_date, platform, selection, source, strategy, title, suppress_validation)
150 raise MissingQueryException()
151 if not suppress_validation:
--> 152 self._validate_fields()
153 self.stats = {
154 "study": "-",
155 "experiment": "-",
(...)
167 "count_stdev": "-",
168 }
169 self.plot_objects = {}
File /data/github/pysradb/pysradb/search.py:449, in QuerySearch._validate_fields(self)
447 self.fields["strategy"] = output[0]
448 if message:
--> 449 raise IncorrectFieldException(message)
IncorrectFieldException: Multiple potential matches have been identified for metagenomic viral rna :
['METAGENOMIC', 'VIRAL RNA']
Please check your input.
Usage Examples:¶
1. Checking the help message on terminal:¶
[8]:
!pysradb search -h
usage: pysradb search [-h] [-o SAVETO] [-s] [-g [GRAPHS]] [-d {ena,geo,sra}]
[-v {0,1,2,3}] [--run-description] [--detailed] [-m MAX]
[-q QUERY [QUERY ...]] [-A ACCESSION]
[-O ORGANISM [ORGANISM ...]] [-L {SINGLE,PAIRED}]
[-M MBASES] [-D PUBLICATION_DATE]
[-P PLATFORM [PLATFORM ...]]
[-E SELECTION [SELECTION ...]] [-C SOURCE [SOURCE ...]]
[-S STRATEGY [STRATEGY ...]] [-T TITLE [TITLE ...]] [-I]
[-G GEO_QUERY [GEO_QUERY ...]]
[-Y GEO_DATASET_TYPE [GEO_DATASET_TYPE ...]]
[-Z GEO_ENTRY_TYPE [GEO_ENTRY_TYPE ...]]
options:
-h, --help show this help message and exit
-o SAVETO, --saveto SAVETO
Save search result dataframe to file
-s, --stats Displays some useful statistics for the search
results.
-g [GRAPHS], --graphs [GRAPHS]
Generates graphs to illustrate the search result. By
default all graphs are generated. Alternatively,
select a subset from the options below in a space-
separated string: daterange, organism, source,
selection, platform, basecount
-d {ena,geo,sra}, --db {ena,geo,sra}
Select the db API (sra, ena, or geo) to query, default
= sra. Note: pysradb search works slightly differently
when db = geo. Please refer to 'pysradb search --geo-
info' for more details.
-v {0,1,2,3}, --verbosity {0,1,2,3}
Level of search result details (0, 1, 2 or 3), default
= 2 0: run accession only 1: run accession and
experiment title 2: accession numbers, titles and
sequencing information 3: records in 2 and other
information such as download url, sample attributes,
etc
--run-description Displays run accessions and descriptions only.
Equivalent to --verbosity 1
--detailed Displays detailed search results. Equivalent to
--verbosity 3.
-m MAX, --max MAX Maximum number of entries to return, default = 20
-q QUERY [QUERY ...], --query QUERY [QUERY ...]
Main query string. Note that if no query is supplied,
at least one of the following flags must be present:
-A ACCESSION, --accession ACCESSION
Accession number
-O ORGANISM [ORGANISM ...], --organism ORGANISM [ORGANISM ...]
Scientific name of the sample organism
-L {SINGLE,PAIRED}, --layout {SINGLE,PAIRED}
Library layout. Accepts either SINGLE or PAIRED
-M MBASES, --mbases MBASES
Size of the sample rounded to the nearest megabase
-D PUBLICATION_DATE, --publication-date PUBLICATION_DATE
Publication date of the run in the format dd-mm-yyyy.
If a date range is desired, enter the start date,
followed by end date, separated by a colon ':'.
Example: 01-01-2010:31-12-2010
-P PLATFORM [PLATFORM ...], --platform PLATFORM [PLATFORM ...]
Sequencing platform
-E SELECTION [SELECTION ...], --selection SELECTION [SELECTION ...]
Library selection
-C SOURCE [SOURCE ...], --source SOURCE [SOURCE ...]
Library source
-S STRATEGY [STRATEGY ...], --strategy STRATEGY [STRATEGY ...]
Library preparation strategy
-T TITLE [TITLE ...], --title TITLE [TITLE ...]
Experiment title
-I, --geo-info Displays information on how to query GEO DataSets via
'pysradb search --db geo ...', including accepted
inputs for -G/--geo-query, -Y/--geo-dataset-type and
-Z/--geo-entry-type.
-G GEO_QUERY [GEO_QUERY ...], --geo-query GEO_QUERY [GEO_QUERY ...]
Main query string for GEO DataSet. This flag is only
used when db is set to be geo.Please refer to 'pysradb
search --geo-info' for more details.
-Y GEO_DATASET_TYPE [GEO_DATASET_TYPE ...], --geo-dataset-type GEO_DATASET_TYPE [GEO_DATASET_TYPE ...]
GEO DataSet Type. This flag is only used when --db is
set to be geo.Please refer to 'pysradb search --geo-
info' for more details.
-Z GEO_ENTRY_TYPE [GEO_ENTRY_TYPE ...], --geo-entry-type GEO_ENTRY_TYPE [GEO_ENTRY_TYPE ...]
GEO Entry Type. This flag is only used when --db is
set to be geo.Please refer to 'pysradb search --geo-
info' for more details.
5. More complex example:¶
[13]:
from pysradb.search import EnaSearch
instance = EnaSearch(
3,
100000,
organism="Escherichia coli",
layout="Paired",
mbases=10,
publication_date="01-01-2019:31-12-2021",
platform="Illumina",
selection="random",
source="Genomic",
strategy="WGS",
)
instance.search()
df = instance.get_df()
df
[13]:
| study_accession | experiment_accession | experiment_title | description | tax_id | scientific_name | library_strategy | library_source | library_selection | sample_accession | ... | tag | target_gene | tax_lineage | taxonomic_classification | taxonomic_identity_marker | temperature | tissue_lib | tissue_type | transposase_protocol | variety | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | PRJNA549801 | SRX10904916 | Illumina MiSeq sequencing: E. coli library mad... | Illumina MiSeq sequencing: E. coli library mad... | 562 | Escherichia coli | WGS | GENOMIC | RANDOM | SAMN11656391 | ... | env_tax;env_tax:freshwater;env_tax:terrestrial... | <NA> | 1;131567;2;3379134;1224;1236;91347;543;561;562 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
| 1 | PRJNA530580 | SRX5658689 | Illumina MiSeq sequencing: Sequencing of Esche... | Illumina MiSeq sequencing: Sequencing of Esche... | 562 | Escherichia coli | WGS | GENOMIC | RANDOM | SAMN11319482 | ... | env_tax;env_tax:freshwater;env_tax:terrestrial... | <NA> | 1;131567;2;3379134;1224;1236;91347;543;561;562 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
| 2 | PRJEB34285 | ERX3552102 | Illumina MiSeq paired end sequencing: Raw read... | Illumina MiSeq paired end sequencing: Raw read... | 562 | Escherichia coli | WGS | GENOMIC | RANDOM | SAMEA5957573 | ... | env_tax;env_tax:freshwater;env_tax:terrestrial... | <NA> | 1;131567;2;3379134;1224;1236;91347;543;561;562 | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> |
3 rows × 192 columns
[ ]:
sorted(df.columns)
[16]:
# https://github.com/saketkc/pysradb/issues/221
instance = GeoSearch(
publication_date="05-09-2024:06-09-2024", return_max=100, verbosity=3
)
instance.search()
df = instance.get_df()
df
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:02<00:00, 44.88it/s]
[16]:
| study_accession | experiment_accession | experiment_title | sample_taxon_id | sample_scientific_name | experiment_library_strategy | experiment_library_source | experiment_library_selection | sample_accession | sample_alias | ... | study_link_1_type | study_link_1_value_1 | study_link_1_value_2 | study_study_abstract | study_study_title | study_study_type_existing_study_type | submission_accession | submission_alias | submission_center_name | submission_lab_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | SRP531137 | SRX25997822 | GSM8501051: HEK293, Prm1 negative, Replicate 3... | 9606 | Homo sapiens | Hi-C | GENOMIC | other | SRS22571584 | GSM8501051 | ... | <NA> | <NA> | <NA> | Although the spatial organization of the genom... | Large-scale manipulation of radial positioning... | Other | SRA1964865 | SUB14711595 | Technion - Israel Institute of Technology | <NA> |
| 1 | SRP531137 | SRX25997821 | GSM8501050: HEK293, Prm1 negative, Replicate 2... | 9606 | Homo sapiens | Hi-C | GENOMIC | other | SRS22571583 | GSM8501050 | ... | <NA> | <NA> | <NA> | Although the spatial organization of the genom... | Large-scale manipulation of radial positioning... | Other | SRA1964865 | SUB14711595 | Technion - Israel Institute of Technology | <NA> |
| 2 | SRP531137 | SRX25997820 | GSM8501049: HEK293, Prm1 negative, Replicate 1... | 9606 | Homo sapiens | Hi-C | GENOMIC | other | SRS22571582 | GSM8501049 | ... | <NA> | <NA> | <NA> | Although the spatial organization of the genom... | Large-scale manipulation of radial positioning... | Other | SRA1964865 | SUB14711595 | Technion - Israel Institute of Technology | <NA> |
| 3 | SRP531137 | SRX25997819 | GSM8501048: HEK293, Prm1 positive DAPI low, Hi... | 9606 | Homo sapiens | Hi-C | GENOMIC | other | SRS22571581 | GSM8501048 | ... | <NA> | <NA> | <NA> | Although the spatial organization of the genom... | Large-scale manipulation of radial positioning... | Other | SRA1964865 | SUB14711595 | Technion - Israel Institute of Technology | <NA> |
| 4 | SRP531137 | SRX25997818 | GSM8501047: HEK293, Prm1 positive DAPI high, H... | 9606 | Homo sapiens | Hi-C | GENOMIC | other | SRS22571580 | GSM8501047 | ... | <NA> | <NA> | <NA> | Although the spatial organization of the genom... | Large-scale manipulation of radial positioning... | Other | SRA1964865 | SUB14711595 | Technion - Israel Institute of Technology | <NA> |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 95 | SRP530052 | SRX25934604 | GSM8493072: HSS-3; Homo sapiens; RNA-Seq | 9606 | Homo sapiens | RNA-Seq | TRANSCRIPTOMIC | cDNA | SRS22520997 | GSM8493072 | ... | XREF_LINK | DB: pubmed | ID: 39752487 | Fluid shear stress (FSS) from blood flow sense... | cSTAR analysis identifies endothelial cell cyc... | Transcriptome Analysis | SRA1960858 | SUB14701816 | Systems Biology Ireland, University College Du... | <NA> |
| 96 | SRP530052 | SRX25934603 | GSM8493071: HSS-2; Homo sapiens; RNA-Seq | 9606 | Homo sapiens | RNA-Seq | TRANSCRIPTOMIC | cDNA | SRS22520996 | GSM8493071 | ... | XREF_LINK | DB: pubmed | ID: 39752487 | Fluid shear stress (FSS) from blood flow sense... | cSTAR analysis identifies endothelial cell cyc... | Transcriptome Analysis | SRA1960858 | SUB14701816 | Systems Biology Ireland, University College Du... | <NA> |
| 97 | SRP530052 | SRX25934602 | GSM8493070: HSS-1; Homo sapiens; RNA-Seq | 9606 | Homo sapiens | RNA-Seq | TRANSCRIPTOMIC | cDNA | SRS22520995 | GSM8493070 | ... | XREF_LINK | DB: pubmed | ID: 39752487 | Fluid shear stress (FSS) from blood flow sense... | cSTAR analysis identifies endothelial cell cyc... | Transcriptome Analysis | SRA1960858 | SUB14701816 | Systems Biology Ireland, University College Du... | <NA> |
| 98 | SRP529928 | SRX25928352 | GSM8492057: NC1 group2; Homo sapiens; RNA-Seq | 9606 | Homo sapiens | RNA-Seq | TRANSCRIPTOMIC | cDNA | SRS22515729 | GSM8492057 | ... | <NA> | <NA> | <NA> | Adrenoleukodystrophy (ALD) is a rare X-linked ... | Transcriptomic Analysis of Identical Twins wit... | Transcriptome Analysis | SRA1960558 | SUB14700372 | Southwest Hospital, Army Medical University (T... | <NA> |
| 99 | SRP529928 | SRX25928351 | GSM8492056: NC1 group1; Homo sapiens; RNA-Seq | 9606 | Homo sapiens | RNA-Seq | TRANSCRIPTOMIC | cDNA | SRS22515728 | GSM8492056 | ... | <NA> | <NA> | <NA> | Adrenoleukodystrophy (ALD) is a rare X-linked ... | Transcriptomic Analysis of Identical Twins wit... | Transcriptome Analysis | SRA1960558 | SUB14700372 | Southwest Hospital, Army Medical University (T... | <NA> |
100 rows × 644 columns
[17]:
instance = GeoSearch(
publication_date="04-09-2024:06-09-2024", return_max=1000, verbosity=3
)
instance.search()
df = instance.get_df()
print(df["study_alias"].unique())
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:48<00:00, 20.72it/s]
['GSE276554' 'GSE276553' 'GSE267301' 'GSE276379' 'GSE276473' 'GSE276365'
'GSE276447' 'GSE276438' 'GSE219269' 'GSE276372' 'GSE276338' 'GSE276337'
'GSE276206' 'GSE276309' 'GSE276304' 'GSE276245' 'GSE276204' 'GSE276185'
'GSE276195' 'GSE276192' 'GSE276153' 'GSE276130' 'GSE276122' 'GSE276065'
'GSE276058' 'GSE276038' 'GSE276037' 'GSE275962' 'GSE275896' 'GSE275863'
'GSE275777' 'GSE275778' 'GSE275571' 'GSE253407' 'GSE275211' 'GSE274586'
'GSE274408' 'GSE273907' 'GSE273844' 'GSE273813' 'GSE253145' 'GSE272793'
'GSE272635' 'GSE272394' 'GSE247707' 'GSE247706' 'GSE272252' 'GSE271996'
'GSE271746' 'GSE271667' 'GSE271653' 'GSE264193' 'GSE245033' 'GSE268837'
'GSE267343' 'GSE267342' 'GSE266976' 'GSE266471' 'GSE264212' 'GSE264012'
'GSE263804' 'GSE263798' 'GSE263549' 'GSE263414' 'GSE263441' 'GSE262699'
'GSE262282' 'GSE262272' 'GSE262127' 'GSE262125' 'GSE262126']
[17]:
np.int64(0)
6. Corresponding terminal command example, with max set to 20:¶
[ ]:
!pysradb search --db ena -m 20 -v 3 --organism Escherichia coli --layout Paired --mbases 100 --publication-date 01-01-2019:31-12-2019 --platform illumina --selection random --source Genomic --strategy wgs
[ ]: