{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/saketkc/pysradb/blob/develop/notebooks/07.Query_Search.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Query and Search\n", "\n", "This notebook demonstrates advanced search capabilities to find SRA studies based on specific criteria." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Install pysradb if not already installed\n", "try:\n", " import pysradb\n", "\n", " print(f\"pysradb {pysradb.__version__} is already installed\")\n", "except ImportError:\n", " print(\"Installing pysradb from GitHub...\")\n", " import sys\n", "\n", " !{sys.executable} -m pip install -q git+https://github.com/saketkc/pysradb\n", " print(\"pysradb installed successfully!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## pysradb search\n", "##### The pysradb search module supports querying the Sequence Read Archive (SRA) and the European Nucleotide Archive (ENA) databases for sequencing data. The module also includes several built-in flags that can be used to fine-tune a search query." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Terminal flags for the pysradb search module:\n", "\n", "|Flags | Explanation|\n", "|----------|------------|\n", "| -h, --help | Displays the help message |\n", "| --saveto | Saves the result in the file specified by the user.
Supported file types: txt, tsv, csv |\n", "| --db | Selects the database (SRA, ENA, or both SRA and Geo DataSets) to query. Default database is SRA. Accepted inputs: sra, ena, geo|\n", "| -v, --verbosity | This determines how much details are retrieved and shown in the search result:
0: run_accession only
1: run_accession and experiment_description only
2: (default) study_accession, experiment_accession, experiment_title, description, tax_id, scientific_name, library_strategy,
library_source, library_selection, sample_accession, sample_title, instrument_model, run_accession, read_count, base_count
3: Everything in verbosity level 2, followed by all other retrievable information from the database|\n", "| -m, --max | Maximum number of returned entries. Default number is 20.
Note: If the maximum number set is large, querying the SRA and GEO DataSets databases will take significantly longer due to API limits|\n", "| -q, --query | The main query string.
Note: if this flag is not used, at least one of the following flags must be supplied: |\n", "| --accession | A relevant study / experiment / sample / run accession number|\n", "| --organism | Scientific name of the sample organism |\n", "| --layout | Library layout. Accepted inputs: single, paired|\n", "| --mbases | Size of the sample rounded to the nearest megabase|\n", "| --publication-date | The publication date of the run in the format dd-mm-yyyy. If a date range is desired,
enter the start date, followed by end date, separated by a colon ':' in the format dd-mm-yyyy:dd-mm-yyyy
Example: 01-01-2010:31-12-2010|\n", "| --platform | Sequencing platform used for the run. Possible inputs: illumina, ion torrent, oxford nanopore |\n", "| --selection | Library selection. Possible inputs: cdna, chip, dnase, pcr, polya |\n", "| --source | Library source. Possible inputs: genomic, metagenomic, transcriptomic |\n", "| --strategy | Library Preparation strategy. Possible inputs: wgs, amplicon, rna seq |\n", "| --title | Title of the experiment associated with the run |\n", "| --geo-query | The main query string to be sent to Geo DataSets |\n", "| --geo-dataset-type | Dataset type. Possible inputs: expression profiling by array, expression profiling by high throughput sequencing, non coding rna profiling by high throughput sequencing |\n", "| --geo-entry-type | Entry type. Accepted inputs: gds, gpl, gse, gsm |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using pysradb search in python:\n", "\n", "##### pysradb search organises each search query as a instance of either the SraSearch, EnaSearch or the GeoSearch classes. These classes takes in the following parameters in their constructor: \n", "\n", "\n", "\n", "`SraSearch (verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, suppress_validation=False,) ` \n", "\n", "\n", "`EnaSearch (verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, suppress_validation=False,) ` \n", "\n", "`GeoSearch (verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, geo_query=None, geo_dataset_type=None, geo_entry_type=None, suppress_validation=False,) ` \n", "\n", "| Parameters | Explanations|\n", "|----------|------------|\n", "| verbosity | This determines how much details are retrieved and shown in the search result (default=2). Same as -v / --verbosity on terminal |\n", "| return_max | Maximum number of returned entries (default=20). Same as -m / --max on terminal |\n", "| suppress_validation | Defaults to False. If this is set to True, the user input format checks will be skipped. Setting this to True may cause the program to behave in unexpected ways, but allows the user to search queries that does not pass the format check.|\n", "\n", "Other parameters match the command line flags of the same name.\n", "
\n", "
\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "##### To query the SRA database for ribosome profiling, expecting an output of verbosity level 2, and returning at most 5 entries, we can do the following:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/data/github/pysradb/pysradb/utils.py:14: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from tqdm.autonotebook import tqdm\n", "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 5.58it/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ " study_accession experiment_accession \\\n", "0 SRP630668 SRX30794203 \n", "1 SRP630668 SRX30794202 \n", "2 SRP630668 SRX30794201 \n", "3 SRP630668 SRX30794200 \n", "4 ERP179251 ERX14883219 \n", "\n", " experiment_title sample_taxon_id \\\n", "0 shRNA treated replicate 2, Ribo-seq 9606 \n", "1 shRNA treated replicate 1, Ribo-seq 9606 \n", "2 negative control replicate 2, Ribo-seq 9606 \n", "3 negative control replicate 1, Ribo-seq 9606 \n", "4 Ribosome profiling and RNA-seq of Salmonella i... 10090 \n", "\n", " sample_scientific_name experiment_library_strategy \\\n", "0 Homo sapiens Ribo-seq \n", "1 Homo sapiens Ribo-seq \n", "2 Homo sapiens Ribo-seq \n", "3 Homo sapiens Ribo-seq \n", "4 Mus musculus OTHER \n", "\n", " experiment_library_source experiment_library_selection sample_accession \\\n", "0 TRANSCRIPTOMIC size fractionation SRS26807700 \n", "1 TRANSCRIPTOMIC size fractionation SRS26807699 \n", "2 TRANSCRIPTOMIC size fractionation SRS26807698 \n", "3 TRANSCRIPTOMIC size fractionation SRS26807697 \n", "4 OTHER other ERS26530900 \n", "\n", " sample_alias experiment_instrument_model pool_member_spots run_1_size \\\n", "0 sh2_ribo Illumina NovaSeq X Plus 57288768 1354504305 \n", "1 sh1_ribo Illumina NovaSeq X Plus 61293382 1431873887 \n", "2 nc2_ribo Illumina NovaSeq X Plus 85858060 2060965292 \n", "3 nc1_ribo Illumina NovaSeq X Plus 61189912 1474261116 \n", "4 SAMEA119980013 NextSeq 2000 14962757 \n", "\n", " run_1_accession run_1_total_spots run_1_total_bases \n", "0 SRR35744894 57288768 4296657600 \n", "1 SRR35744895 61293382 4597003650 \n", "2 SRR35744896 85858060 6439354500 \n", "3 SRR35744897 61189912 4589243400 \n", "4 ERR15479323 \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "from pysradb.search import SraSearch\n", "\n", "instance = SraSearch(2, 5, query=\"ribosome profiling\")\n", "instance.search()\n", "df = instance.get_df()\n", "print(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "### Quickstart\n", "\n", "##### To query ENA instead, replace SraSearch class with the EnaSearch class:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Empty DataFrame\n", "Columns: []\n", "Index: []\n" ] } ], "source": [ "from pysradb.search import EnaSearch\n", "\n", "instance = EnaSearch(2, 5, \"ribosome profiling\")\n", "instance.search()\n", "df = instance.get_df()\n", "print(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### To query GEO DataSets instead and retrieve the metadata of linked entries in SRA:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 5.53it/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ " study_accession experiment_accession \\\n", "0 SRP619527 SRX30454639 \n", "1 SRP619527 SRX30454638 \n", "2 SRP619527 SRX30454637 \n", "3 SRP619527 SRX30454636 \n", "4 SRP618293 SRX30400919 \n", "\n", " experiment_title sample_taxon_id \\\n", "0 GSM9233466: Eef1g-cKO leptotene and zygotene_r... 10090 \n", "1 GSM9233465: Eef1g-cKO leptotene and zygotene_r... 10090 \n", "2 GSM9233464: Control leptotene and zygotene_rep... 10090 \n", "3 GSM9233463: Control leptotene and zygotene_rep... 10090 \n", "4 GSM9224243: Eef1g cKO leptotene/zygotene sperm... 10090 \n", "\n", " sample_scientific_name experiment_library_strategy \\\n", "0 Mus musculus OTHER \n", "1 Mus musculus OTHER \n", "2 Mus musculus OTHER \n", "3 Mus musculus OTHER \n", "4 Mus musculus RNA-Seq \n", "\n", " experiment_library_source experiment_library_selection sample_accession \\\n", "0 TRANSCRIPTOMIC other SRS26489458 \n", "1 TRANSCRIPTOMIC other SRS26489457 \n", "2 TRANSCRIPTOMIC other SRS26489456 \n", "3 TRANSCRIPTOMIC other SRS26489455 \n", "4 TRANSCRIPTOMIC cDNA SRS26437528 \n", "\n", " sample_alias experiment_instrument_model pool_member_spots run_1_size \\\n", "0 GSM9233466 Illumina NovaSeq 6000 69632843 8660602662 \n", "1 GSM9233465 Illumina NovaSeq 6000 98921214 12243831179 \n", "2 GSM9233464 Illumina NovaSeq 6000 38336833 4708346735 \n", "3 GSM9233463 Illumina NovaSeq 6000 88212319 11000694382 \n", "4 GSM9224243 Illumina NovaSeq X 15236947 1826954111 \n", "\n", " run_1_accession run_1_total_spots run_1_total_bases \n", "0 SRR35360226 69632843 20889852900 \n", "1 SRR35360227 98921214 29676364200 \n", "2 SRR35360228 38336833 11501049900 \n", "3 SRR35360229 88212319 26463695700 \n", "4 SRR35298842 15236947 4571084100 \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "from pysradb.search import GeoSearch\n", "\n", "instance = GeoSearch(2, 5, geo_query=\"ribosome profiling\")\n", "instance.search()\n", "df = instance.get_df()\n", "print(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "##### 7. Querying GEO DataSets with publication_date filter and displaying publication dates in results:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "No results found for the following search query: \n", " SRA: {'query': 'sra gds[Filter]', 'accession': None, 'organism': None, 'layout': None, 'mbases': None, 'publication_date': '01-09-2024:30-09-2024', 'platform': None, 'selection': None, 'source': None, 'strategy': None, 'title': None}\n", "GEO DataSets: {'query': 'RNA-Seq AND gds sra[Filter]', 'dataset_type': None, 'entry_type': None, 'publication_date': '01-09-2024:30-09-2024', 'organism': None}\n", "Empty DataFrame\n", "Columns: []\n", "Index: []\n" ] } ], "source": [ "from pysradb.search import GeoSearch\n", "\n", "# Search for RNA-Seq datasets published in September 2024\n", "# Using verbosity=3 to get all available fields including publication_date\n", "instance = GeoSearch(\n", " verbosity=3,\n", " return_max=5,\n", " geo_query=\"RNA-Seq\",\n", " publication_date=\"01-09-2024:30-09-2024\",\n", ")\n", "instance.search()\n", "df = instance.get_df()\n", "\n", "# Display select columns including publication_date\n", "if not df.empty and \"publication_date\" in df.columns:\n", " cols_to_show = [\n", " \"study_accession\",\n", " \"experiment_accession\",\n", " \"sample_scientific_name\",\n", " \"experiment_library_strategy\",\n", " \"publication_date\",\n", " ]\n", " available_cols = [c for c in cols_to_show if c in df.columns]\n", " print(df[available_cols])\n", "else:\n", " print(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "### Error Handling\n", "\n", "##### When suppress_validation is not set to True, query fields with incorrect entries will raise IncorrectFieldException, which provides the complete list of acceptable inputs for fields such as \"selection\", etc:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "ename": "IncorrectFieldException", "evalue": "Incorrect selection: Mudkip\n--selection must be one of the following: \n5-methylcytidine antibody, CAGE, ChIP, ChIP-Seq, DNase, HMPR, Hybrid Selection, \nInverse rRNA, Inverse rRNA selection, MBD2 protein methyl-CpG binding domain, \nMDA, MF, MNase, MSLL, Oligo-dT, PCR, PolyA, RACE, RANDOM, RANDOM PCR, RT-PCR, \nReduced Representation, Restriction Digest, cDNA, cDNA_oligo_dT, cDNA_randomPriming \nother, padlock probes capture method, repeat fractionation, size fractionation, \nunspecified\n\n", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mIncorrectFieldException\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[6], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# 1. Invalid query entered for \"selection\"\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m \u001b[43mSraSearch\u001b[49m\u001b[43m(\u001b[49m\u001b[43mselection\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mMudkip\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m/data/github/pysradb/pysradb/search.py:735\u001b[0m, in \u001b[0;36mSraSearch.__init__\u001b[0;34m(self, verbosity, return_max, query, accession, organism, layout, mbases, publication_date, platform, selection, source, strategy, title, suppress_validation)\u001b[0m\n\u001b[1;32m 718\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21m__init__\u001b[39m(\n\u001b[1;32m 719\u001b[0m \u001b[38;5;28mself\u001b[39m,\n\u001b[1;32m 720\u001b[0m verbosity\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m2\u001b[39m,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 733\u001b[0m suppress_validation\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m,\n\u001b[1;32m 734\u001b[0m ):\n\u001b[0;32m--> 735\u001b[0m \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[38;5;21;43m__init__\u001b[39;49m\u001b[43m(\u001b[49m\n\u001b[1;32m 736\u001b[0m \u001b[43m \u001b[49m\u001b[43mverbosity\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 737\u001b[0m \u001b[43m \u001b[49m\u001b[43mreturn_max\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 738\u001b[0m \u001b[43m \u001b[49m\u001b[43mquery\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 739\u001b[0m \u001b[43m \u001b[49m\u001b[43maccession\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 740\u001b[0m \u001b[43m \u001b[49m\u001b[43morganism\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 741\u001b[0m \u001b[43m \u001b[49m\u001b[43mlayout\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 742\u001b[0m \u001b[43m \u001b[49m\u001b[43mmbases\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 743\u001b[0m \u001b[43m \u001b[49m\u001b[43mpublication_date\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 744\u001b[0m \u001b[43m \u001b[49m\u001b[43mplatform\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 745\u001b[0m \u001b[43m \u001b[49m\u001b[43mselection\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 746\u001b[0m \u001b[43m \u001b[49m\u001b[43msource\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 747\u001b[0m \u001b[43m \u001b[49m\u001b[43mstrategy\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 748\u001b[0m \u001b[43m \u001b[49m\u001b[43mtitle\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 749\u001b[0m \u001b[43m \u001b[49m\u001b[43msuppress_validation\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 750\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 751\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mentries \u001b[38;5;241m=\u001b[39m {}\n\u001b[1;32m 752\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mnumber_entries \u001b[38;5;241m=\u001b[39m \u001b[38;5;241m0\u001b[39m\n", "File \u001b[0;32m/data/github/pysradb/pysradb/search.py:152\u001b[0m, in \u001b[0;36mQuerySearch.__init__\u001b[0;34m(self, verbosity, return_max, query, accession, organism, layout, mbases, publication_date, platform, selection, source, strategy, title, suppress_validation)\u001b[0m\n\u001b[1;32m 150\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m MissingQueryException()\n\u001b[1;32m 151\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m suppress_validation:\n\u001b[0;32m--> 152\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_validate_fields\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 153\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mstats \u001b[38;5;241m=\u001b[39m {\n\u001b[1;32m 154\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mstudy\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m-\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m 155\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mexperiment\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m-\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 167\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcount_stdev\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m-\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m 168\u001b[0m }\n\u001b[1;32m 169\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mplot_objects \u001b[38;5;241m=\u001b[39m {}\n", "File \u001b[0;32m/data/github/pysradb/pysradb/search.py:449\u001b[0m, in \u001b[0;36mQuerySearch._validate_fields\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 447\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mfields[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mstrategy\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m output[\u001b[38;5;241m0\u001b[39m]\n\u001b[1;32m 448\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m message:\n\u001b[0;32m--> 449\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m IncorrectFieldException(message)\n", "\u001b[0;31mIncorrectFieldException\u001b[0m: Incorrect selection: Mudkip\n--selection must be one of the following: \n5-methylcytidine antibody, CAGE, ChIP, ChIP-Seq, DNase, HMPR, Hybrid Selection, \nInverse rRNA, Inverse rRNA selection, MBD2 protein methyl-CpG binding domain, \nMDA, MF, MNase, MSLL, Oligo-dT, PCR, PolyA, RACE, RANDOM, RANDOM PCR, RT-PCR, \nReduced Representation, Restriction Digest, cDNA, cDNA_oligo_dT, cDNA_randomPriming \nother, padlock probes capture method, repeat fractionation, size fractionation, \nunspecified\n\n" ] } ], "source": [ "# 1. Invalid query entered for \"selection\"\n", "SraSearch(selection=\"Mudkip\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "ename": "IncorrectFieldException", "evalue": "Multiple potential matches have been identified for metagenomic viral rna :\n['METAGENOMIC', 'VIRAL RNA']\nPlease check your input.\n\n", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mIncorrectFieldException\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[7], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# 2. Ambiguous query entered for \"source\":\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m \u001b[43mEnaSearch\u001b[49m\u001b[43m(\u001b[49m\u001b[43msource\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mmetagenomic viral rna \u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m/data/github/pysradb/pysradb/search.py:152\u001b[0m, in \u001b[0;36mQuerySearch.__init__\u001b[0;34m(self, verbosity, return_max, query, accession, organism, layout, mbases, publication_date, platform, selection, source, strategy, title, suppress_validation)\u001b[0m\n\u001b[1;32m 150\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m MissingQueryException()\n\u001b[1;32m 151\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m suppress_validation:\n\u001b[0;32m--> 152\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_validate_fields\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 153\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mstats \u001b[38;5;241m=\u001b[39m {\n\u001b[1;32m 154\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mstudy\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m-\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m 155\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mexperiment\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m-\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 167\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcount_stdev\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m-\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m 168\u001b[0m }\n\u001b[1;32m 169\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mplot_objects \u001b[38;5;241m=\u001b[39m {}\n", "File \u001b[0;32m/data/github/pysradb/pysradb/search.py:449\u001b[0m, in \u001b[0;36mQuerySearch._validate_fields\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 447\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mfields[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mstrategy\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m output[\u001b[38;5;241m0\u001b[39m]\n\u001b[1;32m 448\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m message:\n\u001b[0;32m--> 449\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m IncorrectFieldException(message)\n", "\u001b[0;31mIncorrectFieldException\u001b[0m: Multiple potential matches have been identified for metagenomic viral rna :\n['METAGENOMIC', 'VIRAL RNA']\nPlease check your input.\n\n" ] } ], "source": [ "# 2. Ambiguous query entered for \"source\":\n", "EnaSearch(source=\"metagenomic viral rna \")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "### Usage Examples:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 1. Checking the help message on terminal:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "usage: pysradb search [-h] [-o SAVETO] [-s] [-g [GRAPHS]] [-d {ena,geo,sra}]\n", " [-v {0,1,2,3}] [--run-description] [--detailed] [-m MAX]\n", " [-q QUERY [QUERY ...]] [-A ACCESSION]\n", " [-O ORGANISM [ORGANISM ...]] [-L {SINGLE,PAIRED}]\n", " [-M MBASES] [-D PUBLICATION_DATE]\n", " [-P PLATFORM [PLATFORM ...]]\n", " [-E SELECTION [SELECTION ...]] [-C SOURCE [SOURCE ...]]\n", " [-S STRATEGY [STRATEGY ...]] [-T TITLE [TITLE ...]] [-I]\n", " [-G GEO_QUERY [GEO_QUERY ...]]\n", " [-Y GEO_DATASET_TYPE [GEO_DATASET_TYPE ...]]\n", " [-Z GEO_ENTRY_TYPE [GEO_ENTRY_TYPE ...]]\n", "\n", "options:\n", " -h, --help show this help message and exit\n", " -o SAVETO, --saveto SAVETO\n", " Save search result dataframe to file\n", " -s, --stats Displays some useful statistics for the search\n", " results.\n", " -g [GRAPHS], --graphs [GRAPHS]\n", " Generates graphs to illustrate the search result. By\n", " default all graphs are generated. Alternatively,\n", " select a subset from the options below in a space-\n", " separated string: daterange, organism, source,\n", " selection, platform, basecount\n", " -d {ena,geo,sra}, --db {ena,geo,sra}\n", " Select the db API (sra, ena, or geo) to query, default\n", " = sra. Note: pysradb search works slightly differently\n", " when db = geo. Please refer to 'pysradb search --geo-\n", " info' for more details.\n", " -v {0,1,2,3}, --verbosity {0,1,2,3}\n", " Level of search result details (0, 1, 2 or 3), default\n", " = 2 0: run accession only 1: run accession and\n", " experiment title 2: accession numbers, titles and\n", " sequencing information 3: records in 2 and other\n", " information such as download url, sample attributes,\n", " etc\n", " --run-description Displays run accessions and descriptions only.\n", " Equivalent to --verbosity 1\n", " --detailed Displays detailed search results. Equivalent to\n", " --verbosity 3.\n", " -m MAX, --max MAX Maximum number of entries to return, default = 20\n", " -q QUERY [QUERY ...], --query QUERY [QUERY ...]\n", " Main query string. Note that if no query is supplied,\n", " at least one of the following flags must be present:\n", " -A ACCESSION, --accession ACCESSION\n", " Accession number\n", " -O ORGANISM [ORGANISM ...], --organism ORGANISM [ORGANISM ...]\n", " Scientific name of the sample organism\n", " -L {SINGLE,PAIRED}, --layout {SINGLE,PAIRED}\n", " Library layout. Accepts either SINGLE or PAIRED\n", " -M MBASES, --mbases MBASES\n", " Size of the sample rounded to the nearest megabase\n", " -D PUBLICATION_DATE, --publication-date PUBLICATION_DATE\n", " Publication date of the run in the format dd-mm-yyyy.\n", " If a date range is desired, enter the start date,\n", " followed by end date, separated by a colon ':'.\n", " Example: 01-01-2010:31-12-2010\n", " -P PLATFORM [PLATFORM ...], --platform PLATFORM [PLATFORM ...]\n", " Sequencing platform\n", " -E SELECTION [SELECTION ...], --selection SELECTION [SELECTION ...]\n", " Library selection\n", " -C SOURCE [SOURCE ...], --source SOURCE [SOURCE ...]\n", " Library source\n", " -S STRATEGY [STRATEGY ...], --strategy STRATEGY [STRATEGY ...]\n", " Library preparation strategy\n", " -T TITLE [TITLE ...], --title TITLE [TITLE ...]\n", " Experiment title\n", " -I, --geo-info Displays information on how to query GEO DataSets via\n", " 'pysradb search --db geo ...', including accepted\n", " inputs for -G/--geo-query, -Y/--geo-dataset-type and\n", " -Z/--geo-entry-type.\n", " -G GEO_QUERY [GEO_QUERY ...], --geo-query GEO_QUERY [GEO_QUERY ...]\n", " Main query string for GEO DataSet. This flag is only\n", " used when db is set to be geo.Please refer to 'pysradb\n", " search --geo-info' for more details.\n", " -Y GEO_DATASET_TYPE [GEO_DATASET_TYPE ...], --geo-dataset-type GEO_DATASET_TYPE [GEO_DATASET_TYPE ...]\n", " GEO DataSet Type. This flag is only used when --db is\n", " set to be geo.Please refer to 'pysradb search --geo-\n", " info' for more details.\n", " -Z GEO_ENTRY_TYPE [GEO_ENTRY_TYPE ...], --geo-entry-type GEO_ENTRY_TYPE [GEO_ENTRY_TYPE ...]\n", " GEO Entry Type. This flag is only used when --db is\n", " set to be geo.Please refer to 'pysradb search --geo-\n", " info' for more details.\n" ] } ], "source": [ "!pysradb search -h" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "##### 2. Searching for 5 illumina sequences related to the covid-19 pandemic on ENA, using the terminal:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "study_accession\texperiment_accession\texperiment_title\tdescription\ttax_id\tscientific_name\tlibrary_strategy\tlibrary_source\tlibrary_selection\tsample_accession\tsample_title\tinstrument_model\trun_accession\tread_count\tbase_count\n", "PRJEB65477\tERX11285566\tIllumina HiSeq 4000 sequencing: Blood RNA sequencing was performed on a cohort of adults attending the Emergency Department with suspected infection who had subsequently-confirmed viral, bacterial, COVID19 and healthy controls\tIllumina HiSeq 4000 sequencing: Blood RNA sequencing was performed on a cohort of adults attending the Emergency Department with suspected infection who had subsequently-confirmed viral, bacterial, COVID19 and healthy controls\t9606\tHomo sapiens\tRNA-Seq\tTRANSCRIPTOMIC\tInverse rRNA\tSAMEA114308724\tSample 1\tIllumina HiSeq 4000\tERR11901279\t131354572\t19834540372\n", "PRJEB65477\tERX11285567\tIllumina HiSeq 4000 sequencing: Blood RNA sequencing was performed on a cohort of adults attending the Emergency Department with suspected infection who had subsequently-confirmed viral, bacterial, COVID19 and healthy controls\tIllumina HiSeq 4000 sequencing: Blood RNA sequencing was performed on a cohort of adults attending the Emergency Department with suspected infection who had subsequently-confirmed viral, bacterial, COVID19 and healthy controls\t9606\tHomo sapiens\tRNA-Seq\tTRANSCRIPTOMIC\tInverse rRNA\tSAMEA114308725\tSample 10\tIllumina HiSeq 4000\tERR11901280\t95225496\t14379049896\n", "PRJEB65477\tERX11285573\tIllumina HiSeq 4000 sequencing: Blood RNA sequencing was performed on a cohort of adults attending the Emergency Department with suspected infection who had subsequently-confirmed viral, bacterial, COVID19 and healthy controls\tIllumina HiSeq 4000 sequencing: Blood RNA sequencing was performed on a cohort of adults attending the Emergency Department with suspected infection who had subsequently-confirmed viral, bacterial, COVID19 and healthy controls\t9606\tHomo sapiens\tRNA-Seq\tTRANSCRIPTOMIC\tInverse rRNA\tSAMEA114308731\tSample 105\tIllumina HiSeq 4000\tERR11901286\t128864216\t19458496616\n", "PRJEB65477\tERX11285574\tIllumina HiSeq 4000 sequencing: Blood RNA sequencing was performed on a cohort of adults attending the Emergency Department with suspected infection who had subsequently-confirmed viral, bacterial, COVID19 and healthy controls\tIllumina HiSeq 4000 sequencing: Blood RNA sequencing was performed on a cohort of adults attending the Emergency Department with suspected infection who had subsequently-confirmed viral, bacterial, COVID19 and healthy controls\t9606\tHomo sapiens\tRNA-Seq\tTRANSCRIPTOMIC\tInverse rRNA\tSAMEA114308732\tSample 106\tIllumina HiSeq 4000\tERR11901287\t123505478\t18649327178\n", "PRJEB65477\tERX11285577\tIllumina HiSeq 4000 sequencing: Blood RNA sequencing was performed on a cohort of adults attending the Emergency Department with suspected infection who had subsequently-confirmed viral, bacterial, COVID19 and healthy controls\tIllumina HiSeq 4000 sequencing: Blood RNA sequencing was performed on a cohort of adults attending the Emergency Department with suspected infection who had subsequently-confirmed viral, bacterial, COVID19 and healthy controls\t9606\tHomo sapiens\tRNA-Seq\tTRANSCRIPTOMIC\tInverse rRNA\tSAMEA114308735\tSample 109\tIllumina HiSeq 4000\tERR11901290\t120866712\t18250873512\n" ] } ], "source": [ "!pysradb search -q covid19 --platform illumina --db ena -m 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "##### 3. Searching for illumina sequences related to the covid-19 pandemic on ENA, using the terminal, and saving the results in a nicely formatted text file:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "!pysradb search -q covid19 --db ena --saveto query.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "##### 4. Searching for illumina sequences related to the covid-19 pandemic on ENA, within python: (outputs a pandas dataframe)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " study_accession experiment_accession \\\n", "0 PRJEB65477 ERX11285566 \n", "1 PRJEB65477 ERX11285567 \n", "2 PRJEB65477 ERX11285573 \n", "3 PRJEB65477 ERX11285574 \n", "4 PRJEB65477 ERX11285577 \n", "5 PRJEB65477 ERX11285583 \n", "6 PRJEB65477 ERX11285586 \n", "7 PRJEB65477 ERX11285592 \n", "8 PRJEB65477 ERX11285595 \n", "9 PRJEB65477 ERX11285599 \n", "10 PRJEB65477 ERX11285600 \n", "11 PRJEB65477 ERX11285605 \n", "12 PRJEB65477 ERX11285607 \n", "13 PRJEB65477 ERX11285609 \n", "14 PRJEB65477 ERX11285611 \n", "15 PRJEB65477 ERX11285620 \n", "16 PRJEB65477 ERX11285623 \n", "17 PRJEB65477 ERX11285625 \n", "18 PRJEB65477 ERX11285629 \n", "19 PRJEB65477 ERX11285633 \n", "\n", " experiment_title \\\n", "0 Illumina HiSeq 4000 sequencing: Blood RNA sequ... \n", "1 Illumina HiSeq 4000 sequencing: Blood RNA sequ... \n", "2 Illumina HiSeq 4000 sequencing: Blood RNA sequ... \n", "3 Illumina HiSeq 4000 sequencing: Blood RNA sequ... \n", "4 Illumina HiSeq 4000 sequencing: Blood RNA sequ... \n", "5 Illumina HiSeq 4000 sequencing: Blood RNA sequ... \n", "6 Illumina HiSeq 4000 sequencing: Blood RNA sequ... \n", "7 Illumina HiSeq 4000 sequencing: Blood RNA sequ... \n", "8 Illumina HiSeq 4000 sequencing: Blood RNA sequ... \n", "9 Illumina HiSeq 4000 sequencing: Blood RNA sequ... \n", "10 Illumina HiSeq 4000 sequencing: Blood RNA sequ... \n", "11 Illumina HiSeq 4000 sequencing: Blood RNA sequ... \n", "12 Illumina HiSeq 4000 sequencing: Blood RNA sequ... \n", "13 Illumina HiSeq 4000 sequencing: Blood RNA sequ... \n", "14 Illumina HiSeq 4000 sequencing: Blood RNA sequ... \n", "15 Illumina HiSeq 4000 sequencing: Blood RNA sequ... \n", "16 Illumina HiSeq 4000 sequencing: Blood RNA sequ... \n", "17 Illumina HiSeq 4000 sequencing: Blood RNA sequ... \n", "18 Illumina HiSeq 4000 sequencing: Blood RNA sequ... \n", "19 Illumina HiSeq 4000 sequencing: Blood RNA sequ... \n", "\n", " description tax_id scientific_name \\\n", "0 Illumina HiSeq 4000 sequencing: Blood RNA sequ... 9606 Homo sapiens \n", "1 Illumina HiSeq 4000 sequencing: Blood RNA sequ... 9606 Homo sapiens \n", "2 Illumina HiSeq 4000 sequencing: Blood RNA sequ... 9606 Homo sapiens \n", "3 Illumina HiSeq 4000 sequencing: Blood RNA sequ... 9606 Homo sapiens \n", "4 Illumina HiSeq 4000 sequencing: Blood RNA sequ... 9606 Homo sapiens \n", "5 Illumina HiSeq 4000 sequencing: Blood RNA sequ... 9606 Homo sapiens \n", "6 Illumina HiSeq 4000 sequencing: Blood RNA sequ... 9606 Homo sapiens \n", "7 Illumina HiSeq 4000 sequencing: Blood RNA sequ... 9606 Homo sapiens \n", "8 Illumina HiSeq 4000 sequencing: Blood RNA sequ... 9606 Homo sapiens \n", "9 Illumina HiSeq 4000 sequencing: Blood RNA sequ... 9606 Homo sapiens \n", "10 Illumina HiSeq 4000 sequencing: Blood RNA sequ... 9606 Homo sapiens \n", "11 Illumina HiSeq 4000 sequencing: Blood RNA sequ... 9606 Homo sapiens \n", "12 Illumina HiSeq 4000 sequencing: Blood RNA sequ... 9606 Homo sapiens \n", "13 Illumina HiSeq 4000 sequencing: Blood RNA sequ... 9606 Homo sapiens \n", "14 Illumina HiSeq 4000 sequencing: Blood RNA sequ... 9606 Homo sapiens \n", "15 Illumina HiSeq 4000 sequencing: Blood RNA sequ... 9606 Homo sapiens \n", "16 Illumina HiSeq 4000 sequencing: Blood RNA sequ... 9606 Homo sapiens \n", "17 Illumina HiSeq 4000 sequencing: Blood RNA sequ... 9606 Homo sapiens \n", "18 Illumina HiSeq 4000 sequencing: Blood RNA sequ... 9606 Homo sapiens \n", "19 Illumina HiSeq 4000 sequencing: Blood RNA sequ... 9606 Homo sapiens \n", "\n", " library_strategy library_source library_selection sample_accession \\\n", "0 RNA-Seq TRANSCRIPTOMIC Inverse rRNA SAMEA114308724 \n", "1 RNA-Seq TRANSCRIPTOMIC Inverse rRNA SAMEA114308725 \n", "2 RNA-Seq TRANSCRIPTOMIC Inverse rRNA SAMEA114308731 \n", "3 RNA-Seq TRANSCRIPTOMIC Inverse rRNA SAMEA114308732 \n", "4 RNA-Seq TRANSCRIPTOMIC Inverse rRNA SAMEA114308735 \n", "5 RNA-Seq TRANSCRIPTOMIC Inverse rRNA SAMEA114308741 \n", "6 RNA-Seq TRANSCRIPTOMIC Inverse rRNA SAMEA114308744 \n", "7 RNA-Seq TRANSCRIPTOMIC Inverse rRNA SAMEA114308750 \n", "8 RNA-Seq TRANSCRIPTOMIC Inverse rRNA SAMEA114308753 \n", "9 RNA-Seq TRANSCRIPTOMIC Inverse rRNA SAMEA114308757 \n", "10 RNA-Seq TRANSCRIPTOMIC Inverse rRNA SAMEA114308758 \n", "11 RNA-Seq TRANSCRIPTOMIC Inverse rRNA SAMEA114308763 \n", "12 RNA-Seq TRANSCRIPTOMIC Inverse rRNA SAMEA114308765 \n", "13 RNA-Seq TRANSCRIPTOMIC Inverse rRNA SAMEA114308767 \n", "14 RNA-Seq TRANSCRIPTOMIC Inverse rRNA SAMEA114308769 \n", "15 RNA-Seq TRANSCRIPTOMIC Inverse rRNA SAMEA114308778 \n", "16 RNA-Seq TRANSCRIPTOMIC Inverse rRNA SAMEA114308781 \n", "17 RNA-Seq TRANSCRIPTOMIC Inverse rRNA SAMEA114308783 \n", "18 RNA-Seq TRANSCRIPTOMIC Inverse rRNA SAMEA114308787 \n", "19 RNA-Seq TRANSCRIPTOMIC Inverse rRNA SAMEA114308791 \n", "\n", " sample_title instrument_model run_accession read_count base_count \n", "0 Sample 1 Illumina HiSeq 4000 ERR11901279 131354572 19834540372 \n", "1 Sample 10 Illumina HiSeq 4000 ERR11901280 95225496 14379049896 \n", "2 Sample 105 Illumina HiSeq 4000 ERR11901286 128864216 19458496616 \n", "3 Sample 106 Illumina HiSeq 4000 ERR11901287 123505478 18649327178 \n", "4 Sample 109 Illumina HiSeq 4000 ERR11901290 120866712 18250873512 \n", "5 Sample 114 Illumina HiSeq 4000 ERR11901296 126581586 19113819486 \n", "6 Sample 117 Illumina HiSeq 4000 ERR11901299 126421026 19089574926 \n", "7 Sample 122 Illumina HiSeq 4000 ERR11901305 132194842 19961421142 \n", "8 Sample 125 Illumina HiSeq 4000 ERR11901308 128325648 19377172848 \n", "9 Sample 129 Illumina HiSeq 4000 ERR11901312 113973238 17209958938 \n", "10 Sample 13 Illumina HiSeq 4000 ERR11901313 103563334 15638063434 \n", "11 Sample 134 Illumina HiSeq 4000 ERR11901318 125759128 18989628328 \n", "12 Sample 15 Illumina HiSeq 4000 ERR11901320 102246922 15439285222 \n", "13 Sample 17 Illumina HiSeq 4000 ERR11901322 103868320 15684116320 \n", "14 Sample 19 Illumina HiSeq 4000 ERR11901324 102445734 15469305834 \n", "15 Sample 27 Illumina HiSeq 4000 ERR11901333 103177112 15579743912 \n", "16 Sample 3 Illumina HiSeq 4000 ERR11901336 104470270 15775010770 \n", "17 Sample 31 Illumina HiSeq 4000 ERR11901338 95555226 14428839126 \n", "18 Sample 35 Illumina HiSeq 4000 ERR11901342 104135584 15724473184 \n", "19 Sample 39 Illumina HiSeq 4000 ERR11901346 103208434 15584473534 \n" ] } ], "source": [ "from pysradb.search import EnaSearch\n", "\n", "instance = EnaSearch(2, 20, query=\"covid19\", platform=\"illumina\")\n", "instance.search()\n", "df = instance.get_df()\n", "print(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "##### 5. More complex example:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
study_accessionexperiment_accessionexperiment_titledescriptiontax_idscientific_namelibrary_strategylibrary_sourcelibrary_selectionsample_accession...tagtarget_genetax_lineagetaxonomic_classificationtaxonomic_identity_markertemperaturetissue_libtissue_typetransposase_protocolvariety
0PRJNA549801SRX10904916Illumina MiSeq sequencing: E. coli library mad...Illumina MiSeq sequencing: E. coli library mad...562Escherichia coliWGSGENOMICRANDOMSAMN11656391...env_tax;env_tax:freshwater;env_tax:terrestrial...<NA>1;131567;2;3379134;1224;1236;91347;543;561;562<NA><NA><NA><NA><NA><NA><NA>
1PRJNA530580SRX5658689Illumina MiSeq sequencing: Sequencing of Esche...Illumina MiSeq sequencing: Sequencing of Esche...562Escherichia coliWGSGENOMICRANDOMSAMN11319482...env_tax;env_tax:freshwater;env_tax:terrestrial...<NA>1;131567;2;3379134;1224;1236;91347;543;561;562<NA><NA><NA><NA><NA><NA><NA>
2PRJEB34285ERX3552102Illumina MiSeq paired end sequencing: Raw read...Illumina MiSeq paired end sequencing: Raw read...562Escherichia coliWGSGENOMICRANDOMSAMEA5957573...env_tax;env_tax:freshwater;env_tax:terrestrial...<NA>1;131567;2;3379134;1224;1236;91347;543;561;562<NA><NA><NA><NA><NA><NA><NA>
\n", "

3 rows × 192 columns

\n", "
" ], "text/plain": [ " study_accession experiment_accession \\\n", "0 PRJNA549801 SRX10904916 \n", "1 PRJNA530580 SRX5658689 \n", "2 PRJEB34285 ERX3552102 \n", "\n", " experiment_title \\\n", "0 Illumina MiSeq sequencing: E. coli library mad... \n", "1 Illumina MiSeq sequencing: Sequencing of Esche... \n", "2 Illumina MiSeq paired end sequencing: Raw read... \n", "\n", " description tax_id scientific_name \\\n", "0 Illumina MiSeq sequencing: E. coli library mad... 562 Escherichia coli \n", "1 Illumina MiSeq sequencing: Sequencing of Esche... 562 Escherichia coli \n", "2 Illumina MiSeq paired end sequencing: Raw read... 562 Escherichia coli \n", "\n", " library_strategy library_source library_selection sample_accession ... \\\n", "0 WGS GENOMIC RANDOM SAMN11656391 ... \n", "1 WGS GENOMIC RANDOM SAMN11319482 ... \n", "2 WGS GENOMIC RANDOM SAMEA5957573 ... \n", "\n", " tag target_gene \\\n", "0 env_tax;env_tax:freshwater;env_tax:terrestrial... \n", "1 env_tax;env_tax:freshwater;env_tax:terrestrial... \n", "2 env_tax;env_tax:freshwater;env_tax:terrestrial... \n", "\n", " tax_lineage taxonomic_classification \\\n", "0 1;131567;2;3379134;1224;1236;91347;543;561;562 \n", "1 1;131567;2;3379134;1224;1236;91347;543;561;562 \n", "2 1;131567;2;3379134;1224;1236;91347;543;561;562 \n", "\n", " taxonomic_identity_marker temperature tissue_lib tissue_type \\\n", "0 \n", "1 \n", "2 \n", "\n", " transposase_protocol variety \n", "0 \n", "1 \n", "2 \n", "\n", "[3 rows x 192 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pysradb.search import EnaSearch\n", "\n", "instance = EnaSearch(\n", " 3,\n", " 100000,\n", " organism=\"Escherichia coli\",\n", " layout=\"Paired\",\n", " mbases=10,\n", " publication_date=\"01-01-2019:31-12-2021\",\n", " platform=\"Illumina\",\n", " selection=\"random\",\n", " source=\"Genomic\",\n", " strategy=\"WGS\",\n", ")\n", "instance.search()\n", "df = instance.get_df()\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sorted(df.columns)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:02<00:00, 44.88it/s]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
study_accessionexperiment_accessionexperiment_titlesample_taxon_idsample_scientific_nameexperiment_library_strategyexperiment_library_sourceexperiment_library_selectionsample_accessionsample_alias...study_link_1_typestudy_link_1_value_1study_link_1_value_2study_study_abstractstudy_study_titlestudy_study_type_existing_study_typesubmission_accessionsubmission_aliassubmission_center_namesubmission_lab_name
0SRP531137SRX25997822GSM8501051: HEK293, Prm1 negative, Replicate 3...9606Homo sapiensHi-CGENOMICotherSRS22571584GSM8501051...<NA><NA><NA>Although the spatial organization of the genom...Large-scale manipulation of radial positioning...OtherSRA1964865SUB14711595Technion - Israel Institute of Technology<NA>
1SRP531137SRX25997821GSM8501050: HEK293, Prm1 negative, Replicate 2...9606Homo sapiensHi-CGENOMICotherSRS22571583GSM8501050...<NA><NA><NA>Although the spatial organization of the genom...Large-scale manipulation of radial positioning...OtherSRA1964865SUB14711595Technion - Israel Institute of Technology<NA>
2SRP531137SRX25997820GSM8501049: HEK293, Prm1 negative, Replicate 1...9606Homo sapiensHi-CGENOMICotherSRS22571582GSM8501049...<NA><NA><NA>Although the spatial organization of the genom...Large-scale manipulation of radial positioning...OtherSRA1964865SUB14711595Technion - Israel Institute of Technology<NA>
3SRP531137SRX25997819GSM8501048: HEK293, Prm1 positive DAPI low, Hi...9606Homo sapiensHi-CGENOMICotherSRS22571581GSM8501048...<NA><NA><NA>Although the spatial organization of the genom...Large-scale manipulation of radial positioning...OtherSRA1964865SUB14711595Technion - Israel Institute of Technology<NA>
4SRP531137SRX25997818GSM8501047: HEK293, Prm1 positive DAPI high, H...9606Homo sapiensHi-CGENOMICotherSRS22571580GSM8501047...<NA><NA><NA>Although the spatial organization of the genom...Large-scale manipulation of radial positioning...OtherSRA1964865SUB14711595Technion - Israel Institute of Technology<NA>
..................................................................
95SRP530052SRX25934604GSM8493072: HSS-3; Homo sapiens; RNA-Seq9606Homo sapiensRNA-SeqTRANSCRIPTOMICcDNASRS22520997GSM8493072...XREF_LINKDB: pubmedID: 39752487Fluid shear stress (FSS) from blood flow sense...cSTAR analysis identifies endothelial cell cyc...Transcriptome AnalysisSRA1960858SUB14701816Systems Biology Ireland, University College Du...<NA>
96SRP530052SRX25934603GSM8493071: HSS-2; Homo sapiens; RNA-Seq9606Homo sapiensRNA-SeqTRANSCRIPTOMICcDNASRS22520996GSM8493071...XREF_LINKDB: pubmedID: 39752487Fluid shear stress (FSS) from blood flow sense...cSTAR analysis identifies endothelial cell cyc...Transcriptome AnalysisSRA1960858SUB14701816Systems Biology Ireland, University College Du...<NA>
97SRP530052SRX25934602GSM8493070: HSS-1; Homo sapiens; RNA-Seq9606Homo sapiensRNA-SeqTRANSCRIPTOMICcDNASRS22520995GSM8493070...XREF_LINKDB: pubmedID: 39752487Fluid shear stress (FSS) from blood flow sense...cSTAR analysis identifies endothelial cell cyc...Transcriptome AnalysisSRA1960858SUB14701816Systems Biology Ireland, University College Du...<NA>
98SRP529928SRX25928352GSM8492057: NC1 group2; Homo sapiens; RNA-Seq9606Homo sapiensRNA-SeqTRANSCRIPTOMICcDNASRS22515729GSM8492057...<NA><NA><NA>Adrenoleukodystrophy (ALD) is a rare X-linked ...Transcriptomic Analysis of Identical Twins wit...Transcriptome AnalysisSRA1960558SUB14700372Southwest Hospital, Army Medical University (T...<NA>
99SRP529928SRX25928351GSM8492056: NC1 group1; Homo sapiens; RNA-Seq9606Homo sapiensRNA-SeqTRANSCRIPTOMICcDNASRS22515728GSM8492056...<NA><NA><NA>Adrenoleukodystrophy (ALD) is a rare X-linked ...Transcriptomic Analysis of Identical Twins wit...Transcriptome AnalysisSRA1960558SUB14700372Southwest Hospital, Army Medical University (T...<NA>
\n", "

100 rows × 644 columns

\n", "
" ], "text/plain": [ " study_accession experiment_accession \\\n", "0 SRP531137 SRX25997822 \n", "1 SRP531137 SRX25997821 \n", "2 SRP531137 SRX25997820 \n", "3 SRP531137 SRX25997819 \n", "4 SRP531137 SRX25997818 \n", ".. ... ... \n", "95 SRP530052 SRX25934604 \n", "96 SRP530052 SRX25934603 \n", "97 SRP530052 SRX25934602 \n", "98 SRP529928 SRX25928352 \n", "99 SRP529928 SRX25928351 \n", "\n", " experiment_title sample_taxon_id \\\n", "0 GSM8501051: HEK293, Prm1 negative, Replicate 3... 9606 \n", "1 GSM8501050: HEK293, Prm1 negative, Replicate 2... 9606 \n", "2 GSM8501049: HEK293, Prm1 negative, Replicate 1... 9606 \n", "3 GSM8501048: HEK293, Prm1 positive DAPI low, Hi... 9606 \n", "4 GSM8501047: HEK293, Prm1 positive DAPI high, H... 9606 \n", ".. ... ... \n", "95 GSM8493072: HSS-3; Homo sapiens; RNA-Seq 9606 \n", "96 GSM8493071: HSS-2; Homo sapiens; RNA-Seq 9606 \n", "97 GSM8493070: HSS-1; Homo sapiens; RNA-Seq 9606 \n", "98 GSM8492057: NC1 group2; Homo sapiens; RNA-Seq 9606 \n", "99 GSM8492056: NC1 group1; Homo sapiens; RNA-Seq 9606 \n", "\n", " sample_scientific_name experiment_library_strategy \\\n", "0 Homo sapiens Hi-C \n", "1 Homo sapiens Hi-C \n", "2 Homo sapiens Hi-C \n", "3 Homo sapiens Hi-C \n", "4 Homo sapiens Hi-C \n", ".. ... ... \n", "95 Homo sapiens RNA-Seq \n", "96 Homo sapiens RNA-Seq \n", "97 Homo sapiens RNA-Seq \n", "98 Homo sapiens RNA-Seq \n", "99 Homo sapiens RNA-Seq \n", "\n", " experiment_library_source experiment_library_selection sample_accession \\\n", "0 GENOMIC other SRS22571584 \n", "1 GENOMIC other SRS22571583 \n", "2 GENOMIC other SRS22571582 \n", "3 GENOMIC other SRS22571581 \n", "4 GENOMIC other SRS22571580 \n", ".. ... ... ... \n", "95 TRANSCRIPTOMIC cDNA SRS22520997 \n", "96 TRANSCRIPTOMIC cDNA SRS22520996 \n", "97 TRANSCRIPTOMIC cDNA SRS22520995 \n", "98 TRANSCRIPTOMIC cDNA SRS22515729 \n", "99 TRANSCRIPTOMIC cDNA SRS22515728 \n", "\n", " sample_alias ... study_link_1_type study_link_1_value_1 \\\n", "0 GSM8501051 ... \n", "1 GSM8501050 ... \n", "2 GSM8501049 ... \n", "3 GSM8501048 ... \n", "4 GSM8501047 ... \n", ".. ... ... ... ... \n", "95 GSM8493072 ... XREF_LINK DB: pubmed \n", "96 GSM8493071 ... XREF_LINK DB: pubmed \n", "97 GSM8493070 ... XREF_LINK DB: pubmed \n", "98 GSM8492057 ... \n", "99 GSM8492056 ... \n", "\n", " study_link_1_value_2 study_study_abstract \\\n", "0 Although the spatial organization of the genom... \n", "1 Although the spatial organization of the genom... \n", "2 Although the spatial organization of the genom... \n", "3 Although the spatial organization of the genom... \n", "4 Although the spatial organization of the genom... \n", ".. ... ... \n", "95 ID: 39752487 Fluid shear stress (FSS) from blood flow sense... \n", "96 ID: 39752487 Fluid shear stress (FSS) from blood flow sense... \n", "97 ID: 39752487 Fluid shear stress (FSS) from blood flow sense... \n", "98 Adrenoleukodystrophy (ALD) is a rare X-linked ... \n", "99 Adrenoleukodystrophy (ALD) is a rare X-linked ... \n", "\n", " study_study_title \\\n", "0 Large-scale manipulation of radial positioning... \n", "1 Large-scale manipulation of radial positioning... \n", "2 Large-scale manipulation of radial positioning... \n", "3 Large-scale manipulation of radial positioning... \n", "4 Large-scale manipulation of radial positioning... \n", ".. ... \n", "95 cSTAR analysis identifies endothelial cell cyc... \n", "96 cSTAR analysis identifies endothelial cell cyc... \n", "97 cSTAR analysis identifies endothelial cell cyc... \n", "98 Transcriptomic Analysis of Identical Twins wit... \n", "99 Transcriptomic Analysis of Identical Twins wit... \n", "\n", " study_study_type_existing_study_type submission_accession submission_alias \\\n", "0 Other SRA1964865 SUB14711595 \n", "1 Other SRA1964865 SUB14711595 \n", "2 Other SRA1964865 SUB14711595 \n", "3 Other SRA1964865 SUB14711595 \n", "4 Other SRA1964865 SUB14711595 \n", ".. ... ... ... \n", "95 Transcriptome Analysis SRA1960858 SUB14701816 \n", "96 Transcriptome Analysis SRA1960858 SUB14701816 \n", "97 Transcriptome Analysis SRA1960858 SUB14701816 \n", "98 Transcriptome Analysis SRA1960558 SUB14700372 \n", "99 Transcriptome Analysis SRA1960558 SUB14700372 \n", "\n", " submission_center_name submission_lab_name \n", "0 Technion - Israel Institute of Technology \n", "1 Technion - Israel Institute of Technology \n", "2 Technion - Israel Institute of Technology \n", "3 Technion - Israel Institute of Technology \n", "4 Technion - Israel Institute of Technology \n", ".. ... ... \n", "95 Systems Biology Ireland, University College Du... \n", "96 Systems Biology Ireland, University College Du... \n", "97 Systems Biology Ireland, University College Du... \n", "98 Southwest Hospital, Army Medical University (T... \n", "99 Southwest Hospital, Army Medical University (T... \n", "\n", "[100 rows x 644 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# https://github.com/saketkc/pysradb/issues/221\n", "instance = GeoSearch(\n", " publication_date=\"05-09-2024:06-09-2024\", return_max=100, verbosity=3\n", ")\n", "instance.search()\n", "df = instance.get_df()\n", "df" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:48<00:00, 20.72it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "['GSE276554' 'GSE276553' 'GSE267301' 'GSE276379' 'GSE276473' 'GSE276365'\n", " 'GSE276447' 'GSE276438' 'GSE219269' 'GSE276372' 'GSE276338' 'GSE276337'\n", " 'GSE276206' 'GSE276309' 'GSE276304' 'GSE276245' 'GSE276204' 'GSE276185'\n", " 'GSE276195' 'GSE276192' 'GSE276153' 'GSE276130' 'GSE276122' 'GSE276065'\n", " 'GSE276058' 'GSE276038' 'GSE276037' 'GSE275962' 'GSE275896' 'GSE275863'\n", " 'GSE275777' 'GSE275778' 'GSE275571' 'GSE253407' 'GSE275211' 'GSE274586'\n", " 'GSE274408' 'GSE273907' 'GSE273844' 'GSE273813' 'GSE253145' 'GSE272793'\n", " 'GSE272635' 'GSE272394' 'GSE247707' 'GSE247706' 'GSE272252' 'GSE271996'\n", " 'GSE271746' 'GSE271667' 'GSE271653' 'GSE264193' 'GSE245033' 'GSE268837'\n", " 'GSE267343' 'GSE267342' 'GSE266976' 'GSE266471' 'GSE264212' 'GSE264012'\n", " 'GSE263804' 'GSE263798' 'GSE263549' 'GSE263414' 'GSE263441' 'GSE262699'\n", " 'GSE262282' 'GSE262272' 'GSE262127' 'GSE262125' 'GSE262126']\n" ] }, { "data": { "text/plain": [ "np.int64(0)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "instance = GeoSearch(\n", " publication_date=\"04-09-2024:06-09-2024\", return_max=1000, verbosity=3\n", ")\n", "instance.search()\n", "df = instance.get_df()\n", "print(df[\"study_alias\"].unique())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "##### 6. Corresponding terminal command example, with max set to 20:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pysradb search --db ena -m 20 -v 3 --organism Escherichia coli --layout Paired --mbases 100 --publication-date 01-01-2019:31-12-2019 --platform illumina --selection random --source Genomic --strategy wgs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.18" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }