{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/saketkc/pysradb/blob/develop/notebooks/07.Query_Search.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Query and Search\n",
    "\n",
    "This notebook demonstrates advanced search capabilities to find SRA studies based on specific criteria."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install pysradb if not already installed\n",
    "try:\n",
    "    import pysradb\n",
    "\n",
    "    print(f\"pysradb {pysradb.__version__} is already installed\")\n",
    "except ImportError:\n",
    "    print(\"Installing pysradb from GitHub...\")\n",
    "    import sys\n",
    "\n",
    "    !{sys.executable} -m pip install -q git+https://github.com/saketkc/pysradb\n",
    "    print(\"pysradb installed successfully!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## pysradb search\n",
    "##### The pysradb search module supports querying the Sequence Read Archive (SRA) and the European Nucleotide Archive (ENA) databases for sequencing data. The module also includes several built-in flags that can be used to fine-tune a search query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "th {font-size: 16px;}\n",
       "td {font-size: 14px;}\n",
       "td:first-child {font-size: 15px; font-weight: 500;}\n",
       "</style>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "<style>\n",
    "th {font-size: 16px;}\n",
    "td {font-size: 14px;}\n",
    "td:first-child {font-size: 15px; font-weight: 500;}\n",
    "</style>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Terminal flags for the pysradb search module:\n",
    "\n",
    "|Flags | Explanation|\n",
    "|----------|------------|\n",
    "| -h, --help | Displays the help message |\n",
    "| --saveto | Saves the result in the file specified by the user.<br>Supported file types: txt, tsv, csv |\n",
    "| --db   | Selects the database (SRA, ENA, or both SRA and Geo DataSets) to query. Default database is SRA. Accepted inputs: sra, ena, geo|\n",
    "| -v, --verbosity  | This determines how much details are retrieved and shown in the search result: <br>0: run_accession only <br>1: run_accession and experiment_description only <br>2: <u>(default)</u> study_accession, experiment_accession, experiment_title, description, tax_id, scientific_name, library_strategy, <br>library_source, library_selection, sample_accession, sample_title, instrument_model, run_accession, read_count, base_count <br>3: Everything in verbosity level 2, followed by all other retrievable information from the database|\n",
    "| -m, --max | Maximum number of returned entries. Default number is 20.<br>Note: If the maximum number set is large, querying the SRA and GEO DataSets databases will take significantly longer due to API limits|\n",
    "| -q, --query | The main query string. <br><u>Note: if this flag is not used, at least one of the following flags must be supplied</u>: |\n",
    "| --accession | A relevant study / experiment / sample / run accession number|\n",
    "| --organism | Scientific name of the sample organism |\n",
    "| --layout | Library layout. Accepted inputs: single, paired|\n",
    "| --mbases | Size of the sample rounded to the nearest megabase|\n",
    "| --publication-date | The publication date of the run in the format dd-mm-yyyy. If a date range is desired, <br>enter the start date, followed by end date, separated by a colon ':' in the format dd-mm-yyyy:dd-mm-yyyy <br>Example: 01-01-2010:31-12-2010|\n",
    "| --platform | Sequencing platform used for the run. Possible inputs: illumina, ion torrent, oxford nanopore |\n",
    "| --selection | Library selection. Possible inputs: cdna, chip, dnase, pcr, polya |\n",
    "| --source | Library source. Possible inputs: genomic, metagenomic, transcriptomic |\n",
    "| --strategy | Library Preparation strategy. Possible inputs: wgs, amplicon, rna seq |\n",
    "| --title | Title of the experiment associated with the run |\n",
    "| --geo-query | The main query string to be sent to Geo DataSets |\n",
    "| --geo-dataset-type | Dataset type. Possible inputs: expression profiling by array, expression profiling by high throughput sequencing, non coding rna profiling by high throughput sequencing |\n",
    "| --geo-entry-type | Entry type. Accepted inputs: gds, gpl, gse, gsm |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using pysradb search in python:\n",
    "\n",
    "##### pysradb search organises each search query as a instance of either the SraSearch, EnaSearch or the GeoSearch classes. These classes takes in the following parameters in their constructor: \n",
    "\n",
    "\n",
    "\n",
    "<b>`SraSearch (verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, suppress_validation=False,) ` </b>\n",
    "\n",
    "\n",
    "<b>`EnaSearch (verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, suppress_validation=False,) ` </b>\n",
    "\n",
    "<b>`GeoSearch (verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, geo_query=None, geo_dataset_type=None, geo_entry_type=None, suppress_validation=False,) ` </b>\n",
    "\n",
    "| Parameters | Explanations|\n",
    "|----------|------------|\n",
    "| verbosity | This determines how much details are retrieved and shown in the search result (default=2). Same as -v / --verbosity on terminal |\n",
    "| return_max | Maximum number of returned entries (default=20). Same as -m / --max on terminal |\n",
    "| suppress_validation | Defaults to False. If this is set to True, the user input format checks will be skipped. Setting this to True may cause the program to behave in unexpected ways, but allows the user to search queries that does not pass the format check.|\n",
    "\n",
    "Other parameters match the command line flags of the same name.\n",
    "<br>\n",
    "<br>\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "##### To query the SRA database for ribosome profiling, expecting an output of verbosity level 2, and returning at most 5 entries, we can do the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/data/github/pysradb/pysradb/utils.py:14: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from tqdm.autonotebook import tqdm\n",
      "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  5.58it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  study_accession experiment_accession  \\\n",
      "0       SRP630668          SRX30794203   \n",
      "1       SRP630668          SRX30794202   \n",
      "2       SRP630668          SRX30794201   \n",
      "3       SRP630668          SRX30794200   \n",
      "4       ERP179251          ERX14883219   \n",
      "\n",
      "                                    experiment_title sample_taxon_id  \\\n",
      "0                shRNA treated replicate 2, Ribo-seq            9606   \n",
      "1                shRNA treated replicate 1, Ribo-seq            9606   \n",
      "2             negative control replicate 2, Ribo-seq            9606   \n",
      "3             negative control replicate 1, Ribo-seq            9606   \n",
      "4  Ribosome profiling and RNA-seq of Salmonella i...           10090   \n",
      "\n",
      "  sample_scientific_name experiment_library_strategy  \\\n",
      "0           Homo sapiens                    Ribo-seq   \n",
      "1           Homo sapiens                    Ribo-seq   \n",
      "2           Homo sapiens                    Ribo-seq   \n",
      "3           Homo sapiens                    Ribo-seq   \n",
      "4           Mus musculus                       OTHER   \n",
      "\n",
      "  experiment_library_source experiment_library_selection sample_accession  \\\n",
      "0            TRANSCRIPTOMIC           size fractionation      SRS26807700   \n",
      "1            TRANSCRIPTOMIC           size fractionation      SRS26807699   \n",
      "2            TRANSCRIPTOMIC           size fractionation      SRS26807698   \n",
      "3            TRANSCRIPTOMIC           size fractionation      SRS26807697   \n",
      "4                     OTHER                        other      ERS26530900   \n",
      "\n",
      "     sample_alias experiment_instrument_model pool_member_spots  run_1_size  \\\n",
      "0        sh2_ribo     Illumina NovaSeq X Plus          57288768  1354504305   \n",
      "1        sh1_ribo     Illumina NovaSeq X Plus          61293382  1431873887   \n",
      "2        nc2_ribo     Illumina NovaSeq X Plus          85858060  2060965292   \n",
      "3        nc1_ribo     Illumina NovaSeq X Plus          61189912  1474261116   \n",
      "4  SAMEA119980013                NextSeq 2000          14962757        <NA>   \n",
      "\n",
      "  run_1_accession run_1_total_spots run_1_total_bases  \n",
      "0     SRR35744894          57288768        4296657600  \n",
      "1     SRR35744895          61293382        4597003650  \n",
      "2     SRR35744896          85858060        6439354500  \n",
      "3     SRR35744897          61189912        4589243400  \n",
      "4     ERR15479323              <NA>              <NA>  \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from pysradb.search import SraSearch\n",
    "\n",
    "instance = SraSearch(2, 5, query=\"ribosome profiling\")\n",
    "instance.search()\n",
    "df = instance.get_df()\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "### Quickstart\n",
    "\n",
    "##### To query ENA instead, replace SraSearch class with the EnaSearch class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Empty DataFrame\n",
      "Columns: []\n",
      "Index: []\n"
     ]
    }
   ],
   "source": [
    "from pysradb.search import EnaSearch\n",
    "\n",
    "instance = EnaSearch(2, 5, \"ribosome profiling\")\n",
    "instance.search()\n",
    "df = instance.get_df()\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### To query GEO DataSets instead and retrieve the metadata of linked entries in SRA:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  5.53it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  study_accession experiment_accession  \\\n",
      "0       SRP619527          SRX30454639   \n",
      "1       SRP619527          SRX30454638   \n",
      "2       SRP619527          SRX30454637   \n",
      "3       SRP619527          SRX30454636   \n",
      "4       SRP618293          SRX30400919   \n",
      "\n",
      "                                    experiment_title sample_taxon_id  \\\n",
      "0  GSM9233466: Eef1g-cKO leptotene and zygotene_r...           10090   \n",
      "1  GSM9233465: Eef1g-cKO leptotene and zygotene_r...           10090   \n",
      "2  GSM9233464: Control leptotene and zygotene_rep...           10090   \n",
      "3  GSM9233463: Control leptotene and zygotene_rep...           10090   \n",
      "4  GSM9224243: Eef1g cKO leptotene/zygotene sperm...           10090   \n",
      "\n",
      "  sample_scientific_name experiment_library_strategy  \\\n",
      "0           Mus musculus                       OTHER   \n",
      "1           Mus musculus                       OTHER   \n",
      "2           Mus musculus                       OTHER   \n",
      "3           Mus musculus                       OTHER   \n",
      "4           Mus musculus                     RNA-Seq   \n",
      "\n",
      "  experiment_library_source experiment_library_selection sample_accession  \\\n",
      "0            TRANSCRIPTOMIC                        other      SRS26489458   \n",
      "1            TRANSCRIPTOMIC                        other      SRS26489457   \n",
      "2            TRANSCRIPTOMIC                        other      SRS26489456   \n",
      "3            TRANSCRIPTOMIC                        other      SRS26489455   \n",
      "4            TRANSCRIPTOMIC                         cDNA      SRS26437528   \n",
      "\n",
      "  sample_alias experiment_instrument_model pool_member_spots   run_1_size  \\\n",
      "0   GSM9233466       Illumina NovaSeq 6000          69632843   8660602662   \n",
      "1   GSM9233465       Illumina NovaSeq 6000          98921214  12243831179   \n",
      "2   GSM9233464       Illumina NovaSeq 6000          38336833   4708346735   \n",
      "3   GSM9233463       Illumina NovaSeq 6000          88212319  11000694382   \n",
      "4   GSM9224243          Illumina NovaSeq X          15236947   1826954111   \n",
      "\n",
      "  run_1_accession run_1_total_spots run_1_total_bases  \n",
      "0     SRR35360226          69632843       20889852900  \n",
      "1     SRR35360227          98921214       29676364200  \n",
      "2     SRR35360228          38336833       11501049900  \n",
      "3     SRR35360229          88212319       26463695700  \n",
      "4     SRR35298842          15236947        4571084100  \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from pysradb.search import GeoSearch\n",
    "\n",
    "instance = GeoSearch(2, 5, geo_query=\"ribosome profiling\")\n",
    "instance.search()\n",
    "df = instance.get_df()\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "##### 7. Querying GEO DataSets with publication_date filter and displaying publication dates in results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "No results found for the following search query: \n",
      " SRA: {'query': 'sra gds[Filter]', 'accession': None, 'organism': None, 'layout': None, 'mbases': None, 'publication_date': '01-09-2024:30-09-2024', 'platform': None, 'selection': None, 'source': None, 'strategy': None, 'title': None}\n",
      "GEO DataSets: {'query': 'RNA-Seq AND gds sra[Filter]', 'dataset_type': None, 'entry_type': None, 'publication_date': '01-09-2024:30-09-2024', 'organism': None}\n",
      "Empty DataFrame\n",
      "Columns: []\n",
      "Index: []\n"
     ]
    }
   ],
   "source": [
    "from pysradb.search import GeoSearch\n",
    "\n",
    "# Search for RNA-Seq datasets published in September 2024\n",
    "# Using verbosity=3 to get all available fields including publication_date\n",
    "instance = GeoSearch(\n",
    "    verbosity=3,\n",
    "    return_max=5,\n",
    "    geo_query=\"RNA-Seq\",\n",
    "    publication_date=\"01-09-2024:30-09-2024\",\n",
    ")\n",
    "instance.search()\n",
    "df = instance.get_df()\n",
    "\n",
    "# Display select columns including publication_date\n",
    "if not df.empty and \"publication_date\" in df.columns:\n",
    "    cols_to_show = [\n",
    "        \"study_accession\",\n",
    "        \"experiment_accession\",\n",
    "        \"sample_scientific_name\",\n",
    "        \"experiment_library_strategy\",\n",
    "        \"publication_date\",\n",
    "    ]\n",
    "    available_cols = [c for c in cols_to_show if c in df.columns]\n",
    "    print(df[available_cols])\n",
    "else:\n",
    "    print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "### Error Handling\n",
    "\n",
    "##### When suppress_validation is not set to True, query fields with incorrect entries will raise IncorrectFieldException, which provides the complete list of acceptable inputs for fields such as \"selection\", etc:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "ename": "IncorrectFieldException",
     "evalue": "Incorrect selection: Mudkip\n--selection must be one of the following: \n5-methylcytidine antibody, CAGE, ChIP, ChIP-Seq, DNase, HMPR, Hybrid Selection,  \nInverse rRNA, Inverse rRNA selection, MBD2 protein methyl-CpG binding domain, \nMDA, MF, MNase, MSLL, Oligo-dT, PCR, PolyA, RACE, RANDOM, RANDOM PCR, RT-PCR,  \nReduced Representation, Restriction Digest, cDNA, cDNA_oligo_dT, cDNA_randomPriming \nother, padlock probes capture method, repeat fractionation, size fractionation, \nunspecified\n\n",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mIncorrectFieldException\u001b[0m                   Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[6], line 2\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[38;5;66;03m# 1. Invalid query entered for \"selection\"\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m \u001b[43mSraSearch\u001b[49m\u001b[43m(\u001b[49m\u001b[43mselection\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mMudkip\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m/data/github/pysradb/pysradb/search.py:735\u001b[0m, in \u001b[0;36mSraSearch.__init__\u001b[0;34m(self, verbosity, return_max, query, accession, organism, layout, mbases, publication_date, platform, selection, source, strategy, title, suppress_validation)\u001b[0m\n\u001b[1;32m    718\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21m__init__\u001b[39m(\n\u001b[1;32m    719\u001b[0m     \u001b[38;5;28mself\u001b[39m,\n\u001b[1;32m    720\u001b[0m     verbosity\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m2\u001b[39m,\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    733\u001b[0m     suppress_validation\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m,\n\u001b[1;32m    734\u001b[0m ):\n\u001b[0;32m--> 735\u001b[0m     \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[38;5;21;43m__init__\u001b[39;49m\u001b[43m(\u001b[49m\n\u001b[1;32m    736\u001b[0m \u001b[43m        \u001b[49m\u001b[43mverbosity\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    737\u001b[0m \u001b[43m        \u001b[49m\u001b[43mreturn_max\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    738\u001b[0m \u001b[43m        \u001b[49m\u001b[43mquery\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    739\u001b[0m \u001b[43m        \u001b[49m\u001b[43maccession\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    740\u001b[0m \u001b[43m        \u001b[49m\u001b[43morganism\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    741\u001b[0m \u001b[43m        \u001b[49m\u001b[43mlayout\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    742\u001b[0m \u001b[43m        \u001b[49m\u001b[43mmbases\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    743\u001b[0m \u001b[43m        \u001b[49m\u001b[43mpublication_date\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    744\u001b[0m \u001b[43m        \u001b[49m\u001b[43mplatform\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    745\u001b[0m \u001b[43m        \u001b[49m\u001b[43mselection\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    746\u001b[0m \u001b[43m        \u001b[49m\u001b[43msource\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    747\u001b[0m \u001b[43m        \u001b[49m\u001b[43mstrategy\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    748\u001b[0m \u001b[43m        \u001b[49m\u001b[43mtitle\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    749\u001b[0m \u001b[43m        \u001b[49m\u001b[43msuppress_validation\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    750\u001b[0m \u001b[43m    \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    751\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mentries \u001b[38;5;241m=\u001b[39m {}\n\u001b[1;32m    752\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mnumber_entries \u001b[38;5;241m=\u001b[39m \u001b[38;5;241m0\u001b[39m\n",
      "File \u001b[0;32m/data/github/pysradb/pysradb/search.py:152\u001b[0m, in \u001b[0;36mQuerySearch.__init__\u001b[0;34m(self, verbosity, return_max, query, accession, organism, layout, mbases, publication_date, platform, selection, source, strategy, title, suppress_validation)\u001b[0m\n\u001b[1;32m    150\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m MissingQueryException()\n\u001b[1;32m    151\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m suppress_validation:\n\u001b[0;32m--> 152\u001b[0m     \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_validate_fields\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    153\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mstats \u001b[38;5;241m=\u001b[39m {\n\u001b[1;32m    154\u001b[0m     \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mstudy\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m-\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m    155\u001b[0m     \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mexperiment\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m-\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    167\u001b[0m     \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcount_stdev\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m-\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m    168\u001b[0m }\n\u001b[1;32m    169\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mplot_objects \u001b[38;5;241m=\u001b[39m {}\n",
      "File \u001b[0;32m/data/github/pysradb/pysradb/search.py:449\u001b[0m, in \u001b[0;36mQuerySearch._validate_fields\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m    447\u001b[0m         \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mfields[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mstrategy\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m output[\u001b[38;5;241m0\u001b[39m]\n\u001b[1;32m    448\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m message:\n\u001b[0;32m--> 449\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m IncorrectFieldException(message)\n",
      "\u001b[0;31mIncorrectFieldException\u001b[0m: Incorrect selection: Mudkip\n--selection must be one of the following: \n5-methylcytidine antibody, CAGE, ChIP, ChIP-Seq, DNase, HMPR, Hybrid Selection,  \nInverse rRNA, Inverse rRNA selection, MBD2 protein methyl-CpG binding domain, \nMDA, MF, MNase, MSLL, Oligo-dT, PCR, PolyA, RACE, RANDOM, RANDOM PCR, RT-PCR,  \nReduced Representation, Restriction Digest, cDNA, cDNA_oligo_dT, cDNA_randomPriming \nother, padlock probes capture method, repeat fractionation, size fractionation, \nunspecified\n\n"
     ]
    }
   ],
   "source": [
    "# 1. Invalid query entered for \"selection\"\n",
    "SraSearch(selection=\"Mudkip\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "ename": "IncorrectFieldException",
     "evalue": "Multiple potential matches have been identified for metagenomic viral rna :\n['METAGENOMIC', 'VIRAL RNA']\nPlease check your input.\n\n",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mIncorrectFieldException\u001b[0m                   Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[7], line 2\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[38;5;66;03m# 2. Ambiguous query entered for \"source\":\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m \u001b[43mEnaSearch\u001b[49m\u001b[43m(\u001b[49m\u001b[43msource\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mmetagenomic viral rna \u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m/data/github/pysradb/pysradb/search.py:152\u001b[0m, in \u001b[0;36mQuerySearch.__init__\u001b[0;34m(self, verbosity, return_max, query, accession, organism, layout, mbases, publication_date, platform, selection, source, strategy, title, suppress_validation)\u001b[0m\n\u001b[1;32m    150\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m MissingQueryException()\n\u001b[1;32m    151\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m suppress_validation:\n\u001b[0;32m--> 152\u001b[0m     \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_validate_fields\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    153\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mstats \u001b[38;5;241m=\u001b[39m {\n\u001b[1;32m    154\u001b[0m     \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mstudy\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m-\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m    155\u001b[0m     \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mexperiment\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m-\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    167\u001b[0m     \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcount_stdev\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m-\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m    168\u001b[0m }\n\u001b[1;32m    169\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mplot_objects \u001b[38;5;241m=\u001b[39m {}\n",
      "File \u001b[0;32m/data/github/pysradb/pysradb/search.py:449\u001b[0m, in \u001b[0;36mQuerySearch._validate_fields\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m    447\u001b[0m         \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mfields[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mstrategy\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m output[\u001b[38;5;241m0\u001b[39m]\n\u001b[1;32m    448\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m message:\n\u001b[0;32m--> 449\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m IncorrectFieldException(message)\n",
      "\u001b[0;31mIncorrectFieldException\u001b[0m: Multiple potential matches have been identified for metagenomic viral rna :\n['METAGENOMIC', 'VIRAL RNA']\nPlease check your input.\n\n"
     ]
    }
   ],
   "source": [
    "# 2. Ambiguous query entered for \"source\":\n",
    "EnaSearch(source=\"metagenomic viral rna \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "### Usage Examples:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### 1. Checking the help message on terminal:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usage: pysradb search [-h] [-o SAVETO] [-s] [-g [GRAPHS]] [-d {ena,geo,sra}]\n",
      "                      [-v {0,1,2,3}] [--run-description] [--detailed] [-m MAX]\n",
      "                      [-q QUERY [QUERY ...]] [-A ACCESSION]\n",
      "                      [-O ORGANISM [ORGANISM ...]] [-L {SINGLE,PAIRED}]\n",
      "                      [-M MBASES] [-D PUBLICATION_DATE]\n",
      "                      [-P PLATFORM [PLATFORM ...]]\n",
      "                      [-E SELECTION [SELECTION ...]] [-C SOURCE [SOURCE ...]]\n",
      "                      [-S STRATEGY [STRATEGY ...]] [-T TITLE [TITLE ...]] [-I]\n",
      "                      [-G GEO_QUERY [GEO_QUERY ...]]\n",
      "                      [-Y GEO_DATASET_TYPE [GEO_DATASET_TYPE ...]]\n",
      "                      [-Z GEO_ENTRY_TYPE [GEO_ENTRY_TYPE ...]]\n",
      "\n",
      "options:\n",
      "  -h, --help            show this help message and exit\n",
      "  -o SAVETO, --saveto SAVETO\n",
      "                        Save search result dataframe to file\n",
      "  -s, --stats           Displays some useful statistics for the search\n",
      "                        results.\n",
      "  -g [GRAPHS], --graphs [GRAPHS]\n",
      "                        Generates graphs to illustrate the search result. By\n",
      "                        default all graphs are generated. Alternatively,\n",
      "                        select a subset from the options below in a space-\n",
      "                        separated string: daterange, organism, source,\n",
      "                        selection, platform, basecount\n",
      "  -d {ena,geo,sra}, --db {ena,geo,sra}\n",
      "                        Select the db API (sra, ena, or geo) to query, default\n",
      "                        = sra. Note: pysradb search works slightly differently\n",
      "                        when db = geo. Please refer to 'pysradb search --geo-\n",
      "                        info' for more details.\n",
      "  -v {0,1,2,3}, --verbosity {0,1,2,3}\n",
      "                        Level of search result details (0, 1, 2 or 3), default\n",
      "                        = 2 0: run accession only 1: run accession and\n",
      "                        experiment title 2: accession numbers, titles and\n",
      "                        sequencing information 3: records in 2 and other\n",
      "                        information such as download url, sample attributes,\n",
      "                        etc\n",
      "  --run-description     Displays run accessions and descriptions only.\n",
      "                        Equivalent to --verbosity 1\n",
      "  --detailed            Displays detailed search results. Equivalent to\n",
      "                        --verbosity 3.\n",
      "  -m MAX, --max MAX     Maximum number of entries to return, default = 20\n",
      "  -q QUERY [QUERY ...], --query QUERY [QUERY ...]\n",
      "                        Main query string. Note that if no query is supplied,\n",
      "                        at least one of the following flags must be present:\n",
      "  -A ACCESSION, --accession ACCESSION\n",
      "                        Accession number\n",
      "  -O ORGANISM [ORGANISM ...], --organism ORGANISM [ORGANISM ...]\n",
      "                        Scientific name of the sample organism\n",
      "  -L {SINGLE,PAIRED}, --layout {SINGLE,PAIRED}\n",
      "                        Library layout. Accepts either SINGLE or PAIRED\n",
      "  -M MBASES, --mbases MBASES\n",
      "                        Size of the sample rounded to the nearest megabase\n",
      "  -D PUBLICATION_DATE, --publication-date PUBLICATION_DATE\n",
      "                        Publication date of the run in the format dd-mm-yyyy.\n",
      "                        If a date range is desired, enter the start date,\n",
      "                        followed by end date, separated by a colon ':'.\n",
      "                        Example: 01-01-2010:31-12-2010\n",
      "  -P PLATFORM [PLATFORM ...], --platform PLATFORM [PLATFORM ...]\n",
      "                        Sequencing platform\n",
      "  -E SELECTION [SELECTION ...], --selection SELECTION [SELECTION ...]\n",
      "                        Library selection\n",
      "  -C SOURCE [SOURCE ...], --source SOURCE [SOURCE ...]\n",
      "                        Library source\n",
      "  -S STRATEGY [STRATEGY ...], --strategy STRATEGY [STRATEGY ...]\n",
      "                        Library preparation strategy\n",
      "  -T TITLE [TITLE ...], --title TITLE [TITLE ...]\n",
      "                        Experiment title\n",
      "  -I, --geo-info        Displays information on how to query GEO DataSets via\n",
      "                        'pysradb search --db geo ...', including accepted\n",
      "                        inputs for -G/--geo-query, -Y/--geo-dataset-type and\n",
      "                        -Z/--geo-entry-type.\n",
      "  -G GEO_QUERY [GEO_QUERY ...], --geo-query GEO_QUERY [GEO_QUERY ...]\n",
      "                        Main query string for GEO DataSet. This flag is only\n",
      "                        used when db is set to be geo.Please refer to 'pysradb\n",
      "                        search --geo-info' for more details.\n",
      "  -Y GEO_DATASET_TYPE [GEO_DATASET_TYPE ...], --geo-dataset-type GEO_DATASET_TYPE [GEO_DATASET_TYPE ...]\n",
      "                        GEO DataSet Type. This flag is only used when --db is\n",
      "                        set to be geo.Please refer to 'pysradb search --geo-\n",
      "                        info' for more details.\n",
      "  -Z GEO_ENTRY_TYPE [GEO_ENTRY_TYPE ...], --geo-entry-type GEO_ENTRY_TYPE [GEO_ENTRY_TYPE ...]\n",
      "                        GEO Entry Type. This flag is only used when --db is\n",
      "                        set to be geo.Please refer to 'pysradb search --geo-\n",
      "                        info' for more details.\n"
     ]
    }
   ],
   "source": [
    "!pysradb search -h"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "##### 2. Searching for 5 illumina sequences related to the covid-19 pandemic on ENA, using the terminal:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "study_accession\texperiment_accession\texperiment_title\tdescription\ttax_id\tscientific_name\tlibrary_strategy\tlibrary_source\tlibrary_selection\tsample_accession\tsample_title\tinstrument_model\trun_accession\tread_count\tbase_count\n",
      "PRJEB65477\tERX11285566\tIllumina HiSeq 4000 sequencing: Blood RNA sequencing was performed on a cohort of adults attending the Emergency Department with suspected infection who had subsequently-confirmed viral, bacterial, COVID19 and healthy controls\tIllumina HiSeq 4000 sequencing: Blood RNA sequencing was performed on a cohort of adults attending the Emergency Department with suspected infection who had subsequently-confirmed viral, bacterial, COVID19 and healthy controls\t9606\tHomo sapiens\tRNA-Seq\tTRANSCRIPTOMIC\tInverse rRNA\tSAMEA114308724\tSample 1\tIllumina HiSeq 4000\tERR11901279\t131354572\t19834540372\n",
      "PRJEB65477\tERX11285567\tIllumina HiSeq 4000 sequencing: Blood RNA sequencing was performed on a cohort of adults attending the Emergency Department with suspected infection who had subsequently-confirmed viral, bacterial, COVID19 and healthy controls\tIllumina HiSeq 4000 sequencing: Blood RNA sequencing was performed on a cohort of adults attending the Emergency Department with suspected infection who had subsequently-confirmed viral, bacterial, COVID19 and healthy controls\t9606\tHomo sapiens\tRNA-Seq\tTRANSCRIPTOMIC\tInverse rRNA\tSAMEA114308725\tSample 10\tIllumina HiSeq 4000\tERR11901280\t95225496\t14379049896\n",
      "PRJEB65477\tERX11285573\tIllumina HiSeq 4000 sequencing: Blood RNA sequencing was performed on a cohort of adults attending the Emergency Department with suspected infection who had subsequently-confirmed viral, bacterial, COVID19 and healthy controls\tIllumina HiSeq 4000 sequencing: Blood RNA sequencing was performed on a cohort of adults attending the Emergency Department with suspected infection who had subsequently-confirmed viral, bacterial, COVID19 and healthy controls\t9606\tHomo sapiens\tRNA-Seq\tTRANSCRIPTOMIC\tInverse rRNA\tSAMEA114308731\tSample 105\tIllumina HiSeq 4000\tERR11901286\t128864216\t19458496616\n",
      "PRJEB65477\tERX11285574\tIllumina HiSeq 4000 sequencing: Blood RNA sequencing was performed on a cohort of adults attending the Emergency Department with suspected infection who had subsequently-confirmed viral, bacterial, COVID19 and healthy controls\tIllumina HiSeq 4000 sequencing: Blood RNA sequencing was performed on a cohort of adults attending the Emergency Department with suspected infection who had subsequently-confirmed viral, bacterial, COVID19 and healthy controls\t9606\tHomo sapiens\tRNA-Seq\tTRANSCRIPTOMIC\tInverse rRNA\tSAMEA114308732\tSample 106\tIllumina HiSeq 4000\tERR11901287\t123505478\t18649327178\n",
      "PRJEB65477\tERX11285577\tIllumina HiSeq 4000 sequencing: Blood RNA sequencing was performed on a cohort of adults attending the Emergency Department with suspected infection who had subsequently-confirmed viral, bacterial, COVID19 and healthy controls\tIllumina HiSeq 4000 sequencing: Blood RNA sequencing was performed on a cohort of adults attending the Emergency Department with suspected infection who had subsequently-confirmed viral, bacterial, COVID19 and healthy controls\t9606\tHomo sapiens\tRNA-Seq\tTRANSCRIPTOMIC\tInverse rRNA\tSAMEA114308735\tSample 109\tIllumina HiSeq 4000\tERR11901290\t120866712\t18250873512\n"
     ]
    }
   ],
   "source": [
    "!pysradb search -q covid19 --platform illumina --db ena -m 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "##### 3. Searching for illumina sequences related to the covid-19 pandemic on ENA, using the terminal, and saving the results in a nicely formatted text file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pysradb search -q covid19 --db ena --saveto query.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "##### 4. Searching for illumina sequences related to the covid-19 pandemic on ENA, within python: (outputs a pandas dataframe)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   study_accession experiment_accession  \\\n",
      "0       PRJEB65477          ERX11285566   \n",
      "1       PRJEB65477          ERX11285567   \n",
      "2       PRJEB65477          ERX11285573   \n",
      "3       PRJEB65477          ERX11285574   \n",
      "4       PRJEB65477          ERX11285577   \n",
      "5       PRJEB65477          ERX11285583   \n",
      "6       PRJEB65477          ERX11285586   \n",
      "7       PRJEB65477          ERX11285592   \n",
      "8       PRJEB65477          ERX11285595   \n",
      "9       PRJEB65477          ERX11285599   \n",
      "10      PRJEB65477          ERX11285600   \n",
      "11      PRJEB65477          ERX11285605   \n",
      "12      PRJEB65477          ERX11285607   \n",
      "13      PRJEB65477          ERX11285609   \n",
      "14      PRJEB65477          ERX11285611   \n",
      "15      PRJEB65477          ERX11285620   \n",
      "16      PRJEB65477          ERX11285623   \n",
      "17      PRJEB65477          ERX11285625   \n",
      "18      PRJEB65477          ERX11285629   \n",
      "19      PRJEB65477          ERX11285633   \n",
      "\n",
      "                                     experiment_title  \\\n",
      "0   Illumina HiSeq 4000 sequencing: Blood RNA sequ...   \n",
      "1   Illumina HiSeq 4000 sequencing: Blood RNA sequ...   \n",
      "2   Illumina HiSeq 4000 sequencing: Blood RNA sequ...   \n",
      "3   Illumina HiSeq 4000 sequencing: Blood RNA sequ...   \n",
      "4   Illumina HiSeq 4000 sequencing: Blood RNA sequ...   \n",
      "5   Illumina HiSeq 4000 sequencing: Blood RNA sequ...   \n",
      "6   Illumina HiSeq 4000 sequencing: Blood RNA sequ...   \n",
      "7   Illumina HiSeq 4000 sequencing: Blood RNA sequ...   \n",
      "8   Illumina HiSeq 4000 sequencing: Blood RNA sequ...   \n",
      "9   Illumina HiSeq 4000 sequencing: Blood RNA sequ...   \n",
      "10  Illumina HiSeq 4000 sequencing: Blood RNA sequ...   \n",
      "11  Illumina HiSeq 4000 sequencing: Blood RNA sequ...   \n",
      "12  Illumina HiSeq 4000 sequencing: Blood RNA sequ...   \n",
      "13  Illumina HiSeq 4000 sequencing: Blood RNA sequ...   \n",
      "14  Illumina HiSeq 4000 sequencing: Blood RNA sequ...   \n",
      "15  Illumina HiSeq 4000 sequencing: Blood RNA sequ...   \n",
      "16  Illumina HiSeq 4000 sequencing: Blood RNA sequ...   \n",
      "17  Illumina HiSeq 4000 sequencing: Blood RNA sequ...   \n",
      "18  Illumina HiSeq 4000 sequencing: Blood RNA sequ...   \n",
      "19  Illumina HiSeq 4000 sequencing: Blood RNA sequ...   \n",
      "\n",
      "                                          description tax_id scientific_name  \\\n",
      "0   Illumina HiSeq 4000 sequencing: Blood RNA sequ...   9606    Homo sapiens   \n",
      "1   Illumina HiSeq 4000 sequencing: Blood RNA sequ...   9606    Homo sapiens   \n",
      "2   Illumina HiSeq 4000 sequencing: Blood RNA sequ...   9606    Homo sapiens   \n",
      "3   Illumina HiSeq 4000 sequencing: Blood RNA sequ...   9606    Homo sapiens   \n",
      "4   Illumina HiSeq 4000 sequencing: Blood RNA sequ...   9606    Homo sapiens   \n",
      "5   Illumina HiSeq 4000 sequencing: Blood RNA sequ...   9606    Homo sapiens   \n",
      "6   Illumina HiSeq 4000 sequencing: Blood RNA sequ...   9606    Homo sapiens   \n",
      "7   Illumina HiSeq 4000 sequencing: Blood RNA sequ...   9606    Homo sapiens   \n",
      "8   Illumina HiSeq 4000 sequencing: Blood RNA sequ...   9606    Homo sapiens   \n",
      "9   Illumina HiSeq 4000 sequencing: Blood RNA sequ...   9606    Homo sapiens   \n",
      "10  Illumina HiSeq 4000 sequencing: Blood RNA sequ...   9606    Homo sapiens   \n",
      "11  Illumina HiSeq 4000 sequencing: Blood RNA sequ...   9606    Homo sapiens   \n",
      "12  Illumina HiSeq 4000 sequencing: Blood RNA sequ...   9606    Homo sapiens   \n",
      "13  Illumina HiSeq 4000 sequencing: Blood RNA sequ...   9606    Homo sapiens   \n",
      "14  Illumina HiSeq 4000 sequencing: Blood RNA sequ...   9606    Homo sapiens   \n",
      "15  Illumina HiSeq 4000 sequencing: Blood RNA sequ...   9606    Homo sapiens   \n",
      "16  Illumina HiSeq 4000 sequencing: Blood RNA sequ...   9606    Homo sapiens   \n",
      "17  Illumina HiSeq 4000 sequencing: Blood RNA sequ...   9606    Homo sapiens   \n",
      "18  Illumina HiSeq 4000 sequencing: Blood RNA sequ...   9606    Homo sapiens   \n",
      "19  Illumina HiSeq 4000 sequencing: Blood RNA sequ...   9606    Homo sapiens   \n",
      "\n",
      "   library_strategy  library_source library_selection sample_accession  \\\n",
      "0           RNA-Seq  TRANSCRIPTOMIC      Inverse rRNA   SAMEA114308724   \n",
      "1           RNA-Seq  TRANSCRIPTOMIC      Inverse rRNA   SAMEA114308725   \n",
      "2           RNA-Seq  TRANSCRIPTOMIC      Inverse rRNA   SAMEA114308731   \n",
      "3           RNA-Seq  TRANSCRIPTOMIC      Inverse rRNA   SAMEA114308732   \n",
      "4           RNA-Seq  TRANSCRIPTOMIC      Inverse rRNA   SAMEA114308735   \n",
      "5           RNA-Seq  TRANSCRIPTOMIC      Inverse rRNA   SAMEA114308741   \n",
      "6           RNA-Seq  TRANSCRIPTOMIC      Inverse rRNA   SAMEA114308744   \n",
      "7           RNA-Seq  TRANSCRIPTOMIC      Inverse rRNA   SAMEA114308750   \n",
      "8           RNA-Seq  TRANSCRIPTOMIC      Inverse rRNA   SAMEA114308753   \n",
      "9           RNA-Seq  TRANSCRIPTOMIC      Inverse rRNA   SAMEA114308757   \n",
      "10          RNA-Seq  TRANSCRIPTOMIC      Inverse rRNA   SAMEA114308758   \n",
      "11          RNA-Seq  TRANSCRIPTOMIC      Inverse rRNA   SAMEA114308763   \n",
      "12          RNA-Seq  TRANSCRIPTOMIC      Inverse rRNA   SAMEA114308765   \n",
      "13          RNA-Seq  TRANSCRIPTOMIC      Inverse rRNA   SAMEA114308767   \n",
      "14          RNA-Seq  TRANSCRIPTOMIC      Inverse rRNA   SAMEA114308769   \n",
      "15          RNA-Seq  TRANSCRIPTOMIC      Inverse rRNA   SAMEA114308778   \n",
      "16          RNA-Seq  TRANSCRIPTOMIC      Inverse rRNA   SAMEA114308781   \n",
      "17          RNA-Seq  TRANSCRIPTOMIC      Inverse rRNA   SAMEA114308783   \n",
      "18          RNA-Seq  TRANSCRIPTOMIC      Inverse rRNA   SAMEA114308787   \n",
      "19          RNA-Seq  TRANSCRIPTOMIC      Inverse rRNA   SAMEA114308791   \n",
      "\n",
      "   sample_title     instrument_model run_accession read_count   base_count  \n",
      "0      Sample 1  Illumina HiSeq 4000   ERR11901279  131354572  19834540372  \n",
      "1     Sample 10  Illumina HiSeq 4000   ERR11901280   95225496  14379049896  \n",
      "2    Sample 105  Illumina HiSeq 4000   ERR11901286  128864216  19458496616  \n",
      "3    Sample 106  Illumina HiSeq 4000   ERR11901287  123505478  18649327178  \n",
      "4    Sample 109  Illumina HiSeq 4000   ERR11901290  120866712  18250873512  \n",
      "5    Sample 114  Illumina HiSeq 4000   ERR11901296  126581586  19113819486  \n",
      "6    Sample 117  Illumina HiSeq 4000   ERR11901299  126421026  19089574926  \n",
      "7    Sample 122  Illumina HiSeq 4000   ERR11901305  132194842  19961421142  \n",
      "8    Sample 125  Illumina HiSeq 4000   ERR11901308  128325648  19377172848  \n",
      "9    Sample 129  Illumina HiSeq 4000   ERR11901312  113973238  17209958938  \n",
      "10    Sample 13  Illumina HiSeq 4000   ERR11901313  103563334  15638063434  \n",
      "11   Sample 134  Illumina HiSeq 4000   ERR11901318  125759128  18989628328  \n",
      "12    Sample 15  Illumina HiSeq 4000   ERR11901320  102246922  15439285222  \n",
      "13    Sample 17  Illumina HiSeq 4000   ERR11901322  103868320  15684116320  \n",
      "14    Sample 19  Illumina HiSeq 4000   ERR11901324  102445734  15469305834  \n",
      "15    Sample 27  Illumina HiSeq 4000   ERR11901333  103177112  15579743912  \n",
      "16     Sample 3  Illumina HiSeq 4000   ERR11901336  104470270  15775010770  \n",
      "17    Sample 31  Illumina HiSeq 4000   ERR11901338   95555226  14428839126  \n",
      "18    Sample 35  Illumina HiSeq 4000   ERR11901342  104135584  15724473184  \n",
      "19    Sample 39  Illumina HiSeq 4000   ERR11901346  103208434  15584473534  \n"
     ]
    }
   ],
   "source": [
    "from pysradb.search import EnaSearch\n",
    "\n",
    "instance = EnaSearch(2, 20, query=\"covid19\", platform=\"illumina\")\n",
    "instance.search()\n",
    "df = instance.get_df()\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "##### 5. More complex example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>study_accession</th>\n",
       "      <th>experiment_accession</th>\n",
       "      <th>experiment_title</th>\n",
       "      <th>description</th>\n",
       "      <th>tax_id</th>\n",
       "      <th>scientific_name</th>\n",
       "      <th>library_strategy</th>\n",
       "      <th>library_source</th>\n",
       "      <th>library_selection</th>\n",
       "      <th>sample_accession</th>\n",
       "      <th>...</th>\n",
       "      <th>tag</th>\n",
       "      <th>target_gene</th>\n",
       "      <th>tax_lineage</th>\n",
       "      <th>taxonomic_classification</th>\n",
       "      <th>taxonomic_identity_marker</th>\n",
       "      <th>temperature</th>\n",
       "      <th>tissue_lib</th>\n",
       "      <th>tissue_type</th>\n",
       "      <th>transposase_protocol</th>\n",
       "      <th>variety</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>PRJNA549801</td>\n",
       "      <td>SRX10904916</td>\n",
       "      <td>Illumina MiSeq sequencing: E. coli library mad...</td>\n",
       "      <td>Illumina MiSeq sequencing: E. coli library mad...</td>\n",
       "      <td>562</td>\n",
       "      <td>Escherichia coli</td>\n",
       "      <td>WGS</td>\n",
       "      <td>GENOMIC</td>\n",
       "      <td>RANDOM</td>\n",
       "      <td>SAMN11656391</td>\n",
       "      <td>...</td>\n",
       "      <td>env_tax;env_tax:freshwater;env_tax:terrestrial...</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>1;131567;2;3379134;1224;1236;91347;543;561;562</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>PRJNA530580</td>\n",
       "      <td>SRX5658689</td>\n",
       "      <td>Illumina MiSeq sequencing: Sequencing of Esche...</td>\n",
       "      <td>Illumina MiSeq sequencing: Sequencing of Esche...</td>\n",
       "      <td>562</td>\n",
       "      <td>Escherichia coli</td>\n",
       "      <td>WGS</td>\n",
       "      <td>GENOMIC</td>\n",
       "      <td>RANDOM</td>\n",
       "      <td>SAMN11319482</td>\n",
       "      <td>...</td>\n",
       "      <td>env_tax;env_tax:freshwater;env_tax:terrestrial...</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>1;131567;2;3379134;1224;1236;91347;543;561;562</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>PRJEB34285</td>\n",
       "      <td>ERX3552102</td>\n",
       "      <td>Illumina MiSeq paired end sequencing: Raw read...</td>\n",
       "      <td>Illumina MiSeq paired end sequencing: Raw read...</td>\n",
       "      <td>562</td>\n",
       "      <td>Escherichia coli</td>\n",
       "      <td>WGS</td>\n",
       "      <td>GENOMIC</td>\n",
       "      <td>RANDOM</td>\n",
       "      <td>SAMEA5957573</td>\n",
       "      <td>...</td>\n",
       "      <td>env_tax;env_tax:freshwater;env_tax:terrestrial...</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>1;131567;2;3379134;1224;1236;91347;543;561;562</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3 rows × 192 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "  study_accession experiment_accession  \\\n",
       "0     PRJNA549801          SRX10904916   \n",
       "1     PRJNA530580           SRX5658689   \n",
       "2      PRJEB34285           ERX3552102   \n",
       "\n",
       "                                    experiment_title  \\\n",
       "0  Illumina MiSeq sequencing: E. coli library mad...   \n",
       "1  Illumina MiSeq sequencing: Sequencing of Esche...   \n",
       "2  Illumina MiSeq paired end sequencing: Raw read...   \n",
       "\n",
       "                                         description tax_id   scientific_name  \\\n",
       "0  Illumina MiSeq sequencing: E. coli library mad...    562  Escherichia coli   \n",
       "1  Illumina MiSeq sequencing: Sequencing of Esche...    562  Escherichia coli   \n",
       "2  Illumina MiSeq paired end sequencing: Raw read...    562  Escherichia coli   \n",
       "\n",
       "  library_strategy library_source library_selection sample_accession  ...  \\\n",
       "0              WGS        GENOMIC            RANDOM     SAMN11656391  ...   \n",
       "1              WGS        GENOMIC            RANDOM     SAMN11319482  ...   \n",
       "2              WGS        GENOMIC            RANDOM     SAMEA5957573  ...   \n",
       "\n",
       "                                                 tag target_gene  \\\n",
       "0  env_tax;env_tax:freshwater;env_tax:terrestrial...        <NA>   \n",
       "1  env_tax;env_tax:freshwater;env_tax:terrestrial...        <NA>   \n",
       "2  env_tax;env_tax:freshwater;env_tax:terrestrial...        <NA>   \n",
       "\n",
       "                                      tax_lineage taxonomic_classification  \\\n",
       "0  1;131567;2;3379134;1224;1236;91347;543;561;562                     <NA>   \n",
       "1  1;131567;2;3379134;1224;1236;91347;543;561;562                     <NA>   \n",
       "2  1;131567;2;3379134;1224;1236;91347;543;561;562                     <NA>   \n",
       "\n",
       "  taxonomic_identity_marker temperature tissue_lib tissue_type  \\\n",
       "0                      <NA>        <NA>       <NA>        <NA>   \n",
       "1                      <NA>        <NA>       <NA>        <NA>   \n",
       "2                      <NA>        <NA>       <NA>        <NA>   \n",
       "\n",
       "  transposase_protocol variety  \n",
       "0                 <NA>    <NA>  \n",
       "1                 <NA>    <NA>  \n",
       "2                 <NA>    <NA>  \n",
       "\n",
       "[3 rows x 192 columns]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pysradb.search import EnaSearch\n",
    "\n",
    "instance = EnaSearch(\n",
    "    3,\n",
    "    100000,\n",
    "    organism=\"Escherichia coli\",\n",
    "    layout=\"Paired\",\n",
    "    mbases=10,\n",
    "    publication_date=\"01-01-2019:31-12-2021\",\n",
    "    platform=\"Illumina\",\n",
    "    selection=\"random\",\n",
    "    source=\"Genomic\",\n",
    "    strategy=\"WGS\",\n",
    ")\n",
    "instance.search()\n",
    "df = instance.get_df()\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sorted(df.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:02<00:00, 44.88it/s]\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>study_accession</th>\n",
       "      <th>experiment_accession</th>\n",
       "      <th>experiment_title</th>\n",
       "      <th>sample_taxon_id</th>\n",
       "      <th>sample_scientific_name</th>\n",
       "      <th>experiment_library_strategy</th>\n",
       "      <th>experiment_library_source</th>\n",
       "      <th>experiment_library_selection</th>\n",
       "      <th>sample_accession</th>\n",
       "      <th>sample_alias</th>\n",
       "      <th>...</th>\n",
       "      <th>study_link_1_type</th>\n",
       "      <th>study_link_1_value_1</th>\n",
       "      <th>study_link_1_value_2</th>\n",
       "      <th>study_study_abstract</th>\n",
       "      <th>study_study_title</th>\n",
       "      <th>study_study_type_existing_study_type</th>\n",
       "      <th>submission_accession</th>\n",
       "      <th>submission_alias</th>\n",
       "      <th>submission_center_name</th>\n",
       "      <th>submission_lab_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>SRP531137</td>\n",
       "      <td>SRX25997822</td>\n",
       "      <td>GSM8501051: HEK293, Prm1 negative, Replicate 3...</td>\n",
       "      <td>9606</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>Hi-C</td>\n",
       "      <td>GENOMIC</td>\n",
       "      <td>other</td>\n",
       "      <td>SRS22571584</td>\n",
       "      <td>GSM8501051</td>\n",
       "      <td>...</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>Although the spatial organization of the genom...</td>\n",
       "      <td>Large-scale manipulation of radial positioning...</td>\n",
       "      <td>Other</td>\n",
       "      <td>SRA1964865</td>\n",
       "      <td>SUB14711595</td>\n",
       "      <td>Technion - Israel Institute of Technology</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>SRP531137</td>\n",
       "      <td>SRX25997821</td>\n",
       "      <td>GSM8501050: HEK293, Prm1 negative, Replicate 2...</td>\n",
       "      <td>9606</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>Hi-C</td>\n",
       "      <td>GENOMIC</td>\n",
       "      <td>other</td>\n",
       "      <td>SRS22571583</td>\n",
       "      <td>GSM8501050</td>\n",
       "      <td>...</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>Although the spatial organization of the genom...</td>\n",
       "      <td>Large-scale manipulation of radial positioning...</td>\n",
       "      <td>Other</td>\n",
       "      <td>SRA1964865</td>\n",
       "      <td>SUB14711595</td>\n",
       "      <td>Technion - Israel Institute of Technology</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>SRP531137</td>\n",
       "      <td>SRX25997820</td>\n",
       "      <td>GSM8501049: HEK293, Prm1 negative, Replicate 1...</td>\n",
       "      <td>9606</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>Hi-C</td>\n",
       "      <td>GENOMIC</td>\n",
       "      <td>other</td>\n",
       "      <td>SRS22571582</td>\n",
       "      <td>GSM8501049</td>\n",
       "      <td>...</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>Although the spatial organization of the genom...</td>\n",
       "      <td>Large-scale manipulation of radial positioning...</td>\n",
       "      <td>Other</td>\n",
       "      <td>SRA1964865</td>\n",
       "      <td>SUB14711595</td>\n",
       "      <td>Technion - Israel Institute of Technology</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>SRP531137</td>\n",
       "      <td>SRX25997819</td>\n",
       "      <td>GSM8501048: HEK293, Prm1 positive DAPI low, Hi...</td>\n",
       "      <td>9606</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>Hi-C</td>\n",
       "      <td>GENOMIC</td>\n",
       "      <td>other</td>\n",
       "      <td>SRS22571581</td>\n",
       "      <td>GSM8501048</td>\n",
       "      <td>...</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>Although the spatial organization of the genom...</td>\n",
       "      <td>Large-scale manipulation of radial positioning...</td>\n",
       "      <td>Other</td>\n",
       "      <td>SRA1964865</td>\n",
       "      <td>SUB14711595</td>\n",
       "      <td>Technion - Israel Institute of Technology</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>SRP531137</td>\n",
       "      <td>SRX25997818</td>\n",
       "      <td>GSM8501047: HEK293, Prm1 positive DAPI high, H...</td>\n",
       "      <td>9606</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>Hi-C</td>\n",
       "      <td>GENOMIC</td>\n",
       "      <td>other</td>\n",
       "      <td>SRS22571580</td>\n",
       "      <td>GSM8501047</td>\n",
       "      <td>...</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>Although the spatial organization of the genom...</td>\n",
       "      <td>Large-scale manipulation of radial positioning...</td>\n",
       "      <td>Other</td>\n",
       "      <td>SRA1964865</td>\n",
       "      <td>SUB14711595</td>\n",
       "      <td>Technion - Israel Institute of Technology</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>95</th>\n",
       "      <td>SRP530052</td>\n",
       "      <td>SRX25934604</td>\n",
       "      <td>GSM8493072: HSS-3; Homo sapiens; RNA-Seq</td>\n",
       "      <td>9606</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>RNA-Seq</td>\n",
       "      <td>TRANSCRIPTOMIC</td>\n",
       "      <td>cDNA</td>\n",
       "      <td>SRS22520997</td>\n",
       "      <td>GSM8493072</td>\n",
       "      <td>...</td>\n",
       "      <td>XREF_LINK</td>\n",
       "      <td>DB: pubmed</td>\n",
       "      <td>ID: 39752487</td>\n",
       "      <td>Fluid shear stress (FSS) from blood flow sense...</td>\n",
       "      <td>cSTAR analysis identifies endothelial cell cyc...</td>\n",
       "      <td>Transcriptome Analysis</td>\n",
       "      <td>SRA1960858</td>\n",
       "      <td>SUB14701816</td>\n",
       "      <td>Systems Biology Ireland, University College Du...</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>96</th>\n",
       "      <td>SRP530052</td>\n",
       "      <td>SRX25934603</td>\n",
       "      <td>GSM8493071: HSS-2; Homo sapiens; RNA-Seq</td>\n",
       "      <td>9606</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>RNA-Seq</td>\n",
       "      <td>TRANSCRIPTOMIC</td>\n",
       "      <td>cDNA</td>\n",
       "      <td>SRS22520996</td>\n",
       "      <td>GSM8493071</td>\n",
       "      <td>...</td>\n",
       "      <td>XREF_LINK</td>\n",
       "      <td>DB: pubmed</td>\n",
       "      <td>ID: 39752487</td>\n",
       "      <td>Fluid shear stress (FSS) from blood flow sense...</td>\n",
       "      <td>cSTAR analysis identifies endothelial cell cyc...</td>\n",
       "      <td>Transcriptome Analysis</td>\n",
       "      <td>SRA1960858</td>\n",
       "      <td>SUB14701816</td>\n",
       "      <td>Systems Biology Ireland, University College Du...</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>97</th>\n",
       "      <td>SRP530052</td>\n",
       "      <td>SRX25934602</td>\n",
       "      <td>GSM8493070: HSS-1; Homo sapiens; RNA-Seq</td>\n",
       "      <td>9606</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>RNA-Seq</td>\n",
       "      <td>TRANSCRIPTOMIC</td>\n",
       "      <td>cDNA</td>\n",
       "      <td>SRS22520995</td>\n",
       "      <td>GSM8493070</td>\n",
       "      <td>...</td>\n",
       "      <td>XREF_LINK</td>\n",
       "      <td>DB: pubmed</td>\n",
       "      <td>ID: 39752487</td>\n",
       "      <td>Fluid shear stress (FSS) from blood flow sense...</td>\n",
       "      <td>cSTAR analysis identifies endothelial cell cyc...</td>\n",
       "      <td>Transcriptome Analysis</td>\n",
       "      <td>SRA1960858</td>\n",
       "      <td>SUB14701816</td>\n",
       "      <td>Systems Biology Ireland, University College Du...</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98</th>\n",
       "      <td>SRP529928</td>\n",
       "      <td>SRX25928352</td>\n",
       "      <td>GSM8492057: NC1 group2; Homo sapiens; RNA-Seq</td>\n",
       "      <td>9606</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>RNA-Seq</td>\n",
       "      <td>TRANSCRIPTOMIC</td>\n",
       "      <td>cDNA</td>\n",
       "      <td>SRS22515729</td>\n",
       "      <td>GSM8492057</td>\n",
       "      <td>...</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>Adrenoleukodystrophy (ALD) is a rare X-linked ...</td>\n",
       "      <td>Transcriptomic Analysis of Identical Twins wit...</td>\n",
       "      <td>Transcriptome Analysis</td>\n",
       "      <td>SRA1960558</td>\n",
       "      <td>SUB14700372</td>\n",
       "      <td>Southwest Hospital, Army Medical University (T...</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>SRP529928</td>\n",
       "      <td>SRX25928351</td>\n",
       "      <td>GSM8492056: NC1 group1; Homo sapiens; RNA-Seq</td>\n",
       "      <td>9606</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>RNA-Seq</td>\n",
       "      <td>TRANSCRIPTOMIC</td>\n",
       "      <td>cDNA</td>\n",
       "      <td>SRS22515728</td>\n",
       "      <td>GSM8492056</td>\n",
       "      <td>...</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>Adrenoleukodystrophy (ALD) is a rare X-linked ...</td>\n",
       "      <td>Transcriptomic Analysis of Identical Twins wit...</td>\n",
       "      <td>Transcriptome Analysis</td>\n",
       "      <td>SRA1960558</td>\n",
       "      <td>SUB14700372</td>\n",
       "      <td>Southwest Hospital, Army Medical University (T...</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>100 rows × 644 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   study_accession experiment_accession  \\\n",
       "0        SRP531137          SRX25997822   \n",
       "1        SRP531137          SRX25997821   \n",
       "2        SRP531137          SRX25997820   \n",
       "3        SRP531137          SRX25997819   \n",
       "4        SRP531137          SRX25997818   \n",
       "..             ...                  ...   \n",
       "95       SRP530052          SRX25934604   \n",
       "96       SRP530052          SRX25934603   \n",
       "97       SRP530052          SRX25934602   \n",
       "98       SRP529928          SRX25928352   \n",
       "99       SRP529928          SRX25928351   \n",
       "\n",
       "                                     experiment_title sample_taxon_id  \\\n",
       "0   GSM8501051: HEK293, Prm1 negative, Replicate 3...            9606   \n",
       "1   GSM8501050: HEK293, Prm1 negative, Replicate 2...            9606   \n",
       "2   GSM8501049: HEK293, Prm1 negative, Replicate 1...            9606   \n",
       "3   GSM8501048: HEK293, Prm1 positive DAPI low, Hi...            9606   \n",
       "4   GSM8501047: HEK293, Prm1 positive DAPI high, H...            9606   \n",
       "..                                                ...             ...   \n",
       "95           GSM8493072: HSS-3; Homo sapiens; RNA-Seq            9606   \n",
       "96           GSM8493071: HSS-2; Homo sapiens; RNA-Seq            9606   \n",
       "97           GSM8493070: HSS-1; Homo sapiens; RNA-Seq            9606   \n",
       "98      GSM8492057: NC1 group2; Homo sapiens; RNA-Seq            9606   \n",
       "99      GSM8492056: NC1 group1; Homo sapiens; RNA-Seq            9606   \n",
       "\n",
       "   sample_scientific_name experiment_library_strategy  \\\n",
       "0            Homo sapiens                        Hi-C   \n",
       "1            Homo sapiens                        Hi-C   \n",
       "2            Homo sapiens                        Hi-C   \n",
       "3            Homo sapiens                        Hi-C   \n",
       "4            Homo sapiens                        Hi-C   \n",
       "..                    ...                         ...   \n",
       "95           Homo sapiens                     RNA-Seq   \n",
       "96           Homo sapiens                     RNA-Seq   \n",
       "97           Homo sapiens                     RNA-Seq   \n",
       "98           Homo sapiens                     RNA-Seq   \n",
       "99           Homo sapiens                     RNA-Seq   \n",
       "\n",
       "   experiment_library_source experiment_library_selection sample_accession  \\\n",
       "0                    GENOMIC                        other      SRS22571584   \n",
       "1                    GENOMIC                        other      SRS22571583   \n",
       "2                    GENOMIC                        other      SRS22571582   \n",
       "3                    GENOMIC                        other      SRS22571581   \n",
       "4                    GENOMIC                        other      SRS22571580   \n",
       "..                       ...                          ...              ...   \n",
       "95            TRANSCRIPTOMIC                         cDNA      SRS22520997   \n",
       "96            TRANSCRIPTOMIC                         cDNA      SRS22520996   \n",
       "97            TRANSCRIPTOMIC                         cDNA      SRS22520995   \n",
       "98            TRANSCRIPTOMIC                         cDNA      SRS22515729   \n",
       "99            TRANSCRIPTOMIC                         cDNA      SRS22515728   \n",
       "\n",
       "   sample_alias  ... study_link_1_type study_link_1_value_1  \\\n",
       "0    GSM8501051  ...              <NA>                 <NA>   \n",
       "1    GSM8501050  ...              <NA>                 <NA>   \n",
       "2    GSM8501049  ...              <NA>                 <NA>   \n",
       "3    GSM8501048  ...              <NA>                 <NA>   \n",
       "4    GSM8501047  ...              <NA>                 <NA>   \n",
       "..          ...  ...               ...                  ...   \n",
       "95   GSM8493072  ...         XREF_LINK           DB: pubmed   \n",
       "96   GSM8493071  ...         XREF_LINK           DB: pubmed   \n",
       "97   GSM8493070  ...         XREF_LINK           DB: pubmed   \n",
       "98   GSM8492057  ...              <NA>                 <NA>   \n",
       "99   GSM8492056  ...              <NA>                 <NA>   \n",
       "\n",
       "   study_link_1_value_2                               study_study_abstract  \\\n",
       "0                  <NA>  Although the spatial organization of the genom...   \n",
       "1                  <NA>  Although the spatial organization of the genom...   \n",
       "2                  <NA>  Although the spatial organization of the genom...   \n",
       "3                  <NA>  Although the spatial organization of the genom...   \n",
       "4                  <NA>  Although the spatial organization of the genom...   \n",
       "..                  ...                                                ...   \n",
       "95         ID: 39752487  Fluid shear stress (FSS) from blood flow sense...   \n",
       "96         ID: 39752487  Fluid shear stress (FSS) from blood flow sense...   \n",
       "97         ID: 39752487  Fluid shear stress (FSS) from blood flow sense...   \n",
       "98                 <NA>  Adrenoleukodystrophy (ALD) is a rare X-linked ...   \n",
       "99                 <NA>  Adrenoleukodystrophy (ALD) is a rare X-linked ...   \n",
       "\n",
       "                                    study_study_title  \\\n",
       "0   Large-scale manipulation of radial positioning...   \n",
       "1   Large-scale manipulation of radial positioning...   \n",
       "2   Large-scale manipulation of radial positioning...   \n",
       "3   Large-scale manipulation of radial positioning...   \n",
       "4   Large-scale manipulation of radial positioning...   \n",
       "..                                                ...   \n",
       "95  cSTAR analysis identifies endothelial cell cyc...   \n",
       "96  cSTAR analysis identifies endothelial cell cyc...   \n",
       "97  cSTAR analysis identifies endothelial cell cyc...   \n",
       "98  Transcriptomic Analysis of Identical Twins wit...   \n",
       "99  Transcriptomic Analysis of Identical Twins wit...   \n",
       "\n",
       "   study_study_type_existing_study_type submission_accession submission_alias  \\\n",
       "0                                 Other           SRA1964865      SUB14711595   \n",
       "1                                 Other           SRA1964865      SUB14711595   \n",
       "2                                 Other           SRA1964865      SUB14711595   \n",
       "3                                 Other           SRA1964865      SUB14711595   \n",
       "4                                 Other           SRA1964865      SUB14711595   \n",
       "..                                  ...                  ...              ...   \n",
       "95               Transcriptome Analysis           SRA1960858      SUB14701816   \n",
       "96               Transcriptome Analysis           SRA1960858      SUB14701816   \n",
       "97               Transcriptome Analysis           SRA1960858      SUB14701816   \n",
       "98               Transcriptome Analysis           SRA1960558      SUB14700372   \n",
       "99               Transcriptome Analysis           SRA1960558      SUB14700372   \n",
       "\n",
       "                               submission_center_name submission_lab_name  \n",
       "0           Technion - Israel Institute of Technology                <NA>  \n",
       "1           Technion - Israel Institute of Technology                <NA>  \n",
       "2           Technion - Israel Institute of Technology                <NA>  \n",
       "3           Technion - Israel Institute of Technology                <NA>  \n",
       "4           Technion - Israel Institute of Technology                <NA>  \n",
       "..                                                ...                 ...  \n",
       "95  Systems Biology Ireland, University College Du...                <NA>  \n",
       "96  Systems Biology Ireland, University College Du...                <NA>  \n",
       "97  Systems Biology Ireland, University College Du...                <NA>  \n",
       "98  Southwest Hospital, Army Medical University (T...                <NA>  \n",
       "99  Southwest Hospital, Army Medical University (T...                <NA>  \n",
       "\n",
       "[100 rows x 644 columns]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# https://github.com/saketkc/pysradb/issues/221\n",
    "instance = GeoSearch(\n",
    "    publication_date=\"05-09-2024:06-09-2024\", return_max=100, verbosity=3\n",
    ")\n",
    "instance.search()\n",
    "df = instance.get_df()\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:48<00:00, 20.72it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['GSE276554' 'GSE276553' 'GSE267301' 'GSE276379' 'GSE276473' 'GSE276365'\n",
      " 'GSE276447' 'GSE276438' 'GSE219269' 'GSE276372' 'GSE276338' 'GSE276337'\n",
      " 'GSE276206' 'GSE276309' 'GSE276304' 'GSE276245' 'GSE276204' 'GSE276185'\n",
      " 'GSE276195' 'GSE276192' 'GSE276153' 'GSE276130' 'GSE276122' 'GSE276065'\n",
      " 'GSE276058' 'GSE276038' 'GSE276037' 'GSE275962' 'GSE275896' 'GSE275863'\n",
      " 'GSE275777' 'GSE275778' 'GSE275571' 'GSE253407' 'GSE275211' 'GSE274586'\n",
      " 'GSE274408' 'GSE273907' 'GSE273844' 'GSE273813' 'GSE253145' 'GSE272793'\n",
      " 'GSE272635' 'GSE272394' 'GSE247707' 'GSE247706' 'GSE272252' 'GSE271996'\n",
      " 'GSE271746' 'GSE271667' 'GSE271653' 'GSE264193' 'GSE245033' 'GSE268837'\n",
      " 'GSE267343' 'GSE267342' 'GSE266976' 'GSE266471' 'GSE264212' 'GSE264012'\n",
      " 'GSE263804' 'GSE263798' 'GSE263549' 'GSE263414' 'GSE263441' 'GSE262699'\n",
      " 'GSE262282' 'GSE262272' 'GSE262127' 'GSE262125' 'GSE262126']\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "np.int64(0)"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "instance = GeoSearch(\n",
    "    publication_date=\"04-09-2024:06-09-2024\", return_max=1000, verbosity=3\n",
    ")\n",
    "instance.search()\n",
    "df = instance.get_df()\n",
    "print(df[\"study_alias\"].unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "##### 6. Corresponding terminal command example, with max set to 20:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pysradb search --db ena -m 20 -v 3 --organism Escherichia coli --layout Paired --mbases 100 --publication-date 01-01-2019:31-12-2019 --platform illumina --selection random --source Genomic --strategy wgs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.18"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}