This notebook demonstrates how to search for all Parse Bioscience single-cell sequencing datasets in the Gene Expression Omnibus (GEO) database.
Parse Biosciences provides single-cell RNA sequencing technology using a combinatorial barcoding approach. This example shows how to find all Parse Bioscience datasets published from January 1, 2020 to the present day.
Parse BioScience Search¶
This notebook demonstrates how to parse and analyze search results from BioScience datasets.
[1]:
# Install pysradb if not already installed
try:
import pysradb
print(f"pysradb {pysradb.__version__} is already installed")
except ImportError:
print("Installing pysradb from GitHub...")
import sys
!{sys.executable} -m pip install -q git+https://github.com/saketkc/pysradb
print("pysradb installed successfully!")
pysradb 3.0.0.dev0 is already installed
/home/runner/work/pysradb/pysradb/pysradb/download.py:15: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
from tqdm.autonotebook import tqdm
Setup¶
First, let’s import the necessary modules and set up our search parameters:
[2]:
from datetime import datetime
import pandas as pd
from pysradb.search import GeoSearch
# Set pandas display options for better readability
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", 100)
Search for Parse Bioscience Datasets¶
We’ll search GEO for all datasets that mention “Parse Bioscience”.
Note: The GEO search date filtering can be restrictive and may miss some results. We’ll search without date constraints first, then you can filter the results by date manually if needed.
[3]:
# Search for Parse Bioscience datasets
# Note: Date filtering with GeoSearch can be restrictive and may miss results
# We'll first search without date filter, then filter results manually if needed
print("Searching for Parse Bioscience datasets...")
print("This may take a few minutes...\n")
# Create search instance
# We use verbosity=3 to get all available metadata
# Starting without date filter to ensure we capture all datasets
instance = GeoSearch(
verbosity=3,
return_max=500, # Adjust if needed
geo_query="Parse Biosciences OR Evercode",
)
# Execute the search
instance.search()
# Get results as DataFrame
df = instance.get_df()
print(f"\nFound {len(df)} total entries matching the search criteria")
# Optional: Filter by date manually if needed
# Uncomment the following to filter by publication date
# if not df.empty and 'run_1_published' in df.columns:
# df['pub_date'] = pd.to_datetime(df['run_1_published'], errors='coerce')
# df = df[df['pub_date'] >= '2020-01-01']
# print(f"After date filtering (>=2020-01-01): {len(df)} entries")
Searching for Parse Bioscience datasets...
This may take a few minutes...
No results found for the following search query:
SRA: {'query': None, 'accession': None, 'organism': None, 'layout': None, 'mbases': None, 'publication_date': None, 'platform': None, 'selection': None, 'source': None, 'strategy': None, 'title': None}
GEO DataSets: {'query': 'Parse Biosciences OR Evercode AND gds sra[Filter]', 'dataset_type': None, 'entry_type': None, 'publication_date': None, 'organism': None}
Found 0 total entries matching the search criteria
[4]:
df
[4]:
Filter and Explore Results¶
Let’s examine the unique studies and get an overview of the datasets:
[5]:
if not df.empty:
# Get unique GSE IDs (study accessions)
if "study_alias" in df.columns:
unique_studies = df["study_alias"].unique()
print(f"Number of unique studies (GSE IDs): {len(unique_studies)}")
print(f"\nUnique GSE IDs:\n{sorted(unique_studies)}")
# Display summary information
print(f"\n{'='*80}")
print("DATASET SUMMARY")
print(f"{'='*80}")
print(f"Total runs: {len(df)}")
if "sample_scientific_name" in df.columns:
print(f"\nOrganisms represented:")
print(df["sample_scientific_name"].value_counts())
if "experiment_library_strategy" in df.columns:
print(f"\nLibrary strategies:")
print(df["experiment_library_strategy"].value_counts())
else:
print("No results found. This could mean:")
print("1. No Parse Bioscience datasets match the date range")
print("2. The search query needs adjustment")
print("3. Try searching with a broader query or different date range")
No results found. This could mean:
1. No Parse Bioscience datasets match the date range
2. The search query needs adjustment
3. Try searching with a broader query or different date range
Display Key Metadata¶
Let’s create a clean summary table with the most important information:
[6]:
if not df.empty:
# Select key columns for display
key_columns = [
"study_accession",
"study_alias",
"experiment_accession",
"sample_accession",
"sample_scientific_name",
"experiment_library_strategy",
"experiment_title",
"run_1_accession",
]
# Filter to only include columns that exist
available_columns = [col for col in key_columns if col in df.columns]
summary_df = df[available_columns].copy()
print("\nParse Bioscience Dataset Summary:")
print(f"{'='*80}\n")
display(summary_df.head(20)) # Show first 20 entries
if len(summary_df) > 20:
print(f"\n... and {len(summary_df) - 20} more entries")
Organize by Study (GSE)¶
Group the data by study to see how many samples/runs belong to each GSE:
[7]:
if not df.empty and "study_alias" in df.columns:
# Group by study and count entries
studies_summary = (
df.groupby("study_alias")
.agg(
{
"experiment_accession": "count",
"sample_scientific_name": lambda x: (
", ".join(x.unique()) if hasattr(x, "unique") else str(x.iloc[0])
),
"study_accession": "first",
}
)
.rename(
columns={
"experiment_accession": "num_experiments",
"sample_scientific_name": "organisms",
"study_accession": "SRP_accession",
}
)
.reset_index()
)
# Sort by number of experiments (descending)
studies_summary = studies_summary.sort_values("num_experiments", ascending=False)
print("\nStudy-level Summary:")
print(f"{'='*80}\n")
display(studies_summary)
Get Study Titles and Descriptions¶
Extract study titles to understand what each dataset is about:
[8]:
if not df.empty:
title_cols = [
col
for col in df.columns
if "study_title" in col.lower() or "study_abstract" in col.lower()
]
if title_cols and "study_alias" in df.columns:
# Get unique study information
study_info_cols = ["study_alias", "study_accession"] + title_cols
study_info_cols = [col for col in study_info_cols if col in df.columns]
study_info = df[study_info_cols].drop_duplicates(subset=["study_alias"])
print("\nStudy Titles and Descriptions:")
print(f"{'='*80}\n")
for idx, row in study_info.iterrows():
print(f"GSE: {row['study_alias']}")
print(f"SRP: {row['study_accession']}")
for col in title_cols:
if col in row and pd.notna(row[col]):
print(f"{col}: {row[col][:300]}...") # Truncate long descriptions
print(f"{'-'*80}\n")
Save Results¶
Save the complete results to a file for further analysis:
[9]:
if not df.empty:
# Save full results
output_file = "parse_bioscience_datasets.tsv"
df.to_csv(output_file, sep="\t", index=False)
print(f"Full results saved to: {output_file}")
# Also save the study-level summary
if "studies_summary" in locals():
summary_file = "parse_bioscience_studies_summary.tsv"
studies_summary.to_csv(summary_file, sep="\t", index=False)
print(f"Study summary saved to: {summary_file}")
print(f"\nTotal datasets found: {len(df)}")
print(
f"Total unique studies: {len(unique_studies) if 'unique_studies' in locals() else 'N/A'}"
)
Alternative Search Strategies¶
Since Parse Bioscience datasets might not always explicitly mention “Parse Bioscience” in metadata, here are alternative search strategies:
1. Search for SPLiT-seq (Parse’s technology)¶
Parse Biosciences uses SPLiT-seq (split-pool ligation-based transcriptome sequencing) technology.
[10]:
# Search for SPLiT-seq technology
print("Searching for SPLiT-seq / combinatorial indexing datasets...")
print("This may include Parse Bioscience and similar technologies.\n")
split_seq_instance = GeoSearch(
verbosity=2,
return_max=100,
geo_query="SPLiT-seq OR split-seq OR combinatorial indexing single cell",
)
split_seq_instance.search()
split_seq_df = split_seq_instance.get_df()
if not split_seq_df.empty:
print(f"Found {len(split_seq_df)} entries with SPLiT-seq/combinatorial indexing")
if "study_alias" in split_seq_df.columns:
split_seq_studies = split_seq_df["study_alias"].unique()
print(f"Unique studies: {len(split_seq_studies)}")
print(f"\nGSE IDs: {sorted(split_seq_studies)}")
# Show organisms
if "sample_scientific_name" in split_seq_df.columns:
print(f"\nOrganisms:")
print(split_seq_df["sample_scientific_name"].value_counts())
else:
print("No results found with SPLiT-seq search")
# You can also try other search terms
print("\n" + "=" * 80)
print("Other useful search terms for Parse/single-cell datasets:")
print("- 'Parse Evercode' (Parse's platform name)")
print("- 'combinatorial barcoding'")
print("- Specific organisms: 'Parse Bioscience Homo sapiens'")
print("=" * 80)
Searching for SPLiT-seq / combinatorial indexing datasets...
This may include Parse Bioscience and similar technologies.
No results found for the following search query:
SRA: {'query': None, 'accession': None, 'organism': None, 'layout': None, 'mbases': None, 'publication_date': None, 'platform': None, 'selection': None, 'source': None, 'strategy': None, 'title': None}
GEO DataSets: {'query': 'SPLiT-seq OR split-seq OR combinatorial indexing single cell AND gds sra[Filter]', 'dataset_type': None, 'entry_type': None, 'publication_date': None, 'organism': None}
No results found with SPLiT-seq search
================================================================================
Other useful search terms for Parse/single-cell datasets:
- 'Parse Evercode' (Parse's platform name)
- 'combinatorial barcoding'
- Specific organisms: 'Parse Bioscience Homo sapiens'
================================================================================
Command Line Usage¶
You can also perform this search from the command line:
## # Basic search for Parse Bioscience datasets
pysradb search --db geo \
--geo-query "Parse Bioscience" \
--max 500 \
--verbosity 3 \
--saveto parse_bioscience_results.tsv
## # Search for SPLiT-seq technology
pysradb search --db geo \
--geo-query "SPLiT-seq OR split-seq" \
--max 200 \
--verbosity 2 \
--saveto split_seq_results.tsv
## # Search with organism filter
pysradb search --db geo \
--geo-query "Parse Bioscience" \
--organism "Homo sapiens" \
--max 200 \
--saveto parse_human_results.tsv
## # Note: Date filtering can be added with --publication-date flag
## # but may miss some results due to how GEO indexes publication dates
## # Example: --publication-date 01-01-2020:31-12-2024