from pysradb import SRAweb
= SRAweb()
db = db.sra_metadata("SRP510596", detailed=True)
df "metadata.csv", index=False)
df.to_csv( df.head()
DH 607 Autumn 2025 DNA-seq Hands On Session
Sickle cell disease.
Recall Lecture1 where we looked at this picture frok Kato et al 2018.Sickle cell anemia is characterized by mutations in the HBB gene which encodes the \(\beta\) suunut if hemoglobin (Hb). The underlying patho-phyisology of the disease stems from a single(!) A>T point mutation in exon 1 of the HBB gene.
We will be working with only one sample from the dataset deposited by the authors of Moani et al. 2023, Nature Communications, Non-viral DNA delivery and TALEN editing correct the sickle cell mutation in hematopoietic stem cells. The authors report a preclinical proof of concept of an ex vivo gene therapy approach that leverages a TALE nuclease (TALEN) and a DNA repair template to previsely correct the the \(\beta\) gene responsible for sickle cell disease.
The data is available on SRA.
We are asking a very simple question, how does the sample (“mock”) with sickle-cell disease look like. We would like to verify if the sample indeed has the A>T mutation at the HBB locus.
Checking the metadata
pysradb is a python pacakge that can retrieve metadata associated with a sequencing dataset that has been deposited to a public sequencing database such as SRA or ENA
In a Jupyter notebook, type the following:
"library_strategy"].value_counts() df[
= df.loc[df["library_strategy"]!="RNA-Seq"].sort_values(by=["experiment_desc", "isolate"]).reset_index(drop=True)
df_non_rna df_non_rna
"experiment_desc"].value_counts() df_non_rna[
Downloading the sickle cell sequences
There are a lot of datasets in this study. We will just analyze the “mock” sample, which is a sample with
= df_non_rna[df_non_rna["experiment_desc"] == "CAST-seq on HBB of Human: in vitro HSPC Hbs Mock"]
mocks mocks
= "SCD10-Mock-2"
library_name_to_download
= mocks[mocks["library_name"] == library_name_to_download]
mocks_to_download mocks_to_download
"ena_fastq_http_1"].values mocks_to_download[
"ena_fastq_http_2"].values mocks_to_download[
-c -O SCD10-Mock-2_R1.fastq.gz http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR292/083/SRR29275383/SRR29275383_1.fastq.gz
wget -c -O SCD10-Mock-2_R2.fastq.gz http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR292/083/SRR29275383/SRR29275383_2.fastq.gz wget
Exploring quality of data
Once we obtain any sequencing dataset, we should
-c bioconda fastqc fastp mamba install
Quick metrics
Before doing any pre-processing it is always a good idea to get a “sense” of your data.
- What is the size of your fastq.gz files?
- How many reads does your data have?
- Is your data single-end or paired-end?
- What is the length of reads? Are R1 and R2 read lengths same?
- What technology was used to generate the data? (Not only instrument, but the assay)
What is the size of your fastq.gz files?
-ltrha *.fastq.gz ls
How many reads does your data have?
-Mock-2_R1.fastq.gz|wc -l)/4|bc echo $(zcat SCD10
-Mock-2_R2.fastq.gz |wc -l)/4|bc echo $(zcat SCD10
What is the length of reads? Are R1 and R2 read lengths same?
-Mock-2_R1.fastq.gz | head -n 5 zcat SCD10
-Mock-2_R2.fastq.gz | head -n 5 zcat SCD10
len("AAAACTCTCCTTAAACCAGTCTAGTAACCTTGATACCAACCTGCCCAGGGCCTCACCACCAACTTCATCCACGTTCACCTTGCCCCACAGGGCAGTAACGGCAGACTTCAGTCCCTTAAGCGGAGCCCTAGATCGGAAGAGCGTCGTGTAT")
-Mock-2_R1.fastq.gz | head -n 2 | tail -n 1 | wc -c zcat SCD10
-Mock-2_R2.fastq.gz | head -n 2 | tail -n 1 | wc -c zcat SCD10
Quality control checks
Before we do any pre-processing it is always a good idea to take a quick look the quality score description. We will run FastQC to run quality checks for our sequencing data
-Mock-2_R1.fastq.gz SCD10-Mock-2_R2.fastq.gz fastqc SCD10
multiqc .
Download “multiqc_report.html” from the left document panel. You will find multiple metrics reported for both the read files. You will find the following adapter plot:
Removing adapters
Often illumina sequencing data needs to be trimmed to remove adapter sequences. But why do we waste sequencing efforts at sequencing the adapters?
The diagram below shows sites of primer annealing at each stage of sequencing run: Read 1, Index 1, Index 2 and Read 2. You can ignore Index 1 and Index 2, but very briefly these are reads that allow multiple samples to be pooled and sequenced together. The example below is of dual-indexed sequencing (i5 and i7). You can read about indexing details here
Consider the reads R1 and R2 in a typicall illumina flowcell as shown below. Our goal is to sequence the “DNA insert” in gray. In both Reads 1 and 2, the sequencing primer anneals to the DNA insert upstream of it. Sequencing begins at base 1 of the DNA insert and hence 5’ adapters are usually not sequenced. Ideally, if the DNA insert sequence is long enough, the sequencing adapters at the 3’end will not get sequenced. But if the insert size is smaller than the read length, the adapters will also get sequenced. These adapter sequences are “fixed” sequences , and ideally we do not want them to be part of our reads because they will not match to the sequences in our reference genome thereby making the alignment step less accurate. So we would like to remove these.
Adapter and quality trimming using fastp
fastp is a tool designed to provide fast all-in-one preprocessing for FastQ files.
Let’s run fastp
--in1 SCD10-Mock-2_R1.fastq.gz --in2 SCD10-Mock-2_R2.fastq.gz --out1 trimmed_SCD10-Mock-2_R1.fastq.gz --out2 trimmed_SCD10-Mock-2_R2.fastq.gz --detect_adapter_for_pe fastp
Run quality control using FastQC on the trimmed data
-Mock-2_R1.fastq.gz trimmed_SCD10-Mock-2_R2.fastq.gz fastqc trimmed_SCD10
multiqc .
Download “multiqc_report.html” from the left document panel. You will see that we have two samples now: 1. SCD10-Mock-2 (original ) 2. trimmed_SCD10-Mock-2 (processed)
If you compare the two samples, you’ll clearly see that the quality scores have improved
Mapping
Now that we have done initial preprocessing, we are ready to map data to human genome reference.
In an ideal world we will map the data to the entire genome, but since we are working with memory restrictions, we will just map all the data to chromosome 11 as HBB gene is in chromosome 11. While this is not incorrect, it is also not the most appropriate mapping strategy - you should in all scenarios map to the entire human genome.
We can download the human genome from UCSC:
Reference genome
We will use only chromosome 11 to map the data. First we download the fasta file.
#| eval: false
wget -c https://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr11.fa.gz && gunzip chr11.fa.gz
Alignment using bwa
Index the reference genome
#| eval: false
# this will tske some time
bwa index chr11.fa
#| eval: false
# Run BWA mem on the trimmed reads
bwa mem -t 32 chr11.fa trimmed_SCD10-Mock-2_R1.fastq.gz trimmed_SCD10-Mock-2_R2.fastq.gz > SCD10-Mock-2.sam
Let’s check the size of the SAM file
#| eval: false
ls -lrtha SCD10-Mock-2.sam
#| eval: false
head SCD10-Mock-2.sam
Convert the SAM to BAM
-Sb SCD10-Mock-2.sam > SCD10-Mock-2.bam samtools view
-lrtha SCD10-Mock-2.bam ls
The BAM file is smaller than the SAM file due to compression. Let’s get some alignment metrics from samtools
-Mock-2.bam samtools flagstat SCD10
Now let’s sort the BAM file
-Mock-2.bam -o SCD10-Mock-2-sorted.bam samtools sort SCD10
-Mock-2-sorted.bam samtools index SCD10
-la *sorted* *bai* ls
Visualization
Recall why we started this excercise. We were interested in whether the “mock” samples are really sicle cell disease samples. To detect this, we need to look into the mutation data for HBB gene and see if one of the bases appears mutated.
To visualize reads mapped to genome, we will use igv. You can download IGV as a desktop application (recommended) and use it to visualize the alignment and mutations.
We have now aligned our reads to the reference genome and have a sorted BAM file. Since we are interested in sickle cell disease which is caused by the HBB gene let’s make sure that the reads in the HBB gene regions looks alright.
Where is the HBB gene? Let’s go to the Genome Browser and search the HBB gene.
We can see the HBB gene in chromosome 11. The genomic co-ordinates are listed as: chr11:5,225,464-5,227,071
Let’s use the samtools view to see reads overlapping the HBB gene chr11:5,225,464-5,227,071
-Mock-2-sorted.bam chr11:5225464-522707111 | wc -l samtools view SCD10
To make it easy to visualize we will downsample the bam so that it retains only 5% of original reads:
-s 0.05 -b SCD10-Mock-2.bam | samtools sort -o SCD10-Mock-2_sample05.bam && samtools index SCD10-Mock-2_sample05.bam samtools view
import igv_notebook
igv_notebook.init()
= igv_notebook.Browser(
b
{"genome": "hg38",
"locus": "chr11:5,225,464-5,227,071",
} )
b.load_track(
{"name": "SCD10-Mock-2-sorted.bam",
"url": "SCD10-Mock-2-sorted.bam",
"indexURL": "SCD10-Mock-2-sorted.bam.bai",
"format": "bam",
"type": "alignment",
"colorBy": "strand",
"height": 500,
} )
You’ll see now that the IGV is showing the alignment of the SCD10-Mock-2 sample against the HBB gene region.
Let’s navigate to the position where the sickle mutation is expected to be found. based on the literature, the sickle cell allele is located at hg38 coordinates: chr11:5227002 A>T
"chr11:5227002") b.search(
Exercise
We chose one mock sample. Can you go back and try this with a different sample?
How would you verify that the therapy has indeded worked? (you might not be able to do this analysis completely on colab)