DH 607 Autumn 2025 DNA-seq Hands On Session

Sickle cell disease.

Recall Lecture1 where we looked at this picture frok Kato et al 2018.Sickle cell anemia is characterized by mutations in the HBB gene which encodes the \(\beta\) suunut if hemoglobin (Hb). The underlying patho-phyisology of the disease stems from a single(!) A>T point mutation in exon 1 of the HBB gene.

We will be working with only one sample from the dataset deposited by the authors of Moani et al. 2023, Nature Communications, Non-viral DNA delivery and TALEN editing correct the sickle cell mutation in hematopoietic stem cells. The authors report a preclinical proof of concept of an ex vivo gene therapy approach that leverages a TALE nuclease (TALEN) and a DNA repair template to previsely correct the the \(\beta\) gene responsible for sickle cell disease.

The data is available on SRA.

We are asking a very simple question, how does the sample (“mock”) with sickle-cell disease look like. We would like to verify if the sample indeed has the A>T mutation at the HBB locus.

Checking the metadata

pysradb is a python pacakge that can retrieve metadata associated with a sequencing dataset that has been deposited to a public sequencing database such as SRA or ENA

In a Jupyter notebook, type the following:

from pysradb import SRAweb
db = SRAweb()
df = db.sra_metadata("SRP510596", detailed=True)
df.to_csv("metadata.csv", index=False)
df.head()

df["library_strategy"].value_counts()

df_non_rna = df.loc[df["library_strategy"]!="RNA-Seq"].sort_values(by=["experiment_desc", "isolate"]).reset_index(drop=True)
df_non_rna

df_non_rna["experiment_desc"].value_counts()

Downloading the sickle cell sequences

There are a lot of datasets in this study. We will just analyze the “mock” sample, which is a sample with

mocks = df_non_rna[df_non_rna["experiment_desc"] == "CAST-seq on HBB of Human: in vitro HSPC Hbs Mock"] 
mocks

library_name_to_download = "SCD10-Mock-2" 

mocks_to_download = mocks[mocks["library_name"] == library_name_to_download] 
mocks_to_download

mocks_to_download["ena_fastq_http_1"].values

mocks_to_download["ena_fastq_http_2"].values

wget -c -O SCD10-Mock-2_R1.fastq.gz http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR292/083/SRR29275383/SRR29275383_1.fastq.gz
wget -c -O SCD10-Mock-2_R2.fastq.gz http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR292/083/SRR29275383/SRR29275383_2.fastq.gz

Exploring quality of data

Once we obtain any sequencing dataset, we should

mamba install -c bioconda fastqc fastp

Quick metrics

Before doing any pre-processing it is always a good idea to get a “sense” of your data.

What is the size of your fastq.gz files?
How many reads does your data have?
Is your data single-end or paired-end?
What is the length of reads? Are R1 and R2 read lengths same?
What technology was used to generate the data? (Not only instrument, but the assay)

What is the size of your fastq.gz files?

ls -ltrha *.fastq.gz

How many reads does your data have?

echo $(zcat SCD10-Mock-2_R1.fastq.gz|wc -l)/4|bc

echo $(zcat SCD10-Mock-2_R2.fastq.gz |wc -l)/4|bc

What is the length of reads? Are R1 and R2 read lengths same?

zcat SCD10-Mock-2_R1.fastq.gz | head -n 5

zcat SCD10-Mock-2_R2.fastq.gz | head -n 5

len("AAAACTCTCCTTAAACCAGTCTAGTAACCTTGATACCAACCTGCCCAGGGCCTCACCACCAACTTCATCCACGTTCACCTTGCCCCACAGGGCAGTAACGGCAGACTTCAGTCCCTTAAGCGGAGCCCTAGATCGGAAGAGCGTCGTGTAT")

zcat SCD10-Mock-2_R1.fastq.gz  | head -n 2  | tail -n 1 | wc -c

zcat SCD10-Mock-2_R2.fastq.gz  | head -n 2  | tail -n 1 | wc -c

Quality control checks

Before we do any pre-processing it is always a good idea to take a quick look the quality score description. We will run FastQC to run quality checks for our sequencing data

fastqc SCD10-Mock-2_R1.fastq.gz SCD10-Mock-2_R2.fastq.gz

multiqc .

Download “multiqc_report.html” from the left document panel. You will find multiple metrics reported for both the read files. You will find the following adapter plot:

Removing adapters

Often illumina sequencing data needs to be trimmed to remove adapter sequences. But why do we waste sequencing efforts at sequencing the adapters?

The diagram below shows sites of primer annealing at each stage of sequencing run: Read 1, Index 1, Index 2 and Read 2. You can ignore Index 1 and Index 2, but very briefly these are reads that allow multiple samples to be pooled and sequenced together. The example below is of dual-indexed sequencing (i5 and i7). You can read about indexing details here

Consider the reads R1 and R2 in a typicall illumina flowcell as shown below. Our goal is to sequence the “DNA insert” in gray. In both Reads 1 and 2, the sequencing primer anneals to the DNA insert upstream of it. Sequencing begins at base 1 of the DNA insert and hence 5’ adapters are usually not sequenced. Ideally, if the DNA insert sequence is long enough, the sequencing adapters at the 3’end will not get sequenced. But if the insert size is smaller than the read length, the adapters will also get sequenced. These adapter sequences are “fixed” sequences , and ideally we do not want them to be part of our reads because they will not match to the sequences in our reference genome thereby making the alignment step less accurate. So we would like to remove these.

Adapter and quality trimming using fastp

fastp is a tool designed to provide fast all-in-one preprocessing for FastQ files.

Let’s run fastp

fastp --in1 SCD10-Mock-2_R1.fastq.gz --in2 SCD10-Mock-2_R2.fastq.gz --out1 trimmed_SCD10-Mock-2_R1.fastq.gz --out2 trimmed_SCD10-Mock-2_R2.fastq.gz --detect_adapter_for_pe

Run quality control using FastQC on the trimmed data

fastqc trimmed_SCD10-Mock-2_R1.fastq.gz trimmed_SCD10-Mock-2_R2.fastq.gz

multiqc .

Download “multiqc_report.html” from the left document panel. You will see that we have two samples now: 1. SCD10-Mock-2 (original ) 2. trimmed_SCD10-Mock-2 (processed)

If you compare the two samples, you’ll clearly see that the quality scores have improved

Mapping

Now that we have done initial preprocessing, we are ready to map data to human genome reference.

In an ideal world we will map the data to the entire genome, but since we are working with memory restrictions, we will just map all the data to chromosome 11 as HBB gene is in chromosome 11. While this is not incorrect, it is also not the most appropriate mapping strategy - you should in all scenarios map to the entire human genome.

We can download the human genome from UCSC:

genome: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
chromosome wise: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes

Reference genome

We will use only chromosome 11 to map the data. First we download the fasta file.

#| eval: false
wget -c https://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr11.fa.gz && gunzip chr11.fa.gz

Alignment using bwa

Index the reference genome

#| eval: false
# this will tske some time
bwa index chr11.fa

#| eval: false
# Run BWA mem on the  trimmed reads 
bwa mem -t 32  chr11.fa trimmed_SCD10-Mock-2_R1.fastq.gz trimmed_SCD10-Mock-2_R2.fastq.gz > SCD10-Mock-2.sam

Let’s check the size of the SAM file

#| eval: false
ls -lrtha SCD10-Mock-2.sam

#| eval: false
head SCD10-Mock-2.sam

Convert the SAM to BAM

samtools view -Sb SCD10-Mock-2.sam > SCD10-Mock-2.bam

ls -lrtha SCD10-Mock-2.bam

The BAM file is smaller than the SAM file due to compression. Let’s get some alignment metrics from samtools

samtools flagstat SCD10-Mock-2.bam

Now let’s sort the BAM file

samtools sort SCD10-Mock-2.bam -o SCD10-Mock-2-sorted.bam

samtools index SCD10-Mock-2-sorted.bam

ls -la *sorted* *bai*

Visualization

Recall why we started this excercise. We were interested in whether the “mock” samples are really sicle cell disease samples. To detect this, we need to look into the mutation data for HBB gene and see if one of the bases appears mutated.

To visualize reads mapped to genome, we will use igv. You can download IGV as a desktop application (recommended) and use it to visualize the alignment and mutations.

We have now aligned our reads to the reference genome and have a sorted BAM file. Since we are interested in sickle cell disease which is caused by the HBB gene let’s make sure that the reads in the HBB gene regions looks alright.

Where is the HBB gene? Let’s go to the Genome Browser and search the HBB gene.

We can see the HBB gene in chromosome 11. The genomic co-ordinates are listed as: chr11:5,225,464-5,227,071

Let’s use the samtools view to see reads overlapping the HBB gene chr11:5,225,464-5,227,071

samtools view SCD10-Mock-2-sorted.bam chr11:5225464-522707111 | wc -l

To make it easy to visualize we will downsample the bam so that it retains only 5% of original reads:

samtools view -s 0.05 -b SCD10-Mock-2.bam | samtools sort -o SCD10-Mock-2_sample05.bam && samtools index SCD10-Mock-2_sample05.bam

import igv_notebook
igv_notebook.init()

b = igv_notebook.Browser(
    {
        "genome": "hg38",
        "locus": "chr11:5,225,464-5,227,071",
    }
)

b.load_track(
    {
        "name": "SCD10-Mock-2-sorted.bam",
        "url": "SCD10-Mock-2-sorted.bam",
        "indexURL": "SCD10-Mock-2-sorted.bam.bai",
        "format": "bam",
        "type": "alignment",
        "colorBy": "strand",
        "height": 500,
    }
)

You’ll see now that the IGV is showing the alignment of the SCD10-Mock-2 sample against the HBB gene region.

Let’s navigate to the position where the sickle mutation is expected to be found. based on the literature, the sickle cell allele is located at hg38 coordinates: chr11:5227002 A>T

b.search("chr11:5227002")

Exercise

We chose one mock sample. Can you go back and try this with a different sample?
How would you verify that the therapy has indeded worked? (you might not be able to do this analysis completely on colab)