Overview

Welcome to DH 607 - Introduction to computational multi-omics! The course focuses on diving deep into technologies that have been developed in the last couple of decades to understand what goes inside a cell.

About the course

We will take a first-principles approach to understanding biology and the technological advances that make it possible to understand how we can ‘measure’ biological processes quantitatively.

The course will build your foundational knowledge and skills necessary to explore and analyze complex (and real!) biological data. We will take multiple vignettes of biological questions and see how the fields of probability and statistics and computer science provide us with enlightening answers.

The course is neither a course in biology nor a course in statistics but somewhere in between.

What will you learn?

One of the key objectives of (modern) biology is to understand how different molecules shape the functionality of a cell/tissue/organ. The human body has 36 trillion (\(36 \times 10^{12}\)) cells in the human body. All the cells in our body typically have the same DNA but they have different functions.

If we had to draw a functional map of the human body, we would need to measure what goes inside each cell. Over the years, significant technological advancements have made it possible to ‘profile’ the various molecules in each cell (DNA/RNA/Protein). This data is high-dimensional. Scales can vary, but a typical modern-day dataset can have thousands of features (DNA sequence/gene expression/ proteins) measured over millions of rows (observations). To draw any biological insights, this high-dimensional data requires ‘processing’. The course will focus on mathematical and statistical methods that are required to analyze a modern day genomics/transcriptomics dataset.

A detailed set of topics will be available here. But overall, you will learn about:

What is sequencing
Algorithms for aligning DNA sequences and searching DNA databases
Statistical methods to discover genome variation, and application to discovering etiology of disease
Probability and statistics for sequence analysis
How is gene expression quantified computationally
Statistical models for analyzing gene expression data
Linear and non-linear dimensionality reduction methods and their applications in multiomics
Statistical models for identifying transcription factor binding
Hidden markov models and applications in multiomics
How are disease-causing mutations identified (genome wide association studies)
Statistical modelling of single-cell multomics data
Statistical models for modeling CRISPR screens
Recent advancements in statistical methods and deep learning applications in multiomics for human diseases

Learning objectives

How to perform exploratory data analysis and visualize genomic data
Apply tailored statistical methods to answer questions using high dimensional biological data
“Getting your hands dirty” by analyzing genomics data to draw actionable insights for improving human health
Write production level, reproducible, reusable code and software packages

Evaluation

The course will be evaluated based on the following components:

Assignments: 24% (Best 8 out of 9)
- Due every week on Friday 5pm via Gradescope
- Weightage: 3% each (Best 8 will be considered for grading)
- Late submission policy: 10% penalty per day upto a maximum of 6 days
Surprise Quizzes: 6%
Mid-sem: 25%
- Closed book and offline
Course Project: 20%
End-sem: 25%
- Closed book and offline

Collaboration policy

For assignment problems, you should work on your own. If you get stuck, you are welcome to discuss it with other students (in-person, or online on Piazza). However, the solutions must be your work. If you discussed with someone, please mention their name and what you received help with in your submission.

Mid-semester and end-semester exams will be closed-book. No collaboration is allowed.

LLM Policy

You are allowed to use Large Language Models (LLMs) like ChatGPT, Claude, etc. as learning aids, but you must:

Clearly document when and how you used an LLM in your submission
Ensure you understand the solutions provided by the LLM
Be prepared to explain your work during office hours or exams
Not rely solely on LLM-generated code without understanding

For exams, LLMs will not be permitted.

About the instructor

Saket is an Assistant Professor at the Koita Centre for Digital Health at IIT Bombay. His lab focuses on developing statistical models for analyzing multi-omics data. Saket obtained his B.Tech+M.Tech in Chemical Engineering at IIT Bombay in 2014. He pursued his Ph.D in Computational Biology and Bioinformatics at the University of Southern California developing computational methods for understanding how proteins are synthesized in the body. Saket Lab develops novel statistical and computational methods to answer fundamental questions in disease biology and public health.

Course information

Units: 6

Lecture: Mondays and Thursdays, 3:30pm – 4:55pm.

Location: ESE113, 1st Floor, Energy Science Building

Instructor: Saket Choudhary | Homepage | Blog

Office: B-22, KCDH, KReSIT Basement

Office Hours: Wednesdays, 4:00 - 5:00pm or by appointment

For appointments outside office hours: https://cal.com/saketkc/

Contact: saketc@iitb.ac.in | Ext: 3785 (+91 22 2159 3785)

Teaching Assistants

Head Teaching Assistant:

Shubham Thakur
- Contact: shubham.thakur@iitb.ac.in
- Office Hours: Mondays 2:00 PM - 3:30 PM, B-20 ASL Lab, KRESIT Basement

Graders:

Souparna Bhowmik
- Contact: 25d1623@iitb.ac.in
Gaurav Devendra Jain
- Contact: 210040050@iitb.ac.in