Saket Choudhary
04/20/2017
Two roads diverged in a wood, and I took the one less traveled by, And that has made all the difference
– Robert Frost
When you come to a fork in the road, take it.
– Yogi Berra
Technical sources of variations that often confound the effects arising from biological differences.
Arise from but not limited to:
Expression is tissue-specific (mostly) and not species-specific
PCA/MDS aren't often sufficient
RLE plots: Relative Log Expression Median centered \( \log \) values
Assume there are genes that can act as negative controls: difference between exp values arises due to unmodelled factors
For \( J \) genes and \( n \) samples and \( k \) unmodelled factors, \( p \) known covariates (independent of unmodelled factors): \[ \begin{align*} \log E[Y|W,X,O] &= \underbrace{W_{n \times k}}_{\text{Hidden factors design}} \times \alpha_{k \times J} + \overbrace{X_{n \times p}}^{\text{Known covariates design matrix}} \times \beta \\ & + \underbrace{O}_{\text{offset}} \end{align*} \]
General Idea, given a pool of \( J_c \) negative control genes:
Can be modified to account for replicates.
Genes 201-500: affected by an independent factor (unmodelled factor, say age), possibly correlated with class
For \( g^{th} \) gene and \( j^{th} \) sample and \( L \) 'unmodelled' factors:
\[ \begin{align*} \overbrace{Y_{gj}}^\text{Expression} &= \underbrace{\mu_g}_\text{basal expression} + \overbrace{f_g(c_j)}^\text{Dependence on primary variable(say condition)} + \\ &+ \sum_{l=1}^L \underbrace{\gamma_{lg}}_{\text{Gene specific coeff.}} \times \overbrace{p_{lj}}^{l^{th}\text{unmodelled factor(say batch)}} + \underbrace{\epsilon_{gj}}_\text{Noise} \end{align*} \]
General Idea:
For \( i^{th} \) batch and \( j^{th} \) sample
\[ \begin{align*} \overbrace{Y_{ijg}}^\text{Normalise expression in gene $g$} &= \underbrace{\alpha_g}_{\text{Overall gene exp.}} + \overbrace{X}^{\text{Design Matrix}}\underbrace{\beta_g}_{\text{Regression coeff.}}\\ &+ \underbrace{\gamma_{ig}}_{\text{Additive effect}} + \overbrace{\delta_{ig}}^{\text{Multiplicative effect}}\epsilon_{ijg}\\ \gamma_{ig} &= N(Y_i, \tau_i^2) \\ \delta^2_{ig} &= \text{Inverse Gamma} (\lambda_i, \theta_i)\\ \underbrace{Y_{ijg}^*}^{\text{Batch adjusted values} } &= \frac{ Y_{ijg}-\hat{\alpha}_g - X\hat{\beta}_g - \hat{\gamma}_{ig} }{ \hat{\delta}_{ig} } + \hat{\alpha}_g + X\hat{\beta}_g \end{align*}\\ \]
Assume you have 8 treated/control samples coming from 4 cell lines. Unmodelled factor - cell line. Can SVA catch it?