On Github mikelove / techbias
Michael Love
Research Fellow
Irizarry group, DFCI/HSPH
thought experiment: measure potholes in Boston vs Cambridge
everyone's rulers are off by +1 cm Boston rulers left in the sun, stretched by +1 cm...sounds simple, but still we see datasets with 100% confounding of condition with experimental batch
relative to a reference genome
the local number of reads: "read depth"
changes in read depth relative to a reference:
deviation of coverage from that expected from proporitions of molecules in the "pool"
involves polymerase copying DNA many times over
deviation of coverage from expected given the proportion of molecules in the pool
(other steps are certainly also important)
useful plot for identifying non-uniform coverage
linear model of the Poisson rate including sequence bias
model for estimating isoform abundances including fragmentation, size selection, sequence bias
Probability of a vector of read counts \(\vec{n}\), indexed by read type j:
\[ f_\theta(\vec{n}) = \prod_j f_{Pois}(n_j, \vec{\theta} \cdot \vec{a}_j ) \]
likelihood of isoform abundances given fragment length distribution and sequence bias: used in Cufflinks
SVA: Leek et al 2007, svaseq: Leek 2014
Per gene, model the mean for sample j, \(\mu_j\), as:
\[ log(\mu_j) = \beta_0 + \beta_{b} 1_{j \in B} + \beta_{t} 1_{j \in T} \]
where B is the second batch, T is the treated samples.
count \(\sim\) \(\mathcal{L}\) (bias \(\cdot\) biology)