Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.
-
Upload
christian-cook -
Category
Documents
-
view
221 -
download
1
Transcript of Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.
Introduction to DESeq and edgeR packages
Peter A.C. ’t Hoen
Poisson distribution
• discrete probability distribution that expresses the probability
of a number of events occurring in a fixed period of time if
these events occur with a known average rate and
independently of the time since the last event
= expected k = number of occurrences
Count process
• Poisson distribution
Yt ~ Poisson(λt) with λt = pnt
t: tag
λ: true expression
Y: observed expression
p: probability
n: total number of RNA molecules
• Truncated Poisson distribution: zero can mean not expressed or not counted
• Count variance ~ λt
• Murray F Freeman and John W Tukey. Ann Math Statist, 21:607-611, (1950)
Negative binomial distribution
• discrete probability distribution of the number of successes in
a sequence of Bernoulli trials before a specified (non-random)
number r of failures occurs
• also arises as a continuous mixture of Poisson distributions
where the mixing distribution of the Poisson rate is a gamma
distribution. That is, we can view the negative binomial as a
Poisson(λ) distribution, where λ is itself a random variable,
distributed according to Gamma(r, p/(1 − p)).
edgeR (1)
• Robinson, Smyth (Biostatistics, 2008; Bioinformatics 2007)
• Package available from Bioconductor with very informative
vignette
Yij ~ NB (ij , )
Var (Yij) = ij ( 1 + ij x )
• Negative binomial (gamma Poisson) with average mu
• Phi is overdispersion parameter (biological variation)
• = 0 gives Poisson distribution
Overdispersion in our data
edgeR (2)
• Test per gene
Ygij ~ NB (gij , g ) where gij = Mj x pgj
Var (Ygij) = gij ( 1 + ij x g)
pgi is proportion of tags for tag g in sample i
Mj is library size for sample i and library j
g is dispersion parameter for tag g
edgeR (3)
• Estimation of common dispersion parameter by conditioning
g on the sum of counts and maximizing the common
likelihood
lC() = lg (g)
• Common dispersion parameter OR weighted linear
combination of common and individual likelihoods
WL (g) = lg(g) + lC(g)
edgeR (4)
• Exact test replacing hypergeometric probabilities with NB-
derived probabilities (qCML) for single factor experiment
• Generalized linear models and Cox-Reid profile-adjusted
likelihood (CR) method for multifactorial experiments
edgeR: what is new?
• Exact Test not able to work with confounders
replaced by generalized linear model with log likelihood
ratio test
• Abundance trending in dispersion estimates
Dispersion trend
dispersion
abundance
Dispersion trending (after filtering for low ab)
dispersion
abundance
DESeq (1)
• Anders and Huber: Genome Biology (2010) 11:R106
• Roughly same principles as edgeR
• No multifactorial analysis implemented yet
DESeq (2)
(1) Yij ~ NB (ij , σ2ij )
(2) ij = sj qi,ρ(j) sj scaling factor for sample j
qi,ρ(j) proportional concentration
of tag i in condition ρ
(3) σ2ij = ij + s2
j νi,ρ(j) νi,ρ(j) is a smooth function
depending on qi,ρ(j) (concentration)
Count noise Extra variance
DESeq (3): variance trend with expression
Purple: PoissonDashed orange: edgeR (before trending)Orange: DESeq
You can derive:Squared CV is 1/μ + φ
DESeq (3)
• Differences with edgeR:
• Complete shrinkage to trended dispersion; limited tagwise
dispersion estimates
• Different variance estimates for different sample groups allowed
• Deals better with samples with large differences in read depth?
DESeq (4): statistical testing
• In analogy to initial edgeR implementation exact test on the
NB probabilities in the two conditions
Conclusions
• edgeR and DESeq are comparable implementation of
statistical tests using NB distribution
• edgeR and DESeq produce largely similar results
• Implementation of generalized linear models in edgeR allows
for testing with confounders
• Results comparable to limma for medium – high expressed
genes: modeling of stochastic effects is particularly important
for low expressed genes
Comparison to limma (on sqrt scaled data)