RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data...
Transcript of RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data...
![Page 1: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/1.jpg)
RNASeq: Experimental Design & Statistics for Differential Expression
(and a tiny bit of ChipSeq)
Blythe Durbin-Johnson, Ph.D.
![Page 2: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/2.jpg)
Outline
• Hypothesis testing, p-values, and power
• Multiple testing
• Experimental design and replication
• Statistical models for RNAseq data
• Visualization techniques
• Statistics behind IDR
![Page 3: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/3.jpg)
HYPOTHESIS TESTING, P-VALUES, AND POWER
![Page 4: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/4.jpg)
Hypothesis Testing
• Test “null hypothesis” of no effect against “alternative hypothesis”
• Calculate test statistic, reject null if test statistic large relative to what one would expect under null distribution
![Page 5: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/5.jpg)
P-Values
• P-value = probability of seeing a test statistic as large or larger than your test statistic when the null hypothesis is true
• Typically reject null if P < 0.05
–This is purely a historical convention
–Nothing magic happens at the P = 0.05 threshold
![Page 6: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/6.jpg)
A P-Value is NOT
• …the probability that the null hypothesis is true
• …the probability that an experiment will not be replicated
• …a direct measure of the size or importance of an effect
• …a measure of biological/clinical significance
![Page 7: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/7.jpg)
Power
• Power = probability of rejecting null hypothesis for a given effect size
• Depends on:
–Effect size (difference between groups)
–Sample size
–Amount of variability in data
–Hypothesis test being used
–How “significance” is defined
![Page 8: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/8.jpg)
Power and P-Values
• Under the null hypothesis, p-values uniformly distributed between 0 and 1
–Expect 5% to be less than 0.05, on average
• Under alternatives, higher probability of smaller p-values (higher power), but still can theoretically get any p-value between 0 and 1
![Page 9: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/9.jpg)
Power Example
• Simulate two groups of normally distributed data with means 0, 0.5, 1, and 2 standard deviations apart
• Conduct two-sample t-test
• Repeat 5000 times, look at distribution of p-values
• Repeat for various sample sizes
![Page 10: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/10.jpg)
![Page 11: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/11.jpg)
![Page 12: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/12.jpg)
MULTIPLE TESTING
![Page 13: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/13.jpg)
Multiple Testing Example • Patient samples treated with different radiation
doses and observed over time
• Illumina microarray experiment, 16,801 genes used in analysis
• Four replicates per patient/time/dose
• All samples used in this example were replicates from same patient, untreated
• T-tests gene by gene comparing replicates 1 and 3 to replicates 2 and 4
• 196 genes with P < 0.05
![Page 14: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/14.jpg)
Multiple Testing Example
• Entered list of genes with P < 0.05 into DAVID’s functional annotation tool
– http://david.abcc.ncifcrf.gov
• Overrepresented terms (P < 0.05) included disease mutation, mutagenesis site, and 79 others
• If you were doing radiation research, would you be excited about this?
![Page 15: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/15.jpg)
Multiple Testing Example
• We know there is no difference between the “groups”
• What is going on?
![Page 16: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/16.jpg)
Multiple Testing Example
• Expect P < 0.05 about 5% of the time under null hypothesis
• (We see 196/16801 = 1.1% of genes with P < 0.05, but our data aren’t perfectly normal and our p-values are correlated)
• When conducting multiple tests, need to make adjustments to avoid spurious results
![Page 17: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/17.jpg)
Familywise Error Rate
number declared
non-significant
number declared
significant total
true null
hypotheses U V m0
false null
hypotheses T S m - m0
m - R R m
FWER = P(V ≥ 1)
FWER = Probability of ANY false positives
Multiple Testing
![Page 18: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/18.jpg)
One way of controlling FWER: set α’ = α/n (Bonferroni Correction) Problems: 1. Very conservative, even for FWER
control. 2. Is the FWER really what we want to
control?
Multiple Testing
![Page 19: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/19.jpg)
False Discovery Rate (FDR)
FDR = E[V/R]
number declared
non-significant
number declared
significant total
true null
hypotheses U V m0
false null
hypotheses T S m - m0
m - R R m
(Benjamini and Hochberg, 1995)
Multiple Testing
![Page 20: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/20.jpg)
False Discovery Rate (FDR)
FDR = E[V/R]
FWER = P(V ≥ 1) control this
not this
number declared
non-significant
number declared
significant total
true null
hypotheses U V m0
false null
hypotheses T S m - m0
m - R R m
(Benjamini and Hochberg, 1995)
Multiple Testing
![Page 21: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/21.jpg)
Multiple Testing
• False Discovery Rate-controlling procedure: (Benjamini and Hochberg, 1995)
1. Sort p-values from smallest to largest (1 to m), let k be the rank
2. Select a desired FDR α
3. Find the largest rank k’ where P(k) ≤ (k/m)*α
4. Null hypotheses 1 through k’ are rejected
![Page 22: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/22.jpg)
Multiple Testing
• Note that the gene with the smallest p-value is still tested using α/m (like Bonferroni)
• The number of genes/transcripts included in the analysis still matters
• Filtering can help (but don’t filter based on treatment/group membership)
![Page 23: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/23.jpg)
Multiple Testing Example (Revisited)
• Recall example of testing differential expression between 2 pairs of replicates in a microarray experiment
• No genes are differentially expressed at FDR-level 0.1
![Page 24: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/24.jpg)
EXPERIMENTAL DESIGN AND REPLICATION
![Page 25: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/25.jpg)
Replicated and Unreplicated Designs
biological heterogeneity
Why Replicate? mut wt
![Page 26: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/26.jpg)
Replicated and Unreplicated Designs
Unreplicated Design mut wt
Here, groups differ, but single replicates from each group very similar
![Page 27: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/27.jpg)
Replicated and Unreplicated Designs
Unreplicated Design mut wt
Here, groups are similar, but outlying observation from group on right makes it look like there’s a big difference in unreplicated experiment
![Page 28: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/28.jpg)
Why Replicate?
• Single biological replicate may not be representative of a whole group
• Power to test for differential expression is limited
• Cannot estimate within-group variability directly
– Have to assume that most genes aren’t differentially expressed, use between-group variability as surrogate for within-group variability
• Unreplicated experiments not recommended
![Page 29: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/29.jpg)
How Many Replicates?
• Depends on how big of a fold change you want to detect
• Theoretically, for any sample size there is some difference detectable with 80% power
• (This difference might be unrealistically huge)
• BUT very small sample sizes cause other problems besides lack of power
![Page 30: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/30.jpg)
How Many Replicates?
• Too few replicates lack of generalizability
• Rely heavily on other genes to estimate variability
• With only 2 replicates, false discovery rate inflated (Sonenson and Delorenzi, 2013)
• Increasing number of replicates even from 2 to 3 helps with FDR inflation
![Page 31: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/31.jpg)
How Many Replicates?
• Resources of course limit numbers of replicates
• An undersized experiment that misleads may be worse than no experiment
– This is particularly true of n = 1
![Page 32: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/32.jpg)
Technical Replicates vs. Biological Replicates
• Biological variability > > technical variability
• Technical replicates are not a substitute for biological replicates
![Page 33: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/33.jpg)
Technical Replicates vs. Biological Replicates
• Treating technical replicates as biological replicates underestimates variability, inflates Type I error rate –Do not treat technical replicates like
biological replicates –Do not treat repeated measures on the
same experimental unit like independent observations
![Page 34: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/34.jpg)
STATISTICAL MODELS FOR RNASEQ DATA
![Page 35: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/35.jpg)
Models for Count Data
• RNAseq data typically consist of counts for each gene/transcript in each sample
• Generally use special models for count data (or transform data in ways that address variance structure)
![Page 36: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/36.jpg)
Poisson Models
• Count data are often modeled with a Poisson model
• For comparing two groups A and B:
μ = mean count
log(μ) = intercept + β*I(Group = B)
β is log fold-change B/A
• Poisson model assumes variance = mean
Variance(count) = μ
• This is a strong assumption!
![Page 37: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/37.jpg)
Negative Binomial Model
• Negative binomial distribution does not assume variance is equal to mean
Mean(count) = μ
Var(count) = μ + φ μ2
• Φ is “dispersion parameter”
• If Φ = 0 we have a Poisson model
![Page 38: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/38.jpg)
Negative Binomial Model
• Negative binomial model can be derived as a mixture of different Poisson distributions
• RNAseq Data:
–Each Poisson distribution in mixture represents “shot noise” or within-sample variability
–Using mixture of Poisson distributions allows model for biological variability
![Page 39: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/39.jpg)
Negative Binomial Model
• Is this the true data-generating model?
– Unlikely
– Requires variance >= mean, easy to imagine situation where this doesn’t hold
• “All models are wrong, but some models are useful”
--George Box
• Usefulness of NB model is open question
![Page 40: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/40.jpg)
Estimating the Dispersion Parameter
• Negative binomial modeling requires the variability to be estimated separately from the mean
• RNAseq data rarely have enough replicates to do this gene by gene
• Borrow information from other genes to estimate dispersion
![Page 41: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/41.jpg)
Estimating the Dispersion Parameter
• Methods of estimating the dispersion parameter
– Use empirical Bayes methods to “squeeze” local estimate of dispersion towards overall dispersion
– Model dispersion parameter as a function of the mean
– Some combination of these
![Page 42: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/42.jpg)
Normal Models for Count Data
• edgeR, DESeq, Cuffdiff2 model data directly using negative binomial distribution
• Data may not be negative binomial
• Even for data that are truly negative binomial, testing based on asymptotic (n ∞) theory
• Asymptotic theory may not work for small sample sizes
![Page 43: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/43.jpg)
Normal Models for Count Data
• limma-voom calculates variance weights for log2(CPM) so they can be modelled like continuous, normally-distributed data
• Performs well against negative binomial models in comparison papers
• Normal approximation works best for larger counts
![Page 44: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/44.jpg)
Caveats
• Take results with a grain of salt
– Follow up with PCR on different samples
• Different methods can produce very different lists of DE genes
• Most methods produce large numbers of false discoveries*
• No existing method is perfect
*D.M. Rocke, personal communication
![Page 45: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/45.jpg)
More Complicated Models
• Negative binomial modeling is not limited to comparison of two groups
• Can fit models that are analogous to regression or ANOVA for ordinary linear models
![Page 46: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/46.jpg)
More Complicated Models
• Allows modeling of RNAseq data from multifactorial experimental designs
–Can be more powerful than looking at one experimental condition at a time
–Can look at interaction between multiple experimental conditions
–Can look at continuous changes in expression as function of e.g. age
![Page 47: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/47.jpg)
Multifactorial Model Example
• RNAseq data
• Two genotypes of plant (CM and SP)
• Two experimental conditions (N and G)
• 3 replicates
![Page 48: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/48.jpg)
Multifactorial Model Example
• Model fitted that includes “main effects” for genotype and condition plus “interaction effects”
![Page 49: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/49.jpg)
Multifactorial Model Example: Interpretation of Parameters
• Genotype CM, condition G
• Genotype CM, condition N
• Genotype SP, condition G
• Genotype SP, condition N
![Page 50: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/50.jpg)
VISUALIZATION TECHNIQUES
![Page 51: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/51.jpg)
Distances
• Many unsupervised clustering/visualization techniques are based on distance matrices
• Distance between two points can be defined many ways – Euclidean distance
– 1 – abs(correlation)
– Mahalanobis (covariance-scaled) distance
– Maximum distance
• Euclidean distance most common
![Page 52: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/52.jpg)
Euclidean Distance
![Page 53: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/53.jpg)
Euclidean Distance
• For Euclidean distance to be meaningful, data must be scaled so that each dimension (gene) has the same variance
Diamond and X are closest Circle and diamond are closest
![Page 54: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/54.jpg)
• MDS takes distance matrix, recreates data in two dimensions while preserving distances
Multidimensional Scaling Plots
http://statlab.bio5.org/foswiki/pub/Main/RBioconductorWorkshop2012/Day6_demo.pdf
![Page 55: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/55.jpg)
Multidimensional Scaling Plots
![Page 56: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/56.jpg)
Hierarchical Clustering
• Hierarchical clustering starts by treating each sample as its own cluster
• The “closest” clusters are merged successively until only one cluster remains
• Produces tree with series of nested clusterings rather than one set of clusters
• Plots of these trees are called “dendrograms”
![Page 57: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/57.jpg)
Hierarchical Clustering
![Page 58: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/58.jpg)
Heat Maps
• Data are plotted with color corresponding to numeric value
• Dendrograms of rows (genes) and columns (samples) displayed on sides
• Rows/columns are reordered by their means, this tends to create blocks of color
![Page 59: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/59.jpg)
Statistics Behind IDR
• IDR = Irreproducible Discovery Rate
– Li, Brown, Huang, and Bickel 2011
• Way of assessing reproducibility of top ranked signals between replicate experiments
• Used to assess reproducibility of Chip-Seq
• Applicable to any high-throughput method that outputs a ranked list
![Page 60: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/60.jpg)
Statistics Behind IDR
• IDR = P(signal not reproducible)
• Model assumes data are mixture of real and spurious signals
• Ranks of real signals will be correlated between replicate experiments
• Ranks of spurious signals will be uncorrelated
![Page 61: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/61.jpg)
The irreproducible discovery rate (IDR) framework for assessing reproducibility of ChIP-seq
data sets.
Landt S G et al. Genome Res. 2012;22:1813-1831
© 2012, Published by Cold Spring Harbor Laboratory Press
![Page 62: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/62.jpg)
Conclusions • While data from unreplicated RNA Seq
experiments can be analyzed, not recommended – 3 or more biological replicates recommended
– 10 for non-inbred organisms
– For complex experimental designs, consider degrees of freedom
• Multiple testing increases risk of significant p-values when no difference exists, adjust by FDR or other method
![Page 63: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like](https://reader033.fdocuments.net/reader033/viewer/2022052019/60335d3fd0813c74112534f5/html5/thumbnails/63.jpg)
Conclusions (Continued)
• A list of DE genes is a first step, not absolute truth – Follow up experiments
• Thoughtful experimental design and use of statistics is as important for genomics data as for any other kind of data