Statistical tests for differential expression in cDNA microarray experiments (2): ANOVA Xiangqin Cui...
-
Upload
dane-grimmett -
Category
Documents
-
view
225 -
download
3
Transcript of Statistical tests for differential expression in cDNA microarray experiments (2): ANOVA Xiangqin Cui...
Statistical tests for differential expression in cDNA microarray experiments (2): ANOVA
Xiangqin Cui and Gary A. ChurchillGenome Biology 2003, 4:210
Presented by M. Carme Ruíz de Villa and Alex
SánchezDepartament d’Estadística U.B.
Introduction
Remember …
We want to measure how gene expression changes under different conditions. Only two conditions and an adequate number
of replicates t-tests & extensions More than two conditions / more than one
factor: several approaches Analysis of Variance (ANOVA) (Churchill et al.) Linear Models (Smyth, Speed, …)
Sources of variation (1) We want to determine when the
variation due to gene expression is significant, but…
There are multiple sources of variation in measurements besides just gene expression.
We want to know when the variation in measurements is caused by varying levels of gene expression versus other factors.
Sources of variation (2) Some sources of variation in the
measurements in microarray experiments are: Array effects Dye effects Variety effects Gene effects Combinations
Relative expression values If more than two conditions
we cannot simply compute ratios ANOVA modelling yields estimates of the
relative expression for each gene in each sample
The ANOVA model is not based on log ratios. Rather it is applied directly to intensity data. However the difference between two relative expression values can be interpreted as the mean log ratio for comparing two samples.
Technical & biological replicates
If inference is being made on the basis of biological replicates
and there is also technical replication
technical replicates should be averaged to yield a single value for each independent biological unit.
Review of ANOVA models
One way ANOVA Suppose you have a model for each
measurement in your experiment:
yij is jth measurement for ith group. μ : overall mean effect (constant) αi : ith group effect (constant) εij : experimental error term ~N(0,σ2)
Therefore, observations from group i are distributed with mean μ+ αi and variance σ2 .
ijiijy
Hypothesis Testing
)()( .. yyyyyy iiijij
Overall variability Within group variability
Between group variability
Intuition: if between group variability is large compared to within group variability then the differences between means is significant.
Sum of Squares
2
1 1
2.
1 1
2.
1 1
( )
( )
( )
i
i
i
nk
iji j
nk
ij ii j
nk
ii j
y y
y y
y y
Total sum of squares
Within Sum of Squares
Between Sum of Squares
Mean Sum of Squares
Between MS = Between SS/(k-1) Within MS = Within SS/(n-k) F = Between MS / Within SS It is summarized in the ANOVA
table Example 1
Multiple Factor ANOVA
ijk i j ij ijky
The model can be extended by adding more Factors (, , …) Interactions between them (, …) Other …
This is used to model the different sources of variation appearing in microarray experiments
Experiment 1: Latin Square
liver muscle muscleliver
Random effects models If the k factor levels can be considered a
random sample of a population of factors we have a random effect
ANOVA model: Yij= + Ai + eij, overall mean, Ai is a random variable instead of a
constanty, eij experimental error. E(Ai)=0, E(eij)=0, var(Ai)=A
2, var(eij) =2, Ai i eij independent var(Yij)= A
2 + 2.
Where to find more…
Draghici, S. (2003). ANOVA chapter (7) Data analysis tools for microarrays Wiley
Pavlidis, P. (2003) Using ANOVA for gene selection from microarray studies of the nervous systemhttp://microarray.cpmc.columbia.edu/pavlidis/ doc/reprints/anova-methods.pdf
ANOVA Models for Microarray Data
Kerr & Churchill’s model
• yijkg expression measurement from the ith array, jth dye, kth variety, and gth gene.
• μ average expression over all spots.• Ai effect of the ith array. • Dj effect of the jth dye.• Vk effect of the kth variety (=treatment, sample, …)• Gg effect of the gth gene.• (AG)ig effect of the ith array and gth gene.• (VG)kg effect of the kth variety and gth gene.• Єijkg independent and identically distributed error terms.
Interpreting main effects A: differences in fluorescent signal from
array to array (e.g. if arrays are probed under inconsistent conditions that increase or reduce hybridization of labeled cDNA)
D: differences between two dye fluorescent labels (one dye may consistently be brighter than the other)
G: differences in fluorescence for equally expressed genes.
V: differences of expression level between different varieties (samples, tumour types,..).
Interpreting interactions DV: If for a particular variety labelling is
produced in separate runs of the process Differences in the runs can produce pools of cDNA of varying concentrations or quality.
AG: (Spot effect) Spots for a given gene on the different arrays vary in the amount of cDNA available for hybridization.
DG: if there are differences in the dyes that are gene-specific
VG: reflects differences in expression for particular variety and gene combinations that are not explained by the average effects of these varieties and genes.THIS IS THE QUANTITY OF INTEREST !!!
Normalization A,D,V terms effectively normalize the data,
thus the normalization process is integrated with the data analysis.
This approach has several benefits (?) The normalization is based on a clearly stated set
of assumptions It systematically estimates normalization
parameters based on all the data The model can be generalized to the situation
where genes are spotted multiple times on each array rather
Statistically Significant Effects
Array, Dye , Variety & Gene effect Goal: To estimate their value. Need not assess their significance Sometimes don’t appear (gene-level model)
Array x Gene, Variety x Gene effects May or not be present Goal: To assess their significance
Mean effect = 0 if fixed Effect variance = 0 if random
Test statistics: The 3 F’s Hypothesis testing involves
the comparison of two models.
In this setting we consider a null model of no differential
expression (all VG =0) and an alternative model with
differential expression among the conditions (some VG are not equal to zero).
F statistics are computed on a gene-by-gene basis based on the residual sums of squares from fitting each of these models.
11
1010
/
)/()(1
dfrss
dfdfrssrssF
21010 )/()(
3pool
dfdfrssrssF
2/)/(
)/()(2
211
1010
pooldfrss
dfdfrssrssF
Example 1 A gene, which is believed to be related to ovarian cancer is investigated The cancer is sub-classified in 3 cathegories (stages) I, II, III-IV 15 samples, 3 per stage are available They are labelled with 3 colors and hybridized on a 4 channel cDNA array (1
channel empty)(A seemingly more reasonable procedure: double dye-swap reference design)
Example 1. Normalized Data
Array I II III1 100 95 702 90 93 723 105 79 814 83 85 745 78 90 75
Example 1: ANOVA table (1)
ANOVA
EXPRLEVE
810.133 2 405.067 6.505 .012
747.200 12 62.267
1557.333 14
Inter-grupos
Intra-grupos
Total
Suma decuadrados gl
Mediacuadrática F Sig.
555N =
CANCTYP
321
EX
PR
LE
VE
110
100
90
80
70
60
13
If arrays are homogeneousThe appropriate model is 1 factor ANOVA
Example (1): Blocking
Origen de las variaciones
Suma de cuadrados
Grados de libertad
Promedio de los cuadrados F Probabilidad
Valor crítico para F
Filas 18.4 4 4.6 0.08440367 0.98276859 6.38823394Columnas 490 1 490 8.99082569 0.04000226 7.70864972Error 218 4 54.5
Total 726.4 9
If arrays are not homogeneous the appropriate model is 2 factor ANOVA (1 new block factor for arrays)
Example 2: CAMDA kidney dataftp://ftp.camda.duke.edu/CAMDA02_DATASETS/papers/README_normal.html
6 mouse kidney samples
(suppose 6 different treatments)
Compared to a common reference in a double reference design Dye swap Replicate arrays
2
2.1. The ANOVA model
Work only at the gene level: no main effects (A, D, V, G) as defined
Yijk=DGi+AGj+VGk+ijk i=1,2 (dyes) j=1,2; (array) K=1,…,6 (sample)
Example 3: A 2 factor design Diet X Strain
3.2. Design
3.3. The ANOVA model
Yijk=DGi+AGj+Strainl+Dietm+ Strain:Dietlm+ VGk+ijklm
i=1,…,2 (dyes) j=1,…,2; (array) k=1,…,12 (sample) l =1,…,3 (strain) m = 1,...,2 (diet)
3.4 Sample R code (1) data(paigen) paigen <- createData(rawdata, 2) model.full.fix <- makeModel (data =
paigen,formula=~DG+AG+SG+ Strain+Diet+Strain:Diet)
anova.full.fix <-fitmaanova (paigen, model.full.fix)
model.noint.fix <- makeModel (data =
paigen,ormula=~DG+AG+SG+Strain+Diet) anova.noint.fix <- fitmaanova(paigen,
model.noint.fix)
3.4. Sample R code (2) # permutation tests – # test for interaction effect test.int.fix <- ftest(paigen,
model.full.fix, model.noint.fix, n.perm=500)
idx.int.fix <- volcano(anova.full.fix, test.int.fix, title="Int. test")