Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.
-
date post
20-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.
![Page 1: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/1.jpg)
Differential Expression and Tree-based Modeling
Class web site: http://statwww.epfl.ch/davison/teaching/Microarrays/
Statistics for Microarrays
![Page 2: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/2.jpg)
cDNA gene expression data
Data on G genes for n samples
Genes
mRNA samples
Gene expression level of gene i in mRNA sample j
= (normalized) Log( Red intensity / Green intensity)
sample1 sample2 sample3 sample4 sample5 …
1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...
![Page 3: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/3.jpg)
Identifying Differentially Expressed Genes
• Goal: Identify genes associated with covariate or response of interest
• Examples:– Qualitative covariates or factors:
treatment, cell type, tumor class– Quantitative covariate: dose, time– Responses: survival, cholesterol level– Any combination of these!
![Page 4: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/4.jpg)
Differentially Expressed Genes
• Simultaneously test m null hypotheses, one for each gene j :
Hj: no association between expression level of gene j and covariate/response
• Combine expression data from different slides and estimate effects of interest
• Compute test statistic Tj for each gene j
• Adjust for multiple hypothesis testing
![Page 5: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/5.jpg)
Test statistics
• Qualitative covariates: e.g. two-sample t-statistic, Mann-Whitney statistic, F-statistic
• Quantitative covariates: e.g. standardized regression coefficient
• Survival response: e.g. score statistic for Cox model
![Page 6: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/6.jpg)
QQ-PlotUsed to assess whether a sample follows a particular (e.g. normal) distribution (or to compare two samples)
A method for looking for outliers when data are mostly normal
Recall that for the normal distribution, approximately:68% within 1 SD of the mean95% within 2 SDs99.7% within 3 SDs
Sam
ple
Theoretical
Sample quantile is 0.125
Value from Normal distribution which yields a quantile of 0.125 (= -1.15)
![Page 7: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/7.jpg)
Typical Deviations from Straight Line Patterns
• Outliers
• Curvature at both ends (long or short tails)
• Convex/concave curvature (asymmetry)
• Horizontal segments, plateaus, gaps
![Page 8: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/8.jpg)
Outliers
![Page 9: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/9.jpg)
Long Tails
![Page 10: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/10.jpg)
Short Tails
![Page 11: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/11.jpg)
Asymmetry
![Page 12: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/12.jpg)
Plateaus/Gaps
![Page 13: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/13.jpg)
Example: Apo AI experiment(Callow et al., Genome Research, 2000)
GOAL: Identify genes with altered expression in the livers of one line of mice with very low HDL cholesterol levels compared to inbred control mice
Experiment: • Apo AI knock-out mouse model• 8 knockout (ko) mice and 8 control (ctl) mice
(C57Bl/6)• 16 hybridisations: mRNA from each of the 16
mice is labelled with Cy5, pooled mRNA from control mice is labelled with Cy3
Probes: ~6,000 cDNAs, including 200 related to lipid metabolism
![Page 14: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/14.jpg)
Which genes have changed?
This method can be used with replicated data:
1. For each gene and each hybridisation (8 ko + 8 ctl) use M=log2(R/G)
2. For each gene form the t-statistic:
average of 8 ko Ms - average of 8 ctl Mssqrt(1/8 (SD of 8 ko Ms)2 + 1/8 (SD of 8 ctl Ms)2)
3. Form a histogram of 6,000 t values4. Make a normal Q-Q plot; look for values “off
the line”5. Adjust for multiple testing
![Page 15: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/15.jpg)
Histogram & Q-Q plot
ApoA1
![Page 16: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/16.jpg)
Plots of t-statistics
![Page 17: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/17.jpg)
Assigning p-values to measures of change
• Estimate p-values for each comparison (gene) by using the permutation distribution of the t-statistics.
• For each of the possible permutation of the trt / ctl labels, compute the two-sample t-statistics t* for each gene.
• The unadjusted p-value for a particular gene is estimated by the proportion of t*’s greater than the observed t in absolute value.
816 12,870
![Page 18: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/18.jpg)
Multiple Testing
# not rej # rejected totals
# true H U V (False +)
m0
# false H T (False -) S m1
totals m - R R m
* Per-comparison = E(V)/m * Family-wise = p(V ≥ 1)
* Per-family = E(V) * False discovery rate = E(V/R)
![Page 19: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/19.jpg)
Apo AI: Adjusted and unadjusted p-values for the 50 genes with the larges absolute t-statistics
![Page 20: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/20.jpg)
Genes with adjusted p-value 0.01
Gene Adjustedp-values
t Num Den
Apo A1 0 -22.9 -3.2 0.14
Sterol C5desaturase
0 -13.1 -1.1 0.08
Apo A1 0 -12.2 -1.9 0.16
Apo CIII 0 -11.9 -1.0 0.09
ApoA1 0 -11.4 -3.1 0.2
EST 0 -9.1 -1.0 0.11
Apo CIII 0 -8.4 -1.0 0.12
Sterol C5desaturase
0 -7.7 -1.0 0.13
![Page 21: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/21.jpg)
Single-slide methods
• Model-dependent rules for deciding whether (R,G) corresponds to a differentially expressed gene
• Amounts to drawing two curves in the (R,G)-plane; call a gene differentially expressed if it falls outside the region between the two curves
• At this time, not enough known about the systematic and random variation within a microarray experiment to justify these strong modeling assumptions
• n = 1 slide may not be enough (!)
![Page 22: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/22.jpg)
Single-slide methods
• Chen et al: Each (R,G) is assumed to be normally and independently distributed with constant CV; decision based on R/G only (purple)
• Newton et al: Gamma-Gamma-Bernoulli hierarchical model for each (R,G) (yellow)
• Roberts et al: Each (R,G) is assumed to be normally and independently distributed with variance depending linearly on the mean
• Sapir & Churchill: Each log R/G assumed to be distributed according to a mixture of normal and uniform distributions; decision based on R/G only (turquoise)
![Page 23: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/23.jpg)
Matt Callow’s Srb1 dataset (#8). Newton’s, Sapir & Churchill’s and Chen’s single slide method
Difficulty in assigning valid p-values based on a single slide
![Page 24: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/24.jpg)
Another example: Survival analysis with expression data
• Bittner et al. looked at differences in survival between the two groups (the ‘cluster’ and the ‘unclustered’ samples)
• ‘Cluster’ seemed to have longer survival
![Page 25: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/25.jpg)
Kaplan-Meier Survival Curves, Bittner et al.
![Page 26: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/26.jpg)
unclustered
cluster
Average Linkage Hierarchical Clustering, survival only
![Page 27: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/27.jpg)
Kaplan-Meier Survival Curves, reduced grouping
![Page 28: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/28.jpg)
Identification of genes associated with survival
For each gene j, j = 1, …, 3613, model the instantaneous failure rate, or hazard function, h(t) with the Cox proportional hazards model:
h(t) = h0(t) exp(jxij)
and look for genes with both: • large effect size j • large standardized effect size j/SE(j)
^
^ ^
![Page 29: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/29.jpg)
![Page 30: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/30.jpg)
Findings
• Top 5 genes by this method not in Bittner et al. ‘weighted gene list’ - Why?
• weighted gene list based on entire sample; our method only used half
• weighting relies on Bittner et al. cluster assignment
• other possibilities?
![Page 31: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/31.jpg)
Statistical Significance of Cox Model Coefficients
![Page 32: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/32.jpg)
Limitations of Single Gene Tests
• May be too noisy in general to show much
• Do not reveal coordinated effects of positively correlated genes
• Hard to relate to pathways
![Page 33: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/33.jpg)
Some ideas for further work
• Expand models to include more genes and possibly two-way interactions
• Nonparametric tree-based subset selection – would require much larger sample sizes
![Page 34: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/34.jpg)
(BREAK)
![Page 35: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/35.jpg)
Trees• Provide means to express knowledge• Can aid in decision making• Can be portrayed graphically or by means of
a chart or ‘key’, e.g. (MASS space shuttle):
stability error sign wind magnitude visibility
DECISION
any any any any any no auto
xstab any any any any yes noauto
stab LX any any any yes noauto
stab XL any any any yes noauto
stab MM nn tail any yes noauto
any any any any Out of range
yes noauto
Etc…
![Page 36: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/36.jpg)
Tree-based Methods – References
• Hastie, Tibshirani, Friedman 2001– The Elements of Statistical Learning
• Venables and Ripley, 1999– Modern Applied Statistics with S-Plus (MASS)
• Ripley, 1996– Pattern Recognition and Neural Networks
• Breiman, Olshen, Friedman, Stone 1984– Classification and Regression Trees
![Page 37: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/37.jpg)
Tree-based Methods
• Automatic construction of decision trees dates from social science work in the early 1960’s (AID)
• Breiman et al. (1984) proposed new algorithms for tree construction (CART)
• Tree construction can be seen as a type of variable selection
![Page 38: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/38.jpg)
Response types
• Categorical outcome – Classification tree
• Continuous outcome – Regression tree
• Survival outcome – Survival tree
• Software – Available R packages include tree, rpart (tssa available in S)
![Page 39: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/39.jpg)
Trees Partition the Feature Space
• End point of tree is a (labeled) partition of the (feature) space of possible observations X
• Tree-based methods partition X into rectangular regions; try to make the (average) responses in each box as different as possible
• In logical problems it is assumed that there does exist a partition of X that will correctly classify all observations; task is to find a tree to succinctly describe this partition
![Page 40: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/40.jpg)
Partitions and CART
X1 X1
t1 t3
t2
t4
R2
R1
R3
R5
R4
Yes No
X2 X2 XX
XX
XX
![Page 41: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/41.jpg)
Partitions and CART
X1
t1 t3
t2
t4
R2
R1
R3
R5
R4
X2
X1 t1
X2 t2 X1 t3
X2 t4
R1 R2 R3
R4 R5
![Page 42: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/42.jpg)
Tree Comparison
• Measure how well the partition created by a tree corresponds to the correct decision rule (classification)
• For a logical problem, count number errors
• For statistical problem, usually overlapping class distributions, so that no partition unambiguously describes classes – estimate misclassification prob.
![Page 43: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/43.jpg)
Three Aspects of Tree Construction
• Split Selection Rule
• Split-stopping Rule
• Assignment of predicted values
![Page 44: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/44.jpg)
Split Selection
• Binary splits
• Look only one step ahead – avoids massive computational time by not attempting to optimize whole tree performance
• Choose an impurity measure to optimize each split – Gini index or entropy, rather than misclassification rate for classification tree, deviance (squared error) for regression tree
![Page 45: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/45.jpg)
Split-stopping
• Issue: A very large tree will tend to overfit the data (and therefore lack generalizability), while too small a tree might not capture important structure
• Usual solution: grow large/maximal tree (stop splitting only when some minimum node size, 5 or 10 say, is reached), followed by (cost-complexity) pruning
![Page 46: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/46.jpg)
Pruning
• Sequence of rooted subtrees
• Measure Ri (e.g. deviance) at leaves, R = Ri
• Minimize the cost-complexity measure
R = R + * size
governs tradeoff between tree size and goodness of fit
• Choose to minimize cross-validated error (misclassification or deviance)
![Page 47: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/47.jpg)
Assignment of Predicted Values
• Assign value to each leaf (terminal node)
• In Classification: (weighted) voting among observations in the node
• In Regression: mean of observations in the node
![Page 48: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/48.jpg)
Other Issues (I)
• Loss matrix
– Procedures can be modified for asymmetric losses
• Missing predictor values
– Can create ‘missing’ category
– Surrogate splits exploit correlation between predictors
• Linear combination splits
![Page 49: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/49.jpg)
Other Issues (II)• Tree Instability
– Small changes in the data can result in very different series of splits – difficulties in interpretation
– Aggregate trees to reduce (e.g. bagging)
• Lack of smoothness
– More of an issue in regression trees
– Multivariate Adaptive Regression Splines (MARS)
• Difficulty in capturing additive structure with binary trees
![Page 50: Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d4b5503460f94a27f4b/html5/thumbnails/50.jpg)
Acknowledgements
• Sandrine Dudoit
• Jane Fridlyand
• Yee Hwa (Jean) Yang
• Debashis Ghosh
• Erin Conlon
• Ingrid Lonnstedt
• Terry Speed