Micro Array Literature
description
Transcript of Micro Array Literature
![Page 1: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/1.jpg)
1
High Throughput Target Identification
Stan Young, NISS
Doug Hawkins, U Minnesota
Christophe Lambert, Golden Helix
Machine Learning, Statistics, and Discovery
25 June 03
![Page 2: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/2.jpg)
2
PublicationYear
All Journals PNAS
1992 0 01993 0 01994 0 01995 4 01996 3 11997 8 21998 37 11999 134 82000 409 342001 773 46
Micro Array Literature
![Page 3: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/3.jpg)
3
Guilt by Association :
You are known
by the company you keep.
![Page 4: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/4.jpg)
4
Data Matrix
Goal: Associations over the genes.
Guilty Gene
Genes
Tissues
![Page 5: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/5.jpg)
5
Goals
1. Associations.
2. Deep associations – beyond 1st level correlations.
3. Uncover multiple mechanisms.
![Page 6: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/6.jpg)
6
Problems
1. n < < p
2. Strong correlations.
3. Missing values.
4. Non-normal distributions.
5. Outliers.
6. Multiple testing.
![Page 7: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/7.jpg)
7
Technical Approach
1. Recursive partitioning.
2. Resampling-based, adjusted p-values.
3. Multiple trees.
![Page 8: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/8.jpg)
8
Recursive Partitioning
Tasks
1. Create classes.
2. How to split.
3. How to stop.
![Page 9: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/9.jpg)
9
Differences:
Recursive Partitioning• Top-down analysis• Can use any type of descriptor.• Uses biological activities to
determine which features matter.
• Produces a classification tree for interpretation and prediction.
• Big N is not a problem!• Missing values are ok.• Multiple trees, big p is ok.
Clustering• Often bottom-up
• Uses “gestalt” matching.
• Requires an external method for determining the right feature set.
• Difficult to interpret or use for prediction.
• Big N is a severe problem!!
![Page 10: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/10.jpg)
10
Forming Classes, Categories, Groups
Profession Av. Income
Baseball Players 1.5MFootball Players 1.2M
Doctors .8MDentists .5M
Lawyers .23MProfessors .09M
. . . . .
![Page 11: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/11.jpg)
11
Forming Classes from “Continuous” Descriptor
0 31 2 4 5 6-1-2-3
How many “cuts” and where to make them?
![Page 12: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/12.jpg)
12
Splitting : t-test
n = 1650ave = 0.34sd = 0.81
n = 1614ave = 0.29sd = 0.73
n = 36ave = 2.60sd = 0.9
Signal 2.60 - 0.29t = = = 18.68Noise 0.734 1 1
36 1614+
TT: NN-CCNN-CC
rP = 2.03E-70
aP = 1.30E-66
![Page 13: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/13.jpg)
13
Splitting : F-test
n = 1650ave = 0.34sd = 0.81
n = 1553ave = 0.21sd = 0.73
n = 36ave = 2.60sd = 0.9
n = 61ave = 1.29sd = 0.83
n = 61ave = 1.29sd = 0.83
Signal Among Var (Xi. - X..)2/df1F = = =
Noise Within Var (Xij - Xi.)2/df2
![Page 14: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/14.jpg)
14
How to Stop
Examine each current terminal node.
Stop if no variable/class has a
significant split, multiplicity adjusted.
![Page 15: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/15.jpg)
15
Levels of Multiple Testing
1. Raw p-value.
2. Adjust for class formation, segmentation.
3. Adjust for multiple predictors.
4. Adjust for multiple splits in the tree.
5. Adjust for multiple trees.
![Page 16: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/16.jpg)
16
Understanding observations
NB: Splitting variables govern the process,NB: Splitting variables govern the process, linked to response variable.linked to response variable.
MultipleMechanisms
Conditionally important descriptors.
![Page 17: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/17.jpg)
17
Multiple Mechanisms
![Page 18: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/18.jpg)
18
Reality: Example Data
60 Tissues
1453 Genes
Gene 510 is the “guilty” gene, the Y.
![Page 19: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/19.jpg)
19
1st Split of Gene 510 (Guilty Gene)
![Page 20: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/20.jpg)
20
Split Selection
14 spliters
with adjusted
p-value
< 0.05
![Page 21: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/21.jpg)
21
Histogram
Non-normal, hence
resampling p-values
make sense.
![Page 22: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/22.jpg)
22
Resampling-based Adjusted p-value
![Page 23: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/23.jpg)
23
Single Tree RP Drawbacks
• Data greedy.
• Only one view of the data. May miss other mechanisms.
• Highly correlated variables may be obscured.
• Higher order interactions may be masked.
• No formal mechanisms for follow-up experimental design.
• Disposition of outliers is difficult.
![Page 24: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/24.jpg)
24
Etc.
Multiple Trees, how and why?Multiple Trees, how and why?
![Page 25: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/25.jpg)
25
How do you get multiple trees?
1. Bootstrap the sample, one tree per sample.
2. Randomize over valid splitters.
Etc.
![Page 26: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/26.jpg)
26
RandomTreeBrowsing,
1000 Trees.
![Page 27: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/27.jpg)
27
Example Tree
![Page 28: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/28.jpg)
28
1st Split
![Page 29: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/29.jpg)
29
Example Tree, 2nd Split
![Page 30: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/30.jpg)
30
Conclusion for Gene G510
If G518 < -0.56
and
G790 < -1.46
then
G510 = 1.10 +/- 0.30
![Page 31: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/31.jpg)
31
Using Multiple Trees to Understand variables
• Which variables matter?
• How to rank variables in importance.
• Correlations.
• Synergistic variables.
![Page 32: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/32.jpg)
32
CorrelationInteractionMatrix
Red=Syn.
![Page 33: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/33.jpg)
33
Summary
• Review recursive partitioning.
• Demonstrated multiple tree RP’s capabilities– Find associated genes
– Group correlated predictors (genes)
– Synergistic predictors (genes that predict together)
• Used to understand a complex data set.
![Page 34: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/34.jpg)
34
Needed research
• Real data sets with known answers.
• Benchmarking.
• Linking to gene annotations.
• Scale (1,000*10,000).
• Multiple testing in complex data sets.
• Good visualization methods.
• Outlier detection for large data sets.
• Missing values. (see NISS paper 123)
![Page 35: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/35.jpg)
35
Teams
NC State University :Jacqueline Hughes-OliverKatja Rimlinger
U Waterloo :Will WelchHugh ChipmanMarcia WangYan Yuan
U. Minnesota :Douglas Hawkins NISS :
Alan Karr(Consider post docs)GSK :
Lei ZhuRay Lam
![Page 36: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/36.jpg)
36
References/Contact
1. www.goldenhelix.com.
2. www.recursive-partitioning.com.
3. www.niss.org, papers 122 and 123.
5. GSK patent.
![Page 37: Micro Array Literature](https://reader036.fdocuments.net/reader036/viewer/2022062500/568157c3550346895dc549c8/html5/thumbnails/37.jpg)
37
Questions