Post on 06-Jan-2016
description
Title
Statistics O. R. 892Object Oriented Data AnalysisJ. S. Marron
Dept. of Statistics and Operations ResearchUniversity of North Carolina1Administrative InfoDetails on Course Web Pagehttp://stor892fall2014.web.unc.edu/Or:Google: Marron CoursesChoose This Course2Object Oriented Data AnalysisWhat is it?A Sound-Bite Explanation: What is the atom of the statistical analysis?1st Course: NumbersMultivariate Analysis Course : VectorsFunctional Data Analysis: CurvesMore generally: Data Objects3Object Oriented Data AnalysisCurrent Motivation: In Complicated Data Analyses Fundamental (Non-Obvious) Question Is:What Should We Take as Data Objects? Key to Focussing Needed Analyses4
Mortality Time SeriesImprovedColoring:
RainbowRepresentingYear:
Magenta = 1908
Red = 20025Time Series of CurvesJust a Set of CurvesBut Time Order is Important!Useful Approach (as above):Use color to code for time
Start End
6T. S. Toy E.g., PCA View
PCA gives Modes of VariationBut there are ManyIntuitively Useful???Like harmonics?Isnt there only 1 mode of variation?Answer comes in scores scatterplots7T. S. Toy E.g., PCA Scatterplot
8Chemo-metric Time Series, Control
9
SuggestionOfClusters
Which AreThese?Functional Data Analysis10
ManuallyBrushClustersFunctional Data Analysis11
ManuallyBrushClusters
ClearAlternateSplicingFunctional Data Analysis12Limitation of PCA, Toy E.g.
13NCI 60: Can we find classesUsing PCA view?
14PCA Visualization of NCI 60 DataMaybe need to look at more PCs?
Study array of such PCA projections:
15NCI 60: Can we find classesUsing PCA 9-12?
16PCA Visualization of NCI 60 DataCan we find classes using PC directions??Found some, but not othersNothing after 1st five PCs Rest seem to be noise driven
Are There Better Directions? PCA only feels maximal variation Ignores Class Labels How Can We Use Class Labels?17Visualization of NCI 60 DataHow Can We Use Class Labels?
Approach: Find Directions to Best Separate Classes In Disjoint Pairs (thus 4 Directions) Use DWD:Distance Weighted Discrimination Defined (& Motivated) Later Project All Data on These 4 Directions18NCI 60: Views using DWD Dirns (focus on biology)
19DWD Visualization of NCI 60 DataMost cancer types clearly distinct(Renal, CNS, Ovar, Leuk, Colon, Melan)Using these carefully chosen directionsOthers less clear cutNSCLC (at least 3 subtypes)Breast (4 published subtypes)20DWD VisualizationRecall PCA limitationsDWD uses class infoHence can better separate known classesDo this for pairs of classes(DWD just on those, ignore others)Carefully choose pairs in NCI 60 dataNote DWD Directions Not Orthogonal(PCA orthogonality may be too strong a constraint)21NCI 60: Views using DWD Dirns (focus on biology)
22PCA Visualization of NCI 60 DataCan we find classes using PC directions??Found some, but not othersNot so distinct as in DWD viewNothing after 1st five PCs Rest seem to be noise drivenOrthogonality too strong a constraint???Interesting dirns are nearly orthogonal
23Limitation of PCA
Main Point:
May be Important Data StructureNot Visible in 1st Few PCs
24Yeast Cell Cycle DataAnother Example Showing
Interesting Directions Beyond PCA25Yeast Cell Cycle DataGene Expression Microarray dataData (after major preprocessing): Expression level of:thousands of genes (d ~ 1,000s)but only dozens of cases (n ~ 10s)Interesting statistical issue:High Dimension Low Sample Size data(HDLSS)26Yeast Cell Cycle DataData from:
Spellman, et al (1998)
Analysis here is from:
Zhao, Marron & Wells (2004)27Yeast Cell Cycle DataLab experiment:Chemically synchronize cell cycles, of yeast cellsDo cDNA micro-arrays over timeUsed 18 time points, over about 2 cell cyclesStudied 4,489 genes (whole genome)Time series view of data: have 4,489 time series of length 18Functional Data View: have 18 curves, of dimension 4,48928Yeast Cell Cycle DataLab experiment:Chemically synchronize cell cycles, of yeast cellsDo cDNA micro-arrays over timeUsed 18 time points, over about 2 cell cyclesStudied 4,489 genes (whole genome)Time series view of data: have 4,489 time series of length 18Functional Data View: have 18 curves, of dimension 4,489What are the dataobjects?29Yeast Cell Cycle Data, FDA ViewCentral question:Which genes are periodic over 2 cell cycles?
30Yeast Cell Cycle Data, FDA ViewPeriodic genes?
Nave approach:Simple PCA
31Yeast Cell Cycle Data, FDA ViewCentral question: which genes are periodic over 2 cell cycles?Nave approach: Simple PCANo apparent (2 cycle) periodic structure?Eigenvalues suggest large amount of variationPCA finds directions of maximal variationOften, but not always, same as interesting directionsHere need better approach to study periodicities32Yeast Cell Cycle Data, FDA ViewApproachProject on Period 2 Components OnlyCalculate via Fourier RepresentationTo understand, study Fourier Basis
Powerful Fact: linear combos of sin and cos capture phase, since:
33Sin-Cos Phase Shifts are LinearPowerful Fact: linear combos of sin and cos capture phase, since:
Consequence:
Random Phase Shifts Captured in Just 2 PCs
34
n = 30curves
Sin-Cos Phase Shifts are Linear35
n = 30curves
Random Phase Shifts Captured in Just 2 PCs
Sin-Cos Phase Shifts are Linear36
Sin-Cos Phase Shifts are Linear37Fourier Basis
38Fourier Basis39Fourier BasisFourier Basis Facts:Complete Basis (spans whole space) Exactly True for both versionsBasis Elements are Directions Will think about as aboveGood References:Brillinger (2001)Bloomfield (2004)
40Fourier Basis
41Yeast Cell Cycle Data, FDA ViewApproachProject on Period 2 Components OnlyCalculate via Fourier RepresentationProject onto Subspace of Even FrequenciesKeeps only 2-period part of data(i.e. same over both cycles)Then do PCA on projected data42Fourier Basis
43Yeast Cell Cycles, Freq. 2 Proj.
PCA onFreq. 2PeriodicComponent Of Data44Yeast Cell Cycles, Freq. 2 Proj.PCA on periodic component of data Hard to see periodicities in raw dataBut very clear in PC1 (~sin) and PC2 (~cos)PC1 and PC2 explain 65% of variation (see residuals) Recall linear combos of sin and cos capture phase, since:
45Frequency 2 AnalysisImportant features of data appear only at frequency 2,Hence project data onto 2-dim space of sin and cos (freq. 2)Useful view: scatterplotSimilar to PCA projns, except directions are now chosen, not var maxing46Frequency 2 Analysis
Colors are47Frequency 2 AnalysisProject data onto 2-dim space of sin and cos (freq. 2)Useful view: scatterplotAngle (in polar coordinates) shows phaseColors: Spellmans cell cycle phase classificationBlack was labeled not periodicWithin class phases approxly same, but notable differencesLater will try to improve phase classification48Batch and Source AdjustmentFor Stanford Breast Cancer Data (C. Perou)Analysis in Benito, et al (2004) Bioinformatics, 20, 105-114. https://genome.unc.edu/pubsup/dwd/Adjust for Source EffectsDifferent sources of mRNA Adjust for Batch EffectsArrays fabricated at different times
49Idea Behind AdjustmentFind direction from one to otherShift data along that directionDetails of DWD Direction developed later
50Source Batch Adj: Raw Breast Cancer data
51Source Batch Adj: Source Colors
52Source Batch Adj: Batch Colors
53Source Batch Adj: Biological Class Colors
54Source Batch Adj: Biological Class Col. & Symbols
55Source Batch Adj: Biological Class Symbols
56Source Batch Adj: Source Colors
57Source Batch Adj: PC 1-3 & DWD direction
58Source Batch Adj: DWD Source Adjustment
59Source Batch Adj: Source Adjd, PCA view
60Source Batch Adj: Source Adjd, Class Colored
61Source Batch Adj: Source Adjd, Batch Colored
62Source Batch Adj: Source Adjd, 5 PCs
63Source Batch Adj: S. Adjd, Batch 1,2 vs. 3 DWD
64Source Batch Adj: S. & B1,2 vs. 3 Adjusted
65Source Batch Adj: S. & B1,2 vs. 3 Adjd, 5 PCs
66Source Batch Adj: S. & B Adjd, B1 vs. 2 DWD
67Source Batch Adj: S. & B Adjd, B1 vs. 2 Adjd
68Source Batch Adj: S. & B Adjd, 5 PC view
69Source Batch Adj: S. & B Adjd, 4 PC view
70Source Batch Adj: S. & B Adjd, Class Colors
71Source Batch Adj: S. & B Adjd, Adjd PCA
72Source Batch Adj: Raw Data, Tree View
73Caution on Colors~10 % of Males are: Red Green Color Blind
Cant distinguish Red vs. Green
Should use better scheme74Caution About Tree ViewCan Miss Important Features
75Caution About Tree ViewImportant Clusters, not in Coord Axis Dirn
76Source Batch Adj: Raw Data, Tree View
77Source Batch Adj: Raw Data, Array Tree
78Source Batch Adj: Raw Array Tree, Source Colored
79Source Batch Adj: Raw Array Tree, Batch Colored
80Source Batch Adj: Raw Array Tree, Class Colored
81Source Batch Adj: DWD Adjusted Data, Tree View
82Source Batch Adj: DWD Adjusted Data, Array Tree
83Source Batch Adj: DWD Adjusted Data, Source Colored
84Source Batch Adj: DWD Adjusted Data, Batch Colored
85Source Batch Adj: DWD Adjusted Data, Class Colored
86DWD: A look under the hoodDistance Weighted Discrimination (DWD)Modification of Support Vector MachineFor HDLSS dataUses 2nd Order Cone programmingWill study later
Main Goal:Find direction, that separates data classesIn best possible way87DWD: Why not PC1? PC 1 Direction feels variation, not classes
Also eliminates (important?) within class variation
88DWD: Why not PC1? Direction driven by classes
Sliding maintains (important?) within class variation
89E. g. even worse for PCA PC1 direction is worst possible
90But easy for DWDSince DWD uses class label information
91DWD does not solve all problemsOnly handles means, not differing variation
92Interesting Benchmark Data SetNCI 60 Cell LinesInteresting benchmark, since same cellsData Web available:http://discover.nci.nih.gov/datasetsNature2000.jspBoth cDNA and Affymetrix PlatformsDifferent from Breast Cancer DataWhich had no common samples93NCI 60: Raw Data, Platform Colored
94NCI 60: Raw Data, Tree View
95NCI 60: Raw Data
96NCI 60: Raw Data, Before DWD Adjustment
97NCI 60: Before & After DWD adjustment
98NCI 60: Before & After, new scales
99NCI 60: After DWD
100NCI 60: DWD adjusted data
101NCI 60: Before Column Mean Adjustment
102NCI 60: Before & After Column Mean Adjustment
103NCI 60: Before & After Col. Mean Adj., Rescaled
104NCI 60: After DWD & Column Mean Adj.
105NCI 60: DWD & Column Mean Adjusted
106NCI 60: Before Column Stand. Dev. Adjustment
107NCI 60: Before and After Column S.D. Adjustment
108NCI 60: Before and After Col. S.D. Adj., Rescaled
109NCI 60: After Column Stand. Dev. adjustment
110NCI 60: Fully Adjusted Data
111NCI 60: Fully Adjusted Data, Platform Colored
112NCI 60: Fully Adjusted Data, Melanoma Cluster
BREAST.MDAMB435BREAST.MDN MELAN.MALME3M MELAN.SKMEL2 MELAN.SKMEL5 MELAN.SKMEL28 MELAN.M14 MELAN.UACC62 MELAN.UACC257 113NCI 60: Fully Adjusted Data, Leukemia Cluster
LEUK.CCRFCEM LEUK.K562 LEUK.MOLT4 LEUK.HL60 LEUK.RPMI8266LEUK.SR 114Another DWD Appln: VisualizationRecall PCA limitationsDWD uses class infoHence can better separate known classesDo this for pairs of classes(DWD just on those, ignore others)Carefully choose pairs in NCI 60 dataShows Effectiveness of Adjustment115NCI 60: Views using DWD Dirns (focus on biology)
116DWD Visualization of NCI 60 DataMost cancer types clearly distinct(Renal, CNS, Ovar, Leuk, Colon, Melan)Using these carefully chosen directionsOthers less clear cutNSCLC (at least 3 subtypes)Breast (4 published subtypes)DWD adjustment was very effective(very few black connectors visible)117