Object Orie’d Data Analysis, Last Time

93
Object Orie’d Data Analysis, Last Time • Organizational Matters http://www.unc.edu/~marron/UNCstat322-2005/HomePa ge.html • Matlab Software • Time Series of Curves • Chemometrics Data • Mortality Data

description

Object Orie’d Data Analysis, Last Time. Organizational Matters http://www.unc.edu/~marron/UNCstat322-2005/HomePage.html Matlab Software Time Series of Curves Chemometrics Data Mortality Data. Data Object Conceptualization. Object Space  Feature Space Curves - PowerPoint PPT Presentation

Transcript of Object Orie’d Data Analysis, Last Time

Page 1: Object Orie’d Data Analysis, Last Time

Object Orie’d Data Analysis, Last Time

• Organizational Mattershttp://www.unc.edu/~marron/UNCstat322-2005/HomePage

.html

• Matlab Software

• Time Series of Curves

• Chemometrics Data

• Mortality Data

Page 2: Object Orie’d Data Analysis, Last Time

Data Object Conceptualization

Object Space Feature Space

Curves

Images Manifolds

Shapes Tree Space

Trees

d

Page 3: Object Orie’d Data Analysis, Last Time

Functional Data Analysis, Toy EG I

Page 4: Object Orie’d Data Analysis, Last Time

Limitation of PCA

PCA can provide useful projection directions

But can’t “see everything”…

Reason:

• PCA finds dir’ns of maximal variation

• Which may obscure interesting structure

Page 5: Object Orie’d Data Analysis, Last Time

Limitation of PCA, Toy E.g.

Page 6: Object Orie’d Data Analysis, Last Time

Yeast Cell Cycle Data• “Gene Expression” – Micro-array data

• Data (after major preprocessing): Expression “level” of:

• thousands of genes (d ~ 1,000s)

• but only dozens of “cases” (n ~ 10s)

• Interesting statistical issue:

High Dimension Low Sample Size data

(HDLSS)

Page 7: Object Orie’d Data Analysis, Last Time

Yeast Cell Cycle Data

Data from:

Spellman, P. T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D. and Futcher, B. (1998), “Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization”, Molecular Biology of the Cell, 9, 3273-3297.

Page 8: Object Orie’d Data Analysis, Last Time

Yeast Cell Cycle Data

Analysis here is from:

Zhao, X., Marron, J.S. and Wells, M.T. (2004) The Functional Data View of Longitudinal Data, Statistica Sinica, 14, 789-808

Page 9: Object Orie’d Data Analysis, Last Time

Yeast Cell Cycle Data• Lab experiment:

• Chemically “synchronize cell cycles”, of yeast cells

• Do cDNA micro-arrays over time

• Used 18 time points, over “about 2 cell cycles”

• Studied 4,489 genes (whole genome)

• Time series view of data:

4,489 time series of length 18

• Functional Data View:

4,489 “curves”

Page 10: Object Orie’d Data Analysis, Last Time

Yeast Cell Cycle Data, FDA View

Central question:Which genes are “periodic” over 2 cell cycles?

Page 11: Object Orie’d Data Analysis, Last Time

Yeast Cell Cycle Data, FDA View

Periodic genes?

Naïve approach:Simple

PCA

Page 12: Object Orie’d Data Analysis, Last Time

Yeast Cell Cycle Data, FDA View

• Central question: which genes are “periodic” over 2 cell cycles?

• Naïve approach: Simple PCA• No apparent (2 cycle) periodic structure?• Eigenvalues suggest large amount of

“variation”• PCA finds “directions of maximal

variation”• Often, but not always, same as

“interesting directions”• Here need better approach to study

periodicities

Page 13: Object Orie’d Data Analysis, Last Time

Yeast Cell Cycles, Freq. 2 Proj.

PCA on

Freq. 2

Periodic

Component

Of Data

Page 14: Object Orie’d Data Analysis, Last Time

Yeast Cell Cycles, Freq. 2 Proj.

PCA on periodic component of data • Hard to see periodicities in raw data• But very clear in PC1 (~sin) and PC2

(~cos)• PC1 and PC2 explain 65% of variation

(see residuals)• Recall linear combos of sin and cos

capture “phase”• since:

xcxcxxx sincossinsincoscoscos 21

Page 15: Object Orie’d Data Analysis, Last Time

Frequency 2 Analysis

• Important features of data appear

only at frequency 2,

• Hence project data onto 2-dim space

of sin and cos (freq. 2)

• Useful view: scatterplot

Page 16: Object Orie’d Data Analysis, Last Time

Frequency 2 Analysis

Page 17: Object Orie’d Data Analysis, Last Time

Frequency 2 Analysis• Project data onto 2-dim space of sin and cos

(freq. 2)

• Useful view: scatterplot

• Angle (in polar coordinates) shows phase

• Colors: Spellman’s cell cycle phase classification

• Black was labeled “not periodic”

• Within class phases approx’ly same, but notable differences

• Later will try to improve “phase classification”

Page 18: Object Orie’d Data Analysis, Last Time

Batch and Source Adjustment

• For Stanford Breast Cancer Data• Analysis in Benito, et al (2004)

Bioinformaticshttps://genome.unc.edu/pubsup/dwd/

• Adjust for Source Effects– Different sources of mRNA

• Adjust for Batch Effects– Arrays fabricated at different times

Page 19: Object Orie’d Data Analysis, Last Time

Idea Behind Adjustment

• Find “direction” from one to other• Shift data along that direction• Details of DWD Direction developed later

Page 20: Object Orie’d Data Analysis, Last Time

Source Batch Adj: Raw Breast Cancer data

Page 21: Object Orie’d Data Analysis, Last Time

Source Batch Adj: Source Colors

Page 22: Object Orie’d Data Analysis, Last Time

Source Batch Adj: Batch Colors

Page 23: Object Orie’d Data Analysis, Last Time

Source Batch Adj: Biological Class Colors

Page 24: Object Orie’d Data Analysis, Last Time

Source Batch Adj: Biological Class Col. &

Symbols

Page 25: Object Orie’d Data Analysis, Last Time

Source Batch Adj: Biological Class Symbols

Page 26: Object Orie’d Data Analysis, Last Time

Source Batch Adj: Source Colors

Page 27: Object Orie’d Data Analysis, Last Time

Source Batch Adj: PC 1-2 & DWD direction

Page 28: Object Orie’d Data Analysis, Last Time

Source Batch Adj: DWD Source Adjustment

Page 29: Object Orie’d Data Analysis, Last Time

Source Batch Adj: Source Adj’d, PCA view

Page 30: Object Orie’d Data Analysis, Last Time

Source Batch Adj: Source Adj’d, Class

Colored

Page 31: Object Orie’d Data Analysis, Last Time

Source Batch Adj: Source Adj’d, Batch

Colored

Page 32: Object Orie’d Data Analysis, Last Time

Source Batch Adj: Source Adj’d, 5 PCs

Page 33: Object Orie’d Data Analysis, Last Time

Source Batch Adj: S. Adj’d, Batch 1,2 vs. 3

DWD

Page 34: Object Orie’d Data Analysis, Last Time

Source Batch Adj: S. & B1,2 vs. 3 Adjusted

Page 35: Object Orie’d Data Analysis, Last Time

Source Batch Adj: S. & B1,2 vs. 3 Adj’d, 5 PCs

Page 36: Object Orie’d Data Analysis, Last Time

Source Batch Adj: S. & B Adj’d, B1 vs. 2 DWD

Page 37: Object Orie’d Data Analysis, Last Time

Source Batch Adj: S. & B Adj’d, B1 vs. 2 Adj’d

Page 38: Object Orie’d Data Analysis, Last Time

Source Batch Adj: S. & B Adj’d, 5 PC view

Page 39: Object Orie’d Data Analysis, Last Time

Source Batch Adj: S. & B Adj’d, 4 PC view

Page 40: Object Orie’d Data Analysis, Last Time

Source Batch Adj: S. & B Adj’d, Class Colors

Page 41: Object Orie’d Data Analysis, Last Time

Source Batch Adj: S. & B Adj’d, Adj’d PCA

Page 42: Object Orie’d Data Analysis, Last Time

Source Batch Adj: Raw Data, Tree View

Page 43: Object Orie’d Data Analysis, Last Time

Source Batch Adj: Raw Data, Array Tree

Page 44: Object Orie’d Data Analysis, Last Time

Source Batch Adj: Raw Array Tree, Source Colored

Page 45: Object Orie’d Data Analysis, Last Time

Source Batch Adj: Raw Array Tree, Batch Colored

Page 46: Object Orie’d Data Analysis, Last Time

Source Batch Adj: Raw Array Tree, Class Colored

Page 47: Object Orie’d Data Analysis, Last Time

Source Batch Adj: DWD Adjusted Data, Tree View

Page 48: Object Orie’d Data Analysis, Last Time

Source Batch Adj: DWD Adjusted Data, Array

Tree

Page 49: Object Orie’d Data Analysis, Last Time

Source Batch Adj: DWD Adjusted Data, Source

Colored

Page 50: Object Orie’d Data Analysis, Last Time

Source Batch Adj: DWD Adjusted Data, Batch

Colored

Page 51: Object Orie’d Data Analysis, Last Time

Source Batch Adj: DWD Adjusted Data, Class

Colored

Page 52: Object Orie’d Data Analysis, Last Time

DWD: A look under the hood

• Distance Weighted Discrimination (DWD)– Modification of Support Vector

Machine– For HDLSS data– Uses 2nd Order Cone programming– Will study later

• Main Goal:– Find direction, that separates data

classes– In “best” possible way

Page 53: Object Orie’d Data Analysis, Last Time

DWD: Why not PC1?

- Direction feels variation, not classes

- Also eliminates (important?) within class variation

Page 54: Object Orie’d Data Analysis, Last Time

DWD: Under the hood (cont.)

-Direction driven by classes

-“Sliding” maintains (important?) within class variation

Page 55: Object Orie’d Data Analysis, Last Time

E. g. even worse for PCA

- PC1 direction is worst possible

Page 56: Object Orie’d Data Analysis, Last Time

But easy for DWD

Since DWD uses class label information

Page 57: Object Orie’d Data Analysis, Last Time

DWD does not solve all problems

Only handles means, not differing variation

Page 58: Object Orie’d Data Analysis, Last Time

Interesting Benchmark Data Set

• NCI 60 Cell Lines

– Interesting benchmark, since same cells

– Data Web available:– http://discover.nci.nih.gov/

datasetsNature2000.jsp

– Both cDNA and Affymetrix Platforms

• Different from Breast Cancer Data

– No common RNA

Page 59: Object Orie’d Data Analysis, Last Time

NCI 60: Raw Data, Platform Colored

Page 60: Object Orie’d Data Analysis, Last Time

NCI 60: Raw Data, Tree View

Page 61: Object Orie’d Data Analysis, Last Time

NCI 60: Raw Data

Page 62: Object Orie’d Data Analysis, Last Time

NCI 60: Raw Data, Before DWD Adjustment

Page 63: Object Orie’d Data Analysis, Last Time

NCI 60: Before & After DWD adjustment

Page 64: Object Orie’d Data Analysis, Last Time

NCI 60: Before & After, new scales

Page 65: Object Orie’d Data Analysis, Last Time

NCI 60: After DWD

Page 66: Object Orie’d Data Analysis, Last Time

NCI 60: DWD adjusted data

Page 67: Object Orie’d Data Analysis, Last Time

NCI 60: Before Column Mean Adjustment

Page 68: Object Orie’d Data Analysis, Last Time

NCI 60: Before & After Column Mean Adjustment

Page 69: Object Orie’d Data Analysis, Last Time

NCI 60: Before & After Col. Mean Adj., Rescaled

Page 70: Object Orie’d Data Analysis, Last Time

NCI 60: After DWD & Column Mean Adj.

Page 71: Object Orie’d Data Analysis, Last Time

NCI 60: DWD & Column Mean Adjusted

Page 72: Object Orie’d Data Analysis, Last Time

NCI 60: Before Column Stand. Dev. Adjustment

Page 73: Object Orie’d Data Analysis, Last Time

NCI 60: Before and After Column S.D. Adjustment

Page 74: Object Orie’d Data Analysis, Last Time

NCI 60: Before and After Col. S.D. Adj., Rescaled

Page 75: Object Orie’d Data Analysis, Last Time

NCI 60: After Column Stand. Dev. adjustment

Page 76: Object Orie’d Data Analysis, Last Time

NCI 60: Fully Adjusted Data

Page 77: Object Orie’d Data Analysis, Last Time

NCI 60: Fully Adjusted Data, Platform Colored

Page 78: Object Orie’d Data Analysis, Last Time

NCI 60: Fully Adjusted Data, Tree View

Page 79: Object Orie’d Data Analysis, Last Time

NCI 60: Fully Adjusted Data, Melanoma Cluster

BREAST.MDAMB435BREAST.MDN MELAN.MALME3M MELAN.SKMEL2 MELAN.SKMEL5 MELAN.SKMEL28 MELAN.M14 MELAN.UACC62 MELAN.UACC257

Page 80: Object Orie’d Data Analysis, Last Time

NCI 60: Adjusted, Melanoma Cluster, Tree

View

All Melanoma Lines + 2 Breast Lines

in Same Cluster

Page 81: Object Orie’d Data Analysis, Last Time

NCI 60: Fully Adjusted Data, Leukemia Cluster

LEUK.CCRFCEM LEUK.K562 LEUK.MOLT4 LEUK.HL60 LEUK.RPMI8266LEUK.SR

Page 82: Object Orie’d Data Analysis, Last Time

NCI 60: Adjusted, Leukemia Cluster, Tree

View

Page 83: Object Orie’d Data Analysis, Last Time

NCI 60: Views using DWD Dir’ns (focus on

biology)

Page 84: Object Orie’d Data Analysis, Last Time

NCI 60 Controversy

• Can NCI 60 Data be normalized?• Negative Indication:• Kou, et al (2002) Bioinformatics, 18,

405-412.– Based on Gene by Gene Correlations

• Resolution:Gene by Gene Data View

vs.Multivariate Data View

Page 85: Object Orie’d Data Analysis, Last Time

Resolution of Paradox: Toy Data, Gene View

Page 86: Object Orie’d Data Analysis, Last Time

Resolution: Correlations suggest “no chance”

Page 87: Object Orie’d Data Analysis, Last Time

Resolution: Toy Data, PCA View

Page 88: Object Orie’d Data Analysis, Last Time

Resolution: PCA & DWD direct’ns

Page 89: Object Orie’d Data Analysis, Last Time

Resolution: DWD Adjusted

Page 90: Object Orie’d Data Analysis, Last Time

Resolution: DWD Adjusted, PCA view

Page 91: Object Orie’d Data Analysis, Last Time

Resolution: DWD Adjusted, Gene view

Page 92: Object Orie’d Data Analysis, Last Time

Resolution: Correlations & PC1 Projection Correl’n

Page 93: Object Orie’d Data Analysis, Last Time

Needed final verification of Cross-platform

Normal’n

• Is statistical power actually improved?

• Will study later