Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

45
NISS Metabolomics Worksho p, 2005 1 Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data Kwan R. Lee, Ph.D. and Lei A. Zhu, Amit Bhattacharyya, J. Alan Menius Biomedical Data Sciences GlaxoSmithKline [email protected]

description

Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data. Kwan R. Lee, Ph.D. and Lei A. Zhu, Amit Bhattacharyya, J. Alan Menius Biomedical Data Sciences GlaxoSmithKline [email protected]. Overview. Systems Biology Challenges for Statisticians - PowerPoint PPT Presentation

Transcript of Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

Page 1: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 1

Integrative Analysis of High Dimensional Gene Expression,

Metabolite and Blood Chemistry Data

Kwan R. Lee, Ph.D. and

Lei A. Zhu, Amit Bhattacharyya, J. Alan Menius Biomedical Data Sciences

GlaxoSmithKline [email protected]

Page 2: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 2

Overview

• Systems Biology• Challenges for Statisticians• Possible solutions• Example of integrative data analysis• Summary and discussion

Page 3: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 3

Of mice and men

?

?

Page 4: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 4

Integrate knowledge and technologiesReduce attrition by running coordinated studies in animal and man

Page 5: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 5

Focusing on one platform may miss an obvious signal!!!

Page 6: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 6

How can efficacy failures be attacked?

Animal Phenotype Human Phenotype

Classic Phenotypic Approach

Animal Phenotype Human Phenotype

Animal Biomarker Fingerprint

Human Biomarker Fingerprint

Integrative Biology

Few data to support analogy

Many data to support analogy

Page 7: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 7

‘Systems Biology’ approach to drug discovery

Page 8: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 8

123456789 ppm

1H NMR metabolites

Affy Transcriptome

LC-MS Lipid

LC-MS metabolites

Adiponectin (change from baseline)

-15

-10

-5

0

5

10

15

day 7day 14

db/+ db/db

Veh Met30

Gly1

Gly3

Met75 Veh Met

30Gly1

Gly3

Met75

*

Adi

pone

ctin

(ug/

ml)

“Non-omic” markers

Veh A B C D Veh A B C DNormal Disease

A

A

Experimental Platforms Non-omics and Omics, what are they?

Page 9: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 9

Experimental Platforms Non-omics and Omics, what are they? (cont.)

• Traditional Blood Chemistry (non-omics)• Gene Expression (transcriptomics)• Metabolite (metabonomics)• Lipid (lipomics)• Protein (proteomics)

Page 10: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 10

Five Challenges

1. Data Pre-processing2. High Dimensionality3. Multiple Testing for Marker Selection4. Data Integration5. Validation of the Prediction Model

Page 11: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 11

• Peak Alignment (NMR, LC/MS)

• Normalization (Gene Chip, NMR, LC/MS data)– Why? Remove systematic

bias in the data– Normalization within the

platform makes data comparable across samples

Challenge #1: Data Pre-processing

Page 12: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 12

Challenge # 2: High Dimensionality# of subjects << # of variables

• Blood Chemistry: 9 markers• Gene Expression: 22,000 probe sets • Lipid LC/MS: 2, 000 peaks• Metabolite LS/MS: 3,000 peaks• NMR: 500 buckets

Animal 1Animal 2 . . . .

Animal 100

probe set 1 …… 22,000 Lipid 1 ...… 2,000 Metabolite 1 … 3,000 NMR 1 …… 500Choles, Trig,… ...

Page 13: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 13

Noise Signal Signal+Noise

No Adjustment for Multiple Testing

FWERAdjustment

FDR

+ =

Challenge #3: Multiple Testing in Variable Selection

Page 14: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 14

123456789 ppm

1H NMR metabolites

Affy Transcriptome

LC-MS Lipid

LC-MS metabolites

Adiponectin (change from baseline)

-15

-10

-5

0

5

10

15

day 7day 14

db/+ db/db

Veh Met30

Gly1

Gly3

Met75 Veh Met

30Gly1

Gly3

Met75

*

Adi

pone

ctin

(ug/

ml)

“Non-omic” markers

Veh A B C D Veh A B C DNormal Disease

A

Challenge #4: Data integration

A

Page 15: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 15

Platform A20000s var.

Platform B1000s var.

CombinedData

Platform A20000s var.

Platform B1000s var.

Dimension Reduction(eg variable selection)

Platform A1000s var.

Platform B100s var.

CombinedData

Integration Approach 1: Integration Approach 2:

Challenge #4: Data integration (cont.)

Page 16: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 16

Integration approach 1: Simple data integration

– Simply combining the platform data together, the platform with large amount of data and variability will dominate the other platforms

Challenge #4: Data integration Example 1

Page 17: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 17

PCA on Non-omics, Transcriptomics, and Combined.

-8

-6

-4

-2

0

2

4

6

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8

t[2]

t[1]

integrated_orignal_intensity_liver_Epac397.M1 (PCA-X), Non-omics (40 markers) onlyt[Comp. 1]/t[Comp. 2]

Class 1Class 2

1111

1111

1111

11 11

11

1111

1111

11

11

1111

11

11

11

11

11

11

11

11

1111 11

1111

11111111

11 11 11 111111

1111

11

11

58 58

58

58

58

58

58

58

58

58

58

5858

5858

5858

58

58

585858

58

58

58

58

58

58

58

5858

5858 58

58

58

58

58

-80

-60

-40

-20

0

20

40

60

80

-90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110

t[2]

t[1]

integrated_orignal_intensity_liver_Epac397 Transcriptomics only.M1 (PCA-X)t[Comp. 1]/t[Comp. 2]

Class 1Class 2

11

11

11 1111

11

11

11

11

11

11

11

11

11 1111

1111

11

11

11

11

11

11

1111

11

11

1111

1111

11

11

11

11

11

11

11

11

1111

11

11

58

585858

58

58

58 58

58

5858

5858

58

58

58

58

58

58

58

58

58

58

5858 58

58

585858

58

58

58

58

58

58

58 58

Non-omics (20)

Transcriptomics (12,488)

Combined (12,508)

Mirror image!!!

Transcriptomics data dominate Non-omics data!!!

-80

-60

-40

-20

0

20

40

60

80

-90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110

t[2]

t[1]

integrated_orignal_intensity_liver_Epac397 non-omics 20 n transcriptomics.M1 (PCA-X)t[Comp. 1]/t[Comp. 2]

Class 1Class 2

11

11

11 1111

11

11

11

11

11

11

11

11

11 1111

1111

11

11

11

11

11

11

1111

11

11

1111

1111

11

11

11

11

11

11

11

11

1111

11

11

58

585858

58

58

58 58

58

5858

5858

58

58

58

58

58

58

58

58

58

58

5858 58

58

585858

58

58

58

58

58

58

58 58

Page 18: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 18

PCA on Non-omics, Transcriptomics, and Combined.

-8

-6

-4

-2

0

2

4

6

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8

t[2]

t[1]

integrated_orignal_intensity_liver_Epac397.M1 (PCA-X), Non-omics (40 markers) onlyt[Comp. 1]/t[Comp. 2]

Class 1Class 2

1111

1111

1111

11 11

11

1111

1111

11

11

1111

11

11

11

11

11

11

11

11

1111 11

1111

11111111

11 11 11 111111

1111

11

11

58 58

58

58

58

58

58

58

58

58

58

5858

5858

5858

58

58

585858

58

58

58

58

58

58

58

5858

5858 58

58

58

58

58

Non-omics (20)

Transcriptomics (20 PCs)

Combined (40)

Like a mirror image!!!

-6

-4

-2

0

2

4

6

8

-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9

t[2]

t[1]

Nonomics20 n Tran20.M1 (PCA-X)t[Comp. 1]/t[Comp. 2]

Class 1Class 2

111111

11

111111 11

11

11

11

11

11 11

11

1111

11

11

11

11

11

11

11

11

1111

1111

1111

111111

11 11 11 111111

1111

11

11

5858

58

58

5858

58

58

58

58

58

5858

5858

58

58

58

58

5858

58

58

58

58

58

58

58

58

5858

5858 58

58

58

58

58

-2

-1

0

1

2

-2 -1 0 1 2 3

t[2]

t[1]

Transcriptome first 20 PCs.M2 (PCA-X)t[Comp. 1]/t[Comp. 2]

Class 1Class 2

11

11

11 1111

11

11

11

11

11

11

11

11

11 1111

1111

11

11

11

11

11

11

1111

11

11

1111

1111

11

11

11

11

11

11

11

11

1111

11

11

58

585858

58

58

58 58

58

5858

5858

58

58

58

58

58

58

58

58

58

58

5858 58

58

585858

58

58

58

58

58

58

58 58

Page 19: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 19

Integration approach 2: Integrate on selected markers– 9 blood chemistry + 2000 probe sets + 150 metabolites – There are still platforms with more selected markers – How to weight different platforms appropriately? Eg. 9 blood

chemistry markers are known to relate to disease or drug– Identify relationship among the probe sets, metabolites, along with

the blood chemistry markers in terms of biological pathways

Challenge #4: Data integrationExample 2

Page 20: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 20

NormalDisease

-30

-20

-10

0

10

20

30

-70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70

t[2]

t[1]

fat integrated.M11 (PCA-X), Allt[Comp. 1]/t[Comp. 2]Colored according to values in variable fat integrated

Ellipse: Hotelling T2 (0.95)

Series (Settings for DIET)

Missing1158

D10D10

D10

D10D10

D10

D10

D10D30D30

D30

D30D30D30

D30D60

D60

D60

D60D60

D60

Veh

Veh

Veh

VehVeh

Veh

Veh

D10

D10

D10

D10D10D10 D10D10

D10 D30D30 D30

D30

D30

D30

D30

D30

D30D30

D60 D60

D60D60

D60

D60

D60

D60

D60

Veh

Veh

VehVeh

VehVeh

VehVeh

SIMCA-P+ 10.5 - 04/23/2004 14:57

Principle Component Analysis (PCA ) Projection of 67 animals of 28 normal (black) , 39 disease (red)

(9 NO, 1991 TA, 115 MT)

All markers used for projection

Page 21: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 21

-0.060

-0.050

-0.040

-0.030

-0.020

-0.010

0.000

0.010

0.020

0.030

0.040

0.050

-0.030 -0.020 -0.010 0.000 0.010 0.020 0.030

p[2]

p[1]

fat integrated.M11 (PCA-X), Allp[Comp. 1]/p[Comp. 2]Colored according to Var ID (Platform)

MENOTA

G

C

T

N

HH

I

A

L

SIMCA-P+ 10.5 - 04/23/2004 15:00

Loading Plot

Page 22: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 22

-30

-20

-10

0

10

20

30

-80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80

t[2]

t[1]

fat integrated.M6 (PLS-DA), All platforms, HFDt[Comp. 1]/t[Comp. 2]Colored according to classes in M6

Ellipse: Hotelling T2 (0.95)

Class 1Class 2

Veh

VehVeh

Veh

VehVeh

Veh

VehVeh

Veh

D10

D10

D10 D10

D10D10

D10

D10

D10D30D30

D30

D30

D30

D30

D30

D30

D30 D30

D60 D60D60

D60

D60

D60D60

D60

D60

D60

SIMCA-P+ 10.5 - 04/23/2004 15:13

Partial Least Square Discriminant Analysis (PLS-DA)

Disease group only

VehicleDrug

Page 23: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 23

PLS-DA: Corresponding projection of all markers

(9 NO, 1991 TA, 115 MT), Which are important drug markers?

-0.060

-0.040

-0.020

0.000

0.020

0.040

0.060

-0.030 -0.020 -0.010 0.000 0.010 0.020 0.030

w*c

[2]

w *c[1]

fat integrated.M6 (PLS-DA), All platforms, HFDw*c[Comp. 1]/w*c[Comp. 2]Colored according to Var ID (Platform)

$M6.DA1$M6.DA2MENOTA

$

$

GC

T

N

HH

I

A L

SIMCA-P+ 10.5 - 04/23/2004 15:19

Drug

Veh

Page 24: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 24

Ranked drug markers by importance or by coefficients.

marker importance by variable importance on projection

Up or down regulation by coefficients

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100

VIP[

4]

fat integrated.M6 (PLS-DA), All platforms, HFDVIP[Comp. 4]

-0.0100

-0.0080

-0.0060

-0.0040

-0.0020

0.0000

0.0020

0.0040

0.0060

0.0080

0.0100

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100

Coe

ffCS[

4]($

M6.

DA1

)

fat integrated.M6 (PLS-DA), All platforms, HFDCoeffCS[Comp. 4](YVar $M6.DA1)

Page 25: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 25

Validation of the model: R2, Q2 and permutation tests 100 times (P < 0.01)

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

fat integrated.M6 (PLS-DA): Validate Model$M6.DA1 Intercepts: R2=(0.0, 0.918), Q2=(0.0, 0.35)

100 permutations 4 components

R2Q2

Page 26: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 26

Variation explained by each platformPLS-DA for prediction of 2 experimental groups

Platform Q2(cum)

1 NO 49%

2 TA 86%

3 MT 45%

4 All 86%

Two Groups

HFD, vehicle

HFD, Drug treated

Q2(Y) = amount of variation among the 2 groups explained by the model (cross-validated)

The above table is based on 2- component model. If the 4th model uses morecomponents, 91% of the variation in the data can be explained by 4 components.

Page 27: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 27

Challenge #5: Validation of the Prediction Model

• Correct way of doing cross-validation– Especially when the variables are selected

• Is your prediction accuracy significant?

Page 28: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 28

Random Noise Data

• Simulate 20,000 marker columns of random noise for 100 patients and one additional column containing arbitrary labels of class indicators.

• Select 5 marker columns showing most correlation with class label.

• Make a prediction model for class indicators based on these 5 selected markers.

Page 29: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 29

PCA of Full Markers

-40

-30

-20

-10

0

10

20

30

40

-40 -30 -20 -10 0 10 20 30 40 50

t[2]

t[1]

random_noise.M12 (PCA-X), Untitledt[Comp. 1]/t[Comp. 2]Colored according to classes in M12

Ellipse: Hotelling T2 (0.95)

Class 1Class 2

SIMCA-P+ 10.5 - 2/5/2005 8:22:09 AM

Page 30: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 30

PLS-DA on Random Noise Data

• Running a full model on SIMCA does not yield a model – no significant Q2.– Multivariate approach is conservative.– Q2 computes prediction performance.

• But forced the software to fit a 6 -component model by PLS-DA

• (R2 = 1.0, Q2 = 0.225)

Page 31: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 31

Full marker modelPLS-DA

-30

-20

-10

0

10

20

30

-30 -20 -10 0 10 20 30

t[2]

t[1]

random_noise.M1 (PLS-DA), All datat[Comp. 1]/t[Comp. 2]Colored according to classes in M1

Ellipse: Hotelling T2 (0.95)

Class 1Class 2

SIMCA-P+ 10.5 - 2/4/2005 1:20:50 PM

Page 32: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 32

Was it real or by chance?

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

random_noise.M1 (PLS-DA): Validate Model$M1.DA1 Intercepts: R2=(0.0, 1), Q2=(0.0, 0.245)

20 permutations 6 components

R2Q2

SIMCA-P+ 10.5 - 2/4/2005 1:19:58 PM

Page 33: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 33

Select 5 Markers

• Selected top 5 markers using VIP from the over-fitted model and fit PLS-DA again on the same data.

• Now we have (R2 = 0.459, Q2 = 0.348)

Page 34: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 34

Good prediction from PLS-DA? Q2 = 0.35

-3

-2

-1

0

1

2

-3 -2 -1 0 1 2 3

t[2]

t[1]

random_noise.M11 (PLS-DA), Untitledt[Comp. 1]/t[Comp. 2]R2 = 0.447, Q2=0.348

Ellipse: Hotelling T2 (0.95)

Class 1Class 2

SIMCA-P+ 10.5 - 1/19/2005 10:17:08 PM

Page 35: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 35

Validated by permutation test?Significance of Q2

-0.20

0.00

0.20

0.40

0.60

0.80

1.00

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

random_noise.M11 (PLS-DA): Validate Model$M11.DA1 Intercepts: R2=(0.0, 0.0155), Q2=(0.0, -0.17)

20 permutations 2 components

R2Q2

SIMCA-P+ 10.5 - 1/19/2005 10:21:30 PM

Page 36: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 36

Selection Bias

• When a prediction model is tested on the same data that were used in the first instance to select the markers, selection bias makes the test error overly optimistic.– Many publications claimed a small set of selected

“genes” is highly predictive.– IBI practice is to use a data set to select markers

and use the same data set to fit a prediction model based on selected markers.

Page 37: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 37

How to correct for selection bias?

• External validation should be undertaken subsequent to feature selection process.

1. Independent test data set (hold-out data set) that never used for feature selection.

2. External cross-validation (ECV).• Cross validation of the prediction model is external

to the selection process.• In other words, make a new selection for each

cross validation round.

Page 38: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 38

Externally Validated PLS.Model and variable selection

• Divide the data set randomly into d parts.• Set ecv = 1; (this means hold-out one part and use d-1 parts for modeling)• Set a =1 ; (the number of components, do until 10)• Set k = total number of variables;• Loop: • Fit PLS model with given a and k , PLS (a,k);• Predict hold-out set, compute PRESS (ecv, a, k) and save;• Choose top half of the variables by appropriate statistics (coeff, vip, t-ratio etc);• Set k = k/2;• Go back to Loop until k = 2;• Set a = a + 1;• Go back to Loop until a =10;• Set ecv = ecv + 1;• Go back to Loop until ecv = d;• Compute PRESS (a, k) = Sum over ecv {PRESS (ecv, a, k)};• Compute Q2(a, k) = 1 – PRESS (a, k)/TSS;• Plot Q2(a,k) vs. log2(k);

Page 39: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 39

Simulation of 2000 Random DataR. Simon 2003

• 20 x 6000 and 10/10 for class labels• Repeat 2000 times• Compute 3 different error rates

– Re-substitution (wrong)– Cross validation after selection (wrong)– Cross validation before selection (correct)

Page 40: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 40

Results of 2000 Random Data

Page 41: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 41

Permutation testing

• Because of the high dimensionality of gene expression data, it may be possible to achieve relatively small error rates even for random data.

• To assess the significance of the classification results, permutation test may be suggested.

Page 42: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 42

Challenge #5: Validation of the Prediction Model - summary

• Correct way of doing cross-validation– All the steps of the prediction modeling should

be cross-validated.– Each cross validation step should start from

scratch• Is your prediction accuracy significant?

– Random data can give you low prediction error– Permutation tests, bootstrap aggregation

Page 43: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 43

Summary and Discussion• Recent technological advances present

challenging and interesting biological data at molecular level.

• Statistics and multivariate analysis play an important role in understanding and extracting knowledge from these type of data.

• Integrative analysis is even more challenging and we presented some solutions to these challenges. There is plenty of room for improvement.

Page 44: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 44

Acknowledgement

GlaxoSmithKline– High Throughput Biology– Biomedical Data Sciences– Genomics and Proteomics Science– Pathology, Cellular & Biochemical Toxicology– Discovery IT

Page 45: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

NISS Metabolomics Workshop, 2005 45

Data exploration: Present Challenges

Data is an extremely valuable asset, but like a cash crop, unless harvested, it is wasted.

- Sid Adelman