Tree Based Methods for Analyzing Tissue Microarray Data

40
Tree Based Methods for Analyzing Tissue Microarray Data Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles

description

Tree Based Methods for Analyzing Tissue Microarray Data. Steve Horvath Human Genetics and Biostatistics University of California, Los Angeles. Acknowledgements. Horvath Lab Yunda Huang Xueli Liu Ph.D. Zeke Fang Ph.D. Tuyen Hoang UCLA Tissue Microarray Core David Seligson - PowerPoint PPT Presentation

Transcript of Tree Based Methods for Analyzing Tissue Microarray Data

Page 1: Tree Based Methods for Analyzing  Tissue Microarray Data

Tree Based Methodsfor Analyzing

Tissue Microarray Data

Steve HorvathHuman Genetics and Biostatistics

University of California, Los Angeles

Page 2: Tree Based Methods for Analyzing  Tissue Microarray Data

Acknowledgements

• Horvath Lab– Yunda Huang – Xueli Liu Ph.D.– Zeke Fang Ph.D.– Tuyen Hoang

• UCLA Tissue Microarray Core– David Seligson– Aarno Palotie

• Clinicians– Hyung Kim– Arie Belldegrun

Page 3: Tree Based Methods for Analyzing  Tissue Microarray Data

Contents

• Statistical issues with tissue microarray (TMA) data

• Random forest (RF) predictors

• RF clustering

• Application of RF clustering to TMA data

• Supervised Learning Methods

Page 4: Tree Based Methods for Analyzing  Tissue Microarray Data

Background TMA data

Page 5: Tree Based Methods for Analyzing  Tissue Microarray Data

Description of TMA data

• TMA data are a high-throughput tool in validating newly-identified biomarker in genome wide discovery

• Basic technique was summarized in Kononen et al. 1998

Page 6: Tree Based Methods for Analyzing  Tissue Microarray Data

donor block array block slide

Tissue Microarray (TMA) TechnologyKononen et al. Nature Medicine 1998

• Hundreds of tiny (typically 0.6 mm diameter) cylindrical tissue cores

–densely and precisely arrayed into a single histologic paraffin block.

• From this new array block, up to 300 serial 4-8 m thick sections may be produced.

• Targets for fluorescence in situ hybridization (FISH) and protein expression by immunohistochemical studies.

Page 7: Tree Based Methods for Analyzing  Tissue Microarray Data

Pathologists score each spot by looking through a microscope. slide by David Seligson

Non-normal and highly correlated

Page 8: Tree Based Methods for Analyzing  Tissue Microarray Data

Several Spots per Pathology Case Several “Scores” per Spot

• Maximum intensity = Max (1 – 4)

• Percent of cells staining = Pos (0 – 100)

• Percent of cells staining with the

maximum intensity = PosMax (0 – 100)

• Spots have a spot grade: NL,1,2,..

• Indicator of informativeness

• Each case is usually represented by 4 or more spots

– >3 malignant lesions, 1 matched normal

Page 9: Tree Based Methods for Analyzing  Tissue Microarray Data

0 20 40 60 80 100

05

01

00

15

02

00

25

0

0 20 40 60 80 100

05

01

00

15

02

00

25

0

0 20 40 60 80 100

05

01

00

15

02

00

0 0.5 1 1.5 2 2.5 3

05

01

00

15

0

0 0.5 1 1.5 2 2.5 3

05

01

00

15

02

00

0 0.5 1 1.5 2 2.5 3

05

01

00

15

0

P53 CA9 EpCamPercent of Cells Staining(POS)

Maximum Intensity (MAX)

Histogram of tumor marker expression scores: POS and MAX

Page 10: Tree Based Methods for Analyzing  Tissue Microarray Data

P53 and Ki67: Max versus Pos

0.0 0.5 1.0 1.5

1.5 2.0 2.5 3.0

1.5

2.0

2.5

3.0

0.0

0.5

1.0

1.5KiNuclMax

0 20 40

40 60 80

40

60

80

0

20

40KiPos

0.0 0.5 1.0 1.5

1.5 2.0 2.5 3.0

1.5

2.0

2.5

3.0

0.0

0.5

1.0

1.5P5NuclMax

0 20 40

60 80 100

60

80

100

0

20

40P5Pos

Page 11: Tree Based Methods for Analyzing  Tissue Microarray Data

Characteristics of TMA data

• Non-normal, discrete, strongly correlated• Mixed variable types • Pooling (combining) spot measurements across

every patient – between 1 to 10 spots of different grade

– current strategy pools tumor spots and forms median, mean, minimum or max

• Message: tumor marker intensity is measured by up to 12 highly correlated staining scores multicollinearity

Page 12: Tree Based Methods for Analyzing  Tissue Microarray Data

Our main tool are random forest predictors

• Unsupervised analysis of TMA data– RF clustering

• Supervised Analysis– RF based pre-validation method

Page 13: Tree Based Methods for Analyzing  Tissue Microarray Data

Background random forest predictors

L. Breiman 1999

Page 14: Tree Based Methods for Analyzing  Tissue Microarray Data

Random Forests (RFs)

• RFs are a collection of tree predictors such that each tree depends on the values of an independently sampled random vector

Page 15: Tree Based Methods for Analyzing  Tissue Microarray Data

Classification and Regression Trees (CART)

by– Leo Breiman,

UC Berkeley– Jerry Friedman,

Stanford University– Charles J. Stone,

UC Berkeley– Richard Olshen,

Stanford University

Page 16: Tree Based Methods for Analyzing  Tissue Microarray Data

An example of CART

• Goal: For the patients admitted into ER, to predict who is at higher risk of heart attack

• Training data set:– # of subjects = 215– Outcome variable = High/Low Risk

determined– 19 noninvasive clinical and lab variables were

used as the predictors

Page 17: Tree Based Methods for Analyzing  Tissue Microarray Data

High 12%Low 88%

High 17%Low 83%

Is BP <= 91?

High 70%Low 30%

High 11%Low 89%

High 50%Low 50%

High 2%Low 98%

High 23%Low 77%

Is age <= 62.5?Classified as high risk!

Classified as low risk!

Classified as high risk! Classified as low risk!

Is ST present?

CART construction

Yes No

No

No

Yes

Yes

Page 18: Tree Based Methods for Analyzing  Tissue Microarray Data

CART Construction

BINARY RECURSIVE PARTITIONING

• Binary: split parent node into two child nodes

• Recursive: each child node can be treated as parent node

• Partitioning: data set is partitioned into mutually exclusive subsets in each split

Page 19: Tree Based Methods for Analyzing  Tissue Microarray Data

RF Construction

Page 20: Tree Based Methods for Analyzing  Tissue Microarray Data

Prediction by plurality voting

• The forest consists of N trees.

• Class prediction: – Each tree votes for a class; the predicted

class C for an observation is the plurality, maxC k [fk(x,T) == C]

• Regression random forest: – predicted value is the average prediction

Page 21: Tree Based Methods for Analyzing  Tissue Microarray Data

Clustering with random forest predictors

Page 22: Tree Based Methods for Analyzing  Tissue Microarray Data

Intrinsic Proximity Measure

• Terminal tree nodes contain few observations

• If case i and case j both land in the same terminal node, increase the proximity between i and j by 1.

• At the end of the run divide by 2* no. of trees.

• Dissimilarity=sqrt(1-Proximity)

Page 23: Tree Based Methods for Analyzing  Tissue Microarray Data

Casting an unsupervised problem into a supervised RF

problem • Key Idea (Breiman 1999)

– Label observed data as class 1– Generate synthetic observations and

label them as class 2– Construct a RF predictor to distinguish

class 1 from class 2– Use the resulting dissimilarity measure

in unsupervised analysis

Page 24: Tree Based Methods for Analyzing  Tissue Microarray Data

How to generate synthetic observations

• Synthetic observations are simulated to contain no clusters– e.g. randomly sampling from the product of

empirical marginal distributions of the input.

Page 25: Tree Based Methods for Analyzing  Tissue Microarray Data

RF clustering

• Compute distance matrix from RF– distance matrix = sqrt(1-proximity matrix)

• Compute the first 2~3 classical multi-dimensional scaling coordinates based on the distance matrix

• Conduct partitioning around medoid (PAM) clustering analysis

– input parameter=no. of clusters k – use the Euclidean distance between the resulting

scaling points

Page 26: Tree Based Methods for Analyzing  Tissue Microarray Data

Theoretical Study of RF Clustering

Ref: Using random forest proximity for unsupervised learning, BIOKDD-CBGI'03, 7th Joint Conference on Information Sciences, Cary, North Carolina.

Page 27: Tree Based Methods for Analyzing  Tissue Microarray Data

Applying Random Forest Clustering to Tissue Microarray Data--Application to Kidney Cancer

Tao Shi and Steve Horvath

Page 28: Tree Based Methods for Analyzing  Tissue Microarray Data

Scientific Question:Can one discover cancer subtypes

based on the protein expression patterns of tumor markers?

Page 29: Tree Based Methods for Analyzing  Tissue Microarray Data

Why use RF clustering for TMA data?

• no need to transform the often highly skewed features– based on ranks of features

• natural way of weighing tumor marker contributions to the dissimilarity

• elegant way to deal with missing covariates

• intrinsic proximity matrix handles mixed variable types well

Page 30: Tree Based Methods for Analyzing  Tissue Microarray Data

Kidney Multi-marker Data

• 366 patients with Renal Cell Carcinoma (RCC) admitted to UCLA between 1989 and 2000.

• Immuno-histological measures of total 8 tumor markers were obtained from tissue microarrays constructed from the tumor samples of these patients.

Page 31: Tree Based Methods for Analyzing  Tissue Microarray Data

MDS plot of clear cell patients

• Labeled and colored by their RF cluster

-0.1 0.0 0.1 0.2 0.3

-0.2

-0.1

0.0

0.1

cmd plot

coordinate 1

coo

rd 2

1

2

1

2

1

1 11

2

1

2

2

1

2

2 2

1

22

2

11

3

1

3

2

1

1

3

11

1

2

2

2

3

1

2

3

2

22

2 2

2

2

3

1

22

2

1

1

3

1

32

2

1

2

3

1

2 2

1

2 22

2

3322

2

22

2

2

3

2

22

2

1

22

22

11

2

1

2

2

2

1

2

2

2

2

3

1

2

3

3

2

3

2

2

2

2

1

2

22

2

22

2

2

1

2

1

222

1

2

2

1

2

1

1

2

2

1

2

2

2

3

22

1

2

2 3

1

21

2

2

2

1

2

2

222

2

2

2

1

2

2

222

2

2

2

3

2

222

1

2

2

1

3

2

1

2

2

2

2

2

22

1

1

1

2

1

1

22

1

22

2

2

1

22

2

22

2

2

22

2

3

2

11

1

2

2

2

1

22

1

2

1

2

2

3

2

2

1

3

2

22

3

2

3

1

1

2

1

1

31

22

22

1

2

2

2

2

1

2 2

2

22

22

2

2

2

2

1

22

3

2

3

2

2

2

1

2

23

1

2

2

3

1

3

1

2

11

1

22

22

1

2

23

2

2

2

1

3

2 2

2

2

1

22

22

31

3

1

2

2

2

2

2

22

1

22

22

1

2

3

1

1

2

2

3

2

2

1

2

1

1

1

1

3

2

3

2

22

2

22

2

2

1

2

2

22

2

2

1

2

Page 32: Tree Based Methods for Analyzing  Tissue Microarray Data

Interpreting the clusters in terms of survival

0 2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

K-M curves

Time to death(Months)

Su

rviv

al

1 Log Rank p value= 0.00037423

Clustering label

Non clearCell

patients

Clear cellpatients

1 0 92

2 20 215

3 30 9

Page 33: Tree Based Methods for Analyzing  Tissue Microarray Data

Hierarchical clustering with Euclidean distance leads to less satisfactory results

11 1

11 1

11

1 11

1 11 1 1

01 1

1 11

1 11

1 11

1 11

1 11

11 1

11

1 11

1 11 1

11 1

11

1 11 1

11

1 11

1 11 1

1 11 1

11 1 1

11 1

1 1 1 11

11 1

11 1 11 1

11 1 1 1

11 1

1 11

1 11 1

1 11 1

11 1

11 1 1

11 1

1 11 1 1 1 1

1 1 1 11

1 1 11

1 1 1 1 1 11 1

11 1 1 1 11 1 1

1 1 11

1 1 11 1

11

1 11

1 1 1 1 1 1 1 11 1

1 11

11

1 11 1

1 11 1

1 1 1 11

1 11

1 11

1 11

1 1 1 11 1

11 1

11 1

1 1 1 1 1 0 11 0

11

1 11

11

1 11 1

11 1

11 1

1 11

11 1 1 1

1 11

1 11 1

1 11 1

11

0 1 11 1

11

1 11

11 1

01 1

11

0 11

11 1

1 01

1 10

01

1 11 1

01

00 0

11 1

11 1

10 0

0 00 0

1 11 0 0

0 00 0

11

1 01

00 1

10 0

0 10

10 1

1 10

00 0 0 0

0 00

0 00 0

11

0 10

0 01

1 1

05

01

00

15

0

Cluster Dendrogram

hclust (*, "average")dist(KidneyRF)

He

igh

t

Cluster-ing label

NonclearCell

patients

Clearcell

patients

1 9 (20)

286 (307)

2 41(30)

30 (9)

* RF clustering grouping in red

Page 34: Tree Based Methods for Analyzing  Tissue Microarray Data

Euclidean vs. RF Distance

RF

dis

tan

ce

Euclidean distance

Page 35: Tree Based Methods for Analyzing  Tissue Microarray Data

Molecular grouping vs. Pathological grouping

Message: molecular grouping is superior to pathological grouping

0 2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Time to death (years)

Su

rviv

al

327 patients in cluster 1 and 239 patients in cluster 3

0 2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Time to death (years)

Su

rviv

al316 non-clear cell patients50 clear cell patients

p = 0.0229p = 9.03e-05

Molecular Grouping Pathological Grouping

Page 36: Tree Based Methods for Analyzing  Tissue Microarray Data

Identify “irregular” patients

Clustering label

Non clearCell

patients

Clear cellpatients

1 0 92

2 20 215

3 30 9

Message: molecular grouping can be used to refine clear celldefinition.

0 2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Time to death (years)

Su

rviv

al

p = 0.00522

9 irregular clear cell patients307 regular clear cell patients

50 non-clear cell patients

Page 37: Tree Based Methods for Analyzing  Tissue Microarray Data

Detect novel cancer subtypes

• Group clear cell grade 2 patients into two clusters with significantly different survival.

0 2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

K-M curves

Time to death (years)

Su

rviv

al

p value= 0.0125

Page 38: Tree Based Methods for Analyzing  Tissue Microarray Data

Results TMA clustering

• Clusters reproduce well known clinical subgroups– Ex: global expression differences between

clear cell and non-clear cell patients– RF clustering works better than clustering

based on the Euclidean distance for TMA data

• RF clustering allows one to identify “outlying” tumor samples.

• Can detect previously unknown sub-groups

Page 39: Tree Based Methods for Analyzing  Tissue Microarray Data

Boxplots of tumor marker expression vs. cluster

1 2 3

020

40

60

80

100

CA

9M

em

PosM

n

p= 9.95e-28

1 2 3

020

40

60

80

100

CA

12M

em

PosM

n

p= 4.61e-15

1 2 3

010

20

30

40

50

Ki6

7P

osM

n

p= 3.51e-13

1 2 3

020

40

60

80

100

GeP

osH

arr

iMn

p= 3.33e-21

1 2 3

020

40

60

80

p53P

osM

n

p= 1.7e-10

1 2 3

020

40

60

80

100

EpD

ctP

osM

n

p= 1.64e-14

1 2 3

020

40

60

80

100

pT

EN

PosM

np= 1.43e-27

1 2 30

20

40

60

80

100

Vim

Pos

p= 7.97e-14

Message: clusters can be explained in terms of tumor expression values, i..e in terms of biological pathways.

Page 40: Tree Based Methods for Analyzing  Tissue Microarray Data

Conclusions

• There is a need to develop tailor made data mining methods for TMA data– Major differences:

• highly non-normal data • Euclidean distance metrics seems to be sub-

optimal for TMA data

• tree or forest based methods work well for kidney and prostate TMA data