Thanh Le, Katheleen J. Gardiner University of Colorado Denver

A validation method forfuzzy clusteringA biological problem of gene expression data

Thanh Le, Katheleen J. GardinerUniversity of Colorado Denver

July 18th, 2011

Overview Introduction

Data clustering: approaches and current challenges fzBLE

a novel method for validation of clustering results Datasets

artificial and real datasets for testing fzBLE Experimental results Discussion:

Advantages and limitations of fzBLE

Clustering problem Genes are clustered based on

Similarity Dissimilarity

Clusters are described by Boundaries & overlaps Number of clusters Compactness within clusters Separation between clusters

Clustering approaches Hierarchical approach Partitioning approach

Hard clustering approach Crisp cluster boundaries Crisp cluster membership

Soft/Fuzzy clustering approach Overlapping cluster boundaries Soft/Fuzzy membership Appropriate for many real-world problems

Fuzzy C-Means algorithm The model

Features:Fuzzy membership, soft cluster boundaries,One gene can belong to multiple clusters & be assigned to multiple biological processes

c

1kki

2ki

n

1i

c

1k

mki

n..1i,1u

1mmin,vxu)V,U|X(J

Fuzzy C-Means (contd.) Possibility-based model Model parameters estimated using an iteration process Rapid convergence Most appropriate for gene expression data Challenges:

Determining the number of clusters Avoiding local optima The goodness-of-fit to validate

clustering results

Methods for fuzzy clustering validation Methods based on compactness and separation

Problem: Over-fit - the larger the number of cluster is, the better the cluster index is. No rationale for how to scale the two factors in the model

Methods based on goodness of fit Statistics approach Expectation-Maximization (EM) method Problem:

Slowly convergent, particularly at cluster boundaries because of the exponential function. Inappropriate to real dataset because of the model assumption of data distributions: Gaussian, chi-squared…

The fzBLE method for cluster validation1. Cluster using Fuzzy C-Means

clustering algorithm2. Validate using the goodness-of-fit

(the log likelihood estimator) and Bayesian approach

Cluster validation:Goodness-of-fit & fuzzy clustering1. Convert the possibility model into a

probability model2. Use Bayesian approach to compute the

statistics.3. Apply the Central Limit Theory

To effectively represent the data distribution

4. Model selection based on goodness-of-fit

Datasets Artificial datasets

Finite mixture model based datasets Real datasets

Iris, Wine and Glass datasets at UC Irvine Machine Learning Repository

Gene datasets which are more complexYeast cell cycle gene expression (Yeast)Yeast gene functional annotations (Yeast-MIPS)Rat Central Nervous System (RCNS) gene expression

Experimental results onartificial datasets

# clusters fzBLE PC PE FS XB CWB PBMF BR CF

3 1.00 0.42 0.42 0.42 0.42 1.00 1.00 0.83 0.00

4 1.00 0.92 0.92 0.92 0.83 1.00 1.00 1.00 0.00

5 1.00 0.75 0.75 0.83 0.75 0.83 1.00 1.00 0.00

6 1.00 0.92 0.83 0.92 0.58 0.58 1.00 0.92 0.00

7 1.00 0.83 0.83 0.83 0.67 0.58 1.00 0.67 0.00

8 1.00 1.00 0.92 1.00 0.92 0.67 1.00 0.83 0.00

9 1.00 0.92 0.67 0.92 0.67 0.33 1.00 0.83 0.00

PC-partition coefficient, PE-partition entropy, FS-Fukuyama-Sugeno, XB-Xie and Beni, CWB-Compose Within and Between scattering, PBMF-Pakhira, Bandyopadhyay and Maulik Fuzzy, BR-Rezaee B., CF-Compactness factor; loop=5, #cluster range=[2,12]

Correctness Ratios in determining the number of clusters

Experimental results onGlass dataset

# clusters

fzble PC PE FS XB CWB PBMF BR CF

2 -1135.688

6

0.8884 0.1776 0.3700 0.7222 6538.9311 0.3732

1.9817

0.5782

3 -1127.685

4

0.8386 0.2747 0.1081 0.7817 4410.3006 0.4821

1.5004

0.4150

4 -1119.245

7

0.8625 0.2515 -0.0630 0.6917 3266.5876 0.4463

1.0455

0.3354

5 -1123.282

6

0.8577 0.2698 -0.1978 0.6450 2878.8912 0.4610

0.8380

0.2818

6 -1113.833

9

0.8004 0.3865 -0.2050 1.4944 5001.1752 0.3400

0.8371

0.2430

7 -1116.572

4

0.8183 0.3650 -0.2834 1.3802 5109.6082 0.3891

0.6914

0.2214

8 -1127.262

6

0.8190 0.3637 -0.3948 1.4904 7172.2250 0.6065

0.5916

0.2108

9 -1117.748

4

0.8119 0.3925 -0.3583 1.7503 8148.7667 0.3225

0.5634

0.1887

10 -1122.158

5

0.8161 0.3852 -0.4214 1.7821 9439.3785 0.3909

0.4926

0.1758

11 -1121.984

8

0.8259 0.3689 -0.4305 1.6260 9826.4211 0.3265

0.4470

0.1704

12 -1135.045

3

0.8325 0.3555 -0.5183 1.4213 11318.4879

0.5317

0.3949

0.1591

13 -1138.946

2

0.8317 0.3556 -0.5816 1.4918 14316.7592

0.6243

0.3544

0.1472

Algorithm Cluster Validity Scores and Decisions (highlighted in yellow)

Experimental results on RCNS - more complex dataset; two-factor scaling issue

#clusters

fzble PC PE FS XB CWB PBMF BR CF

2 -580.072

8

0.9942

0.0121

-568.797

2

0.0594

5.5107 4.2087

1.1107

177.8094

3 -564.198

6

0.9430

0.0942

-487.610

4

0.4877

4.1309 4.2839

1.6634

117.9632

4 -561.016

9

0.9142

0.1470

-430.486

3

0.9245

6.1224 3.3723

1.3184

99.1409

5 -561.742

0

0.8900

0.1941

-397.093

5

1.3006

9.4770 2.6071

1.1669

88.5963

6 -552.915

3

0.8695

0.2387

-300.656

4

2.5231

20.6496

1.9499

1.1026

84.0905

7 -556.290

5

0.8707

0.2386

-468.312

1

2.1422

21.0187

2.8692

0.7875

57.5159

8 -555.350

7

0.8925

0.2078

-462.067

3

1.7245

20.0113

2.5323

0.5894

52.0348

9 -558.868

6

0.8863

0.2192

-512.427

8

1.6208

22.4772

2.6041

0.5019

45.9214

10 -565.836

0

0.8847

0.2241

-644.145

1

1.1897

21.9932

3.4949

0.3918

33.1378

Algorithm Cluster Validity Scores and Decisions (highlighted in yellow)

• 112 genes during RCNS development at 9 time points• 6 clusters, 4 of which are functionality-annotated (Somogyi et al. 1995, Wen et al. 1998)

Discussion:The advantages of fzBLE Performs better than other

approaches on 3 levels of data. Compactness-separation approaches

Solves the over-fit problem using goodness-of-fit.

Eliminates need for two scaling factors Mixture model with EM approach

Rapid convergence No assumption on data distribution

Discussion:The limitations of fzBLE Depends on internal validity External validities are needed

Biological validity GO terms, Pathways, PPI

Future work on gene expression: Distance definition based on biological

context Combine fzBLE with biological homology and

stability indices

Thank you!

Questions?

We acknowledge the support from National Institutes of Health Linda Crnic Institute Vietnamese Ministry of Education and

Training

Thanh Le, Katheleen J. Gardiner University of Colorado Denver

Documents

Transcript of Thanh Le, Katheleen J. Gardiner University of Colorado Denver