Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable...
-
Upload
magdalene-roberts -
Category
Documents
-
view
212 -
download
0
description
Transcript of Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable...
Survival-Time Classification of Breast Cancer Patients
DIMACS Workshop on Data Mining and Scalable AlgorithmsAugust 22-24, 2001- Rutgers University
Y.-J. Lee & O. L. Mangasarian
Second Annual ReviewJune 1, 2001
Data Mining Institute University of Wisconsin - Madison
American Cancer Society2001 Breast Cancer Estimates
Breast cancer, the most common cancer among women, is the second leading cause of cancer deaths in women (after lung cancer) 192,200 new cases of breast cancer in women will be diagnosed in the United States 40,600 deaths will occur from breast cancer (40,200 among women, 400 among men) in the United StatesAccording to the World Health Organization, more than 1.2 million people will be diagnosed with breast cancer this year worldwide
Key Objective
Identify breast cancer patients for whom adjuvant chemotherapy prolongs survival time Main Difficulty: Cannot carry out comparative tests on human subjects
Similar patients must be treated similarly Our Approach: Classify patients into: good, intermediate & poor groups
Characterize classes by: Tumor size & lymph node status Classification based on: 5 cytological features plus tumor size
Principal ResultsFor 253 Breast Cancer Patients
All 69 patients in the good group: Had no chemotherapy Had the best survival rate
All 73 patients in the poor group: Had chemotherapy Had the worst survival rate
For the 121 patients in the intermediate group: The 67 patients who had chemotherapy had better survival rate than: The 44 patients who did not have chemotherapy
Last result reverses role of chemotherapy for both the overall population as well as the good & poor groups
Outline Tools used
Support vector machines (SVMs). Feature selection Classification
Clustering k-Median (k-Mean fails!)
Cluster chemo patients into chemo-good & chemo-poor Cluster no-chemo patients into no-chemo-good & no-chemo-poor Three final classes
Good = No-chemo good Poor = Chemo poor Intermediate = Remaining patients
Generate survival curves for three classes Use SVM to classify new patients into one of above three classes
Simplest Support Vector MachineLinear Surface Maximizing the Margin
x0w= í +1
x0w= í à 1
A+
A-
jjwjj2
w
= Margin
Clustering in Data Mining
General Objective
Given: A dataset of m points in n-dimensional real space
Problem: Extract hidden distinct properties by clustering the dataset
Concave Minimization Formulationof Clustering Problem
, and a numberA 2 Rn Given: Set A of m points inRn represented by the matrix
k of desired clusters
` 2 f1; . . .;kgi = 1;. . .;m
Problem: Determine centers ` = 1;. . .;kC` Rn, in suchthat the sum of the minima over of the1-norm distance between each point A i , ,
C`and cluster centers , ` = 1;. . .k is minimized
k Objective: Sum of m minima of linear functions, hence it is piecewise-linear concave
Difficulty: Minimizing a general piecewise-linear concavefunction over a polyhedral set is NP-hard
Clustering via Concave Minimization
Reformulation:
i = 1;. . .;m; `= 1;. . .;k
C`;D i ` 2 R n; Ti ` 2 RPi=1
m P`=1
kTi`e0D i`
à D i` ô A0i à C` ô D i`
P`=1k Ti`=1; Ti` õ 0
min
s.t.
à D i` ô A0i à C` ô D i`
i = 1;. . .;m;` = 1;. . .;k
C`;D i `
Pi=1
m
` = 1; . . .; kf e0D i `gmin min
s.t.
Minimize the sum of 1-norm distances between each dataA ipoint C` :and the closest cluster center
Finite K-Median Clustering Algorithm(Minimizing Piecewise-linear Concave Function)
Step 0 (Initialization): Given k initial cluster centers Different initial centers will lead to different clusters
Step 1 (Cluster Assignment): Assign points to the cluster withthe nearest cluster center in 1-norm
Step 2 (Center Update) Recompute location of center for eachcluster as the cluster median (closest point to all clusterpoints in 1-norm)
Step3 (Stopping Criterion) Stop if the cluster centers are unchanged, else go to Step 1
Clustering Process: Feature Selection & Initial Cluster Centers
6 out of 31 features selected by a linear SVM SVM separating lymph node positive (Lymph>0) from lymph node negative (Lymph=0)
Clustering performed in 6-dimensional feature space Initial cluster centers used:
Good: Median in 6-dimensional space of patients with Lymph=0 AND Tumor <2 Poor: Median in 6-dimensional space of patients with of Lymph>4 OR Tumor >4
Typical indicator for chemotherapy
Clustering Process253 Patients
Intermediate1:(0<Lymph<5 & Tumor<4)
OR (Lymph<5 & 2<=Tumor<4)
Use k-Median Algorithm with Initial Centers:Medians of Good1 & Poor1
Cluster 113 NoChemo Patients
69 NoChemo Good 44 NoChemo Poor 67 Chemo Good 73 Chemo Poor
Good PoorIntermediate
Use k-Median Algorithm with Initial Centers:Medians of Good1 & Poor1Cluster 140 Chemo Patients
Good1:Lymph=0 AND Tumor<2
Compute Median Using 6 Features
Poor1:Lymph>=5 OR Tumor>=4
Compute Median Using 6 Features
Survival Curves forGood, Intermediate & Poor Groups
Survival Curves for Intermediate Group:Split by Chemo & NoChemo
Survival Curves for All PatientsSplit by Chemo & NoChemo
Survival Curves for Intermediate GroupSplit by Lymph Node & Chemotherapy
Survival Curves for All PatientsSplit by Lymph Node Positive & Negative
Nonlinear SVM Classifier82.7% Tenfold Test Correctness
Good PoorChemoGood NoChemoPoor
SVM
Not Poor Not GoodGood2:Good & ChemoGood
Poor2:NoChemoPoor & Poor
Compute LI(x) & CI(x)
Compute LI(x) & CI(x)
SVM
Good IntermediateSVM
PoorIntermediate
Conclusion
By using five features from a fine needle aspirate & tumor size, breast cancer patients can be classified into 3 classes
Good – Requiring no chemotherapy Intermediate – Chemotherapy recommended for longer survival Poor – Chemotherapy may or may not enhance survival
3 classes have very distinct survival curves First categorization of a breast cancer group for which chemotherapy enhances longevity