Instance-based Classification
description
Transcript of Instance-based Classification
Instance-based Classification
• Examine the training samples each time a new query instance is given.
• The relationship between the new query instance and training examples will be checked to assign a class label to the query instance.
KNN: k-Nearest Neighbor
• A test sample x can be best predicted by determining the most common class label among k training samples to which x is most similar.
• Xj—jth training sample, yj—the class label for xj, Nx—the set of k nearest neighbors of x in training set. Estimate the probability x belongs to ith class:
KNN: k-Nearest Neighbor, con’t
• Proportion of K nearest neighbors that belong to ith class:
• The ith class which maximizes the proportion above will be assigned as the label of x.
• Variants of KNN: filtering out irrelevant genes before applying KNN.
K
Nxiyxip xjj |}|{|)|(
^
Molecular Classification of Cancer
Class Discovery and Class Prediction by Gene Expression Monitoring
"Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"Golub, Slonim, Tamayo, Huard, Gaasenbeek, Mesirov, Coller, Lo, Downing, Caligiuri, Bloomfield, Lander
Appears in Science Volume 286, October 15, 1999
Whitehead Institute/MIT Center for Genome Researchhttp://www-genome.wi.mit.edu/cancer
...and Dana-Farber (Boston), St. Jude (Memphis), Ohio State
...additional publications by same group shows similar technique applied to different disease modalities.
Publication InfoPublication Info
Cancer ClassificationCancer Classification
Class Discovery: defining previously unrecognized tumor subtypes
Class Prediction: assignment of particular tumor samples to already-defined classes
Given bone marrow samples:Which cancer classes are present among sample?How many cancer classes? 2, 4?Given samples are from leukemia patients, what
type of leukemia is each sample (AML vs ALL)?
Cancer of bone marrowMyelogenous or lymphocytic, acute or chronicAcute Myelogenous Leukemia (AML) vs Acute Lymphocytic Leukemia (ALL)
Marrow cannot produce appropriate amount of red and white blood cells
Anemia -> weakness, minor infections; Platlet deficiency -> easy bruising
AML: 10,000 new adult cases per yearALL: 3,500/2,400 new adult/child cases per yearAML vs. ALL in adults & children
Leukemia: Definitions & SymptomsLeukemia: Definitions & Symptoms
Leukemia: Treatment & expected Leukemia: Treatment & expected outcomeoutcome
Diagnosis via highly specialized laboratoryALL: 58% survival rateAML: 14% survival rateTreatment: chemotherapy, bone marrow transplant
ALL: corticosteroids, vincristine, methotrexate, L-asparaginase
AML: daunorubicin, cytarabineCorrect diagnosis very important for treatment options and expected outcome!!!
Microarray could provide systematic diagnosis optionBUT ONLY ONE TYPE OF DIAGNOSTIC TOOL!!!
38 bone marrow samples (27 AML, 11 AML)
6817 human gene probes
Leukemia: Data setLeukemia: Data set
Cancer Class Prediction
• Learning Task– Given: Expression profiles of leukemia patients
– Compute: A model distinguishing disease classes (e.g., AML vs. ALL patients) from expression data.
• Classification Task– Given: Expression profile of a new patient + A
learned model (e.g., one computed in a learning task)
– Determine: The disease class of the patient (e.g., whether the patient has AML or ALL)
Cancer Class Prediction
• n genes measured in m patients
g1,1 g1,n à class1
g2,1 g2,n à class2
gm,1 gm,n à classm
Vector for a
patient
Cancer Class Prediction Approach
• Rank genes by their correlation with class variable (AML/ALL)
• Select subset of “informative” genes
• Have these genes do a weighted vote to classify a previously unclassified patient.
• Test validity of predictors.
Ranking Genes
• Rank genes by how predictive they are (individually) of the class…
g1,1 g1,n à class1
g2,1 g2,n à class2
gm,1 gm,n à classm
Ranking Genes• Split the expression values for a given
gene g into two pools – one for each class (AML vs. ALL)
• Determine their mean and standard deviation sigma of each pool
• Rank genes by correlation metric (separation)
P(g, class) = (ALL - AML)/(ALL + AML)
The mean difference between the classes
relative to the SD within the classes.
Neighborhood AnalysisNeighborhood AnalysisEach gene g: V(g) = (e1, e2, …, en), ei: expression level of gene g in ith sample.Idealized pattern: c = (c1, c2, …, cn), ci: 1 or 0 (sample I belongs to class 1 or 2.C* idealized random pattern. Counting the number of genes having various levels of correlation with C, compared with the corresponding distribution obtained for random pattern C*.
Selecting Informative Genes• Select the kALL top ranked genes (highly
expressed in ALL) and the kAML bottom ranked genes (highly expressed in AML)
P(g, class) = (ALL - AML)/(ALL + AML)
In Golub’s paper, 25 most positively correlated and 25 most negatively correlated genes are selected.
Determine significant genesDetermine significant genes
1% significance level means 1% of random neighborhoods contain as many points as observed neighborhood.
P(g,c)>0.30 is 709 genes (intersects 1%)
Median is ~150 genes (if totally random)
Weighted Voting
• Given a new patient to classify, each of the selected genes casts a weighted vote for only one class.
• The class that gets the most vote is the prediction.
Weighted Voting
• Suppose that x is the expression level measured for gene g in the patient
V = P(g,class) X |x – [ALL + AML]/2|
Weight for gene g – weighting factor
reflecting how well the gene is
correlated with the class distinction
Distance from the measurement to the
class boundary -- reflecting the deviation of the expression level in the sample from the average
of AML and ALL
PredictionPrediction
Weighted vote:VAML=viwi|vi is vote for AML where vi=|xi-(AML+ALL)/2|
Prediction Strength
• Can assess the “strength” of a prediction as follows:
PS = (Vwinner – Vloser)/(Vwinner+ Vloser)
where Vwinner is the summed vote (absolute value) from the winning class, and Vloser is the summed vote (absolute value) for the losing class
Prediction Strength
• When classifying new cases, the algorithm ignores those cases where the strength of the prediction is below a threshold…
• Prediction =– [ALL, if VALL > VAML Æ PS >
– [AML, if VAML > VALL Æ PS >
– [No-call, otherwise.
Experiments
• Cross validation with the original set of patients– For i = 1 to 38
• Hold the ith sample aside
• Use the other 37 samples to determine weights
• With this set of weights, make prediction on the ith samples
• Testing with another set of 34 patients…
"Training set" results were 36/38 with 100% accuracy, 2 unknown via cross-validation (37 train, 1 test)
Independent "test set" consisted of 34 samples
24 bone marrow samples, 10 peripheral blood samplesNOTE: "training set" was ONLY bone marrow samples"test set" contained childhood AML samples, different laboratories
Strong predictions (PS=0.77) for 29/34 samples with 100% accuracy
Low prediction strength from questionable laboratory
Prediction: ResultsPrediction: Results
Slection of 8-200 genes gives roughly the same prediction quality.
Cancer Class Discovery
• Given– Expression profiles of leukemia patients
• Do– Cluster the profiles, leading to discovery of
the subclasses of leukemia represented by the set of patients
Cancer Class Discovery Experiment
• Cluster the expression profiles of 38 patients in the training set– Using self-organizing maps with a predefined
number of clusters (say, k)
• Run with k = 2– Cluster 1 contained 1 AML, 24 ALL– Cluster 2 contained 10 AML, 3 ALL
Cancer Class Discovery Experiment
• Run with k = 4– Cluster 1 contained mostly AML– Cluster 2 contained mostly T-cell ALL– Cluster 3 contained mostly B-cell ALL– Cluster 4 contained mostly B-cell ALL
• It is unlikely that the clustering algorithm was able to discover the distinction between T-cell and B-cell ALL cases