Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
-
Upload
erika-collins -
Category
Documents
-
view
227 -
download
1
Transcript of Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
![Page 1: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/1.jpg)
Classification(slides adapted from Rob Schapire)
Eran Segal Weizmann Institute
![Page 2: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/2.jpg)
Classification Scheme
Labeled training
examples
Classification algorithm
Classification rule
Test example
Predicted classification
![Page 3: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/3.jpg)
Building a Good Classifier Need enough training examples Good performance on training set Classifier is not too complex
Measures of complexity: Number of bits needed to write classifier Number of parameters VC dimension
![Page 4: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/4.jpg)
Example
![Page 5: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/5.jpg)
Example
![Page 6: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/6.jpg)
Classification Algorithms Nearest neighbors Naïve Bayes Decision trees Boosting Neural networks SVMs Bagging …
![Page 7: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/7.jpg)
Nearest Neighbor Classification
Popular nonlinear classifier Find k nearest neighbors of the unknown (test)
vector from the training vectors Assign the unknown (test) vector to the most
frequent class from the k nearest neighboring vectors
Question: how to select similarity measure between vectors?
Problem: in high-dimensional data, nearest neighbors are still not ‘near’
![Page 8: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/8.jpg)
Naïve Bayes Classifier Input of labeled examples
m1={c, x1,...,xn} m2={c, x1,...,xn} …
Parameter estimation / learning phase
Prediction Compute assignment to classes and choose most
likely class
C
X1 XnX2…
Naïve Bayes
€
P(C[m] = c | x[m],θ) ∝ P(C[m] = c |θ)P(x[m] | C[m] = c,θ)
= P(C[m] = c |θ) P(x i[m] | C[m] = c,θ)i=1
n
∏
€
M θ [c] =
δ[c]m
∑M
M θ [x i,c] =
δ[x i,c]m
∑M
![Page 9: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/9.jpg)
Decision Trees
![Page 10: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/10.jpg)
Decision Trees Example
![Page 11: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/11.jpg)
Building Decision Trees Choose a rule to split on Divide data to disjoint subsets based on
splitting rule Repeat recursively for each subset Stop when leaves are (almost) “pure”
![Page 12: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/12.jpg)
Choosing the Splitting Rule Choose rule that leads to greatest increase in
“purity” Purity measures
Entropy: -(p+ln(p+) + p-ln(p-)) Gini index: p+p-
![Page 13: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/13.jpg)
Tree Size vs. Prediction Accuracy
Trees must be big enough to fit training data, but not too big to overfit (capture noise or spurious patterns)
![Page 14: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/14.jpg)
Decision Tree Summary Best known packages
C4.5 (Quinlan) CART (Breiman, Friedman, Olshen & Stone)
Very fast to train and evaluate Relatively easy to interpret But: accuracy is often not state-of-the-art Work well within boosting approaches
![Page 15: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/15.jpg)
Boosting Main observation: easy to find simple ‘rules of
thumb’ that are ‘often’ correct General approach
Concentrate on “hard examples” Derive ‘rule of thumb’ for these examples Combine rule with previous rules by taking a
weighted majority of all current rules Repeat T times
Boosting guarantees: given sufficient data, and an algorithm that can consistently find classifiers (‘rules of thumb’) slightly better than random, a high accuracy classifier can be built
![Page 16: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/16.jpg)
AdaBoost
![Page 17: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/17.jpg)
Setting α
Error εt:
Setting αt:
Classifier weight inversely proportional to classifier error, i.e., classifier weight increases with classification accuracy
€
εt = P(ht (x i) ≠ y i)i=1
N
∑
€
α t =1
2ln
1−ε t
ε t
⎛
⎝ ⎜
⎞
⎠ ⎟
![Page 18: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/18.jpg)
Toy Example Weak classifiers: vertical or horizontal half-
planes
![Page 19: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/19.jpg)
Round 1
![Page 20: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/20.jpg)
Round 2
![Page 21: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/21.jpg)
Round 3
![Page 22: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/22.jpg)
Final Boosting Classifier
![Page 23: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/23.jpg)
Test Error Behavior
Expected Typical
![Page 24: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/24.jpg)
The Margins Explanation Training error measures classification accuracy,
but confidence of classifications is also important
Recall: Hfinal is weighted majority vote of weak rules
Measure confidence by margin = vote strength Empirical evidence and mathematical proof
that: Large margins better generalization error Boosting tends to increase margins of training
examples
![Page 25: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/25.jpg)
Boosting Summary Relatively fast (but not like other algorithms) Simple and easy to program Flexible: can combine with any learning
algorithm e.g., C4.5, very simple rules of thumb
Provable guarantees State-of-the-art accuracy Tends not to overfit (but does sometimes) Many applications
![Page 26: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/26.jpg)
Basic unit: perceptron (linear threshold function)
Neural network Perceptrons in a network Weight on edges Each unit: perceptron
Neural Networks
![Page 27: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/27.jpg)
Perceptron Units Problem: network computation
is discontinuous due to g(x) Solution: approximate g(x) with
smoothed threshold function e.g., g(x)=1/(1+exp(-x))
Hw(x) is now continuous and
differentiable in both x and w
![Page 28: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/28.jpg)
Finding Weights
![Page 29: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/29.jpg)
Neural Network Summary Slow to converge Difficult to get right network architecture and
parameters Not start-of-the-art accuracy as general
method Can be tuned to specific applications and
achieve good performance then
![Page 30: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/30.jpg)
Support Vector Machines (SVMs)
Given linearly separable data Choose hyperplane that maximizes minimum
margin (=distance to separating hyperplane) Intuition: separate +’s from –’s as much as
possible
![Page 31: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/31.jpg)
Finding Max-Margin Hyperplane
![Page 32: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/32.jpg)
Non-Linearly Separable Data Penalize each point by distance from the
margin 1, i.e., minimize:
Map data into high-dimensional space in which data becomes linearly separable
![Page 33: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/33.jpg)
SVM Summary Fast algorithms available Not simple to program State-of-the-art accuracy Theoretical justification Many applications
![Page 34: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/34.jpg)
Assignment Classify tissue samples into various clinical
states (e.g., tumor/normal) based on microarray profiles
Classifiers to compare: Naïve Bayes Naïve Bayes + feature selection Boosting with decision tree stumps (weak classifiers
are decision tree with one split)
![Page 35: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.](https://reader035.fdocuments.net/reader035/viewer/2022062309/5697bfa31a28abf838c96d71/html5/thumbnails/35.jpg)
Assignment Data: Breast cancer dataset
295 samples data.tab: genes on rows, first column is gene identifier (ID),
columns 2-296 are expression level of gene in each array 27 clinical attributes
experiment_attributes.tab: attributes on columns, samples on rows, typically 0/1 indicates association of sample and attribute
Evaluation: use 10-fold cross validation scheme and compute prediction accuracy for clinical attributes: met_in_5_yr_ eventmet_ Alive_8yr_ eventdea_ GradeIII_