Clustering UNSUPERVISED LEARNING INTRODUCTION Machine Learning.
Introduction Machine Learning
Transcript of Introduction Machine Learning
![Page 2: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/2.jpg)
What is Machine Learning?
Learningalgorithm
TRAININGDATA Answer
Trainedmachine
Query
![Page 3: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/3.jpg)
What for?
• Classification• Time series prediction• Regression• Clustering
![Page 4: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/4.jpg)
Some Learning Machines
• Linear models • Kernel methods• Neural networks• Decision trees
![Page 5: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/5.jpg)
Applications
inputs
training examples
10
102
103
104
105
Bioinformatics
Ecology
OCRHWR
MarketAnalysis
TextCategorization
MachineVision
Syst
em d
iagno
sis
10 102 103 104 105
![Page 6: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/6.jpg)
Banking / Telecom / Retail
• Identify:– Prospective customers– Dissatisfied customers– Good customers– Bad payers
• Obtain:– More effective advertising– Less credit risk– Fewer fraud– Decreased churn rate
![Page 7: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/7.jpg)
Biomedical / Biometrics
• Medicine:– Screening– Diagnosis and prognosis– Drug discovery
• Security:– Face recognition– Signature / fingerprint / iris
verification– DNA fingerprinting 6
![Page 8: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/8.jpg)
Computer / Internet
• Computer interfaces:– Troubleshooting wizards – Handwriting and speech– Brain waves
• Internet– Hit ranking– Spam filtering– Text categorization– Text translation– Recommendation 7
![Page 9: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/9.jpg)
Challenges
inputs
training examples
10
102
103
104
105
Arcene, Dorothea, Hiva
Sylva
GisetteGina
Ada
Dexter, Nova
Made
lon
10 102 103 104 105
NIPS 2003 & WCCI 2006
![Page 10: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/10.jpg)
Ten Classification Tasks
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
50100150
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
50100150
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
50100150
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
50100150
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
50100150
ADA
GINA
HIVA
NOVA
SYLVA
0 5 10 15 20 25 30 35 40 45 500
20
40ARCENE
0 5 10 15 20 25 30 35 40 45 500
20
40DEXTER
0 5 10 15 20 25 30 35 40 45 500
20
40DOROTHEA
0 5 10 15 20 25 30 35 40 45 500
20
40GISETTE
0 5 10 15 20 25 30 35 40 45 500
20
40MADELON
Test BER (%)
![Page 11: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/11.jpg)
Challenge Winning Methods
00.20.40.60.8
11.21.41.61.8
Linear/Kernel
NeuralNets
Trees/RF
NaïveBayes
Gisette (HWR)Gina (HWR)Dexter (Text)Nova (Text)Madelon (Artificial)Arcene (Spectral)Dorothea (Pharma)Hiva (Pharma)Ada (Marketing)Sylva (Ecology)
BE
R/<
BE
R>
![Page 12: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/12.jpg)
Conventions
X={xij}
n
mxiy ={yj}
α
w
![Page 13: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/13.jpg)
Learning problem
Colon cancer, Alon et al 1999
Unsupervised learningIs there structure in data?
Supervised learningPredict an outcome y.
Data matrix: X
m lines = patterns (data points, examples): samples, patients, documents, images, …
n columns = features: (attributes, input variables): genes, proteins, words, pixels, …
![Page 14: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/14.jpg)
Linear Models
• f(x) = w • x +b = Σj=1:n wj xj +b Linearity in the parameters, NOT in the input components.
• f(x) = w • Φ(x) +b = Σj wj φj(x) +b (Perceptron)
• f(x) = Σi=1:m αi k(xi,x) +b (Kernel method)
![Page 15: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/15.jpg)
Artificial Neurons
x1
x2
xn
1
Σ f(x)
w1
w2
wn
b
f(x) = w • x + b
Axon
Synapses
Activation of other neurons Dendrites
Cell potential
Activation function
McCulloch and Pitts, 1943
![Page 16: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/16.jpg)
Linear Decision Boundary
-0.50
0.5-0.
00.5
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
X1X2
X3
x1x2
x3
hyperplane
x1
x2
![Page 17: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/17.jpg)
Perceptron
Rosenblatt, 1957
f(x)
f(x) = w • Φ(x) + b
φ1(x)
1
x1
x2
xn
φ2(x)
φN(x)
Σw1
w2
wN
b
![Page 18: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/18.jpg)
NL Decision Boundary
x1
x2
-0.5
0
0.5
-0.5
0
0.5-0.5
0
0.5
Hs.128749Hs.234680
Hs.
7780
x1
x2
x3
![Page 19: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/19.jpg)
Kernel Method
Potential functions, Aizerman et al 1964
f(x) = Σi αi k(xi,x) + b
k(x1,x)
1
x1
x2
xn
Σα1
α2
αm
b
k(x2,x)
k(xm,x)
k(. ,. ) is a similarity measure or “kernel”.
![Page 20: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/20.jpg)
Hebb’s Rule
wj ← wj + yi xij
AxonΣ
yxj wj
Synapse
Activation of another neuron
Dendrite
Link to “Naïve Bayes”
![Page 21: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/21.jpg)
Kernel “Trick” (for Hebb’s rule)
• Hebb’s rule for the Perceptron:
w = Σi yi Φ(xi)
f(x) = w • Φ(x) = Σi yi Φ(xi) • Φ(x)
• Define a dot product: k(xi,x) = Φ(xi) • Φ(x)
f(x) = Σi yi k(xi,x)
![Page 22: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/22.jpg)
Kernel “Trick” (general)
• f(x) = Σi αi k(xi, x)
• k(xi, x) = Φ(xi) • Φ(x)
• f(x) = w • Φ(x)
• w = Σi αi Φ(xi)
Dual forms
![Page 23: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/23.jpg)
A kernel is:• a similarity measure• a dot product in some feature space: k(s, t) = Φ(s) • Φ(t)
But we do not need to know the Φ representation.
Examples:
• k(s, t) = exp(-||s-t||2/σ2) Gaussian kernel
• k(s, t) = (s • t)q Polynomial kernel
What is a Kernel?
![Page 24: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/24.jpg)
Multi-Layer Perceptron
Back-propagation, Rumelhart et al, 1986
Σxj
Σ
Σ
“hidden units”
internal “latent” variables
![Page 25: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/25.jpg)
Chessboard Problem
![Page 26: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/26.jpg)
Tree Classifiers
CART (Breiman, 1984) or C4.5 (Quinlan, 1993)
At each step, choose the feature that
“reduces entropy” most. Work towards “node purity”.
All the data
f1
f2
Choose f2
Choose f1
![Page 27: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/27.jpg)
Iris Data (Fisher, 1936)
Linear discriminant Tree classifier
Gaussian mixture Kernel method (SVM)
setosavirginica
versicolor
Figure from Norbert Jankowski and Krzysztof Grabczewski
![Page 28: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/28.jpg)
x1
x2
Fit / Robustness Tradeoff
x1
x2
15
![Page 29: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/29.jpg)
x1
x2
Performance evaluation
x1
x2
f(x) = 0
f(x) > 0
f(x) < 0
f(x) = 0
f(x) > 0
f(x) < 0
![Page 30: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/30.jpg)
x1
x2
x1
x2
f(x) = -1
f(x) > -1
f(x) < -1f(x) = -1
f(x) > -1
f(x) < -1
Performance evaluation
![Page 31: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/31.jpg)
x1
x2
x1
x2
f(x) = 1
f(x) > 1
f(x) < 1
f(x) = 1
f(x) > 1
f(x) < 1
Performance evaluation
![Page 32: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/32.jpg)
ROC Curve
100%
100%
For a given threshold on f(x), you get a point on the ROC curve. Actu
al ROC
0
Positive class success rate
(hit rate, sensitivity)
1 - negative class success rate (false alarm rate, 1-specificity)
Random ROC
Ideal ROC curve
![Page 33: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/33.jpg)
ROC Curve
Ideal ROC curve (AUC=1)
100%
100%
0 ≤ AUC ≤ 1
Actual R
OC
Random ROC (AUC=0.5)
0
Positive class success rate
(hit rate, sensitivity)
1 - negative class success rate (false alarm rate, 1-specificity)
For a given threshold on f(x), you get a point on the ROC curve.
![Page 34: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/34.jpg)
Lift Curve
OMGini =
O M
Fraction of customers selected
Hit
rate
= F
rac.
goo
dcu
stom
ers s
elec
t.
Random lift
Ideal Lift
100%
100%Customers ranked according to f(x); selection of the top ranking customers.
Gini=2 AUC-1
0 ≤ Gini ≤ 1
Actual L
ift
0
![Page 35: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/35.jpg)
Predictions: F(x)
Class -1 Class +1
Truth:y
Class -1 tn fp
Class +1 fn tp
Cost matrixPredictions: F(x)
Class -1 Class +1
Truth:y
Class -1 tn fp
Class +1 fn tp
neg=tn+fp
Total
pos=fn+tp
sel=fp+tprej=tn+fnTotal m=tn+fp+fn+tp
False alarm = fp/neg
Class +1 / Total
Hit rate = tp/pos
Frac. selected = sel/m
Cost matrix
Class+1/Total
Precision= tp/sel
False alarm rate = type I errate = 1-specificityHit rate = 1-type II errate = sensitivity = recall = test power
Compare F(x) = sign(f(x)) to the target y, and report:• Error rate = (fn + fp)/m• {Hit rate , False alarm rate} or {Hit rate , Precision} or {Hit rate , Frac.selected} • Balanced error rate (BER) = (fn/pos + fp/neg)/2 = 1 – (sensitivity+specificity)/2• F measure = 2 precision.recall/(precision+recall)
Vary the decision threshold θ in F(x) = sign(f(x)+θ), and plot: • ROC curve
Performance Assessment
: Hit rate vs. False alarm rate• Lift curve: Hit rate vs. Fraction selected• Precision/recall curve: Hit rate vs. Precision
Predictions: F(x)
Class -1 Class +1
Truth:y
Class -1 tn fp
Class +1 fn tp
neg=tn+fp
Total
pos=fn+tp
sel=fp+tprej=tn+fnTotal m=tn+fp+fn+tp
False alarm = fp/neg
Class +1 / Total
Hit rate = tp/pos
Frac. selected = sel/m
Cost matrixPredictions: F(x)
Class -1 Class +1
Truth:y
Class -1 tn fp
Class +1 fn tp
neg=tn+fp
Total
pos=fn+tp
sel=fp+tprej=tn+fnTotal m=tn+fp+fn+tp
Cost matrix
![Page 36: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/36.jpg)
What is a Risk Functional?
A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task.
Examples:• Classification:
– Error rate: (1/m) Σi=1:m 1(F(xi)≠yi)– 1- AUC (Gini Index = 2 AUC-1)
• Regression: – Mean square error: (1/m) Σi=1:m(f(xi)-yi)2
![Page 37: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/37.jpg)
How to train?
• Define a risk functional R[f(x,w)]• Optimize it w.r.t. w (gradient descent,
mathematical programming, simulated annealing, genetic algorithms, etc.)
Parameter space (w)
R[f(x,w)]
w*(… to be continued in the next lecture)
![Page 38: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/38.jpg)
How to Train?
• Define a risk functional R[f(x,w)]• Find a method to optimize it, typically
“gradient descent”wj ← wj - η ∂R/∂wj
or any optimization method (mathematical programming, simulated annealing, genetic algorithms, etc.)
(… to be continued in the next lecture)
![Page 39: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/39.jpg)
Summary
• With linear threshold units (“neurons”) we can build:– Linear discriminant (including Naïve Bayes)– Kernel methods– Neural networks– Decision trees
• The architectural hyper-parameters may include:– The choice of basis functions φ (features)– The kernel – The number of units
• Learning means fitting:– Parameters (weights)– Hyper-parameters– Be aware of the fit vs. robustness tradeoff
![Page 40: Introduction Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022012011/61d9d23c3033d939d150ccd7/html5/thumbnails/40.jpg)
Want to Learn More?
• Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard pattern recognition textbook. Limited to classification problems. Matlabcode. http://rii.ricoh.com/~stork/DHS.html
• The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods forclassification, regression, clustering. R code. http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/
• Linear Discriminants and Support Vector Machines, I. Guyon and D. Stork, In Smola et al Eds. Advances in Large Margin Classiers. Pages 147--169, MIT Press, 2000.http://clopinet.com/isabelle/Papers/guyon_stork_nips98.ps.gz
• Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material.http://clopinet.com/fextract-book