Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF...
-
Upload
duonghuong -
Category
Documents
-
view
222 -
download
0
Transcript of Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF...
![Page 1: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/1.jpg)
Machine Learning & Data Mining CS/CNS/EE 155
Lecture 3: SVM, Logis7c Regression, Neural Nets,
Evalua7on Metrics
![Page 2: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/2.jpg)
Announcements
• HW1 Due Tomorrow – Will be graded in about a week
• HW2 Released Tonight/Tomorrow – Due Jan 20th at 9pm – Via Moodle
• Recita7on Tonight – Linear Algebra & Vector Calculus
![Page 3: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/3.jpg)
Recap: Basic Recipe
• Training Data:
• Model Class:
• Loss Func7on:
• Learning Objec7ve:
S = (xi, yi ){ }i=1N
f (x |w,b) = wT x − b
L(a,b) = (a− b)2
Linear Models
Squared Loss
x ∈ RD
y ∈ −1,+1{ }
argminw,b
L yi, f (xi |w,b)( )i=1
N
∑
Op8miza8on Problem
![Page 4: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/4.jpg)
Recap: Bias-‐Variance Trade-‐off
0 20 40 60 80 100−1
−0.5
0
0.5
1
1.5
0 20 40 60 80 1000
0.5
1
1.5
0 20 40 60 80 100−1
−0.5
0
0.5
1
1.5
0 20 40 60 80 1000
0.5
1
1.5
0 20 40 60 80 100−1
−0.5
0
0.5
1
1.5
0 20 40 60 80 1000
0.5
1
1.5Variance Bias Variance Bias Variance Bias
![Page 5: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/5.jpg)
Recap: Complete Pipeline
S = (xi, yi ){ }i=1N
Training Data
f (x |w,b) = wT x − b
Model Class(es)
L(a,b) = (a− b)2
Loss Func7on
argminw,b
L yi, f (xi |w,b)( )i=1
N
∑
Cross Valida7on & Model Selec7on Profit!
SGD!
![Page 6: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/6.jpg)
Today
• Beyond Basic Linear Models – Support Vector Machines – Logis8c Regression – Feed-‐forward Neural Networks – Different ways to interpret models
• Different Evalua7on Metrics
![Page 7: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/7.jpg)
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 30
1
2
3
4
5
6
7
8
9
0/1 Loss
Squared Loss
f(x)
Loss
Target y
argminw,b
L yi, f (xi |w,b)( )i=1
N
∑
∂w L yi, f (xi |w,b)( )i=1
N
∑
How to compute gradient for 0/1 Loss?
![Page 8: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/8.jpg)
Recap: 0/1 Loss is Intractable
• 0/1 Loss is flat or discon7nuous everywhere
• VERY difficult to op7mize using gradient descent
• Solu8on: Op7mize surrogate Loss – Today: Hinge Loss (…eventually)
![Page 9: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/9.jpg)
Support Vector Machines aka Max-‐Margin Classifiers
![Page 10: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/10.jpg)
Source: hbp://en.wikipedia.org/wiki/Support_vector_machine
Which Line is the Best Classifier?
![Page 11: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/11.jpg)
Source: hbp://en.wikipedia.org/wiki/Support_vector_machine
Which Line is the Best Classifier?
“Margin”
![Page 12: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/12.jpg)
• Line is a 1D, Plane is 2D • Hyperplane is many D – Includes Line and Plane
• Defined by (w,b)
• Distance:
• Signed Distance:
Recall: Hyperplane Distance
wT x − bw
wT x − bw
w
un-‐normalized signed distance!
Linear Model =
b/|w|
![Page 13: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/13.jpg)
13
Recall: Margin
γ =maxwmin(x,y)
y(wT x)w
![Page 14: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/14.jpg)
How to Maximize Margin? (Assume Linearly Separable)
Choose w that maximizes:
argmaxw,b
min(x,y)
y wT x − b( )w
"
#$$
%
&''
Margin
![Page 15: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/15.jpg)
How to Maximize Margin? (Assume Linearly Separable)
argmaxw,b
min(x,y)
y wT x − b( )w
"
#$$
%
&''
≡ argmaxw,b: w =1
min(x,y)
y wT x − b( )#$%
&'(
Suppose we enforce:
min(x,y)
y wT x − b( ) =1
= argminw,b
w ≡ argminw,b
w 2
Then:
![Page 16: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/16.jpg)
Image Source: hbp://en.wikipedia.org/wiki/Support_vector_machine
Max Margin Classifier (Support Vector Machine)
“Linearly Separable”
Beber generaliza7on to unseen test examples
(beyond scope of course*)
(only training data on margin maber)
“Margin”
argminw,b
12wTw ≡ 1
2w 2
∀i : yi wT xi − b( ) ≥1
*hbp://olivier.chapelle.cc/pub/span_lmc.pdf
![Page 17: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/17.jpg)
argminw,b,ξ
12wTw+ C
Nξi
i∑
∀i : yi wT xi − b( ) ≥1−ξi
∀i :ξi ≥ 0
So]-‐Margin Support Vector Machine
ξi
“Margin”
Size of Margin vs
Size of Margin Viola8ons (C controls trade-‐off)
Image Source: hbp://en.wikipedia.org/wiki/Support_vector_machine
“Slack”
![Page 18: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/18.jpg)
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 30
0.5
1
1.5
2
2.5
3
f(x)
Loss
0/1 Loss
Hinge Loss argmin
w,b,ξ
12wTw+ C
Nξi
i∑
∀i : yi wT xi − b( ) ≥1−ξi
∀i :ξi ≥ 0
Hinge Loss
L(yi, f (xi )) =max(0,1− yi f (xi )) = ξi
Regulariza8on
Target y
![Page 19: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/19.jpg)
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 30
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0/1 Loss
Squared Loss
f(x)
Loss
Target y
Hinge Loss
Hinge Loss vs Squared Loss
![Page 20: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/20.jpg)
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 30
0.5
1
1.5
2
2.5
3
Comparison with Perceptron
20
max 0,−yi f (xi |w,b){ }
yf(x)
Loss
max 0,1− yi f (xi |w,b){ }
Perceptron SVM/Hinge
![Page 21: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/21.jpg)
Support Vector Machine
• 2 Interpreta7ons
• Geometric – Margin vs Margin Viola7ons
• Loss Minimiza7on – Model complexity vs Hinge Loss – (Will discuss in depth next lecture)
• Equivalent!
argminw,b,ξ
12wTw+ C
Nξi
i∑
∀i : yi wT xi − b( ) ≥1−ξi
∀i :ξi ≥ 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 30
0.5
1
1.5
2
2.5
3
![Page 22: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/22.jpg)
Comment on Op7miza7on
• Hinge Loss is not smooth – Not differen7able
• How to op7mize?
• Stochas8c (Sub-‐)Gradient Descent s8ll works! – Sub-‐gradients discussed next lecture
argminw,b,ξ
12wTw+ C
Nξi
i∑
∀i : yi wT xi − b( ) ≥1−ξi
∀i :ξi ≥ 0
hbps://en.wikipedia.org/wiki/Subgradient_method
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 30
0.5
1
1.5
2
2.5
3
![Page 23: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/23.jpg)
Logis7c Regression aka “Log-‐Linear” Models
![Page 24: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/24.jpg)
Logis7c Regression
P(y | x,w,b) = e12y wT x−b( )
e12y wT x−b( )
+ e−12y wT x−b( )
P(y | x,w,b) = 11+ e−y(w
T x−b)
“Log-‐Linear” Model
-5 -4 -3 -2 -1 0 1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
y*f(x) P(y|x)
Also known as sigmoid func7on: σ (a) = ea
1+ ea
y ∈ −1,+1{ }
![Page 25: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/25.jpg)
• Training set:
• Maximum Likelihood: – (Why?)
• Each (x,y) in S sampled independently! – Discussed further in Probably Recita7on
Maximum Likelihood Training
S = (xi, yi ){ }i=1N
argmaxw,b
P(yi | xi,w,b)i∏
x ∈ RD
y ∈ −1,+1{ }
![Page 26: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/26.jpg)
• SVMs oken beber at classifica7on – Assuming margin exists…
• Calibrated Probabili7es?
• Increase in SVM score…. – ...similar increase in P(y=+1|x)? – Not well calibrated!
• Logis8c Regression!
Why Use Logis7c Regression? 0
0.02 0.04 0.06 0.08 0.1
0.12 0.14 0.16
COV_TYPE ADULT LETTER.P1 LETTER.P2 MEDIS SLAC
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16
HS
0
0.2
0.4
0.6
0.8
1
Fra
ctio
n of
Pos
itive
s
0
0.2
0.4
0.6
0.8
1
Fra
ctio
n of
Pos
itive
s
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Fra
ctio
n of
Pos
itive
s
Mean Predicted Value 0 0.2 0.4 0.6 0.8 1
Mean Predicted Value 0 0.2 0.4 0.6 0.8 1
Mean Predicted Value 0 0.2 0.4 0.6 0.8 1
Mean Predicted Value 0 0.2 0.4 0.6 0.8 1
Mean Predicted Value 0 0.2 0.4 0.6 0.8 1
Mean Predicted Value 0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Fra
ctio
n of
Pos
itive
s
Mean Predicted Value
Figure 1. Histograms of predicted values and reliability diagrams for boosted decision trees.
Table 4. Squared error and cross-entropy performance of learning algorithmsSQUARED ERROR CROSS-ENTROPY
ALGORITHM RAW PLATT ISOTONIC RAW PLATT ISOTONICBST-DT 0.3050 0.2650 0.2652 0.4810 0.3727 0.3745SVM 0.3303 0.2727 0.2719 0.5767 0.3988 0.3984BAG-DT 0.2818 0.2815 0.2799 0.4050 0.4082 0.3996ANN 0.2805 0.2821 0.2806 0.4143 0.4229 0.4120KNN 0.2861 0.2871 0.2839 0.4367 0.4300 0.4186BST-STMP 0.3659 0.3098 0.3096 0.6241 0.4713 0.4734DT 0.3211 0.3212 0.3145 0.5019 0.5091 0.4865
to have probability near 0.
The reliability plots in Figure 1 display roughly sigmoid-shaped reliability diagrams, motivating the use of a sig-moid to transform predictions into calibrated probabili-ties. The reliability plots in the middle row of the £gurealso show sigmoids £tted using Platt’s method. The reli-ability plots in the bottom of the £gure show the function£tted with Isotonic Regression.
To show how calibration transforms the predictions, weplot histograms and reliability diagrams for the sevenproblem for boosted trees after 1024 steps of boosting,after Platt Calibration (Figure 2) and after Isotonic Re-gression (Figure 3). The reliability diagrams for IsotonicRegression are very similar to the ones for Platt Scal-ing, so we omit them in the interest of space. The £guresshow that calibration undoes the shift in probability masscaused by boosting: after calibration many more caseshave predicted probabilities near 0 and 1. The reliabil-ity diagrams are closer to the diagonal, and the S shapecharacteristic of boosting’s predictions is gone. On each
problem, transforming the predictions using either PlattScaling or Isotonic Regression yields a signi£cant im-provement in the quality of the predicted probabilities,leading to much lower squared error and cross-entropy.The main difference between Isotonic Regression andPlatt Scaling for boosting can be seen when comparingthe histograms in the two £gures. Because Isotonic Re-gression generates a piecewise constant function, the his-tograms are quite coarse, while the histograms generatedby Platt Scaling are smooth and easier to interpret.
Table 4 compares the RMS and MXE performance of thelearning methods before and after calibration. Figure 4shows the squared error results from Table 4 graphically.
After calibration with Platt Scaling or Isotonic Regres-sion, boosted decision trees have better squared error andcross-entropy than the other learning methods. The nextbest methods are SVMs, bagged decision trees and neu-ral nets. While Platt Scaling and Isotonic Regression sig-ni£cantly improve the performance of the SVM models,they have little or no effect on the performance of bagged
Image Source: hbp://machinelearning.org/proceedings/icml2005/papers/079_GoodProbabili7es_NiculescuMizilCaruana.pdf
*Figure above is for Boosted Decision Trees (SVMs have similar effect)
f(x)
P(y=+1)
![Page 27: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/27.jpg)
Log Loss
P(y | x,w,b) = e12y wT x−b( )
e12y wT x−b( )
+ e−12y wT x−b( )
=e12yf (x|w,b)
e12yf (x|w,b)
+ e−12yf (x|w,b)
argmaxw,b
P(yi | xi,w,b)i∏ = argmin
w,b− lnP(yi | xi,w,b)
i∑
Log Loss
Solve using (Stoch.) Gradient Descent
L(y, f (x)) = − ln e12yf (x )
e12yf (x )
+ e−12yf (x )
"
#
$$$
%
&
'''
![Page 28: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/28.jpg)
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 30
0.5
1
1.5
2
2.5
3
Log Loss vs Hinge Loss
L(y, f (x)) = − ln e12yf (x )
e12yf (x )
+ e−12yf (x )
"
#
$$$
%
&
'''
L(y, f (x)) =max(0,1− yf (x))
f(x)
Loss
0/1 Loss
Hinge Loss
Log Loss
![Page 29: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/29.jpg)
Log-‐Loss Gradient (For One Example)
∂w − lnP(yi | xi ) = −∂w12yi f (xi |w,b)− ln e
12yi f (xi |w,b)
+ e−
12yi f (xi |w,b)#
$%
&
'(
#
$%%
&
'((
= − 12yixi +∂w ln e
12yi f (xi |w,b)
+ e−
12yi f (xi |w,b)#
$%
&
'(
= − 12yixi +
1
e12yi f (xi |w,b)
+ e−
12yi f (xi |w,b)
∂w e12yi f (xi |w,b)
+ e−
12yi f (xi |w,b)#
$%
&
'(
= −1+ 1
e12yi f (xi |w,b)
+ e−
12yi f (xi |w,b)
e12yi f (xi |w,b)
− e−
12yi f (xi |w,b)#
$%
&
'(
#
$
%%
&
'
((
12yixi
= −1+P(yi | xi )−P(−yi | xi )( ) 12yixi
= −P(−yi | xi )yixi = − 1−P(yi | xi )( ) yixiP(y | x,w,b) = e
12yf (x|w,b)
e12yf (x|w,b)
+ e−12yf (x|w,b)
![Page 30: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/30.jpg)
Logis7c Regression
• Two Interpreta7ons
• Maximizing Likelihood
• Minimizing Log Loss
• Equivalent!
Logis?c(Regression(
P(y | x,w,b) = ey wT x−b( )
ey wT x−b( ) + e
−y wT x−b( )
P(y | x,w,b)∝ ey wT x−b( ) ≡ ey* f (x|w,b)
“LogNLinear”&Model&
-5 -4 -3 -2 -1 0 1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
y*f(x)(
P(y|x)(
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 30
0.5
1
1.5
2
2.5
3
![Page 31: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/31.jpg)
Feed-‐Forward Neural Networks aka Not Quite Deep Learning
![Page 32: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/32.jpg)
1 Layer Neural Network
• 1 Neuron – Takes input x – Outputs y
• ~Logis8c Regression! – Solve via Gradient Descent
Σx y
“Neuron” f(x|w,b) = wTx – b = w1*x1 + w2*x2 + w3*x3 – b
y = σ( f(x) )
-5 -4 -3 -2 -1 0 1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
sigmoid tanh rec7linear
![Page 33: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/33.jpg)
2 Layer Neural Network
• 2 Layers of Neurons – 1st Layer takes input x – 2nd Layer takes output of 1st layer
• Can approximate arbitrary func7ons – Provided hidden layer is large enough – “fat” 2-‐Layer Network
Σx y
Σ Σ
Hidden Layer
Non-‐Linear!
![Page 34: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/34.jpg)
Aside: Deep Neural Networks
• Why prefer Deep over a “Fat” 2-‐Layer? – Compact model
• (exponen7ally large “fat” model) – Easier to train? – Discussed further in deep learning lectures
Image Source: hbp://blog.peltarion.com/2014/06/22/deep-‐learning-‐and-‐deep-‐neural-‐networks-‐in-‐synapse/
![Page 35: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/35.jpg)
Training Neural Networks
• Gradient Descent! – Even for Deep Networks*
• Parameters: – (w11,b11,w12,b12,w2,b2)
Σx y
Σ Σ
*more complicated
∂w2 L yi,σ 2( )i=1
N
∑ = ∂w2L yi,σ 2( )i=1
N
∑ = ∂σ 2L yi,σ 2( )i=1
N
∑ ∂w2σ 2 = ∂σ 2L yi,σ 2( )i=1
N
∑ ∂ f2σ 2∂w2 f2
f(x|w,b) = wTx – b y = σ( f(x) )
∂w1m L yi,σ 2( )i=1
N
∑ = ∂σ 2L yi,σ 2( )i=1
N
∑ ∂ f2σ 2∂w1 f2 = ∂σ 2L yi,σ 2( )
i=1
N
∑ ∂ f2σ 2∂σ1m f2∂ f1m
σ1m∂w1m f1m
Backpropaga8on = Gradient Descent (lots of chain rules)
![Page 36: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/36.jpg)
Today
• Beyond Basic Linear Models – Support Vector Machines – Logis7c Regression – Feed-‐forward Neural Networks – Different ways to interpret models
• Different Evalua8on Metrics
![Page 37: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/37.jpg)
Evalua7on
• 0/1 Loss (Classifica7on)
• Squared Loss (Regression)
• Anything Else?
![Page 38: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/38.jpg)
Example: Cancer Predic7on
Loss Func8on Has Cancer Doesn’t Have Cancer
Predicts Cancer Low Medium Predicts No Cancer OMG Panic! Low M
odel
Pa8ent
• Value Posi7ves & Nega7ves Differently – Care much more about posi7ves
• “Cost Matrix” – 0/1 Loss is Special Case
![Page 39: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/39.jpg)
Op7mizing for Cost-‐Sensi7ve Loss
• There is no universally accepted way.
Simplest Approach (Cost Balancing):
Loss Func8on Has Cancer Doesn’t Have Cancer
Predicts Cancer 0 1 Predicts No Cancer 1000 0
argminw,b
1000 L yi, f (xi |w,b)( )i:yi=1∑ + L yi, f (xi |w,b)( )
i:yi=−1∑
#
$%%
&
'((
![Page 40: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/40.jpg)
Precision & Recall
• Precision = TP/(TP + FP) • Recall = TP/(TP + FN)
• TP = True Posi7ve, TN = True Nega7ve • FP = False Posi7ve, FN = False Nega7ve
Counts Has Cancer Doesn’t Have Cancer Predicts Cancer 20 (TP) 30 (FP) Predicts No Cancer 5 (FN) 70 (TN) M
odel
Pa8ent
Care More About Posi8ves!
F1 = 2/(1/P+ 1/R)
Image Source: hbp://pmtk3.googlecode.com/svn-‐history/r785/trunk/docs/demos/Decision_theory/PRhand.html
![Page 41: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/41.jpg)
Example: Search Query
• Rank webpages by relevance
![Page 42: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/42.jpg)
• Predict a Ranking (of webpages) – Users only look at top 4 – Sort by f(x|w,b)
• Precision @4 =1/2 – Frac7on of top 4 relevant
• Recall @4 =2/3 – Frac7on of relevant in top 4
• Top of Ranking Only!
Ranking Measures
Image Source: hbp://pmtk3.googlecode.com/svn-‐history/r785/trunk/docs/demos/Decision_theory/PRhand.html
Top 4
![Page 43: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/43.jpg)
Pairwise Preferences
2 Pairwise Disagreements 4 Pairwise Agreements
![Page 44: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/44.jpg)
ROC-‐Area
• ROC-‐Area – Area under ROC Curve – Frac7on pairwise agreements
• Example:
Image Source: hbp://www.medcalc.org/manual/roc-‐curves.php
ROC-‐Area: 0.5 #Pairwise Preferences = 6 #Agreements = 3
![Page 45: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/45.jpg)
Average Precision
• Average Precision – Area under P-‐R Curve – P@K for each posi7ve
• Example:
Image Source: hbp://pmtk3.googlecode.com/svn-‐history/r785/trunk/docs/demos/Decision_theory/PRhand.html
76.053
32
11
31
≈⎟⎠
⎞⎜⎝
⎛ ++⋅AP:
Precision at Rank Loca7on of Each Posi7ve Example
![Page 46: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/46.jpg)
ROC-‐Area versus Average Precision • ROC-‐Area Cares about every pairwise preference equally
• Average Precision cares more about top of ranking
ROC-‐Area: 0.5 Average Precision: 0.76
ROC-‐Area: 0.5 Average Precision: 0.64
![Page 47: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/47.jpg)
Summary: Evalua7on Measures
• Different Evalua7ons Measures – Different Scenarios
• Large focus on ge}ng posi7ves – Large cost of mis-‐predic7ng cancer – Relevant webpages are rare
• Aka “Class Imbalance”
![Page 48: Machine(Learning(&(DataMining - Yisong Yue | Machine ... · PDF fileMachine(Learning(&(DataMining (CS/CNS/EE155& Lecture(3:(SVM,(Logis7c(Regression,(Neural(Nets, ... Scaling or Isotonic](https://reader034.fdocuments.net/reader034/viewer/2022051718/5a711df77f8b9aac538c89b1/html5/thumbnails/48.jpg)
Next Lecture
• Regulariza7on
• Lasso • Tonight: – Recita8on on Matrix Linear Algebra & Calculus