CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
-
Upload
cecilia-gregory -
Category
Documents
-
view
214 -
download
0
Transcript of CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
CSE 446 Logistic Regression
Winter 2012
Dan Weld
Some slides from Carlos Guestrin, Luke Zettlemoyer
Gaussian Naïve Bayes
P(Xi =x| Y=yk) = N(ik, ik)
P(Y | X) P(X | Y) P(Y)
N(ik, ik) =
Sometimes Assume Variance– is independent of Y (i.e., i), – or independent of Xi (i.e., k)– or both (i.e., )
3
Generative vs. Discriminative Classifiers
• Want to Learn: h:X Y– X – features– Y – target classes
• Bayes optimal classifier – P(Y|X)• Generative classifier, e.g., Naïve Bayes:
– Assume some functional form for P(X|Y), P(Y)– Estimate parameters of P(X|Y), P(Y) directly from training data– Use Bayes rule to calculate P(Y|X= x)– This is a ‘generative’ model
• Indirect computation of P(Y|X) through Bayes rule• As a result, can also generate a sample of the data, P(X) = y P(y) P(X|y)
• Discriminative classifiers, e.g., Logistic Regression:– Assume some functional form for P(Y|X)– Estimate parameters of P(Y|X) directly from training data– This is the ‘discriminative’ model
• Directly learn P(Y|X)• But cannot obtain a sample of the data, because P(X) is not available
P(Y | X) P(X | Y) P(Y)
4
Univariate Linear Regression
hw(x) = w1x + w0
Loss(hw) =j=1
n
L2(yj, hw(xj)) = j=1
n
(yj - hw(xj))2 =
j=1
n
(yj – (w1xj+w0))2
5
Understanding Weight Space
Loss(hw) =j=1
n
(yj – (w1xj+w0))2
hw(x) = w1x + w0
6
Understanding Weight Space
hw(x) = w1x + w0
Loss(hw) =j=1
n
(yj – (w1xj+w0))2
7
Finding Minimum Loss
hw(x) = w1x + w0
Loss(hw) =j=1
n
(yj – (w1xj+w0))2
Loss(hw) = 0w0
Argminw Loss(hw)
Loss(hw) = 0w1
Unique Solution!
hw(x) = w1x + w0
w1 =
Argminw Loss(hw)
w0 = ((yj)–w1(xj)/N
N(xjyj)–(xj)(yj)
N(xj2)–(xj)2
Could also Solve Iteratively
w = any point in weight spaceLoop until convergence
For each wi in w do
wi := wi - Loss(w)
Argminw Loss(hw)
wi
10
Multivariate Linear Regression
hw(xj) = w0
Unique Solution = (xTx)-1xTy
+wi xj,i =wi xj,i =wTxj
Argminw Loss(hw)
Problem….
11
Overfitting
Regularize!!Penalize high weights
Loss(hw) =j=1
n
(yj – (w1xj+w0))2 i=1+wi2
k
Loss(hw) =j=1
n
(yj – (w1xj+w0))2 i=1+|wi|
k
Alternatively….
12
Regularization
L1 L2
13
Back to Classification
P(edible|X)=1
P(edible|X)=0
Decision Boundary
14
Logistic Regression
Learn P(Y|X) directly! Assume a particular functional form Not differentiable…
P(Y)=1
P(Y)=0
15
Logistic Regression
Learn P(Y|X) directly! Assume a particular functional form Logistic Function
Aka Sigmoid
16
Logistic Function in n Dimensions
Sigmoid applied to a linear function of the data:
Features can be discrete or continuous!
Understanding Sigmoids
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
w0=0, w1=-1
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
w0=-2, w1=-1
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
w0=0, w1= -0.5
18
Very convenient!
implies
implies
implies
linear classification
rule!
©Carlos Guestrin 2005-2009
©Carlos Guestrin 2005-2009 19
Logistic regression more generally
Logistic regression in more general case, where Y {y1,…,yR}
for k<R
for k=R (normalization, so no weights for this class)
Features can be discrete or continuous!
20
Generative (Naïve Bayes) Loss function: Data likelihood
Discriminative (Logistic Regr.) Loss funct: Conditional Data Likelihood
Discriminative models can’t compute P(xj|w)!Or, … “They don’t waste effort learning P(X)” Focus only on P(Y|X) - all that matters for classification
Loss Functions: Likelihood vs. Conditional Likelihood
©Carlos Guestrin 2005-2009
21
Expressing Conditional Log Likelihood
©Carlos Guestrin 2005-2009
}Pro
babil
ity o
f pre
dictin
g 1
Proba
bility
of p
redic
ting
0
}
1 whe
n co
rrect
answ
er is
0
1 whe
n co
rrect
answ
er is
1
ln(P(Y=0|X,w) = -ln(1+exp(w0+i wiXi)
ln(P(Y=0|X,w) = w0+i wiXi - ln(1+exp(w0+i wiXi)
Expressing Conditional Log Likelihood
ln(P(Y=0|X,w) = -ln(1+exp(w0+i wiXi)
ln(P(Y=0|X,w) = w0+i wiXi - ln(1+exp(w0+i wiXi)
23
Maximizing Conditional Log Likelihood
Bad news: no closed-form solution to maximize l(w)
Good news: l(w) is concave function of w!
No local minima
Concave functions easy to optimize
©Carlos Guestrin 2005-2009
24
Optimizing Concave FunctionsGradient Ascent
Conditional likelihood for Logistic Regression is concave ! Find optimum with gradient ascent
Gradient ascent is simplest of optimization approachese.g., Conjugate gradient ascent much better (see reading)
Gradient:
Learning rate, >0
Update rule:
©Carlos Guestrin 2005-2009