04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy...
Transcript of 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy...
![Page 1: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/1.jpg)
Linear models: Logistic regression
Chapter 3.3
1
![Page 2: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/2.jpg)
Predicting probabilities
Objective: learn to predict a probability P(y | x) for a binary classification problem using a linear classifier The target function: For positive examples P(y = +1 | x) = 1 whereas P(y = +1 | x) = 0 for negative examples.
2
The Target Function is Inherently Noisy
f(x) = P[y = +1 | x].
The data is generated from a noisy target function:
P (y | x) =
⎧
⎪⎨
⎪⎩
f(x) for y = +1;
1− f(x) for y = −1.
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 6 /23 When is h good? −→
![Page 3: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/3.jpg)
Predicting probabilities
Objective: learn to predict a probability P(y | x) for a binary classification problem using a linear classifier The target function: For positive examples P(y = +1 | x) = 1 whereas P(y = +1 | x) = 0 for negative examples. Can we assume that P(y = +1 | x) is linear?
3
The Target Function is Inherently Noisy
f(x) = P[y = +1 | x].
The data is generated from a noisy target function:
P (y | x) =
⎧
⎪⎨
⎪⎩
f(x) for y = +1;
1− f(x) for y = −1.
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 6 /23 When is h good? −→
![Page 4: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/4.jpg)
Logistic regression
4
s =d!
i=0
wixi
h(x) = (s) h(x) = s h(x) = θ(s)
sx
x
x
x0
1
2
d
h x( )
sx
x
x
x0
1
2
d
h x( )
sx
x
x
x0
1
2
d
h x( )
⃝ AML
s = w
|x
θ
θ(s) =es
1 + es
θ(s)1
0 s
⃝ AML
θ
θ(s) =es
1 + es
θ(s)1
0 s
⃝ AML
The logistic function (aka squashing function):
The signal is the basis for several linear models:
![Page 5: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/5.jpg)
Properties of the logistic function
5
Predicting a Probability
Will someone have a heart attack over the next year?
age 62 yearsgender maleblood sugar 120 mg/dL40,000HDL 50LDL 120Mass 190 lbsHeight 5′ 10′′
. . . . . .
Classification: Yes/No
Logistic Regression: Likelihood of heart attack logistic regression ≡ y ∈ [0, 1]
h(x) = θ
!d"
i=0
wixi
#
= θ(wtx)θ(s)
1
0 s
θ(s) =es
1 + es=
1
1 + e−s.
θ(−s) =e−s
1 + e−s=
1
1 + es= 1− θ(s).
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 4 /23 Data is binary ±1 −→
Predicting a Probability
Will someone have a heart attack over the next year?
age 62 yearsgender maleblood sugar 120 mg/dL40,000HDL 50LDL 120Mass 190 lbsHeight 5′ 10′′
. . . . . .
Classification: Yes/No
Logistic Regression: Likelihood of heart attack logistic regression ≡ y ∈ [0, 1]
h(x) = θ
!d"
i=0
wixi
#
= θ(wtx)θ(s)
1
0 s
θ(s) =es
1 + es=
1
1 + e−s.
θ(−s) =e−s
1 + e−s=
1
1 + es= 1− θ(s).
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 4 /23 Data is binary ±1 −→
![Page 6: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/6.jpg)
Predicting probabilities
Fitting the data means finding a good hypothesis h
6
What Makes an h Good?
‘fiting’ the data means finding a good h
h is good if:
⎧
⎨
⎩
h(xn) ≈ 1 whenever yn = +1;
h(xn) ≈ 0 whenever yn = −1.
A simple error measure that captures this:
Ein(h) =1
N
N∑
n=1
(
h(xn)−12(1 + yn)
)2.
Not very convenient (hard to minimize).
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 7 /23 Cross entropy error −→
The Probabilistic Interpretation
Suppose that h(x) = θ(wtx) closely captures P[+1|x]:
P (y | x) =
⎧
⎪⎨
⎪⎩
θ(wtx) for y = +1;
1− θ(wtx) for y = −1.
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 9 /23 1− θ(s) = θ(−s) −→
The Probabilistic Interpretation
Suppose that h(x) = θ(wtx) closely captures P[+1|x]:
P (y | x) =
⎧
⎪⎨
⎪⎩
θ(wtx) for y = +1;
1− θ(wtx) for y = −1.
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 9 /23 1− θ(s) = θ(−s) −→
![Page 7: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/7.jpg)
Predicting probabilities
Fitting the data means finding a good hypothesis h
7
What Makes an h Good?
‘fiting’ the data means finding a good h
h is good if:
⎧
⎨
⎩
h(xn) ≈ 1 whenever yn = +1;
h(xn) ≈ 0 whenever yn = −1.
A simple error measure that captures this:
Ein(h) =1
N
N∑
n=1
(
h(xn)−12(1 + yn)
)2.
Not very convenient (hard to minimize).
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 7 /23 Cross entropy error −→
The Probabilistic Interpretation
Suppose that h(x) = θ(wtx) closely captures P[+1|x]:
P (y | x) =
⎧
⎪⎨
⎪⎩
θ(wtx) for y = +1;
1− θ(wtx) for y = −1.
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 9 /23 1− θ(s) = θ(−s) −→
The Probabilistic Interpretation
So, if h(x) = θ(wtx) closely captures P[+1|x]:
P (y | x) =
⎧
⎪⎨
⎪⎩
θ(wtx) for y = +1;
θ(−wtx) for y = −1.
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 10 /23 Simplify to one equation −→
The Probabilistic Interpretation
So, if h(x) = θ(wtx) closely captures P[+1|x]:
P (y | x) =
⎧
⎪⎨
⎪⎩
θ(wtx) for y = +1;
θ(−wtx) for y = −1.
. . . or, more compactly,P (y | x) = θ(y ·wtx)
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 11 /23 The likelihood −→
More compactly:
![Page 8: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/8.jpg)
Is logistic regression really linear?
To figure out how the decision boundary looks like set solving for x we get: i.e.
8
P (y = +1|x) = exp(w
|x)
exp(w
|x) + 1
P (y = �1|x) = 1� P (y = +1|x) = 1
exp(w
|x) + 1
P (y = +1|x) = P (y = �1|x)
exp(w
|x) = 1
w
|x = 0
![Page 9: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/9.jpg)
Maximum likelihood
We will find w using the principle of maximum likelihood.
9
The Likelihood
P (y | x) = θ(y ·wtx)
Recall: (x1, y1), . . . , (xN, yN) are independently generated
Likelihood:The probability of getting the y1, . . . , yN in D from the corresponding x1, . . . ,xN :
P (y1, . . . , yN | x1, . . . ,xn) =N!
n=1
P (yn | xn).
The likelihood measures the probability that the data were generated if f were h.
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 12 /23 Maximize the likelihood −→
The Likelihood
P (y | x) = θ(y ·wtx)
Recall: (x1, y1), . . . , (xN, yN) are independently generated
Likelihood:The probability of getting the y1, . . . , yN in D from the corresponding x1, . . . ,xN :
P (y1, . . . , yN | x1, . . . ,xn) =N!
n=1
P (yn | xn).
The likelihood measures the probability that the data were generated if f were h.
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 12 /23 Maximize the likelihood −→
Valid since
![Page 10: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/10.jpg)
Maximizing the likelihood
10
Maximizing The Likelihood (why?)
max!N
n=1P (yn | xn)
⇔ max ln"!N
n=1P (yn | xn)#
≡ max$N
n=1 lnP (yn | xn)
⇔ min − 1N
$Nn=1 lnP (yn | xn)
≡ min 1N
$Nn=1 ln
1P (yn|xn)
≡ min 1N
$Nn=1 ln
1θ(yn·wtxn)
← we specialize to our “model” here
≡ min 1N
$Nn=1 ln(1 + e−yn·w
txn)
Ein(w) =1
N
N%
n=1
ln(1 + e−yn·wtxn)
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 13 /23 How to minimize Ein(w) −→
![Page 11: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/11.jpg)
Maximizing the likelihood
Summary: maximizing the likelihood is equivalent to
11
−1
Nln
!N"
n=1
θ(yn w xn)
#
=1
N
N$
n=1
ln
%1
θ(yn w xn)
& '
θ(s) =1
1 + e−s
(
E (w) =1
N
N$
n=1
ln)
1 + e−ynw xn
*
+ ,- .
(h(xn),yn)⃝ AML
minimize
Cross entropy error
![Page 12: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/12.jpg)
Maximizing the likelihood
Summary: maximizing the likelihood is equivalent to
12
−1
Nln
!N"
n=1
θ(yn w xn)
#
=1
N
N$
n=1
ln
%1
θ(yn w xn)
& '
θ(s) =1
1 + e−s
(
E (w) =1
N
N$
n=1
ln)
1 + e−ynw xn
*
+ ,- .
(h(xn),yn)⃝ AML
minimize
Cross entropy error
Ein(w) =1
N
NX
n=1
I(yn = +1) ln1
h(xn)+ I(yn = �1) ln
1
1� h(xn)
Exercise: check that this is equivalent to:
![Page 13: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/13.jpg)
Digression: information theory
I am thinking of an integer between 0 and 1,023. You want to guess it using the fewest number of questions. Most of us would ask “is it between 0 and 512?” This is a good strategy because it provides the most information about the unknown number. It provides the first binary digit of the number. Initially you need to obtain log2(1024) = 10 bits of information. After the first question you only need log2(512) = 9 bits.
![Page 14: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/14.jpg)
Information and Entropy
By halving the search space we obtained one bit. In general, the information associated with a probabilistic outcome: Why the logarithm? Assume we have two independent events x, and y. We would like the information they carry to be additive. Let’s check: Entropy:
14
I(p) = � log p
I(x, y) = � logP (x, y) = � logP (x)P (y)
= � logP (x)� logP (y) = I(x) + I(y)
H(P ) = �X
x
P (x) logP (x)
![Page 15: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/15.jpg)
Entropy
For a Bernoulli random variable: Maximal when p = ½.
H(p) = �p log p� (1� p) log(1� p)
![Page 16: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/16.jpg)
KL divergence
The KL divergence between distributions P and Q: Properties: ² Non-negative, equal to 0 iff P = Q ² It is not symmetric
16
D
KL
(P ||Q) = �X
x
P (x) log
Q(x)
P (x)
![Page 17: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/17.jpg)
KL divergence
The KL divergence between distributions P and Q:
17
D
KL
(P ||Q) = �X
x
P (x) log
Q(x)
P (x)
D
KL
(P ||Q) = �X
x
P (x) logQ(x) +
X
x
P (x) logP (x)
cross entropy - entropy
![Page 18: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/18.jpg)
Cross entropy and logistic regression
The logistic regression cost function: It is the average cross entropy between the learned P(y | x) and the observed probabilities Cross entropy And for binary variables:
18
Ein(w) =1
N
NX
n=1
I(yn = +1) ln1
h(xn)+ I(yn = �1) ln
1
1� h(xn)
H(y, y) = �y log y � (1� y) log(1� y)
H(P,Q) = �X
x
P (x) logQ(x)
![Page 19: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/19.jpg)
In-sample error
The in-sample error for logistic regression
19
−1
Nln
!N"
n=1
θ(yn w xn)
#
=1
N
N$
n=1
ln
%1
θ(yn w xn)
& '
θ(s) =1
1 + e−s
(
E (w) =1
N
N$
n=1
ln)
1 + e−ynw xn
*
+ ,- .
(h(xn),yn)⃝ AML
minimize
Cross entropy error t = 0 w(0)
t = 0, 1, 2, . . .
∇E = −1
N
N!
n=1
ynxn
1 + eynw (t)xn
w(t + 1) = w(t)− η∇E
w
⃝ AML
t = 0 w(0)t = 0, 1, 2, . . .
∇E = −1
N
N!
n=1
ynxn
1 + eynw (t)xn
w(t + 1) = w(t)− η∇E
w
⃝ AML
t = 0 w(0)t = 0, 1, 2, . . .
∇E = −1
N
N!
n=1
ynxn
1 + eynw (t)xn
w(t + 1) = w(t)− η∇E
w
⃝ AML
![Page 20: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/20.jpg)
Digression: gradient ascent/descent
Topographical maps can give us intuition on how to optimize a cost function
20
http://www.csus.edu/indiv/s/slaymaker/archives/geol10l/shield1.jpg http://www.sir-ray.com/touro/IMG_0001_NEW.jpg
![Page 21: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/21.jpg)
Digression: gradient descent
21 Images from http://en.wikipedia.org/wiki/Gradient_descent
Given a function E(w), the gradient is the direction of steepest ascent Therefore to minimize E(w), take a step in the direction of the negative of the gradient
Notice that the gradient is perpendicular to contours of equal E(w)
![Page 22: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/22.jpg)
Gradient descent
Gradient descent is an iterative process How to pick ?
22
Our Ein Has Only One Valley
Weights, w
In-sam
pleError,E
in
. . . because Ein(w) is a convex function of w. (So, who care’s if it looks ugly!)
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 16 /23 How to roll down? −→
How to “Roll Down”?
Assume you are at weights w(t) and you take a step of size η in the direction v.
w(t + 1) = w(t) + ηv
We get to pick v ← what’s the best direction to take the step?
Pick v to make Ein(w(t + 1)) as small as possible.
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 17 /23 The gradient −→
w(t+ 1) = w(t) + ⌘v
![Page 23: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/23.jpg)
Gradient descent
The gradient is the best direction to take to optimize Ein(w):
23
The Gradient is the Fastest Way to Roll Down
Approximating the change in Ein
∆Ein = Ein(w(t+ 1))− Ein(w(t))
= Ein(w(t) + ηv)− Ein(w(t))
= η∇Ein(w(t))tv
! "# $
minimized at v = − ∇Ein(w(t))||∇Ein(w(t)) ||
+O(η2) (Taylor’s Approximation)
>≈ −η||∇Ein(w(t)) || ←attained at v = − ∇Ein(w(t))
||∇Ein(w(t)) ||
The best (steepest) direction to move is the negative gradient:
v = −
∇Ein(w(t))
||∇Ein(w(t)) ||
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 18 /23 Iterate the gradient −→
The Gradient is the Fastest Way to Roll Down
Approximating the change in Ein
∆Ein = Ein(w(t+ 1))− Ein(w(t))
= Ein(w(t) + ηv)− Ein(w(t))
= η∇Ein(w(t))tv
! "# $
minimized at v = − ∇Ein(w(t))||∇Ein(w(t)) ||
+O(η2) (Taylor’s Approximation)
>≈ −η||∇Ein(w(t)) || ←attained at v = − ∇Ein(w(t))
||∇Ein(w(t)) ||
The best (steepest) direction to move is the negative gradient:
v = −
∇Ein(w(t))
||∇Ein(w(t)) ||
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 18 /23 Iterate the gradient −→
![Page 24: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/24.jpg)
Choosing the step size
The choice of the step size affects the rate of convergence:
24
The ‘Goldilocks’ Step Size
η too small η too large variable ηt – just right
Weights, w
In-sam
pleError,E
in
Weights, w
In-sam
pleError,E
in
Weights, w
In-sam
pleError,E
in
large η
small η
η = 0.1; 75 steps η = 2; 10 steps variable ηt; 10 steps
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 20 /23 Fixed learning rate gradient descent −→
Let’s use a variable learning rate:
w(t+ 1) = w(t) + ⌘tv
⌘t = ⌘ · ||rEin(w(t))||
||rEin(w(t))|| ! 0When approaching the minimum:
![Page 25: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/25.jpg)
Choosing the step size
The choice of the step size affects the rate of convergence:
25
The ‘Goldilocks’ Step Size
η too small η too large variable ηt – just right
Weights, w
In-sam
pleError,E
in
Weights, w
In-sam
pleError,E
in
Weights, w
In-sam
pleError,E
in
large η
small η
η = 0.1; 75 steps η = 2; 10 steps variable ηt; 10 steps
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 20 /23 Fixed learning rate gradient descent −→
Let’s use a variable learning rate:
w(t+ 1) = w(t) + ⌘tv
⌘t = ⌘ · ||rEin(w(t))||
⌘tv = �⌘ · ||rEin(w(t))|| · rEin(w(t))
||rEin(w(t))|| = �⌘rEin(w(t))
![Page 26: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/26.jpg)
The final form of gradient descent
The choice of the step size affects the rate of convergence:
26
The ‘Goldilocks’ Step Size
η too small η too large variable ηt – just right
Weights, w
In-sam
pleError,E
in
Weights, w
In-sam
pleError,E
in
Weights, w
In-sam
pleError,E
in
large η
small η
η = 0.1; 75 steps η = 2; 10 steps variable ηt; 10 steps
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 20 /23 Fixed learning rate gradient descent −→
w(t+ 1) = w(t)� ⌘rEin(w(t))
![Page 27: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/27.jpg)
Logistic regression using gradient descent
We will use gradient descent to minimize our error function. Fortunately, the logistic regression error function has a single global minimum:
27
Our Ein Has Only One Valley
Weights, w
In-sam
pleError,E
in
. . . because Ein(w) is a convex function of w. (So, who care’s if it looks ugly!)
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 16 /23 How to roll down? −→
Finding The Best Weights - Hill Descent
Ball on a complicated hilly terrain
— rolls down to a local valley↑
this is called a local minimum
Questions:
How to get to the bottom of the deepest valey?
How to do this when we don’t have gravity?
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 15 /23 Our Ein is convex −→
So we don’t need to worry about getting stuck in local minima
![Page 28: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/28.jpg)
Logistic regression using gradient descent Putting it all together: This is called batch gradient descent
28
Fixed Learning Rate Gradient Descent
ηt = η · ||∇Ein(w(t)) ||
||∇Ein(w(t)) ||→ 0 when closer to the minimum.
v = −ηt ·∇Ein(w(t))
||∇Ein(w(t)) ||
= −η · ||∇Ein(w(t)) || ·∇Ein(w(t))
||∇Ein(w(t)) ||
v = −η ·∇Ein(w(t))
1: Initialize at step t = 0 to w(0).2: for t = 0, 1, 2, . . . do3: Compute the gradient
gt = ∇Ein(w(t)). ←− (Ex. 3.7 in LFD)
4: Move in the direction vt = −gt.5: Update the weights:
w(t+ 1) = w(t) + ηvt.
6: Iterate ‘until it is time to stop’.7: end for8: Return the final weights.
Gradient descent can minimize any smooth function, for example
Ein(w) =1
N
N!
n=1
ln(1 + e−yn·wtx) ← logistic regression
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 21 /23 Stochastic gradient descent −→
t = 0 w(0)t = 0, 1, 2, . . .
∇E = −1
N
N!
n=1
ynxn
1 + eynw (t)xn
w(t + 1) = w(t)− η∇E
w
⃝ AML
−1
Nln
!N"
n=1
θ(yn w xn)
#
=1
N
N$
n=1
ln
%1
θ(yn w xn)
& '
θ(s) =1
1 + e−s
(
E (w) =1
N
N$
n=1
ln)
1 + e−ynw xn
*
+ ,- .
(h(xn),yn)⃝ AML
![Page 29: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/29.jpg)
Logistic regression
Comments: v In practice logistic regression is solved by faster
methods than gradient descent v There is an extension to multi-class classification
29
![Page 30: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/30.jpg)
Stochastic gradient descent
Variation on gradient descent that considers the error for a single training example: Tends to converge faster than the batch version.
30
Stochastic Gradient Descent (SGD)
A variation of GD that considers only the error on one data point.
Ein(w) =1
N
N!
n=1
ln(1 + e−yn·wtx) =
1
N
N!
n=1
e(w,xn, yn)
• Pick a random data point (x∗, y∗)
• Run an iteration of GD on e(w,x∗, y∗)
w(t + 1)← w(t)− η∇we(w,x∗, y∗)
1. The ‘average’ move is the same as GD;
2. Computation: fraction 1N
cheaper per step;
3. Stochastic: helps escape local minima;
4. Simple;
5. Similar to PLA.
Logistic Regression:
w(t + 1)← w(t) + y∗x∗
"η
1 + ey∗wtx∗
#
(Recall PLA: w(t+ 1)← w(t) + y∗x∗)
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 22 /23 GD versus SGD, a picture −→
Stochastic Gradient Descent (SGD)
A variation of GD that considers only the error on one data point.
Ein(w) =1
N
N!
n=1
ln(1 + e−yn·wtx) =
1
N
N!
n=1
e(w,xn, yn)
• Pick a random data point (x∗, y∗)
• Run an iteration of GD on e(w,x∗, y∗)
w(t + 1)← w(t)− η∇we(w,x∗, y∗)
1. The ‘average’ move is the same as GD;
2. Computation: fraction 1N
cheaper per step;
3. Stochastic: helps escape local minima;
4. Simple;
5. Similar to PLA.
Logistic Regression:
w(t + 1)← w(t) + y∗x∗
"η
1 + ey∗wtx∗
#
(Recall PLA: w(t+ 1)← w(t) + y∗x∗)
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 22 /23 GD versus SGD, a picture −→
Stochastic Gradient Descent (SGD)
A variation of GD that considers only the error on one data point.
Ein(w) =1
N
N!
n=1
ln(1 + e−yn·wtx) =
1
N
N!
n=1
e(w,xn, yn)
• Pick a random data point (x∗, y∗)
• Run an iteration of GD on e(w,x∗, y∗)
w(t + 1)← w(t)− η∇we(w,x∗, y∗)
1. The ‘average’ move is the same as GD;
2. Computation: fraction 1N
cheaper per step;
3. Stochastic: helps escape local minima;
4. Simple;
5. Similar to PLA.
Logistic Regression:
w(t + 1)← w(t) + y∗x∗
"η
1 + ey∗wtx∗
#
(Recall PLA: w(t+ 1)← w(t) + y∗x∗)
c⃝ AML Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 22 /23 GD versus SGD, a picture −→
![Page 31: 04 logistic regression - Colorado State Universitycs545/fall18/lib/exe/fetch... · Cross entropy and logistic regression The logistic regression cost function: It is the average cross](https://reader036.fdocuments.net/reader036/viewer/2022081515/5f0fa9677e708231d4454466/html5/thumbnails/31.jpg)
Summary of linear models
Linear methods for classification and regression:
31
The Linear Signal
s = wtx−→
⎧
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
→ sign(wtx) {−1,+1}
→ wtx R
→ θ(wtx) [0, 1]
y = θ(s)
c⃝ AML Creator: Malik Magdon-Ismail Linear Classification and Regression: 6 /21 Classification and PLA −→
More to come!