Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan...
Transcript of Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan...
![Page 1: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/1.jpg)
Introduction to Machine Learning
Machine Learning: Jordan Boyd-GraberUniversity of MarylandLOGISTIC REGRESSION FROM TEXT
Slides adapted from Emily Fox
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 1 / 18
![Page 2: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/2.jpg)
Reminder: Logistic Regression
P(Y = 0|X) =1
1+exp�
β0 +∑
i βiXi
� (1)
P(Y = 1|X) =exp
�
β0 +∑
i βiXi
�
1+exp�
β0 +∑
i βiXi
� (2)
� Discriminative prediction: p(y |x)� Classification uses: ad placement, spam detection
� What we didn’t talk about is how to learn β from data
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 2 / 18
![Page 3: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/3.jpg)
Logistic Regression: Objective Function
L ≡ lnp(Y |X ,β) =∑
j
lnp(y(j) |x(j),β) (3)
=∑
j
y(j)�
β0 +∑
i
βix(j)i
�
− ln
�
1+exp
�
β0 +∑
i
βix(j)i
��
(4)
Training data (y ,x) are fixed. Objective function is a function of β . . . whatvalues of β give a good value.
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 3 / 18
![Page 4: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/4.jpg)
Logistic Regression: Objective Function
L ≡ lnp(Y |X ,β) =∑
j
lnp(y(j) |x(j),β) (3)
=∑
j
y(j)�
β0 +∑
i
βix(j)i
�
− ln
�
1+exp
�
β0 +∑
i
βix(j)i
��
(4)
Training data (y ,x) are fixed. Objective function is a function of β . . . whatvalues of β give a good value.
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 3 / 18
![Page 5: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/5.jpg)
Convexity
� Convex function
� Doesn’t matter where you start, ifyou “walk up” objective
� Gradient!
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 4 / 18
![Page 6: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/6.jpg)
Convexity
� Convex function
� Doesn’t matter where you start, ifyou “walk up” objective
� Gradient!
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 4 / 18
![Page 7: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/7.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
![Page 8: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/8.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
UndiscoveredCountry
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
![Page 9: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/9.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
0
UndiscoveredCountry
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
![Page 10: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/10.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
0
UndiscoveredCountry
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
![Page 11: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/11.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
1
0
UndiscoveredCountry
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
![Page 12: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/12.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
1
0
UndiscoveredCountry
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
![Page 13: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/13.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
1
0
2
UndiscoveredCountry
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
![Page 14: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/14.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
1
0
2
UndiscoveredCountry
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
![Page 15: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/15.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
1
0
2
3
UndiscoveredCountry
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
![Page 16: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/16.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
0
β ij(l+1)
Jα∂∂β
β ij(l)
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
![Page 17: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/17.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
0
β ij(l+1)
Jα∂∂β
β ij(l)
Luckily, (vanilla) logistic regression is convex
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
![Page 18: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/18.jpg)
Gradient for Logistic Regression
To ease notation, let’s define
πi =expβT xi
1+expβT xi(5)
Our objective function is
L =∑
i
logp(yi |xi) =∑
i
Li =∑
i
¨
logπi if yi = 1
log(1−πi) if yi = 0(6)
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 6 / 18
![Page 19: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/19.jpg)
Taking the Derivative
Apply chain rule:
∂L∂ βj
=∑
i
∂Li( ~β)
∂ βj=∑
i
(
1πi
∂ πi∂ βj
if yi = 11
1−πi
�
− ∂ πi∂ βj
�
if yi = 0(7)
If we plug in the derivative,
∂ πi
∂ βj=πi(1−πi)xj , (8)
we can merge these two cases
∂Li
∂ βj= (yi −πi)xj . (9)
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 7 / 18
![Page 20: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/20.jpg)
Gradient for Logistic Regression
Gradient
∇βL ( ~β) =
�
∂L ( ~β)
∂ β0, . . . ,
∂L ( ~β)
∂ βn
�
(10)
Update
∆β ≡η∇βL ( ~β) (11)
β ′i ←βi +η∂L ( ~β)
∂ βi(12)
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 18
![Page 21: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/21.jpg)
Gradient for Logistic Regression
Gradient
∇βL ( ~β) =
�
∂L ( ~β)
∂ β0, . . . ,
∂L ( ~β)
∂ βn
�
(10)
Update
∆β ≡η∇βL ( ~β) (11)
β ′i ←βi +η∂L ( ~β)
∂ βi(12)
Why are we adding? What would well do if we wanted to do descent?
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 18
![Page 22: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/22.jpg)
Gradient for Logistic Regression
Gradient
∇βL ( ~β) =
�
∂L ( ~β)
∂ β0, . . . ,
∂L ( ~β)
∂ βn
�
(10)
Update
∆β ≡η∇βL ( ~β) (11)
β ′i ←βi +η∂L ( ~β)
∂ βi(12)
η: step size, must be greater than zero
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 18
![Page 23: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/23.jpg)
Gradient for Logistic Regression
Gradient
∇βL ( ~β) =
�
∂L ( ~β)
∂ β0, . . . ,
∂L ( ~β)
∂ βn
�
(10)
Update
∆β ≡η∇βL ( ~β) (11)
β ′i ←βi +η∂L ( ~β)
∂ βi(12)
NB: Conjugate gradient is usually better, but harder to implement
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 18
![Page 24: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/24.jpg)
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18
![Page 25: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/25.jpg)
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18
![Page 26: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/26.jpg)
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18
![Page 27: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/27.jpg)
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18
![Page 28: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/28.jpg)
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18
![Page 29: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/29.jpg)
Remaining issues
� When to stop?
� What if β keeps getting bigger?
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 10 / 18
![Page 30: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/30.jpg)
Regularized Conditional Log Likelihood
Unregularized
β ∗= argmaxβ
ln�
p(y(j) |x(j),β)�
(13)
Regularized
β ∗= argmaxβ
ln�
p(y(j) |x(j),β)�
−µ∑
i
β 2i (14)
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 11 / 18
![Page 31: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/31.jpg)
Regularized Conditional Log Likelihood
Unregularized
β ∗= argmaxβ
ln�
p(y(j) |x(j),β)�
(13)
Regularized
β ∗= argmaxβ
ln�
p(y(j) |x(j),β)�
−µ∑
i
β 2i (14)
µ is “regularization” parameter that trades off between likelihood andhaving small parameters
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 11 / 18
![Page 32: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/32.jpg)
Approximating the Gradient
� Our datasets are big (to fit into memory)
� . . . or data are changing / streaming
� Hard to compute true gradient
L (β)≡Ex [∇L (β ,x)] (15)
� Average over all observations
� What if we compute an update just from one observation?
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 12 / 18
![Page 33: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/33.jpg)
Approximating the Gradient
� Our datasets are big (to fit into memory)
� . . . or data are changing / streaming
� Hard to compute true gradient
L (β)≡Ex [∇L (β ,x)] (15)
� Average over all observations
� What if we compute an update just from one observation?
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 12 / 18
![Page 34: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/34.jpg)
Approximating the Gradient
� Our datasets are big (to fit into memory)
� . . . or data are changing / streaming
� Hard to compute true gradient
L (β)≡Ex [∇L (β ,x)] (15)
� Average over all observations
� What if we compute an update just from one observation?
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 12 / 18
![Page 35: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/35.jpg)
Getting to Union Station
Pretend it’s a pre-smartphone world and you want to get to Union Station
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 13 / 18
![Page 36: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/36.jpg)
Stochastic Gradient for Logistic Regression
Given a single observation xi chosen at random from the dataset,
βj ←β ′j +η�
−µβ ′j + xij [yi −πi ]�
(16)
Examples in class.
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 14 / 18
![Page 37: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/37.jpg)
Stochastic Gradient for Logistic Regression
Given a single observation xi chosen at random from the dataset,
βj ←β ′j +η�
−µβ ′j + xij [yi −πi ]�
(16)
Examples in class.
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 14 / 18
![Page 38: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/38.jpg)
Stochastic Gradient for Regularized Regression
L = logp(y |x;β)−µ∑
j
β 2j (17)
Taking the derivative (with respect to example xi )
∂L∂ βj
=(yi −πi)xj −2µβj (18)
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 15 / 18
![Page 39: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/39.jpg)
Stochastic Gradient for Regularized Regression
L = logp(y |x;β)−µ∑
j
β 2j (17)
Taking the derivative (with respect to example xi )
∂L∂ βj
=(yi −πi)xj −2µβj (18)
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 15 / 18
![Page 40: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/40.jpg)
Algorithm
1. Initialize a vector B to be all zeros
2. For t = 1, . . . ,T� For each example ~xi ,yi and feature j :� Compute πi ≡ Pr(yi = 1 | ~xi)� Set β [j] =β [j]′+λ(yi −πi)xi
3. Output the parameters β1, . . . ,βd .
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 16 / 18
![Page 41: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/41.jpg)
Proofs about Stochastic Gradient
� Depends on convexity of objective and how close ε you want to get toactual answer
� Best bounds depend on changing η over time and per dimension (notall features created equal)
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 17 / 18
![Page 42: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station](https://reader030.fdocuments.net/reader030/viewer/2022040117/5f0f65537e708231d443f381/html5/thumbnails/42.jpg)
In class
� Your questions!
� Working through simple example
� Prepared for logistic regression homework
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 18 / 18