Lecture 04: Logistic Regression - University of Waterlooy328yu/mycourses/489/lectures/lec04.pdf5...
Transcript of Lecture 04: Logistic Regression - University of Waterlooy328yu/mycourses/489/lectures/lec04.pdf5...
Outl ine • Announcements • Baseline • Learning “Machine Learning” Pyramid • Regression or Classification (that’s it!) • History of Classification • History of Solvers (Analytical to Convex to “Non-Convex but smooth”) • Convexity • SGD • Perceptron Review • Bernoulli model / Logistic Regression • Tensorflow Playground / Demo code • Multiclass
9/21/17 Yao-Liang Yu 2
Outl ine • Announcements • Baseline • Learning “Machine Learning” Pyramid • Regression or Classification (that’s it!) • History of Classification • History of Solvers (Analytical to Convex to “Non-Convex but smooth”) • Convexity • SGD • Perceptron Review • Bernoulli model / Logistic Regression • Tensorflow Playground / Demo code • Multiclass
9/21/17 Yao-Liang Yu 4
Outl ine • Announcements • Baseline • Learning “Machine Learning” Pyramid • Regression or Classification (that’s it!) • History of Classification • History of Solvers (Analytical to Convex to “Non-Convex but smooth”) • Convexity • SGD • Perceptron Review • Bernoulli model / Logistic Regression • Tensorflow Playground / Demo code • Multiclass
9/21/17 Yao-Liang Yu 6
ML Pyramid
9/21/17 Francois Chaubard and Agastya Kalra 7
f calSystems
Deep Learning
Software Engineering Linear Algebra
Convex Optimization Probability and Statistics Information Theory
Machine Learning (anything fancier than simple Sklearn fit / predict calls)
Outl ine • Announcements • Baseline • Learning “Machine Learning” Pyramid • Regression or Classification (that’s it!) • History of Classification • History of Solvers (Analytical to Convex to “Non-Convex but smooth”) • Convexity • SGD • Perceptron Review • Bernoulli model / Logistic Regression • Tensorflow Playground / Demo code • Multiclass
9/21/17 Yao-Liang Yu 8
Outl ine • Announcements • Baseline • Learning “Machine Learning” Pyramid • Regression or Classification (that’s it!) • History of Classification • History of Solvers (Analytical to Convex to “Non-Convex but smooth”) • Convexity • SGD • Perceptron Review • Bernoulli model / Logistic Regression • Tensorflow Playground / Demo code • Multiclass
9/21/17 Yao-Liang Yu 10
Classif icat ion
9/21/17 Francois Chaubard and Agastya Kalra 11
f calSystems
Higher “complexity” datasets need higher ”capacity” models (Formally, higher VC-Dimensionality)
Perceptron
!Perceptron Logistic Regression
!Logistic Regression
MLP Feature Eng etc
etc
?
Outl ine • Announcements • Baseline • Learning “Machine Learning” Pyramid • Regression or Classification (that’s it!) • History of Classification • History of Solvers (Analytical to Convex to “Non-Convex but
smooth”) • Convexity • SGD • Perceptron Review • Bernoulli model / Logistic Regression • Tensorflow Playground / Demo code • Multiclass
9/21/17 Yao-Liang Yu 12
History of Solvers
9/21/17 Francois Chaubard and Agastya Kalra 13
f calSystems
Closed Form <1950s Iterative Methods+Convex 1950-2012 Iterative Methods+Smooth 2012+
Least Squares etc
Interior Point Methods Log Barrier etc
Deep Learning etc
Outl ine • Announcements • Baseline • Learning “Machine Learning” Pyramid • Regression or Classification (that’s it!) • History of Classification • History of Solvers (Analytical to Convex to “Non-Convex but smooth”) • Convexity • SGD • Perceptron Review • Bernoulli model / Logistic Regression • Tensorflow Playground / Demo code • Multiclass
9/21/17 Yao-Liang Yu 14
Outl ine • Announcements • Baseline • Learning “Machine Learning” Pyramid • Regression or Classification (that’s it!) • History of Classification • History of Solvers (Analytical to Convex to “Non-Convex but smooth”) • Convexity • SGD • Perceptron Review • Bernoulli model / Logistic Regression • Tensorflow Playground / Demo code • Multiclass
9/21/17 Yao-Liang Yu 16
Outl ine • Announcements • Baseline • Learning “Machine Learning” Pyramid • Regression or Classification (that’s it!) • History of Classification • History of Solvers (Analytical to Convex to “Non-Convex but smooth”) • Convexity • SGD • Perceptron Review • Bernoulli model / Logistic Regression • Tensorflow Playground / Demo code • Multiclass
9/21/17 Yao-Liang Yu 18
Outl ine • Announcements • Baseline • Learning “Machine Learning” Pyramid • Regression or Classification (that’s it!) • History of Classification • History of Solvers (Analytical to Convex to “Non-Convex but smooth”) • Convexity • SGD • Perceptron Review • Bernoulli model / Logistic Regression • Tensorflow Playground / Demo code • Multiclass
9/21/17 Yao-Liang Yu 20
Bernoul l i model
• Let P(Y=1 | X=x) = p(x; w), parameterized by w
• Conditional likelihood on {(x1, y1), … (xn, yn)}:
• simplifies if independence holds
• Assuming yi is {0,1}-valued
9/21/17 Yao-Liang Yu 21
P(Y1 = y1, . . . , Yn = yn|X1 = x1, . . . , Xn = xn)
nY
i=1
P(Yi = yi|Xi = xi) =nY
i=1
p(xi;w)yi(1� p(xi;w))1�yi
Naïve solut ion
• Find w to maximize conditional likelihood
• What is the solution if p(x; w) does not depend on x?
• What is the solution if p(x; w) does not depend on ?
9/21/17 Yao-Liang Yu 22
nY
i=1
p(xi;w)yi(1� p(xi;w))1�yi
General ized l inear models (GLM)
• y ~ Bernoulli(p); p = p(x; w) natural parameter • Logistic regression
• y ~ Normal(µ, σ2); µ = µ(x; w) • (weighted) least-squares regression
• GLM: y ~ exp( θ φ(y) – A(θ) )
9/21/17 Yao-Liang Yu 23
sufficient statistics
log-partition function
Logit transform
• p(x; w) = wTx? p >=0 not guaranteed…
• log p(x; w) = wTx? better! • LHS negative, RHS real-valued…
• Logit transform
• Or equivalently
9/21/17 Yao-Liang Yu 24
log
p(x;w)
1� p(x;w)
= w
>x
p(x;w) =
1
1 + exp(�w
>x)
odds ratio
Predict ion with confidence
• ŷ = 1 if p = P(Y=1 | X=x) > ½ iff wTx > 0
• Decision boundary wTx = 0
• ŷ = sign(wTx) as before, but with confidence p(x; w)
9/21/17 Yao-Liang Yu 25
p(x;w) =
1
1 + exp(�w
>x)
Not just a classif icat ion algori thm
• Logistic regression does more than classification • it estimates conditional probabilities • under the logit transform assumption
• Having confidence in prediction is nice • the price is an assumption that may or may not hold
• If classification is the sole goal, then doing extra work • as shall see, SVM only estimates decision boundary
9/21/17 Yao-Liang Yu 26
More than logist ic regression
• F(p) transforms p from [0,1] to R
• Then, equating F(p) to a linear function wTx
• But, there are many other choices for F! • precisely the inverse of any
distribution function!
9/21/17 Yao-Liang Yu 27
Logist ic distr ibut ion
• Cumulative Distribution Function
• Mean mu, variance s2π2/3
9/21/17 Yao-Liang Yu 28
F (x;µ, s) =
1
1 + exp(�x�µ
s
)
Outl ine • Announcements • Baseline • Learning “Machine Learning” Pyramid • Regression or Classification (that’s it!) • History of Classification • History of Solvers (Analytical to Convex to “Non-Convex but smooth”) • Convexity • SGD • Perceptron Review • Bernoulli model / Logistic Regression • Tensorflow Playground / Demo code • Multiclass
9/21/17 Yao-Liang Yu 29
Outl ine • Announcements • Baseline • Learning “Machine Learning” Pyramid • Regression or Classification (that’s it!) • History of Classification • History of Solvers (Analytical to Convex to “Non-Convex but smooth”) • Convexity • SGD • Perceptron Review • Bernoulli model / Logistic Regression • Tensorflow Playground / Demo code • Multiclass
9/21/17 Yao-Liang Yu 32
More than 2 classes
• Softmax
• Again, nonnegative and sum to 1
• Negative log-likelihood (y is one-hot)
9/21/17 Yao-Liang Yu 34
P(Y = c|x,W ) =
exp(w
>c x)Pk
q=1 exp(w>q x)
� log
nY
i=1
kY
c=1
pyicic = �
nX
i=1
kX
c=1
yic log pic
Classif icat ion revisi ted
9/21/17 Yao-Liang Yu 37
• ŷ = sign( xTw + b )
• How confident we are about ŷ?
• |xTw + b| seems a good indicator • real-valued; hard to interpret • ways to transform into [0,1]
• Better(?) idea: learn confidence directly
Condit ional probabi l i ty
• P(Y=1 | X=x): conditional on seeing x, what is the chance of this instance being positive, i.e., Y=1? • obviously, value in [0,1]
• P(Y=0 | X=x) = 1 – P(Y=1 | X=x), if two classes • more generally, sum to 1
9/21/17 Yao-Liang Yu 38
Notation (Simplex). Δk-1 := { p in Rk: p ≥ 0, Σi pi = 1 }
Reduction to a harder problem
• P(Y=1 | X=x) = E(1Y=1 | X=x)
• Let Z = 1Y=1, then regression function for (X, Z) • use linear regression for binary Z?
• Exploit structure! • conditional probabilities are in a simplex
• Never reduce to unnecessarily harder problem 9/21/17 Yao-Liang Yu 39
1A =
(1, A is true
0, A is false
Maximum l ikel ihood
• Minimize negative log-likelihood
9/21/17 Yao-Liang Yu 40
nY
i=1
p(xi;w)yi(1� p(xi;w))1�yi
p(x;w) =
1
1 + exp(�w
>x)
X
i
log(e(1�yi)w>xi
+ e�yiw>xi) ⌘
X
i
log(1 + e�yiw>xi)
Newton’s algori thm
• η = 1: iterative weighted least-squares
9/21/17 Yao-Liang Yu 41
wt+1 wt � ⌘t[r2f(wt)]�1 ·rf(wt)
PSD
pi =1
1 + e�w
>t xi
Uncertain predictions get bigger weight
rf(wt) = X>(p� y)
r2f(wt) =X
i
pi(1� pi)xix>i
A word about implementat ion
• Numerically computing exponential can be tricky • easily underflows or overflows
• The usual trick • estimate the range of the exponents • shift the mean of the exponents to 0
9/21/17 Yao-Liang Yu 42