Generalized linear models and logistic regression · the full version of the Newton-Raphson...
Transcript of Generalized linear models and logistic regression · the full version of the Newton-Raphson...
Generalized linear models and logistic regression
Comments (Oct. 15, 2019)
• Assignment 1 almost marked • Any questions about the problems on this assignment?
• Assignment 2 due soon• Any questions?
2
Summary so far• From chapters 1 and 2, obtained tools needed to talk about
uncertainty/noise underlying machine learning• capture uncertainty about data/observations using probabilities
• formalize estimation problem for distributions
• Identify variables x_1, …, x_d• e.g. observed features, observed targets
• Pick the desired distribution• e.g. p(x_1, …, x_d) or p(x_1 | x_2, …, x_d) (conditional distribution)
• e.g. p(x_i) is Poisson or p(y | x_1, …, x_d) is Gaussian
• Perform parameter estimation for chosen distribution• e.g., estimate lambda for Poisson
• e.g. estimate mu and sigma for Gaussian3
Summary so far (2)
• For prediction problems, which is much of machine learning, first discuss • the types of data we get (i.e., features and types of targets)
• goal to minimize expected cost of incorrect predictions
• Concluded optimal prediction functions use p(y | x) or E[Y | x]
• From there, our goal becomes to estimate p(y|x) or E[Y | x]
• Starting from this general problem specification, it is useful to use our parameter estimation techniques to solve this problem• e.g., specify Y = Xw + noise, estimate mu = xw
4
Summary so far (3)
• For linear regression setting, modeling p(y|x) as a Gaussian with mu = <x,w> and a constant sigma
• Performed maximum likelihood to get weights w
• Possible question: why all this machinery to get to linear regression?• one answer: makes our assumptions about uncertainty more clear
• another answer: it will make it easier to generalize p(y | x) to other distributions (which we will do with GLMs)
5
Estimation approaches for Linear regression
• Recall we estimated w for p(y | x) as a Gaussian
• We discussed the closed form solution
• and using batch or stochastic gradient descent
• Exercise: Now imagine you have 10 new data points. How do we get a new w, that incorporates these data points?
6
w = (X>X)�1X>y
wt+1 = wt � ⌘X>(Xwt � y)
wt+1 = wt � ⌘tx>t (xtwt � yt)
Exercise: MAP for Poisson
• Recall we estimated lambda for Poisson p(x)• Had a dataset of scalars {x1, …, xn}
• For MLE, found the closed form solution lambda = average of xi
• Can we use gradient descent for this optimization? And if so, should we?
7
Exercise: Predicting the number of accidents
• In Assignment 1, learned p(y) as Poisson, where Y is the number of accidents in a factory
• How would the question from assignment 1 change if we also wanted to condition on features?• For example, want to model the number of accidents in the factory,
given x1 = size of the factory and x2 = number of employees
• What is p(y | x)? What are the parameters?
8
Poisson regression
9
p(y|x) = Poisson(y|� = exp(x>w))
1. E[Y |x] = exp(w>x)
2. p(y|x) = Poisson(�)<latexit sha1_base64="/uAhu9amlIrK6dkT+dvMXWoEGMg=">AAACUHicbVFNbxMxEPWGr7J8BThysYio0ku0myLRC1IFQuIYJNIWxUvk9c6mVr22Zc+WRNv9iVx643dw4QACbxqk0jKSNU/vzWhmnnOrpMck+Rb1bty8dfvO1t343v0HDx/1Hz858KZ2AqbCKOOOcu5BSQ1TlKjgyDrgVa7gMD952+mHp+C8NPojrixkFV9oWUrBMVDz/mI7HVFG380+nS0z+poyWNohy40q/KoKqfnSfmZoLKs4Hudls2x3GIu3x12THa7oGb2kdP0IS2wmRnpvdDtkKqxS8J143h8ko2Qd9DpIN2BANjGZ989ZYURdgUahuPezNLGYNdyhFAramNUeLBcnfAGzADWvwGfN2pCWvghMQUvjwtNI1+zljoZXvrsvVHbb+6taR/5Pm9VY7mWN1LZG0OJiUFkrioZ27tJCOhCoVgFw4WTYlYpj7rjA8AedCenVk6+Dg/Eo3R2NP7wc7L/Z2LFFnpHnZEhS8orsk/dkQqZEkK/kO/lJfkXn0Y/ody+6KP2byVPyT/TiP9hHspc=</latexit>
Exponential Family Distributions
10
p(y|✓) = exp(✓y � a(✓) + b(y))
Transfer f corresponds to the derivate of the log-normalizer function a
✓ = x>w
We will always linearly predict the natural parameter
Useful property:da(✓)
d✓= E[Y ]
<latexit sha1_base64="LP8vmPiAGFSMwrT7Yi3Wzu6P7to=">AAACFHicbZDLSsNAFIYnXmu9RV26GSxCRShJFXQjFEVwWcFepAllMpm0QycXZk6EEvIQbnwVNy4UcevCnW/j9LLQ1h8GPv5zDnPO7yWCK7Csb2NhcWl5ZbWwVlzf2NzaNnd2mypOJWUNGotYtj2imOARawAHwdqJZCT0BGt5g6tRvfXApOJxdAfDhLkh6UU84JSAtrrmsRNIQjMfk7IDfQbkKM/8CeX4Ajshgb7nZdd5597tmiWrYo2F58GeQglNVe+aX44f0zRkEVBBlOrYVgJuRiRwKlhedFLFEkIHpMc6GiMSMuVm46NyfKgdHwex1C8CPHZ/T2QkVGoYerpztKSarY3M/2qdFIJzN+NRkgKL6OSjIBUYYjxKCPtcMgpiqIFQyfWumPaJTgl0jkUdgj178jw0qxX7pFK9PS3VLqdxFNA+OkBlZKMzVEM3qI4aiKJH9Ixe0ZvxZLwY78bHpHXBmM7soT8yPn8ASoOeWA==</latexit>
Examples• Gaussian distribution
• Poisson distribution
• Bernoulli distribution
11
a(✓) =1
2✓2 f(✓) = ✓
✓ = x>w
a(✓) = exp(✓) f(✓) = exp(✓)
a(✓) = ln(1 + exp(✓)) f(✓) =1
1 + exp(�✓)
p(y|✓) = exp(✓y � a(✓) + b(y))
sigmoid
Exercise: How do we extract the form for the Poisson distribution?
12
where ◊ œ R is the parameter to the distribution, a : R æ R is a log-normalizer functionand b : R æ R is a function of only x that will typically be ignored in our optimizationbecause it is not a function of ◊. Many of the often encountered (families of) distributionsare members of the exponential family; e.g. exponential, Gaussian, Gamma, Poisson, or thebinomial distributions. Therefore, it is useful to generically study the exponential family tobetter understand commonalities and di�erences between individual member functions.
Example 17: The Poisson distribution can be expressed as
p(x|⁄) = exp (x log ⁄ ≠ ⁄ ≠ log x!) ,
where ⁄ œ R+ and X = N0. Thus, ◊ = log ⁄, a(◊) = e◊, and b(x) = ≠ log x!. ⇤Now let us get some further insight into the properties of the exponential family param-
eters and why this class is convenient for estimation. The function a(◊) is typically calledthe log-partitioning function or simply a log-normalizer. It is called this because
a(◊) = logˆ
X
exp (◊x + b(x)) dx
and so plays the role of ensuring that we have a valid density:´
Xp(x)dx = 1. Importantly,
for many common GLMs, the derivative of a corresponds to the inverse of the link function.For example, for Poisson regression, the link function g(◊) = log(◊), and the derivative ofa is e◊, which is the inverse of g. Therefore, as we discuss below, the log-normalizer foran exponential family informs what link g should be used (or correspondingly the transferf = g≠1).
The properties of this log-normalizer are also key for estimation of generalized linearmodels. It can be derived that
ˆa(◊)ˆ◊
= E [X]
ˆ2a(◊)ˆ◊2 = V[X]
7.3 Formalizing generalized linear modelsWe shall now formalize the generalized linear models. The two key components of GLMscan be expressed as
1. E[y|x] = f(Ê€x) or g(E[y|x]) = Ê€x where g = f≠1
2. p(y|x) is an Exponential Family distribution
The function f is called the transfer function and g is called the link function. For Poissonregression, f is the exponential function, and as we shall see for logistic regression, f isthe sigmoid function. The transfer function adjusts the range of ʀx to the domain of Y ;because of this relationship, link functions are usually not selected independently of the dis-tribution for Y . The generalization to the exponential family from the Gaussian distributionused in ordinary least-squares regression, allows us to model a much wider range of targetfunctions. GLMs include three widely used models: linear regression, Poisson regressionand logistic regression, which we will talk about in the next chapter about classification.
80
p(y|✓) = exp(✓y � a(✓) + b(y))
• What is the transfer f?
f(✓) =da(✓)
d✓= exp(✓)
<latexit sha1_base64="mPlxiJ73hK2CB2L4cF3/NL7Os0w=">AAACH3icbZDLSsNAFIYn9V5vUZduBotQNyWpom6EohuXClaFJpTJ5KQdOrkwcyKW0Ddx46u4caGIuPNtnF4EtR4Y+Pj/czhz/iCTQqPjfFqlmdm5+YXFpfLyyuraur2xea3TXHFo8lSm6jZgGqRIoIkCJdxmClgcSLgJemdD/+YOlBZpcoX9DPyYdRIRCc7QSG37MKp62AVke/SEepFivAgp+9YGRTimwdCF++zbaNsVp+aMik6DO4EKmdRF2/7wwpTnMSTIJdO65ToZ+gVTKLiEQdnLNWSM91gHWgYTFoP2i9F9A7prlJBGqTIvQTpSf04ULNa6HwemM2bY1X+9ofif18oxOvYLkWQ5QsLHi6JcUkzpMCwaCgUcZd8A40qYv1LeZSYkNJGWTQju35On4bpec/dr9cuDSuN0Esci2SY7pEpcckQa5JxckCbh5IE8kRfyaj1az9ab9T5uLVmTmS3yq6zPLwg+ocA=</latexit>
Exercise: How do we extract the form for the exponential distribution?
• Recall exponential family distribution
• How do we write the exponential distribution this way?
• What is the transfer f?
13
� exp(��y)
p(y|✓) = exp(✓y � a(✓) + b(y))
f(✓) =d
d✓a(✓) =
�1
✓<latexit sha1_base64="KafInQrCEKwV8/yH0pAuo0L73es=">AAACJXicbZDLSsNAFIYn9VbrLerSzWAR6sKSiKALhaIblxXsBZpQJpNJO3RyYeZEKCEv48ZXcePCIoIrX8Vpm4W2Hhj4+P9zOHN+LxFcgWV9GaWV1bX1jfJmZWt7Z3fP3D9oqziVlLVoLGLZ9YhigkesBRwE6yaSkdATrOON7qZ+54lJxePoEcYJc0MyiHjAKQEt9c3roObAkAE5xTfYCSShmZ9nPp6LOSZL9pmdZ4XbN6tW3ZoVXga7gCoqqtk3J44f0zRkEVBBlOrZVgJuRiRwKlhecVLFEkJHZMB6GiMSMuVmsytzfKIVHwex1C8CPFN/T2QkVGocerozJDBUi95U/M/rpRBcuRmPkhRYROeLglRgiPE0MuxzySiIsQZCJdd/xXRIdBagg63oEOzFk5ehfV63NT9cVBu3RRxldISOUQ3Z6BI10D1qohai6Bm9onc0MV6MN+PD+Jy3loxi5hD9KeP7Bw3fpGQ=</latexit><latexit sha1_base64="KafInQrCEKwV8/yH0pAuo0L73es=">AAACJXicbZDLSsNAFIYn9VbrLerSzWAR6sKSiKALhaIblxXsBZpQJpNJO3RyYeZEKCEv48ZXcePCIoIrX8Vpm4W2Hhj4+P9zOHN+LxFcgWV9GaWV1bX1jfJmZWt7Z3fP3D9oqziVlLVoLGLZ9YhigkesBRwE6yaSkdATrOON7qZ+54lJxePoEcYJc0MyiHjAKQEt9c3roObAkAE5xTfYCSShmZ9nPp6LOSZL9pmdZ4XbN6tW3ZoVXga7gCoqqtk3J44f0zRkEVBBlOrZVgJuRiRwKlhecVLFEkJHZMB6GiMSMuVmsytzfKIVHwex1C8CPFN/T2QkVGocerozJDBUi95U/M/rpRBcuRmPkhRYROeLglRgiPE0MuxzySiIsQZCJdd/xXRIdBagg63oEOzFk5ehfV63NT9cVBu3RRxldISOUQ3Z6BI10D1qohai6Bm9onc0MV6MN+PD+Jy3loxi5hD9KeP7Bw3fpGQ=</latexit><latexit sha1_base64="KafInQrCEKwV8/yH0pAuo0L73es=">AAACJXicbZDLSsNAFIYn9VbrLerSzWAR6sKSiKALhaIblxXsBZpQJpNJO3RyYeZEKCEv48ZXcePCIoIrX8Vpm4W2Hhj4+P9zOHN+LxFcgWV9GaWV1bX1jfJmZWt7Z3fP3D9oqziVlLVoLGLZ9YhigkesBRwE6yaSkdATrOON7qZ+54lJxePoEcYJc0MyiHjAKQEt9c3roObAkAE5xTfYCSShmZ9nPp6LOSZL9pmdZ4XbN6tW3ZoVXga7gCoqqtk3J44f0zRkEVBBlOrZVgJuRiRwKlhecVLFEkJHZMB6GiMSMuVmsytzfKIVHwex1C8CPFN/T2QkVGocerozJDBUi95U/M/rpRBcuRmPkhRYROeLglRgiPE0MuxzySiIsQZCJdd/xXRIdBagg63oEOzFk5ehfV63NT9cVBu3RRxldISOUQ3Z6BI10D1qohai6Bm9onc0MV6MN+PD+Jy3loxi5hD9KeP7Bw3fpGQ=</latexit><latexit sha1_base64="KafInQrCEKwV8/yH0pAuo0L73es=">AAACJXicbZDLSsNAFIYn9VbrLerSzWAR6sKSiKALhaIblxXsBZpQJpNJO3RyYeZEKCEv48ZXcePCIoIrX8Vpm4W2Hhj4+P9zOHN+LxFcgWV9GaWV1bX1jfJmZWt7Z3fP3D9oqziVlLVoLGLZ9YhigkesBRwE6yaSkdATrOON7qZ+54lJxePoEcYJc0MyiHjAKQEt9c3roObAkAE5xTfYCSShmZ9nPp6LOSZL9pmdZ4XbN6tW3ZoVXga7gCoqqtk3J44f0zRkEVBBlOrZVgJuRiRwKlhecVLFEkJHZMB6GiMSMuVmsytzfKIVHwex1C8CPFN/T2QkVGocerozJDBUi95U/M/rpRBcuRmPkhRYROeLglRgiPE0MuxzySiIsQZCJdd/xXRIdBagg63oEOzFk5ehfV63NT9cVBu3RRxldISOUQ3Z6BI10D1qohai6Bm9onc0MV6MN+PD+Jy3loxi5hD9KeP7Bw3fpGQ=</latexit>
✓ = x>w<latexit sha1_base64="XmIcyi2DjkGQ0wE1D6kjjm/Ezcs=">AAACDHicbVDLSsNAFJ3UV62vqks3g0VwVRIRdCMU3bisYB/Q1DKZTtqhk0mYuVFLyAe48VfcuFDErR/gzr9x0gbR1gMDZ865l3vv8SLBNdj2l1VYWFxaXimultbWNza3yts7TR3GirIGDUWo2h7RTHDJGsBBsHakGAk8wVre6CLzW7dMaR7KaxhHrBuQgeQ+pwSM1CtXXBgyIPgMuwGBoecn9+mNC2H0879LTZVdtSfA88TJSQXlqPfKn24/pHHAJFBBtO44dgTdhCjgVLC05MaaRYSOyIB1DJUkYLqbTI5J8YFR+tgPlXkS8ET93ZGQQOtx4JnKbEM962Xif14nBv+0m3AZxcAknQ7yY4EhxFkyuM8VoyDGhhCquNkV0yFRhILJr2RCcGZPnifNo6pj+NVxpXaex1FEe2gfHSIHnaAaukR11EAUPaAn9IJerUfr2Xqz3qelBSvv2UV/YH18Aw25m6A=</latexit><latexit sha1_base64="XmIcyi2DjkGQ0wE1D6kjjm/Ezcs=">AAACDHicbVDLSsNAFJ3UV62vqks3g0VwVRIRdCMU3bisYB/Q1DKZTtqhk0mYuVFLyAe48VfcuFDErR/gzr9x0gbR1gMDZ865l3vv8SLBNdj2l1VYWFxaXimultbWNza3yts7TR3GirIGDUWo2h7RTHDJGsBBsHakGAk8wVre6CLzW7dMaR7KaxhHrBuQgeQ+pwSM1CtXXBgyIPgMuwGBoecn9+mNC2H0879LTZVdtSfA88TJSQXlqPfKn24/pHHAJFBBtO44dgTdhCjgVLC05MaaRYSOyIB1DJUkYLqbTI5J8YFR+tgPlXkS8ET93ZGQQOtx4JnKbEM962Xif14nBv+0m3AZxcAknQ7yY4EhxFkyuM8VoyDGhhCquNkV0yFRhILJr2RCcGZPnifNo6pj+NVxpXaex1FEe2gfHSIHnaAaukR11EAUPaAn9IJerUfr2Xqz3qelBSvv2UV/YH18Aw25m6A=</latexit><latexit sha1_base64="XmIcyi2DjkGQ0wE1D6kjjm/Ezcs=">AAACDHicbVDLSsNAFJ3UV62vqks3g0VwVRIRdCMU3bisYB/Q1DKZTtqhk0mYuVFLyAe48VfcuFDErR/gzr9x0gbR1gMDZ865l3vv8SLBNdj2l1VYWFxaXimultbWNza3yts7TR3GirIGDUWo2h7RTHDJGsBBsHakGAk8wVre6CLzW7dMaR7KaxhHrBuQgeQ+pwSM1CtXXBgyIPgMuwGBoecn9+mNC2H0879LTZVdtSfA88TJSQXlqPfKn24/pHHAJFBBtO44dgTdhCjgVLC05MaaRYSOyIB1DJUkYLqbTI5J8YFR+tgPlXkS8ET93ZGQQOtx4JnKbEM962Xif14nBv+0m3AZxcAknQ7yY4EhxFkyuM8VoyDGhhCquNkV0yFRhILJr2RCcGZPnifNo6pj+NVxpXaex1FEe2gfHSIHnaAaukR11EAUPaAn9IJerUfr2Xqz3qelBSvv2UV/YH18Aw25m6A=</latexit><latexit sha1_base64="XmIcyi2DjkGQ0wE1D6kjjm/Ezcs=">AAACDHicbVDLSsNAFJ3UV62vqks3g0VwVRIRdCMU3bisYB/Q1DKZTtqhk0mYuVFLyAe48VfcuFDErR/gzr9x0gbR1gMDZ865l3vv8SLBNdj2l1VYWFxaXimultbWNza3yts7TR3GirIGDUWo2h7RTHDJGsBBsHakGAk8wVre6CLzW7dMaR7KaxhHrBuQgeQ+pwSM1CtXXBgyIPgMuwGBoecn9+mNC2H0879LTZVdtSfA88TJSQXlqPfKn24/pHHAJFBBtO44dgTdhCjgVLC05MaaRYSOyIB1DJUkYLqbTI5J8YFR+tgPlXkS8ET93ZGQQOtx4JnKbEM962Xif14nBv+0m3AZxcAknQ7yY4EhxFkyuM8VoyDGhhCquNkV0yFRhILJr2RCcGZPnifNo6pj+NVxpXaex1FEe2gfHSIHnaAaukR11EAUPaAn9IJerUfr2Xqz3qelBSvv2UV/YH18Aw25m6A=</latexit>
a(✓) = � ln(�✓)<latexit sha1_base64="X2Ru3aryqZVrVdlcVrZwCT+1nYs=">AAACBHicbZDLSgMxFIYz9VbrbdRlN8Ei1IVlRgTdCEU3LivYC3SGkkkzbWgmMyRnhDJ04cZXceNCEbc+hDvfxrSdhbYeSPj4/3NIzh8kgmtwnG+rsLK6tr5R3Cxtbe/s7tn7By0dp4qyJo1FrDoB0UxwyZrAQbBOohiJAsHawehm6rcfmNI8lvcwTpgfkYHkIacEjNSzy6TqwZABOcFX+BR7QlbNPVd6dsWpObPCy+DmUEF5NXr2l9ePaRoxCVQQrbuuk4CfEQWcCjYpealmCaEjMmBdg5JETPvZbIkJPjZKH4exMkcCnqm/JzISaT2OAtMZERjqRW8q/ud1Uwgv/YzLJAUm6fyhMBUYYjxNBPe5YhTE2AChipu/YjokilAwuZVMCO7iysvQOqu5hu/OK/XrPI4iKqMjVEUuukB1dIsaqIkoekTP6BW9WU/Wi/VufcxbC1Y+c4j+lPX5A0WLle0=</latexit><latexit sha1_base64="X2Ru3aryqZVrVdlcVrZwCT+1nYs=">AAACBHicbZDLSgMxFIYz9VbrbdRlN8Ei1IVlRgTdCEU3LivYC3SGkkkzbWgmMyRnhDJ04cZXceNCEbc+hDvfxrSdhbYeSPj4/3NIzh8kgmtwnG+rsLK6tr5R3Cxtbe/s7tn7By0dp4qyJo1FrDoB0UxwyZrAQbBOohiJAsHawehm6rcfmNI8lvcwTpgfkYHkIacEjNSzy6TqwZABOcFX+BR7QlbNPVd6dsWpObPCy+DmUEF5NXr2l9ePaRoxCVQQrbuuk4CfEQWcCjYpealmCaEjMmBdg5JETPvZbIkJPjZKH4exMkcCnqm/JzISaT2OAtMZERjqRW8q/ud1Uwgv/YzLJAUm6fyhMBUYYjxNBPe5YhTE2AChipu/YjokilAwuZVMCO7iysvQOqu5hu/OK/XrPI4iKqMjVEUuukB1dIsaqIkoekTP6BW9WU/Wi/VufcxbC1Y+c4j+lPX5A0WLle0=</latexit><latexit sha1_base64="X2Ru3aryqZVrVdlcVrZwCT+1nYs=">AAACBHicbZDLSgMxFIYz9VbrbdRlN8Ei1IVlRgTdCEU3LivYC3SGkkkzbWgmMyRnhDJ04cZXceNCEbc+hDvfxrSdhbYeSPj4/3NIzh8kgmtwnG+rsLK6tr5R3Cxtbe/s7tn7By0dp4qyJo1FrDoB0UxwyZrAQbBOohiJAsHawehm6rcfmNI8lvcwTpgfkYHkIacEjNSzy6TqwZABOcFX+BR7QlbNPVd6dsWpObPCy+DmUEF5NXr2l9ePaRoxCVQQrbuuk4CfEQWcCjYpealmCaEjMmBdg5JETPvZbIkJPjZKH4exMkcCnqm/JzISaT2OAtMZERjqRW8q/ud1Uwgv/YzLJAUm6fyhMBUYYjxNBPe5YhTE2AChipu/YjokilAwuZVMCO7iysvQOqu5hu/OK/XrPI4iKqMjVEUuukB1dIsaqIkoekTP6BW9WU/Wi/VufcxbC1Y+c4j+lPX5A0WLle0=</latexit><latexit sha1_base64="X2Ru3aryqZVrVdlcVrZwCT+1nYs=">AAACBHicbZDLSgMxFIYz9VbrbdRlN8Ei1IVlRgTdCEU3LivYC3SGkkkzbWgmMyRnhDJ04cZXceNCEbc+hDvfxrSdhbYeSPj4/3NIzh8kgmtwnG+rsLK6tr5R3Cxtbe/s7tn7By0dp4qyJo1FrDoB0UxwyZrAQbBOohiJAsHawehm6rcfmNI8lvcwTpgfkYHkIacEjNSzy6TqwZABOcFX+BR7QlbNPVd6dsWpObPCy+DmUEF5NXr2l9ePaRoxCVQQrbuuk4CfEQWcCjYpealmCaEjMmBdg5JETPvZbIkJPjZKH4exMkcCnqm/JzISaT2OAtMZERjqRW8q/ud1Uwgv/YzLJAUm6fyhMBUYYjxNBPe5YhTE2AChipu/YjokilAwuZVMCO7iysvQOqu5hu/OK/XrPI4iKqMjVEUuukB1dIsaqIkoekTP6BW9WU/Wi/VufcxbC1Y+c4j+lPX5A0WLle0=</latexit>
b(y) = 0<latexit sha1_base64="KwXm+UdrcxtJxVMGzZhb6OLUVG4=">AAAB73icbZBNS8NAEIYnftb6VfXoZbEI9VISEfQiFL14rGA/oA1ls520SzebuLsRSuif8OJBEa/+HW/+G7dtDtr6wsLDOzPszBskgmvjut/Oyura+sZmYau4vbO7t186OGzqOFUMGywWsWoHVKPgEhuGG4HtRCGNAoGtYHQ7rbeeUGkeywczTtCP6EDykDNqrNUOKuMzck3cXqnsVt2ZyDJ4OZQhV71X+ur2Y5ZGKA0TVOuO5ybGz6gynAmcFLupxoSyER1gx6KkEWo/m+07IafW6ZMwVvZJQ2bu74mMRlqPo8B2RtQM9WJtav5X66QmvPIzLpPUoGTzj8JUEBOT6fGkzxUyI8YWKFPc7krYkCrKjI2oaEPwFk9ehuZ51bN8f1Gu3eRxFOAYTqACHlxCDe6gDg1gIOAZXuHNeXRenHfnY9664uQzR/BHzucPB3qOow==</latexit><latexit sha1_base64="KwXm+UdrcxtJxVMGzZhb6OLUVG4=">AAAB73icbZBNS8NAEIYnftb6VfXoZbEI9VISEfQiFL14rGA/oA1ls520SzebuLsRSuif8OJBEa/+HW/+G7dtDtr6wsLDOzPszBskgmvjut/Oyura+sZmYau4vbO7t186OGzqOFUMGywWsWoHVKPgEhuGG4HtRCGNAoGtYHQ7rbeeUGkeywczTtCP6EDykDNqrNUOKuMzck3cXqnsVt2ZyDJ4OZQhV71X+ur2Y5ZGKA0TVOuO5ybGz6gynAmcFLupxoSyER1gx6KkEWo/m+07IafW6ZMwVvZJQ2bu74mMRlqPo8B2RtQM9WJtav5X66QmvPIzLpPUoGTzj8JUEBOT6fGkzxUyI8YWKFPc7krYkCrKjI2oaEPwFk9ehuZ51bN8f1Gu3eRxFOAYTqACHlxCDe6gDg1gIOAZXuHNeXRenHfnY9664uQzR/BHzucPB3qOow==</latexit><latexit sha1_base64="KwXm+UdrcxtJxVMGzZhb6OLUVG4=">AAAB73icbZBNS8NAEIYnftb6VfXoZbEI9VISEfQiFL14rGA/oA1ls520SzebuLsRSuif8OJBEa/+HW/+G7dtDtr6wsLDOzPszBskgmvjut/Oyura+sZmYau4vbO7t186OGzqOFUMGywWsWoHVKPgEhuGG4HtRCGNAoGtYHQ7rbeeUGkeywczTtCP6EDykDNqrNUOKuMzck3cXqnsVt2ZyDJ4OZQhV71X+ur2Y5ZGKA0TVOuO5ybGz6gynAmcFLupxoSyER1gx6KkEWo/m+07IafW6ZMwVvZJQ2bu74mMRlqPo8B2RtQM9WJtav5X66QmvPIzLpPUoGTzj8JUEBOT6fGkzxUyI8YWKFPc7krYkCrKjI2oaEPwFk9ehuZ51bN8f1Gu3eRxFOAYTqACHlxCDe6gDg1gIOAZXuHNeXRenHfnY9664uQzR/BHzucPB3qOow==</latexit><latexit sha1_base64="KwXm+UdrcxtJxVMGzZhb6OLUVG4=">AAAB73icbZBNS8NAEIYnftb6VfXoZbEI9VISEfQiFL14rGA/oA1ls520SzebuLsRSuif8OJBEa/+HW/+G7dtDtr6wsLDOzPszBskgmvjut/Oyura+sZmYau4vbO7t186OGzqOFUMGywWsWoHVKPgEhuGG4HtRCGNAoGtYHQ7rbeeUGkeywczTtCP6EDykDNqrNUOKuMzck3cXqnsVt2ZyDJ4OZQhV71X+ur2Y5ZGKA0TVOuO5ybGz6gynAmcFLupxoSyER1gx6KkEWo/m+07IafW6ZMwVvZJQ2bu74mMRlqPo8B2RtQM9WJtav5X66QmvPIzLpPUoGTzj8JUEBOT6fGkzxUyI8YWKFPc7krYkCrKjI2oaEPwFk9ehuZ51bN8f1Gu3eRxFOAYTqACHlxCDe6gDg1gIOAZXuHNeXRenHfnY9664uQzR/BHzucPB3qOow==</latexit>
p(y|✓) = exp(✓y) exp(�a(✓)) exp(b(y))<latexit sha1_base64="t9R/g0rc179gUcoKP7Lb7l0WLoA=">AAACIXicbVDLSgNBEJz1GeMr6tHLYBCyB8OuCOYiBL14jGAekIQwO+kkQ2YfzPSKS8yvePFXvHhQJDfxZ5wke9DEgoHqqm56urxICo2O82WtrK6tb2xmtrLbO7t7+7mDw5oOY8WhykMZqobHNEgRQBUFSmhECpjvSah7w5upX38ApUUY3GMSQdtn/UD0BGdopE6uFBWSpxYOAJlNr2gLHqPCvKSJPS/PWKrYqeAVEtvu5PJO0ZmBLhM3JXmSotLJTVrdkMc+BMgl07rpOhG2R0yh4BLG2VasIWJ8yPrQNDRgPuj2aHbhmJ4apUt7oTIvQDpTf0+MmK914num02c40IveVPzPa8bYK7VHIohihIDPF/ViSTGk07hoVyjgKBNDGFfC/JXyAVOMowk1a0JwF09eJrXzomv43UW+fJ3GkSHH5IQUiEsuSZnckgqpEk6eySt5Jx/Wi/VmfVqTeeuKlc4ckT+wvn8AhtOhNg==</latexit><latexit sha1_base64="t9R/g0rc179gUcoKP7Lb7l0WLoA=">AAACIXicbVDLSgNBEJz1GeMr6tHLYBCyB8OuCOYiBL14jGAekIQwO+kkQ2YfzPSKS8yvePFXvHhQJDfxZ5wke9DEgoHqqm56urxICo2O82WtrK6tb2xmtrLbO7t7+7mDw5oOY8WhykMZqobHNEgRQBUFSmhECpjvSah7w5upX38ApUUY3GMSQdtn/UD0BGdopE6uFBWSpxYOAJlNr2gLHqPCvKSJPS/PWKrYqeAVEtvu5PJO0ZmBLhM3JXmSotLJTVrdkMc+BMgl07rpOhG2R0yh4BLG2VasIWJ8yPrQNDRgPuj2aHbhmJ4apUt7oTIvQDpTf0+MmK914num02c40IveVPzPa8bYK7VHIohihIDPF/ViSTGk07hoVyjgKBNDGFfC/JXyAVOMowk1a0JwF09eJrXzomv43UW+fJ3GkSHH5IQUiEsuSZnckgqpEk6eySt5Jx/Wi/VmfVqTeeuKlc4ckT+wvn8AhtOhNg==</latexit><latexit sha1_base64="t9R/g0rc179gUcoKP7Lb7l0WLoA=">AAACIXicbVDLSgNBEJz1GeMr6tHLYBCyB8OuCOYiBL14jGAekIQwO+kkQ2YfzPSKS8yvePFXvHhQJDfxZ5wke9DEgoHqqm56urxICo2O82WtrK6tb2xmtrLbO7t7+7mDw5oOY8WhykMZqobHNEgRQBUFSmhECpjvSah7w5upX38ApUUY3GMSQdtn/UD0BGdopE6uFBWSpxYOAJlNr2gLHqPCvKSJPS/PWKrYqeAVEtvu5PJO0ZmBLhM3JXmSotLJTVrdkMc+BMgl07rpOhG2R0yh4BLG2VasIWJ8yPrQNDRgPuj2aHbhmJ4apUt7oTIvQDpTf0+MmK914num02c40IveVPzPa8bYK7VHIohihIDPF/ViSTGk07hoVyjgKBNDGFfC/JXyAVOMowk1a0JwF09eJrXzomv43UW+fJ3GkSHH5IQUiEsuSZnckgqpEk6eySt5Jx/Wi/VmfVqTeeuKlc4ckT+wvn8AhtOhNg==</latexit><latexit sha1_base64="t9R/g0rc179gUcoKP7Lb7l0WLoA=">AAACIXicbVDLSgNBEJz1GeMr6tHLYBCyB8OuCOYiBL14jGAekIQwO+kkQ2YfzPSKS8yvePFXvHhQJDfxZ5wke9DEgoHqqm56urxICo2O82WtrK6tb2xmtrLbO7t7+7mDw5oOY8WhykMZqobHNEgRQBUFSmhECpjvSah7w5upX38ApUUY3GMSQdtn/UD0BGdopE6uFBWSpxYOAJlNr2gLHqPCvKSJPS/PWKrYqeAVEtvu5PJO0ZmBLhM3JXmSotLJTVrdkMc+BMgl07rpOhG2R0yh4BLG2VasIWJ8yPrQNDRgPuj2aHbhmJ4apUt7oTIvQDpTf0+MmK914num02c40IveVPzPa8bYK7VHIohihIDPF/ViSTGk07hoVyjgKBNDGFfC/JXyAVOMowk1a0JwF09eJrXzomv43UW+fJ3GkSHH5IQUiEsuSZnckgqpEk6eySt5Jx/Wi/VmfVqTeeuKlc4ckT+wvn8AhtOhNg==</latexit>
i.e.,
� > 0<latexit sha1_base64="gL/ES4sLDHFKGqtzp8ncprsQ++c=">AAAB8nicbVDLSsNAFL2pr1pfVZduBovgqiQi6EqKblxWsA9oQ5lMJu3QySTM3Agl9DPcuFDErV/jzr9x2mahrQcGDuecy9x7glQKg6777ZTW1jc2t8rblZ3dvf2D6uFR2ySZZrzFEpnobkANl0LxFgqUvJtqTuNA8k4wvpv5nSeujUjUI05S7sd0qEQkGEUr9frSRkNKbog7qNbcujsHWSVeQWpQoDmofvXDhGUxV8gkNabnuSn6OdUomOTTSj8zPKVsTIe8Z6miMTd+Pl95Ss6sEpIo0fYpJHP190ROY2MmcWCTMcWRWfZm4n9eL8Po2s+FSjPkii0+ijJJMCGz+0koNGcoJ5ZQpoXdlbAR1ZShbaliS/CWT14l7Yu6Z/nDZa1xW9RRhhM4hXPw4AoacA9NaAGDBJ7hFd4cdF6cd+djES05xcwx/IHz+QPkSpBT</latexit><latexit sha1_base64="gL/ES4sLDHFKGqtzp8ncprsQ++c=">AAAB8nicbVDLSsNAFL2pr1pfVZduBovgqiQi6EqKblxWsA9oQ5lMJu3QySTM3Agl9DPcuFDErV/jzr9x2mahrQcGDuecy9x7glQKg6777ZTW1jc2t8rblZ3dvf2D6uFR2ySZZrzFEpnobkANl0LxFgqUvJtqTuNA8k4wvpv5nSeujUjUI05S7sd0qEQkGEUr9frSRkNKbog7qNbcujsHWSVeQWpQoDmofvXDhGUxV8gkNabnuSn6OdUomOTTSj8zPKVsTIe8Z6miMTd+Pl95Ss6sEpIo0fYpJHP190ROY2MmcWCTMcWRWfZm4n9eL8Po2s+FSjPkii0+ijJJMCGz+0koNGcoJ5ZQpoXdlbAR1ZShbaliS/CWT14l7Yu6Z/nDZa1xW9RRhhM4hXPw4AoacA9NaAGDBJ7hFd4cdF6cd+djES05xcwx/IHz+QPkSpBT</latexit><latexit sha1_base64="gL/ES4sLDHFKGqtzp8ncprsQ++c=">AAAB8nicbVDLSsNAFL2pr1pfVZduBovgqiQi6EqKblxWsA9oQ5lMJu3QySTM3Agl9DPcuFDErV/jzr9x2mahrQcGDuecy9x7glQKg6777ZTW1jc2t8rblZ3dvf2D6uFR2ySZZrzFEpnobkANl0LxFgqUvJtqTuNA8k4wvpv5nSeujUjUI05S7sd0qEQkGEUr9frSRkNKbog7qNbcujsHWSVeQWpQoDmofvXDhGUxV8gkNabnuSn6OdUomOTTSj8zPKVsTIe8Z6miMTd+Pl95Ss6sEpIo0fYpJHP190ROY2MmcWCTMcWRWfZm4n9eL8Po2s+FSjPkii0+ijJJMCGz+0koNGcoJ5ZQpoXdlbAR1ZShbaliS/CWT14l7Yu6Z/nDZa1xW9RRhhM4hXPw4AoacA9NaAGDBJ7hFd4cdF6cd+djES05xcwx/IHz+QPkSpBT</latexit><latexit sha1_base64="gL/ES4sLDHFKGqtzp8ncprsQ++c=">AAAB8nicbVDLSsNAFL2pr1pfVZduBovgqiQi6EqKblxWsA9oQ5lMJu3QySTM3Agl9DPcuFDErV/jzr9x2mahrQcGDuecy9x7glQKg6777ZTW1jc2t8rblZ3dvf2D6uFR2ySZZrzFEpnobkANl0LxFgqUvJtqTuNA8k4wvpv5nSeujUjUI05S7sd0qEQkGEUr9frSRkNKbog7qNbcujsHWSVeQWpQoDmofvXDhGUxV8gkNabnuSn6OdUomOTTSj8zPKVsTIe8Z6miMTd+Pl95Ss6sEpIo0fYpJHP190ROY2MmcWCTMcWRWfZm4n9eL8Po2s+FSjPkii0+ijJJMCGz+0koNGcoJ5ZQpoXdlbAR1ZShbaliS/CWT14l7Yu6Z/nDZa1xW9RRhhM4hXPw4AoacA9NaAGDBJ7hFd4cdF6cd+djES05xcwx/IHz+QPkSpBT</latexit>
Logistic regression
14
the full version of the Newton-Raphson algorithm with the Hessian matrix.Instead, Gauss-Newton and other types of solutions are considered and aregenerally called iteratively reweighted least-squares (IRLS) algorithms in thestatistical literature.
5.4 Logistic regression
At the end, we mention that GLMs extend to classification. One of the mostpopular uses of GLMs is a combination of a Bernoulli distribution with alogit link function. This framework is frequently encountered and is calledlogistic regression. We summarize the logistic regression model as follows
1. logit(E[y|x]) = !Tx
2. p(y|x) = Bernoulli(↵)
where logit(x) = ln x
1�x, y 2 {0, 1}, and ↵ 2 (0, 1) is the parameter (mean)
of the Bernoulli distribution. It follows that
E[y|x] = 1
1 + e�!Tx
and
p(y|x) =✓
1
1 + e�!Tx
◆y ✓1� 1
1 + e�!Tx
◆1�y
.
Given a data set D = {(xi, yi)}ni=1, where xi 2 Rk and yi 2 {0, 1}, the pa-rameters of the model w can be found by maximizing the likelihood function.
86
−5 0 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
f(t) =1
1 + e�t
t
Figure 6.2: Sigmoid function in [�5, 5] interval.
as negative (y = 0). Thus, the predictor maps a (d+ 1)-dimensional vectorx = (x0 = 1, x1, . . . , xd) into a zero or one. Note that P (Y = 1|x,w⇤) � 0.5only when w>x � 0. The expression w>x = 0 represents equation of ahyperplane that separates positive and negative examples. Thus, the logisticregression model is a linear classifier.
6.1.2 Maximum conditional likelihood estimation
To frame the learning problem as parameter estimation, we will assume thatthe data set D = {(xi, yi)}ni=1 is an i.i.d. sample from a fixed but unknownprobability distribution p(x, y). Even more specifically, we will assume thatthe data generating process randomly draws a data point x, a realization ofthe random vector (X0 = 1, X1, . . . , Xd), according to p(x) and then sets itsclass label Y according to the Bernoulli distribution
p(y|x) =
8<
:
⇣1
1+e�!>x
⌘y
⇣1� 1
1+e�!>x
⌘1�y
for y = 1
for y = 0(6.2)
= �(x>w)y(1� �(x>w))1�y
where ! = (!0,!1, . . . ,!d) is a set of unknown coefficients we want torecover (or learn) from the observed data D. Based on the principles ofparameter estimation, we can estimate ! by maximizing the conditional
89
↵ = p(y = 1|x)g(x>w) = logit(x>w)
f(x>w) = g�1(x>w)
= sigmoid(x>w)
= E[y|x]
linear classifiers, our flexibility is restricted because our method must learn the posteriorprobabilities p(y|x) and at the same time have a linear decision surface in Rd. This, however,can be achieved if p(y|x) is modeled as a monotonic function of w€x; e.g., tanh(w€x) or(1 + e≠w
€x)≠1. Of course, a model trained to learn posterior probabilities p(y|x) can be
seen as a “soft” predictor or a scoring function s : X æ [0, 1]. Then, the conversion from sto f is a straightforward application of the maximum a posteriori principle: the predictedoutput is positive if s(x) Ø 0.5 and negative if s(x) < 0.5. More generally, the scoringfunction can be any mapping s : X æ R, with thresholding applied based on any particularvalue · .
8.1 Logistic regression
Let us consider binary classification in Rd, where X = Rd+1 and Y = {0, 1}. Logistic re-gression is a Generalized Linear Model, where the distribution over Y given x is a Bernoullidistribution, and the transfer function is the sigmoid function, also called the logistic func-tion,
‡(t) =11 + e≠t
2≠1
plotted in Figure 8.2. In the same terminology as for GLMs, the transfer function is thesigmoid and the link function—the inverse of the transfer function— is the logit functionlogit(x) = ln x
1≠x, with
1. E[y|x] = ‡(Ê€x)
2. p(y|x) = Bernoulli(–) with – = E[y|x].
The Bernoulli distribution, with – a function of x, is
p(y|x) =
Y_]
_[
11
1+e≠Ê€x
2y
11 ≠
11+e≠Ê€x
21≠y
for y = 1
for y = 0(8.1)
= ‡(x€w)y(1 ≠ ‡(x€w))1≠y
where Ê = (Ê0, Ê1, . . . , Êd) is a set of unknown coe�cients we want to recover (or learn).Notice that our prediction is ‡(Ê€x), and that it satisfies
p(y = 1|x) = ‡(Ê€x).
Therefore, as with many binary classification approaches, our goal is to predict the proba-bility that the class is 1; given this probability, we can infer p(y = 0|x) = 1 ≠ p(y = 1|x).
8.1.1 Predicting class labelsThe function learned by logistic regression returns a probability, rather than an explicitprediction of 0 or 1. Therefore, we have to take this probability estimate and convert it toa suitable prediction of the class. For a previously unseen data point x and a set of learnedcoe�cients w, we simply calculate the posterior probability as
P (Y = 1|x, w) = 11 + e≠w€x
.
84
What is c(w) for GLMs?
• Still formulating an optimization problem to predict targets y given features x
• The variables we learn is the weight vector w
• What is c(w)?
•
15
MLE : c(w) / � ln p(D|w)
/ �nX
i=1
ln p(yi|xiw)
argminw
c(w) = argmaxw
p(D|w)
Cross-entropy loss for Logistic Regression
16
ci(w) = yi ln�(w>xi) + (1� yi) ln(1� �(w>xi))
<latexit sha1_base64="H6v0zpMCsFrmirWF3kTxlYabSW8=">AAACX3icjVHPS8MwGE3rrznnrHoSL8EhbIijVUEvwtCLRwXnBmstaZZuwTQtSaqW0n/Sm+DF/8R0TpibBz8IvLz3Pb4vL0HCqFS2/W6YS8srq2uV9epGbbO+ZW3vPMg4FZh0ccxi0Q+QJIxy0lVUMdJPBEFRwEgveLou9d4zEZLG/F5lCfEiNOI0pBgpTfnWM/Zp042QGgdh/lK04CXMfApdxqEr6ShCM+Kjq+IE/txfC5+24BFsOsdZibSl6cDjf9lavtWw2/ak4CJwpqABpnXrW2/uMMZpRLjCDEk5cOxEeTkSimJGiqqbSpIg/IRGZKAhRxGRXj7Jp4CHmhnCMBb6cAUn7KwjR5GUWRToznJLOa+V5F/aIFXhhZdTnqSKcPw9KEwZVDEsw4ZDKghWLNMAYUH1rhCPkUBY6S+p6hCc+ScvgoeTtnPaPrk7a3SupnFUwD44AE3ggHPQATfgFnQBBh+GaWwYNePTXDPrpvXdahpTzy74VebeF/dbtDs=</latexit>
Extra exercises
• Go through the derivation of c(w) for logistic regression
• Derive Maximum Likelihood objective in Section 8.1.2
17
Benefits of GLMs
• Gave a generic update rule, where you only needed to know the transfer for your chosen distribution• e.g., linear regression with transfer f = identity
• e.g., Poisson regression with transfer f = exp
• e.g., logistic regression with transfer f = sigmoid
• We know the objective is convex in w!
18
Convexity
• Convexity of negative log likelihood of (many) exponential families• The negative log likelihood of many exponential families is convex, which
is an important advantage of the maximum likelihood approach
• Why is convexity important?• e.g.,(sigmoid(xw) - y)^2 is nonconvex, but who cares?
19
Cross-entropy loss versus Euclidean loss for classification
• Why not just use
• The notes explain that this is a non-convex objective• from personal experience, it seems to do more poorly in practice
• If no obvious reason to prefer one or the other, we may as well pick the objective that is convex (no local minima)
20
minw2Rd
nX
i=1
(�(x>i w)� yi)
2
<latexit sha1_base64="Ut4UtA6c8YroSPY9QQmuscEMLOY=">AAACR3icbVDLSsQwFE3H1zi+qi7dBAdhZqG0IuhGEN24HMVRYfogTdMxmKQlSdWh9O/cuHXnL7hxoYhLMw/xMV4InJxzz83NiTJGlXacJ6syMTk1PVOdrc3NLywu2csr5yrNJSZtnLJUXkZIEUYFaWuqGbnMJEE8YuQiuj7q6xc3RCqaijPdy4jPUVfQhGKkDRXagcepCAuPI30VJcVtCT0q4PAaFadlEBtG5Tws6L5bBgI2PEW7HDW+HHdlSANPpxn8ntGEm7AX0mawHdp1Z8sZFBwH7gjUwahaof3oxSnOOREaM6RUx3Uy7RdIaooZKWterkiG8DXqko6BAnGi/GKQQwk3DBPDJJXmCA0H7E9HgbhSPR6Zzv6u6q/WJ//TOrlO9vyCiizXRODhQ0nOoE5hP1QYU0mwZj0DEJbU7ArxFZIIaxN9zYTg/v3yODjf3nINPtmpHxyO4qiCNbAOGsAFu+AAHIMWaAMM7sEzeAVv1oP1Yr1bH8PWijXyrIJfVbE+ARPRsn0=</latexit><latexit sha1_base64="Ut4UtA6c8YroSPY9QQmuscEMLOY=">AAACR3icbVDLSsQwFE3H1zi+qi7dBAdhZqG0IuhGEN24HMVRYfogTdMxmKQlSdWh9O/cuHXnL7hxoYhLMw/xMV4InJxzz83NiTJGlXacJ6syMTk1PVOdrc3NLywu2csr5yrNJSZtnLJUXkZIEUYFaWuqGbnMJEE8YuQiuj7q6xc3RCqaijPdy4jPUVfQhGKkDRXagcepCAuPI30VJcVtCT0q4PAaFadlEBtG5Tws6L5bBgI2PEW7HDW+HHdlSANPpxn8ntGEm7AX0mawHdp1Z8sZFBwH7gjUwahaof3oxSnOOREaM6RUx3Uy7RdIaooZKWterkiG8DXqko6BAnGi/GKQQwk3DBPDJJXmCA0H7E9HgbhSPR6Zzv6u6q/WJ//TOrlO9vyCiizXRODhQ0nOoE5hP1QYU0mwZj0DEJbU7ArxFZIIaxN9zYTg/v3yODjf3nINPtmpHxyO4qiCNbAOGsAFu+AAHIMWaAMM7sEzeAVv1oP1Yr1bH8PWijXyrIJfVbE+ARPRsn0=</latexit><latexit sha1_base64="Ut4UtA6c8YroSPY9QQmuscEMLOY=">AAACR3icbVDLSsQwFE3H1zi+qi7dBAdhZqG0IuhGEN24HMVRYfogTdMxmKQlSdWh9O/cuHXnL7hxoYhLMw/xMV4InJxzz83NiTJGlXacJ6syMTk1PVOdrc3NLywu2csr5yrNJSZtnLJUXkZIEUYFaWuqGbnMJEE8YuQiuj7q6xc3RCqaijPdy4jPUVfQhGKkDRXagcepCAuPI30VJcVtCT0q4PAaFadlEBtG5Tws6L5bBgI2PEW7HDW+HHdlSANPpxn8ntGEm7AX0mawHdp1Z8sZFBwH7gjUwahaof3oxSnOOREaM6RUx3Uy7RdIaooZKWterkiG8DXqko6BAnGi/GKQQwk3DBPDJJXmCA0H7E9HgbhSPR6Zzv6u6q/WJ//TOrlO9vyCiizXRODhQ0nOoE5hP1QYU0mwZj0DEJbU7ArxFZIIaxN9zYTg/v3yODjf3nINPtmpHxyO4qiCNbAOGsAFu+AAHIMWaAMM7sEzeAVv1oP1Yr1bH8PWijXyrIJfVbE+ARPRsn0=</latexit><latexit sha1_base64="Ut4UtA6c8YroSPY9QQmuscEMLOY=">AAACR3icbVDLSsQwFE3H1zi+qi7dBAdhZqG0IuhGEN24HMVRYfogTdMxmKQlSdWh9O/cuHXnL7hxoYhLMw/xMV4InJxzz83NiTJGlXacJ6syMTk1PVOdrc3NLywu2csr5yrNJSZtnLJUXkZIEUYFaWuqGbnMJEE8YuQiuj7q6xc3RCqaijPdy4jPUVfQhGKkDRXagcepCAuPI30VJcVtCT0q4PAaFadlEBtG5Tws6L5bBgI2PEW7HDW+HHdlSANPpxn8ntGEm7AX0mawHdp1Z8sZFBwH7gjUwahaof3oxSnOOREaM6RUx3Uy7RdIaooZKWterkiG8DXqko6BAnGi/GKQQwk3DBPDJJXmCA0H7E9HgbhSPR6Zzv6u6q/WJ//TOrlO9vyCiizXRODhQ0nOoE5hP1QYU0mwZj0DEJbU7ArxFZIIaxN9zYTg/v3yODjf3nINPtmpHxyO4qiCNbAOGsAFu+AAHIMWaAMM7sEzeAVv1oP1Yr1bH8PWijXyrIJfVbE+ARPRsn0=</latexit>
ci(w) = yi ln�(w>xi) + (1� yi) ln(1� �(w>xi))
<latexit sha1_base64="H6v0zpMCsFrmirWF3kTxlYabSW8=">AAACX3icjVHPS8MwGE3rrznnrHoSL8EhbIijVUEvwtCLRwXnBmstaZZuwTQtSaqW0n/Sm+DF/8R0TpibBz8IvLz3Pb4vL0HCqFS2/W6YS8srq2uV9epGbbO+ZW3vPMg4FZh0ccxi0Q+QJIxy0lVUMdJPBEFRwEgveLou9d4zEZLG/F5lCfEiNOI0pBgpTfnWM/Zp042QGgdh/lK04CXMfApdxqEr6ShCM+Kjq+IE/txfC5+24BFsOsdZibSl6cDjf9lavtWw2/ak4CJwpqABpnXrW2/uMMZpRLjCDEk5cOxEeTkSimJGiqqbSpIg/IRGZKAhRxGRXj7Jp4CHmhnCMBb6cAUn7KwjR5GUWRToznJLOa+V5F/aIFXhhZdTnqSKcPw9KEwZVDEsw4ZDKghWLNMAYUH1rhCPkUBY6S+p6hCc+ScvgoeTtnPaPrk7a3SupnFUwD44AE3ggHPQATfgFnQBBh+GaWwYNePTXDPrpvXdahpTzy74VebeF/dbtDs=</latexit>
How can we check convexity?
• Can check the definition of convexity
• Can check second derivative for scalar parameters (e.g. ) and Hessian for multidimensional parameters (e.g., )• e.g., for linear regression (least-squares), the Hessian is
and so positive semi-definite
• e.g., for Poisson regression, the Hessian of the negative log-likelihood is and so positive semi-definite
21
�w
Convex function on an interval.
A function (in black) is convex if and only if theregion above its graph (in green) is a convex set.
Convex functionFrom Wikipedia, the free encyclopedia
In mathematics, a real-valued function f(x) defined on an interval is
called convex (or convex downward or concave upward) if the line
segment between any two points on the graph of the function lies above
or on the graph, in a Euclidean space (or more generally a vector space)
of at least two dimensions. Equivalently, a function is convex if its
epigraph (the set of points on or above the graph of the function) is a
convex set. Well-known examples of convex functions are the quadratic
function and the exponential function for any
real number x.
Convex functions play an important role in many areas of mathematics.
They are especially important in the study of optimization problems
where they are distinguished by a number of convenient properties. For
instance, a (strictly) convex function on an open set has no more than
one minimum. Even in infinite-dimensional spaces, under suitable
additional hypotheses, convex functions continue to satisfy such
properties and, as a result, they are the most well-understood functionals
in the calculus of variations. In probability theory, a convex function
applied to the expected value of a random variable is always less than or
equal to the expected value of the convex function of the random
variable. This result, known as Jensen's inequality, underlies many
important inequalities (including, for instance, the arithmetic–geometric
mean inequality and Hölder's inequality).
Exponential growth is a special case of convexity. Exponential growth
narrowly means "increasing at a rate proportional to the current value",
while convex growth generally means "increasing at an increasing rate (but not necessarily proportionally to current value)".
Contents1 Definition
2 Properties
3 Convex function calculus
4 Strongly convex functions
4.1 Uniformly convex functions
5 Examples
6 See also
7 Notes
8 References
9 External links
Definition
Let X be a convex set in a real vector space and let f : X → R be a function.
f is called convex if:
f is called strictly convex if:
H = X>CX
H = X>X
Prediction with logistic regression
• So far, we have used the prediction f(xw)• eg., xw for linear regression, exp(xw) for Poisson regression
• For binary classification, want to output 0 or 1, rather than the probability value p(y = 1 | x) = sigmoid(xw)
• Sigmoid has few values xw mapped close to 0.5; most values somewhat larger than 0 are mapped close to 0 (and vice versa for 1)
• Decision threshold: • sigmoid(xw) < 0.5 is class 0
• sigmoid(xw) > 0.5 is class 1
22−5 0 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
f(t) =1
1 + e�t
t
Figure 6.2: Sigmoid function in [�5, 5] interval.
as negative (y = 0). Thus, the predictor maps a (d+ 1)-dimensional vectorx = (x0 = 1, x1, . . . , xd) into a zero or one. Note that P (Y = 1|x,w⇤) � 0.5only when w>x � 0. The expression w>x = 0 represents equation of ahyperplane that separates positive and negative examples. Thus, the logisticregression model is a linear classifier.
6.1.2 Maximum conditional likelihood estimation
To frame the learning problem as parameter estimation, we will assume thatthe data set D = {(xi, yi)}ni=1 is an i.i.d. sample from a fixed but unknownprobability distribution p(x, y). Even more specifically, we will assume thatthe data generating process randomly draws a data point x, a realization ofthe random vector (X0 = 1, X1, . . . , Xd), according to p(x) and then sets itsclass label Y according to the Bernoulli distribution
p(y|x) =
8<
:
⇣1
1+e�!>x
⌘y
⇣1� 1
1+e�!>x
⌘1�y
for y = 1
for y = 0(6.2)
= �(x>w)y(1� �(x>w))1�y
where ! = (!0,!1, . . . ,!d) is a set of unknown coefficients we want torecover (or learn) from the observed data D. Based on the principles ofparameter estimation, we can estimate ! by maximizing the conditional
89
Logistic regression is a linear classifier
• Hyperplane separates the two classes• P(y=1 | x, w) > 0.5 only when
• P(y=0 | x, w) > 0.5 only when P(y=1 | x, w) < 0.5, which happens when
23
( , )x1 1y
( , )xi yi( , )x2 2y
w w w x0 1 2 2+ +x1 0=
x1
x2
++
+
+
+
+
+
+ +
Figure 6.1: A data set in R2 consisting of nine positive and nine negativeexamples. The gray line represents a linear decision surface in R2. Thedecision surface does not perfectly separate positives from negatives.
to f is a straightforward application of the maximum a posteriori principle:the predicted output is positive if g(x) � 0.5 and negative if g(x) < 0.5.
6.1 Logistic regression
Let us consider binary classification in Rd, where X = Rd+1 and Y = {0, 1}.The basic idea for many classification approaches is to hypothesize a closed-form representation for the posterior probability that the class label is posi-tive and learn parameters w from data. In logistic regression, this relation-ship can be expressed as
P (Y = 1|x,w) =1
1 + e�w>x, (6.1)
which is a monotonic function of w>x. Function �(t) =�1 + e�t
��1 is calledthe sigmoid function or the logistic function and is plotted in Figure 6.2.
6.1.1 Predicting class labels
For a previously unseen data point x and a set of coefficients w⇤ found fromEq. (6.4) or Eq. (6.11), we simply calculate the posterior probability as
P (Y = 1|x,w⇤) =1
1 + e�w⇤T ·x .
If P (Y = 1|x,w⇤) � 0.5 we conclude that data point x should be labeled aspositive (y = 1). Otherwise, if P (Y = 1|x,w⇤) < 0.5, we label the data point
88
−5 0 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
f(t) =1
1 + e�t
t
Figure 6.2: Sigmoid function in [�5, 5] interval.
as negative (y = 0). Thus, the predictor maps a (d+ 1)-dimensional vectorx = (x0 = 1, x1, . . . , xd) into a zero or one. Note that P (Y = 1|x,w⇤) � 0.5only when w>x � 0. The expression w>x = 0 represents equation of ahyperplane that separates positive and negative examples. Thus, the logisticregression model is a linear classifier.
6.1.2 Maximum conditional likelihood estimation
To frame the learning problem as parameter estimation, we will assume thatthe data set D = {(xi, yi)}ni=1 is an i.i.d. sample from a fixed but unknownprobability distribution p(x, y). Even more specifically, we will assume thatthe data generating process randomly draws a data point x, a realization ofthe random vector (X0 = 1, X1, . . . , Xd), according to p(x) and then sets itsclass label Y according to the Bernoulli distribution
p(y|x) =
8<
:
⇣1
1+e�!>x
⌘y
⇣1� 1
1+e�!>x
⌘1�y
for y = 1
for y = 0(6.2)
= �(x>w)y(1� �(x>w))1�y
where ! = (!0,!1, . . . ,!d) is a set of unknown coefficients we want torecover (or learn) from the observed data D. Based on the principles ofparameter estimation, we can estimate ! by maximizing the conditional
89
−5 0 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
f(t) =1
1 + e�t
t
Figure 6.2: Sigmoid function in [�5, 5] interval.
as negative (y = 0). Thus, the predictor maps a (d+ 1)-dimensional vectorx = (x0 = 1, x1, . . . , xd) into a zero or one. Note that P (Y = 1|x,w⇤) � 0.5only when w>x � 0. The expression w>x = 0 represents equation of ahyperplane that separates positive and negative examples. Thus, the logisticregression model is a linear classifier.
6.1.2 Maximum conditional likelihood estimation
To frame the learning problem as parameter estimation, we will assume thatthe data set D = {(xi, yi)}ni=1 is an i.i.d. sample from a fixed but unknownprobability distribution p(x, y). Even more specifically, we will assume thatthe data generating process randomly draws a data point x, a realization ofthe random vector (X0 = 1, X1, . . . , Xd), according to p(x) and then sets itsclass label Y according to the Bernoulli distribution
p(y|x) =
8<
:
⇣1
1+e�!>x
⌘y
⇣1� 1
1+e�!>x
⌘1�y
for y = 1
for y = 0(6.2)
= �(x>w)y(1� �(x>w))1�y
where ! = (!0,!1, . . . ,!d) is a set of unknown coefficients we want torecover (or learn) from the observed data D. Based on the principles ofparameter estimation, we can estimate ! by maximizing the conditional
89
w>x < 0e.g., w = [2.75 -1/3 -1]
x = [1 4 3]
x>w = �1.8
x = [1 2 1]
x>w = 0.27
Logistic regression versus Linear regression
• Why might one be better than the other? They both use a linear approach
• Linear regression could still learn <x, w> to predict E[Y | x]
• Demo: logistic regression performs better under outliers, when the outlier is still on the correct side of the line
• Conclusion: • logistic regression better reflects the goals of predicting p(y=1 | x), to
finding separating hyperplane
• Linear regression assumes E[Y | x] a linear function of x!
24
Adding regularizers to GLMs• How do we add regularization to logistic regression?
• We had an optimization for logistic regression to get w: minimize negative log-likelihood, i.e. minimize cross-entropy
• Now want to balance negative log-likelihood and regularizer (i.e., the prior for MAP)
• Simply add regularizer to the objective function
25
Adding a regularizer to logistic regression
• Original objective function for logistic regression
26
everything more suitable for further steps, we will slightly rearrange Eq. (6.6). Notice firstthat
✓1� 1
1 + e�w>xi
◆=
e�w>xi
1 + e�w>xi
giving
ll(w) =nX
i=1
⇣� yi · log
⇣1 + e�w
>xi
⌘+ (1� yi) · log
⇣e�w
>xi
⌘
� (1� yi) · log⇣1 + e�w
>xi
⌘⌘
=nX
i=1
✓(yi � 1)w>
xi + log
✓1
1 + e�w>xi
◆◆. (6.7)
It does not take much effort to realize that there is no closed-form solution to rll(w) = 0
(we did have this luxury in linear regression, but not here). Thus, we have to proceedwith iterative optimization methods. We will calculate rll(w) and Hll(w) to enable eithermethod (first-order with only the gradient or second-order with the gradient and Hessian),both written as a function of inputs X, class labels y, and the current parameter vector.We can calculate the first and second partial derivatives of ll(w) as follows
@ll(w)
@wj
=nX
i=1
✓(yi � 1) · xij �
1
1 + e�w>xi
· e�w>xi · (�xij)
◆
=nX
i=1
xij · yi � 1 +
e�w>xi
1 + ew>xi
!
=nX
i=1
xij ·✓yi �
1
1 + e�w>xi
◆
= f>j (y � p) ,
where fj is the j-th column (feature) of data matrix X, y is an n-dimensional column vectorof class labels and p is an n-dimensional column vector of (estimated) posterior probabilitiespi = P (Yi = 1|xi,w), for i = 1, ..., n. Considering partial derivatives for every componentof w, we have
rll(w) = X> (y � p) . (6.8)
The second partial derivative of the log-likelihood function can be found as
@2ll(w)
@wj@wd
=nX
i=1
xij ·e�w
>xi
�1 + e�w>xi
�2 · (�xik)
= �nX
i=1
xij ·1
1 + e�w>xi
·✓1� 1
1 + e�w>xi
◆· xik
= �f>j P (I�P) fd,
70
argmaxw
argminw
�
everything more suitable for further steps, we will slightly rearrange Eq. (6.6). Notice firstthat
✓1� 1
1 + e�w>xi
◆=
e�w>xi
1 + e�w>xi
giving
ll(w) =nX
i=1
⇣� yi · log
⇣1 + e�w
>xi
⌘+ (1� yi) · log
⇣e�w
>xi
⌘
� (1� yi) · log⇣1 + e�w
>xi
⌘⌘
=nX
i=1
✓(yi � 1)w>
xi + log
✓1
1 + e�w>xi
◆◆. (6.7)
It does not take much effort to realize that there is no closed-form solution to rll(w) = 0
(we did have this luxury in linear regression, but not here). Thus, we have to proceedwith iterative optimization methods. We will calculate rll(w) and Hll(w) to enable eithermethod (first-order with only the gradient or second-order with the gradient and Hessian),both written as a function of inputs X, class labels y, and the current parameter vector.We can calculate the first and second partial derivatives of ll(w) as follows
@ll(w)
@wj
=nX
i=1
✓(yi � 1) · xij �
1
1 + e�w>xi
· e�w>xi · (�xij)
◆
=nX
i=1
xij · yi � 1 +
e�w>xi
1 + ew>xi
!
=nX
i=1
xij ·✓yi �
1
1 + e�w>xi
◆
= f>j (y � p) ,
where fj is the j-th column (feature) of data matrix X, y is an n-dimensional column vectorof class labels and p is an n-dimensional column vector of (estimated) posterior probabilitiespi = P (Yi = 1|xi,w), for i = 1, ..., n. Considering partial derivatives for every componentof w, we have
rll(w) = X> (y � p) . (6.8)
The second partial derivative of the log-likelihood function can be found as
@2ll(w)
@wj@wd
=nX
i=1
xij ·e�w
>xi
�1 + e�w>xi
�2 · (�xik)
= �nX
i=1
xij ·1
1 + e�w>xi
·✓1� 1
1 + e�w>xi
◆· xik
= �f>j P (I�P) fd,
70
• Adding regularizer
argminw
�
everything more suitable for further steps, we will slightly rearrange Eq. (6.6). Notice firstthat
✓1� 1
1 + e�w>xi
◆=
e�w>xi
1 + e�w>xi
giving
ll(w) =nX
i=1
⇣� yi · log
⇣1 + e�w
>xi
⌘+ (1� yi) · log
⇣e�w
>xi
⌘
� (1� yi) · log⇣1 + e�w
>xi
⌘⌘
=nX
i=1
✓(yi � 1)w>
xi + log
✓1
1 + e�w>xi
◆◆. (6.7)
It does not take much effort to realize that there is no closed-form solution to rll(w) = 0
(we did have this luxury in linear regression, but not here). Thus, we have to proceedwith iterative optimization methods. We will calculate rll(w) and Hll(w) to enable eithermethod (first-order with only the gradient or second-order with the gradient and Hessian),both written as a function of inputs X, class labels y, and the current parameter vector.We can calculate the first and second partial derivatives of ll(w) as follows
@ll(w)
@wj
=nX
i=1
✓(yi � 1) · xij �
1
1 + e�w>xi
· e�w>xi · (�xij)
◆
=nX
i=1
xij · yi � 1 +
e�w>xi
1 + ew>xi
!
=nX
i=1
xij ·✓yi �
1
1 + e�w>xi
◆
= f>j (y � p) ,
where fj is the j-th column (feature) of data matrix X, y is an n-dimensional column vectorof class labels and p is an n-dimensional column vector of (estimated) posterior probabilitiespi = P (Yi = 1|xi,w), for i = 1, ..., n. Considering partial derivatives for every componentof w, we have
rll(w) = X> (y � p) . (6.8)
The second partial derivative of the log-likelihood function can be found as
@2ll(w)
@wj@wd
=nX
i=1
xij ·e�w
>xi
�1 + e�w>xi
�2 · (�xik)
= �nX
i=1
xij ·1
1 + e�w>xi
·✓1� 1
1 + e�w>xi
◆· xik
= �f>j P (I�P) fd,
70
+�kwk22
Other regularizers• Have discussed l2 and l1 regularizers
• Other examples: • elastic net regularization is a combination of l1 and l2 (i.e., l1 + l2):
ensures a unique solution
• capped regularizers: do not prevent large weights
27* Figure from “Robust Dictionary Learning with Capped l1-Norm”, Jiang et al., IJCAI 2015
Does this regularizer still protect against
overfitting?
Practical considerations: outliers• What happens if one sample is bad?
• Regularization helps a little bit
• Can also change losses
• Robust losses• use l1 instead of l2
• even better: use capped l1
• What are the disadvantages to these losses?
28
Exercise: intercept unit
• In linear regression, we added an intercept unit (bias unit) to the features• i.e., added a feature that is always 1 to the feature vector
• Does it make sense to do this for GLMs?• e.g., sigmoid(<x,w> + w_0)
29
Adding a column of ones to GLMs
• This provides the same outcome as for linear regression
• g(E[y | x]) = x w —> bias unit in x with coefficient w0 shifts the function left or right
30*Figure from http://stackoverflow.com/questions/2480650/role-of-bias-in-neural-networks