Generalized linear models and logistic regression · the full version of the Newton-Raphson...

30
Generalized linear models and logistic regression

Transcript of Generalized linear models and logistic regression · the full version of the Newton-Raphson...

Page 1: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Generalized linear models and logistic regression

Page 2: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Comments (Oct. 15, 2019)

• Assignment 1 almost marked • Any questions about the problems on this assignment?

• Assignment 2 due soon• Any questions?

2

Page 3: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Summary so far• From chapters 1 and 2, obtained tools needed to talk about

uncertainty/noise underlying machine learning• capture uncertainty about data/observations using probabilities

• formalize estimation problem for distributions

• Identify variables x_1, …, x_d• e.g. observed features, observed targets

• Pick the desired distribution• e.g. p(x_1, …, x_d) or p(x_1 | x_2, …, x_d) (conditional distribution)

• e.g. p(x_i) is Poisson or p(y | x_1, …, x_d) is Gaussian

• Perform parameter estimation for chosen distribution• e.g., estimate lambda for Poisson

• e.g. estimate mu and sigma for Gaussian3

Page 4: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Summary so far (2)

• For prediction problems, which is much of machine learning, first discuss • the types of data we get (i.e., features and types of targets)

• goal to minimize expected cost of incorrect predictions

• Concluded optimal prediction functions use p(y | x) or E[Y | x]

• From there, our goal becomes to estimate p(y|x) or E[Y | x]

• Starting from this general problem specification, it is useful to use our parameter estimation techniques to solve this problem• e.g., specify Y = Xw + noise, estimate mu = xw

4

Page 5: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Summary so far (3)

• For linear regression setting, modeling p(y|x) as a Gaussian with mu = <x,w> and a constant sigma

• Performed maximum likelihood to get weights w

• Possible question: why all this machinery to get to linear regression?• one answer: makes our assumptions about uncertainty more clear

• another answer: it will make it easier to generalize p(y | x) to other distributions (which we will do with GLMs)

5

Page 6: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Estimation approaches for Linear regression

• Recall we estimated w for p(y | x) as a Gaussian

• We discussed the closed form solution

• and using batch or stochastic gradient descent

• Exercise: Now imagine you have 10 new data points. How do we get a new w, that incorporates these data points?

6

w = (X>X)�1X>y

wt+1 = wt � ⌘X>(Xwt � y)

wt+1 = wt � ⌘tx>t (xtwt � yt)

Page 7: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Exercise: MAP for Poisson

• Recall we estimated lambda for Poisson p(x)• Had a dataset of scalars {x1, …, xn}

• For MLE, found the closed form solution lambda = average of xi

• Can we use gradient descent for this optimization? And if so, should we?

7

Page 8: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Exercise: Predicting the number of accidents

• In Assignment 1, learned p(y) as Poisson, where Y is the number of accidents in a factory

• How would the question from assignment 1 change if we also wanted to condition on features?• For example, want to model the number of accidents in the factory,

given x1 = size of the factory and x2 = number of employees

• What is p(y | x)? What are the parameters?

8

Page 9: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Poisson regression

9

p(y|x) = Poisson(y|� = exp(x>w))

1. E[Y |x] = exp(w>x)

2. p(y|x) = Poisson(�)<latexit sha1_base64="/uAhu9amlIrK6dkT+dvMXWoEGMg=">AAACUHicbVFNbxMxEPWGr7J8BThysYio0ku0myLRC1IFQuIYJNIWxUvk9c6mVr22Zc+WRNv9iVx643dw4QACbxqk0jKSNU/vzWhmnnOrpMck+Rb1bty8dfvO1t343v0HDx/1Hz858KZ2AqbCKOOOcu5BSQ1TlKjgyDrgVa7gMD952+mHp+C8NPojrixkFV9oWUrBMVDz/mI7HVFG380+nS0z+poyWNohy40q/KoKqfnSfmZoLKs4Hudls2x3GIu3x12THa7oGb2kdP0IS2wmRnpvdDtkKqxS8J143h8ko2Qd9DpIN2BANjGZ989ZYURdgUahuPezNLGYNdyhFAramNUeLBcnfAGzADWvwGfN2pCWvghMQUvjwtNI1+zljoZXvrsvVHbb+6taR/5Pm9VY7mWN1LZG0OJiUFkrioZ27tJCOhCoVgFw4WTYlYpj7rjA8AedCenVk6+Dg/Eo3R2NP7wc7L/Z2LFFnpHnZEhS8orsk/dkQqZEkK/kO/lJfkXn0Y/ody+6KP2byVPyT/TiP9hHspc=</latexit>

Page 10: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Exponential Family Distributions

10

p(y|✓) = exp(✓y � a(✓) + b(y))

Transfer f corresponds to the derivate of the log-normalizer function a

✓ = x>w

We will always linearly predict the natural parameter

Useful property:da(✓)

d✓= E[Y ]

<latexit sha1_base64="LP8vmPiAGFSMwrT7Yi3Wzu6P7to=">AAACFHicbZDLSsNAFIYnXmu9RV26GSxCRShJFXQjFEVwWcFepAllMpm0QycXZk6EEvIQbnwVNy4UcevCnW/j9LLQ1h8GPv5zDnPO7yWCK7Csb2NhcWl5ZbWwVlzf2NzaNnd2mypOJWUNGotYtj2imOARawAHwdqJZCT0BGt5g6tRvfXApOJxdAfDhLkh6UU84JSAtrrmsRNIQjMfk7IDfQbkKM/8CeX4Ajshgb7nZdd5597tmiWrYo2F58GeQglNVe+aX44f0zRkEVBBlOrYVgJuRiRwKlhedFLFEkIHpMc6GiMSMuVm46NyfKgdHwex1C8CPHZ/T2QkVGoYerpztKSarY3M/2qdFIJzN+NRkgKL6OSjIBUYYjxKCPtcMgpiqIFQyfWumPaJTgl0jkUdgj178jw0qxX7pFK9PS3VLqdxFNA+OkBlZKMzVEM3qI4aiKJH9Ixe0ZvxZLwY78bHpHXBmM7soT8yPn8ASoOeWA==</latexit>

Page 11: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Examples• Gaussian distribution

• Poisson distribution

• Bernoulli distribution

11

a(✓) =1

2✓2 f(✓) = ✓

✓ = x>w

a(✓) = exp(✓) f(✓) = exp(✓)

a(✓) = ln(1 + exp(✓)) f(✓) =1

1 + exp(�✓)

p(y|✓) = exp(✓y � a(✓) + b(y))

sigmoid

Page 12: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Exercise: How do we extract the form for the Poisson distribution?

12

where ◊ œ R is the parameter to the distribution, a : R æ R is a log-normalizer functionand b : R æ R is a function of only x that will typically be ignored in our optimizationbecause it is not a function of ◊. Many of the often encountered (families of) distributionsare members of the exponential family; e.g. exponential, Gaussian, Gamma, Poisson, or thebinomial distributions. Therefore, it is useful to generically study the exponential family tobetter understand commonalities and di�erences between individual member functions.

Example 17: The Poisson distribution can be expressed as

p(x|⁄) = exp (x log ⁄ ≠ ⁄ ≠ log x!) ,

where ⁄ œ R+ and X = N0. Thus, ◊ = log ⁄, a(◊) = e◊, and b(x) = ≠ log x!. ⇤Now let us get some further insight into the properties of the exponential family param-

eters and why this class is convenient for estimation. The function a(◊) is typically calledthe log-partitioning function or simply a log-normalizer. It is called this because

a(◊) = logˆ

X

exp (◊x + b(x)) dx

and so plays the role of ensuring that we have a valid density:´

Xp(x)dx = 1. Importantly,

for many common GLMs, the derivative of a corresponds to the inverse of the link function.For example, for Poisson regression, the link function g(◊) = log(◊), and the derivative ofa is e◊, which is the inverse of g. Therefore, as we discuss below, the log-normalizer foran exponential family informs what link g should be used (or correspondingly the transferf = g≠1).

The properties of this log-normalizer are also key for estimation of generalized linearmodels. It can be derived that

ˆa(◊)ˆ◊

= E [X]

ˆ2a(◊)ˆ◊2 = V[X]

7.3 Formalizing generalized linear modelsWe shall now formalize the generalized linear models. The two key components of GLMscan be expressed as

1. E[y|x] = f(Ê€x) or g(E[y|x]) = Ê€x where g = f≠1

2. p(y|x) is an Exponential Family distribution

The function f is called the transfer function and g is called the link function. For Poissonregression, f is the exponential function, and as we shall see for logistic regression, f isthe sigmoid function. The transfer function adjusts the range of ʀx to the domain of Y ;because of this relationship, link functions are usually not selected independently of the dis-tribution for Y . The generalization to the exponential family from the Gaussian distributionused in ordinary least-squares regression, allows us to model a much wider range of targetfunctions. GLMs include three widely used models: linear regression, Poisson regressionand logistic regression, which we will talk about in the next chapter about classification.

80

p(y|✓) = exp(✓y � a(✓) + b(y))

• What is the transfer f?

f(✓) =da(✓)

d✓= exp(✓)

<latexit sha1_base64="mPlxiJ73hK2CB2L4cF3/NL7Os0w=">AAACH3icbZDLSsNAFIYn9V5vUZduBotQNyWpom6EohuXClaFJpTJ5KQdOrkwcyKW0Ddx46u4caGIuPNtnF4EtR4Y+Pj/czhz/iCTQqPjfFqlmdm5+YXFpfLyyuraur2xea3TXHFo8lSm6jZgGqRIoIkCJdxmClgcSLgJemdD/+YOlBZpcoX9DPyYdRIRCc7QSG37MKp62AVke/SEepFivAgp+9YGRTimwdCF++zbaNsVp+aMik6DO4EKmdRF2/7wwpTnMSTIJdO65ToZ+gVTKLiEQdnLNWSM91gHWgYTFoP2i9F9A7prlJBGqTIvQTpSf04ULNa6HwemM2bY1X+9ofif18oxOvYLkWQ5QsLHi6JcUkzpMCwaCgUcZd8A40qYv1LeZSYkNJGWTQju35On4bpec/dr9cuDSuN0Esci2SY7pEpcckQa5JxckCbh5IE8kRfyaj1az9ab9T5uLVmTmS3yq6zPLwg+ocA=</latexit>

Page 13: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Exercise: How do we extract the form for the exponential distribution?

• Recall exponential family distribution

• How do we write the exponential distribution this way?

• What is the transfer f?

13

� exp(��y)

p(y|✓) = exp(✓y � a(✓) + b(y))

f(✓) =d

d✓a(✓) =

�1

✓<latexit sha1_base64="KafInQrCEKwV8/yH0pAuo0L73es=">AAACJXicbZDLSsNAFIYn9VbrLerSzWAR6sKSiKALhaIblxXsBZpQJpNJO3RyYeZEKCEv48ZXcePCIoIrX8Vpm4W2Hhj4+P9zOHN+LxFcgWV9GaWV1bX1jfJmZWt7Z3fP3D9oqziVlLVoLGLZ9YhigkesBRwE6yaSkdATrOON7qZ+54lJxePoEcYJc0MyiHjAKQEt9c3roObAkAE5xTfYCSShmZ9nPp6LOSZL9pmdZ4XbN6tW3ZoVXga7gCoqqtk3J44f0zRkEVBBlOrZVgJuRiRwKlhecVLFEkJHZMB6GiMSMuVmsytzfKIVHwex1C8CPFN/T2QkVGocerozJDBUi95U/M/rpRBcuRmPkhRYROeLglRgiPE0MuxzySiIsQZCJdd/xXRIdBagg63oEOzFk5ehfV63NT9cVBu3RRxldISOUQ3Z6BI10D1qohai6Bm9onc0MV6MN+PD+Jy3loxi5hD9KeP7Bw3fpGQ=</latexit><latexit sha1_base64="KafInQrCEKwV8/yH0pAuo0L73es=">AAACJXicbZDLSsNAFIYn9VbrLerSzWAR6sKSiKALhaIblxXsBZpQJpNJO3RyYeZEKCEv48ZXcePCIoIrX8Vpm4W2Hhj4+P9zOHN+LxFcgWV9GaWV1bX1jfJmZWt7Z3fP3D9oqziVlLVoLGLZ9YhigkesBRwE6yaSkdATrOON7qZ+54lJxePoEcYJc0MyiHjAKQEt9c3roObAkAE5xTfYCSShmZ9nPp6LOSZL9pmdZ4XbN6tW3ZoVXga7gCoqqtk3J44f0zRkEVBBlOrZVgJuRiRwKlhecVLFEkJHZMB6GiMSMuVmsytzfKIVHwex1C8CPFN/T2QkVGocerozJDBUi95U/M/rpRBcuRmPkhRYROeLglRgiPE0MuxzySiIsQZCJdd/xXRIdBagg63oEOzFk5ehfV63NT9cVBu3RRxldISOUQ3Z6BI10D1qohai6Bm9onc0MV6MN+PD+Jy3loxi5hD9KeP7Bw3fpGQ=</latexit><latexit sha1_base64="KafInQrCEKwV8/yH0pAuo0L73es=">AAACJXicbZDLSsNAFIYn9VbrLerSzWAR6sKSiKALhaIblxXsBZpQJpNJO3RyYeZEKCEv48ZXcePCIoIrX8Vpm4W2Hhj4+P9zOHN+LxFcgWV9GaWV1bX1jfJmZWt7Z3fP3D9oqziVlLVoLGLZ9YhigkesBRwE6yaSkdATrOON7qZ+54lJxePoEcYJc0MyiHjAKQEt9c3roObAkAE5xTfYCSShmZ9nPp6LOSZL9pmdZ4XbN6tW3ZoVXga7gCoqqtk3J44f0zRkEVBBlOrZVgJuRiRwKlhecVLFEkJHZMB6GiMSMuVmsytzfKIVHwex1C8CPFN/T2QkVGocerozJDBUi95U/M/rpRBcuRmPkhRYROeLglRgiPE0MuxzySiIsQZCJdd/xXRIdBagg63oEOzFk5ehfV63NT9cVBu3RRxldISOUQ3Z6BI10D1qohai6Bm9onc0MV6MN+PD+Jy3loxi5hD9KeP7Bw3fpGQ=</latexit><latexit sha1_base64="KafInQrCEKwV8/yH0pAuo0L73es=">AAACJXicbZDLSsNAFIYn9VbrLerSzWAR6sKSiKALhaIblxXsBZpQJpNJO3RyYeZEKCEv48ZXcePCIoIrX8Vpm4W2Hhj4+P9zOHN+LxFcgWV9GaWV1bX1jfJmZWt7Z3fP3D9oqziVlLVoLGLZ9YhigkesBRwE6yaSkdATrOON7qZ+54lJxePoEcYJc0MyiHjAKQEt9c3roObAkAE5xTfYCSShmZ9nPp6LOSZL9pmdZ4XbN6tW3ZoVXga7gCoqqtk3J44f0zRkEVBBlOrZVgJuRiRwKlhecVLFEkJHZMB6GiMSMuVmsytzfKIVHwex1C8CPFN/T2QkVGocerozJDBUi95U/M/rpRBcuRmPkhRYROeLglRgiPE0MuxzySiIsQZCJdd/xXRIdBagg63oEOzFk5ehfV63NT9cVBu3RRxldISOUQ3Z6BI10D1qohai6Bm9onc0MV6MN+PD+Jy3loxi5hD9KeP7Bw3fpGQ=</latexit>

✓ = x>w<latexit sha1_base64="XmIcyi2DjkGQ0wE1D6kjjm/Ezcs=">AAACDHicbVDLSsNAFJ3UV62vqks3g0VwVRIRdCMU3bisYB/Q1DKZTtqhk0mYuVFLyAe48VfcuFDErR/gzr9x0gbR1gMDZ865l3vv8SLBNdj2l1VYWFxaXimultbWNza3yts7TR3GirIGDUWo2h7RTHDJGsBBsHakGAk8wVre6CLzW7dMaR7KaxhHrBuQgeQ+pwSM1CtXXBgyIPgMuwGBoecn9+mNC2H0879LTZVdtSfA88TJSQXlqPfKn24/pHHAJFBBtO44dgTdhCjgVLC05MaaRYSOyIB1DJUkYLqbTI5J8YFR+tgPlXkS8ET93ZGQQOtx4JnKbEM962Xif14nBv+0m3AZxcAknQ7yY4EhxFkyuM8VoyDGhhCquNkV0yFRhILJr2RCcGZPnifNo6pj+NVxpXaex1FEe2gfHSIHnaAaukR11EAUPaAn9IJerUfr2Xqz3qelBSvv2UV/YH18Aw25m6A=</latexit><latexit sha1_base64="XmIcyi2DjkGQ0wE1D6kjjm/Ezcs=">AAACDHicbVDLSsNAFJ3UV62vqks3g0VwVRIRdCMU3bisYB/Q1DKZTtqhk0mYuVFLyAe48VfcuFDErR/gzr9x0gbR1gMDZ865l3vv8SLBNdj2l1VYWFxaXimultbWNza3yts7TR3GirIGDUWo2h7RTHDJGsBBsHakGAk8wVre6CLzW7dMaR7KaxhHrBuQgeQ+pwSM1CtXXBgyIPgMuwGBoecn9+mNC2H0879LTZVdtSfA88TJSQXlqPfKn24/pHHAJFBBtO44dgTdhCjgVLC05MaaRYSOyIB1DJUkYLqbTI5J8YFR+tgPlXkS8ET93ZGQQOtx4JnKbEM962Xif14nBv+0m3AZxcAknQ7yY4EhxFkyuM8VoyDGhhCquNkV0yFRhILJr2RCcGZPnifNo6pj+NVxpXaex1FEe2gfHSIHnaAaukR11EAUPaAn9IJerUfr2Xqz3qelBSvv2UV/YH18Aw25m6A=</latexit><latexit sha1_base64="XmIcyi2DjkGQ0wE1D6kjjm/Ezcs=">AAACDHicbVDLSsNAFJ3UV62vqks3g0VwVRIRdCMU3bisYB/Q1DKZTtqhk0mYuVFLyAe48VfcuFDErR/gzr9x0gbR1gMDZ865l3vv8SLBNdj2l1VYWFxaXimultbWNza3yts7TR3GirIGDUWo2h7RTHDJGsBBsHakGAk8wVre6CLzW7dMaR7KaxhHrBuQgeQ+pwSM1CtXXBgyIPgMuwGBoecn9+mNC2H0879LTZVdtSfA88TJSQXlqPfKn24/pHHAJFBBtO44dgTdhCjgVLC05MaaRYSOyIB1DJUkYLqbTI5J8YFR+tgPlXkS8ET93ZGQQOtx4JnKbEM962Xif14nBv+0m3AZxcAknQ7yY4EhxFkyuM8VoyDGhhCquNkV0yFRhILJr2RCcGZPnifNo6pj+NVxpXaex1FEe2gfHSIHnaAaukR11EAUPaAn9IJerUfr2Xqz3qelBSvv2UV/YH18Aw25m6A=</latexit><latexit sha1_base64="XmIcyi2DjkGQ0wE1D6kjjm/Ezcs=">AAACDHicbVDLSsNAFJ3UV62vqks3g0VwVRIRdCMU3bisYB/Q1DKZTtqhk0mYuVFLyAe48VfcuFDErR/gzr9x0gbR1gMDZ865l3vv8SLBNdj2l1VYWFxaXimultbWNza3yts7TR3GirIGDUWo2h7RTHDJGsBBsHakGAk8wVre6CLzW7dMaR7KaxhHrBuQgeQ+pwSM1CtXXBgyIPgMuwGBoecn9+mNC2H0879LTZVdtSfA88TJSQXlqPfKn24/pHHAJFBBtO44dgTdhCjgVLC05MaaRYSOyIB1DJUkYLqbTI5J8YFR+tgPlXkS8ET93ZGQQOtx4JnKbEM962Xif14nBv+0m3AZxcAknQ7yY4EhxFkyuM8VoyDGhhCquNkV0yFRhILJr2RCcGZPnifNo6pj+NVxpXaex1FEe2gfHSIHnaAaukR11EAUPaAn9IJerUfr2Xqz3qelBSvv2UV/YH18Aw25m6A=</latexit>

a(✓) = � ln(�✓)<latexit sha1_base64="X2Ru3aryqZVrVdlcVrZwCT+1nYs=">AAACBHicbZDLSgMxFIYz9VbrbdRlN8Ei1IVlRgTdCEU3LivYC3SGkkkzbWgmMyRnhDJ04cZXceNCEbc+hDvfxrSdhbYeSPj4/3NIzh8kgmtwnG+rsLK6tr5R3Cxtbe/s7tn7By0dp4qyJo1FrDoB0UxwyZrAQbBOohiJAsHawehm6rcfmNI8lvcwTpgfkYHkIacEjNSzy6TqwZABOcFX+BR7QlbNPVd6dsWpObPCy+DmUEF5NXr2l9ePaRoxCVQQrbuuk4CfEQWcCjYpealmCaEjMmBdg5JETPvZbIkJPjZKH4exMkcCnqm/JzISaT2OAtMZERjqRW8q/ud1Uwgv/YzLJAUm6fyhMBUYYjxNBPe5YhTE2AChipu/YjokilAwuZVMCO7iysvQOqu5hu/OK/XrPI4iKqMjVEUuukB1dIsaqIkoekTP6BW9WU/Wi/VufcxbC1Y+c4j+lPX5A0WLle0=</latexit><latexit sha1_base64="X2Ru3aryqZVrVdlcVrZwCT+1nYs=">AAACBHicbZDLSgMxFIYz9VbrbdRlN8Ei1IVlRgTdCEU3LivYC3SGkkkzbWgmMyRnhDJ04cZXceNCEbc+hDvfxrSdhbYeSPj4/3NIzh8kgmtwnG+rsLK6tr5R3Cxtbe/s7tn7By0dp4qyJo1FrDoB0UxwyZrAQbBOohiJAsHawehm6rcfmNI8lvcwTpgfkYHkIacEjNSzy6TqwZABOcFX+BR7QlbNPVd6dsWpObPCy+DmUEF5NXr2l9ePaRoxCVQQrbuuk4CfEQWcCjYpealmCaEjMmBdg5JETPvZbIkJPjZKH4exMkcCnqm/JzISaT2OAtMZERjqRW8q/ud1Uwgv/YzLJAUm6fyhMBUYYjxNBPe5YhTE2AChipu/YjokilAwuZVMCO7iysvQOqu5hu/OK/XrPI4iKqMjVEUuukB1dIsaqIkoekTP6BW9WU/Wi/VufcxbC1Y+c4j+lPX5A0WLle0=</latexit><latexit sha1_base64="X2Ru3aryqZVrVdlcVrZwCT+1nYs=">AAACBHicbZDLSgMxFIYz9VbrbdRlN8Ei1IVlRgTdCEU3LivYC3SGkkkzbWgmMyRnhDJ04cZXceNCEbc+hDvfxrSdhbYeSPj4/3NIzh8kgmtwnG+rsLK6tr5R3Cxtbe/s7tn7By0dp4qyJo1FrDoB0UxwyZrAQbBOohiJAsHawehm6rcfmNI8lvcwTpgfkYHkIacEjNSzy6TqwZABOcFX+BR7QlbNPVd6dsWpObPCy+DmUEF5NXr2l9ePaRoxCVQQrbuuk4CfEQWcCjYpealmCaEjMmBdg5JETPvZbIkJPjZKH4exMkcCnqm/JzISaT2OAtMZERjqRW8q/ud1Uwgv/YzLJAUm6fyhMBUYYjxNBPe5YhTE2AChipu/YjokilAwuZVMCO7iysvQOqu5hu/OK/XrPI4iKqMjVEUuukB1dIsaqIkoekTP6BW9WU/Wi/VufcxbC1Y+c4j+lPX5A0WLle0=</latexit><latexit sha1_base64="X2Ru3aryqZVrVdlcVrZwCT+1nYs=">AAACBHicbZDLSgMxFIYz9VbrbdRlN8Ei1IVlRgTdCEU3LivYC3SGkkkzbWgmMyRnhDJ04cZXceNCEbc+hDvfxrSdhbYeSPj4/3NIzh8kgmtwnG+rsLK6tr5R3Cxtbe/s7tn7By0dp4qyJo1FrDoB0UxwyZrAQbBOohiJAsHawehm6rcfmNI8lvcwTpgfkYHkIacEjNSzy6TqwZABOcFX+BR7QlbNPVd6dsWpObPCy+DmUEF5NXr2l9ePaRoxCVQQrbuuk4CfEQWcCjYpealmCaEjMmBdg5JETPvZbIkJPjZKH4exMkcCnqm/JzISaT2OAtMZERjqRW8q/ud1Uwgv/YzLJAUm6fyhMBUYYjxNBPe5YhTE2AChipu/YjokilAwuZVMCO7iysvQOqu5hu/OK/XrPI4iKqMjVEUuukB1dIsaqIkoekTP6BW9WU/Wi/VufcxbC1Y+c4j+lPX5A0WLle0=</latexit>

b(y) = 0<latexit sha1_base64="KwXm+UdrcxtJxVMGzZhb6OLUVG4=">AAAB73icbZBNS8NAEIYnftb6VfXoZbEI9VISEfQiFL14rGA/oA1ls520SzebuLsRSuif8OJBEa/+HW/+G7dtDtr6wsLDOzPszBskgmvjut/Oyura+sZmYau4vbO7t186OGzqOFUMGywWsWoHVKPgEhuGG4HtRCGNAoGtYHQ7rbeeUGkeywczTtCP6EDykDNqrNUOKuMzck3cXqnsVt2ZyDJ4OZQhV71X+ur2Y5ZGKA0TVOuO5ybGz6gynAmcFLupxoSyER1gx6KkEWo/m+07IafW6ZMwVvZJQ2bu74mMRlqPo8B2RtQM9WJtav5X66QmvPIzLpPUoGTzj8JUEBOT6fGkzxUyI8YWKFPc7krYkCrKjI2oaEPwFk9ehuZ51bN8f1Gu3eRxFOAYTqACHlxCDe6gDg1gIOAZXuHNeXRenHfnY9664uQzR/BHzucPB3qOow==</latexit><latexit sha1_base64="KwXm+UdrcxtJxVMGzZhb6OLUVG4=">AAAB73icbZBNS8NAEIYnftb6VfXoZbEI9VISEfQiFL14rGA/oA1ls520SzebuLsRSuif8OJBEa/+HW/+G7dtDtr6wsLDOzPszBskgmvjut/Oyura+sZmYau4vbO7t186OGzqOFUMGywWsWoHVKPgEhuGG4HtRCGNAoGtYHQ7rbeeUGkeywczTtCP6EDykDNqrNUOKuMzck3cXqnsVt2ZyDJ4OZQhV71X+ur2Y5ZGKA0TVOuO5ybGz6gynAmcFLupxoSyER1gx6KkEWo/m+07IafW6ZMwVvZJQ2bu74mMRlqPo8B2RtQM9WJtav5X66QmvPIzLpPUoGTzj8JUEBOT6fGkzxUyI8YWKFPc7krYkCrKjI2oaEPwFk9ehuZ51bN8f1Gu3eRxFOAYTqACHlxCDe6gDg1gIOAZXuHNeXRenHfnY9664uQzR/BHzucPB3qOow==</latexit><latexit sha1_base64="KwXm+UdrcxtJxVMGzZhb6OLUVG4=">AAAB73icbZBNS8NAEIYnftb6VfXoZbEI9VISEfQiFL14rGA/oA1ls520SzebuLsRSuif8OJBEa/+HW/+G7dtDtr6wsLDOzPszBskgmvjut/Oyura+sZmYau4vbO7t186OGzqOFUMGywWsWoHVKPgEhuGG4HtRCGNAoGtYHQ7rbeeUGkeywczTtCP6EDykDNqrNUOKuMzck3cXqnsVt2ZyDJ4OZQhV71X+ur2Y5ZGKA0TVOuO5ybGz6gynAmcFLupxoSyER1gx6KkEWo/m+07IafW6ZMwVvZJQ2bu74mMRlqPo8B2RtQM9WJtav5X66QmvPIzLpPUoGTzj8JUEBOT6fGkzxUyI8YWKFPc7krYkCrKjI2oaEPwFk9ehuZ51bN8f1Gu3eRxFOAYTqACHlxCDe6gDg1gIOAZXuHNeXRenHfnY9664uQzR/BHzucPB3qOow==</latexit><latexit sha1_base64="KwXm+UdrcxtJxVMGzZhb6OLUVG4=">AAAB73icbZBNS8NAEIYnftb6VfXoZbEI9VISEfQiFL14rGA/oA1ls520SzebuLsRSuif8OJBEa/+HW/+G7dtDtr6wsLDOzPszBskgmvjut/Oyura+sZmYau4vbO7t186OGzqOFUMGywWsWoHVKPgEhuGG4HtRCGNAoGtYHQ7rbeeUGkeywczTtCP6EDykDNqrNUOKuMzck3cXqnsVt2ZyDJ4OZQhV71X+ur2Y5ZGKA0TVOuO5ybGz6gynAmcFLupxoSyER1gx6KkEWo/m+07IafW6ZMwVvZJQ2bu74mMRlqPo8B2RtQM9WJtav5X66QmvPIzLpPUoGTzj8JUEBOT6fGkzxUyI8YWKFPc7krYkCrKjI2oaEPwFk9ehuZ51bN8f1Gu3eRxFOAYTqACHlxCDe6gDg1gIOAZXuHNeXRenHfnY9664uQzR/BHzucPB3qOow==</latexit>

p(y|✓) = exp(✓y) exp(�a(✓)) exp(b(y))<latexit sha1_base64="t9R/g0rc179gUcoKP7Lb7l0WLoA=">AAACIXicbVDLSgNBEJz1GeMr6tHLYBCyB8OuCOYiBL14jGAekIQwO+kkQ2YfzPSKS8yvePFXvHhQJDfxZ5wke9DEgoHqqm56urxICo2O82WtrK6tb2xmtrLbO7t7+7mDw5oOY8WhykMZqobHNEgRQBUFSmhECpjvSah7w5upX38ApUUY3GMSQdtn/UD0BGdopE6uFBWSpxYOAJlNr2gLHqPCvKSJPS/PWKrYqeAVEtvu5PJO0ZmBLhM3JXmSotLJTVrdkMc+BMgl07rpOhG2R0yh4BLG2VasIWJ8yPrQNDRgPuj2aHbhmJ4apUt7oTIvQDpTf0+MmK914num02c40IveVPzPa8bYK7VHIohihIDPF/ViSTGk07hoVyjgKBNDGFfC/JXyAVOMowk1a0JwF09eJrXzomv43UW+fJ3GkSHH5IQUiEsuSZnckgqpEk6eySt5Jx/Wi/VmfVqTeeuKlc4ckT+wvn8AhtOhNg==</latexit><latexit sha1_base64="t9R/g0rc179gUcoKP7Lb7l0WLoA=">AAACIXicbVDLSgNBEJz1GeMr6tHLYBCyB8OuCOYiBL14jGAekIQwO+kkQ2YfzPSKS8yvePFXvHhQJDfxZ5wke9DEgoHqqm56urxICo2O82WtrK6tb2xmtrLbO7t7+7mDw5oOY8WhykMZqobHNEgRQBUFSmhECpjvSah7w5upX38ApUUY3GMSQdtn/UD0BGdopE6uFBWSpxYOAJlNr2gLHqPCvKSJPS/PWKrYqeAVEtvu5PJO0ZmBLhM3JXmSotLJTVrdkMc+BMgl07rpOhG2R0yh4BLG2VasIWJ8yPrQNDRgPuj2aHbhmJ4apUt7oTIvQDpTf0+MmK914num02c40IveVPzPa8bYK7VHIohihIDPF/ViSTGk07hoVyjgKBNDGFfC/JXyAVOMowk1a0JwF09eJrXzomv43UW+fJ3GkSHH5IQUiEsuSZnckgqpEk6eySt5Jx/Wi/VmfVqTeeuKlc4ckT+wvn8AhtOhNg==</latexit><latexit sha1_base64="t9R/g0rc179gUcoKP7Lb7l0WLoA=">AAACIXicbVDLSgNBEJz1GeMr6tHLYBCyB8OuCOYiBL14jGAekIQwO+kkQ2YfzPSKS8yvePFXvHhQJDfxZ5wke9DEgoHqqm56urxICo2O82WtrK6tb2xmtrLbO7t7+7mDw5oOY8WhykMZqobHNEgRQBUFSmhECpjvSah7w5upX38ApUUY3GMSQdtn/UD0BGdopE6uFBWSpxYOAJlNr2gLHqPCvKSJPS/PWKrYqeAVEtvu5PJO0ZmBLhM3JXmSotLJTVrdkMc+BMgl07rpOhG2R0yh4BLG2VasIWJ8yPrQNDRgPuj2aHbhmJ4apUt7oTIvQDpTf0+MmK914num02c40IveVPzPa8bYK7VHIohihIDPF/ViSTGk07hoVyjgKBNDGFfC/JXyAVOMowk1a0JwF09eJrXzomv43UW+fJ3GkSHH5IQUiEsuSZnckgqpEk6eySt5Jx/Wi/VmfVqTeeuKlc4ckT+wvn8AhtOhNg==</latexit><latexit sha1_base64="t9R/g0rc179gUcoKP7Lb7l0WLoA=">AAACIXicbVDLSgNBEJz1GeMr6tHLYBCyB8OuCOYiBL14jGAekIQwO+kkQ2YfzPSKS8yvePFXvHhQJDfxZ5wke9DEgoHqqm56urxICo2O82WtrK6tb2xmtrLbO7t7+7mDw5oOY8WhykMZqobHNEgRQBUFSmhECpjvSah7w5upX38ApUUY3GMSQdtn/UD0BGdopE6uFBWSpxYOAJlNr2gLHqPCvKSJPS/PWKrYqeAVEtvu5PJO0ZmBLhM3JXmSotLJTVrdkMc+BMgl07rpOhG2R0yh4BLG2VasIWJ8yPrQNDRgPuj2aHbhmJ4apUt7oTIvQDpTf0+MmK914num02c40IveVPzPa8bYK7VHIohihIDPF/ViSTGk07hoVyjgKBNDGFfC/JXyAVOMowk1a0JwF09eJrXzomv43UW+fJ3GkSHH5IQUiEsuSZnckgqpEk6eySt5Jx/Wi/VmfVqTeeuKlc4ckT+wvn8AhtOhNg==</latexit>

i.e.,

� > 0<latexit sha1_base64="gL/ES4sLDHFKGqtzp8ncprsQ++c=">AAAB8nicbVDLSsNAFL2pr1pfVZduBovgqiQi6EqKblxWsA9oQ5lMJu3QySTM3Agl9DPcuFDErV/jzr9x2mahrQcGDuecy9x7glQKg6777ZTW1jc2t8rblZ3dvf2D6uFR2ySZZrzFEpnobkANl0LxFgqUvJtqTuNA8k4wvpv5nSeujUjUI05S7sd0qEQkGEUr9frSRkNKbog7qNbcujsHWSVeQWpQoDmofvXDhGUxV8gkNabnuSn6OdUomOTTSj8zPKVsTIe8Z6miMTd+Pl95Ss6sEpIo0fYpJHP190ROY2MmcWCTMcWRWfZm4n9eL8Po2s+FSjPkii0+ijJJMCGz+0koNGcoJ5ZQpoXdlbAR1ZShbaliS/CWT14l7Yu6Z/nDZa1xW9RRhhM4hXPw4AoacA9NaAGDBJ7hFd4cdF6cd+djES05xcwx/IHz+QPkSpBT</latexit><latexit sha1_base64="gL/ES4sLDHFKGqtzp8ncprsQ++c=">AAAB8nicbVDLSsNAFL2pr1pfVZduBovgqiQi6EqKblxWsA9oQ5lMJu3QySTM3Agl9DPcuFDErV/jzr9x2mahrQcGDuecy9x7glQKg6777ZTW1jc2t8rblZ3dvf2D6uFR2ySZZrzFEpnobkANl0LxFgqUvJtqTuNA8k4wvpv5nSeujUjUI05S7sd0qEQkGEUr9frSRkNKbog7qNbcujsHWSVeQWpQoDmofvXDhGUxV8gkNabnuSn6OdUomOTTSj8zPKVsTIe8Z6miMTd+Pl95Ss6sEpIo0fYpJHP190ROY2MmcWCTMcWRWfZm4n9eL8Po2s+FSjPkii0+ijJJMCGz+0koNGcoJ5ZQpoXdlbAR1ZShbaliS/CWT14l7Yu6Z/nDZa1xW9RRhhM4hXPw4AoacA9NaAGDBJ7hFd4cdF6cd+djES05xcwx/IHz+QPkSpBT</latexit><latexit sha1_base64="gL/ES4sLDHFKGqtzp8ncprsQ++c=">AAAB8nicbVDLSsNAFL2pr1pfVZduBovgqiQi6EqKblxWsA9oQ5lMJu3QySTM3Agl9DPcuFDErV/jzr9x2mahrQcGDuecy9x7glQKg6777ZTW1jc2t8rblZ3dvf2D6uFR2ySZZrzFEpnobkANl0LxFgqUvJtqTuNA8k4wvpv5nSeujUjUI05S7sd0qEQkGEUr9frSRkNKbog7qNbcujsHWSVeQWpQoDmofvXDhGUxV8gkNabnuSn6OdUomOTTSj8zPKVsTIe8Z6miMTd+Pl95Ss6sEpIo0fYpJHP190ROY2MmcWCTMcWRWfZm4n9eL8Po2s+FSjPkii0+ijJJMCGz+0koNGcoJ5ZQpoXdlbAR1ZShbaliS/CWT14l7Yu6Z/nDZa1xW9RRhhM4hXPw4AoacA9NaAGDBJ7hFd4cdF6cd+djES05xcwx/IHz+QPkSpBT</latexit><latexit sha1_base64="gL/ES4sLDHFKGqtzp8ncprsQ++c=">AAAB8nicbVDLSsNAFL2pr1pfVZduBovgqiQi6EqKblxWsA9oQ5lMJu3QySTM3Agl9DPcuFDErV/jzr9x2mahrQcGDuecy9x7glQKg6777ZTW1jc2t8rblZ3dvf2D6uFR2ySZZrzFEpnobkANl0LxFgqUvJtqTuNA8k4wvpv5nSeujUjUI05S7sd0qEQkGEUr9frSRkNKbog7qNbcujsHWSVeQWpQoDmofvXDhGUxV8gkNabnuSn6OdUomOTTSj8zPKVsTIe8Z6miMTd+Pl95Ss6sEpIo0fYpJHP190ROY2MmcWCTMcWRWfZm4n9eL8Po2s+FSjPkii0+ijJJMCGz+0koNGcoJ5ZQpoXdlbAR1ZShbaliS/CWT14l7Yu6Z/nDZa1xW9RRhhM4hXPw4AoacA9NaAGDBJ7hFd4cdF6cd+djES05xcwx/IHz+QPkSpBT</latexit>

Page 14: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Logistic regression

14

the full version of the Newton-Raphson algorithm with the Hessian matrix.Instead, Gauss-Newton and other types of solutions are considered and aregenerally called iteratively reweighted least-squares (IRLS) algorithms in thestatistical literature.

5.4 Logistic regression

At the end, we mention that GLMs extend to classification. One of the mostpopular uses of GLMs is a combination of a Bernoulli distribution with alogit link function. This framework is frequently encountered and is calledlogistic regression. We summarize the logistic regression model as follows

1. logit(E[y|x]) = !Tx

2. p(y|x) = Bernoulli(↵)

where logit(x) = ln x

1�x, y 2 {0, 1}, and ↵ 2 (0, 1) is the parameter (mean)

of the Bernoulli distribution. It follows that

E[y|x] = 1

1 + e�!Tx

and

p(y|x) =✓

1

1 + e�!Tx

◆y ✓1� 1

1 + e�!Tx

◆1�y

.

Given a data set D = {(xi, yi)}ni=1, where xi 2 Rk and yi 2 {0, 1}, the pa-rameters of the model w can be found by maximizing the likelihood function.

86

−5 0 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

f(t) =1

1 + e�t

t

Figure 6.2: Sigmoid function in [�5, 5] interval.

as negative (y = 0). Thus, the predictor maps a (d+ 1)-dimensional vectorx = (x0 = 1, x1, . . . , xd) into a zero or one. Note that P (Y = 1|x,w⇤) � 0.5only when w>x � 0. The expression w>x = 0 represents equation of ahyperplane that separates positive and negative examples. Thus, the logisticregression model is a linear classifier.

6.1.2 Maximum conditional likelihood estimation

To frame the learning problem as parameter estimation, we will assume thatthe data set D = {(xi, yi)}ni=1 is an i.i.d. sample from a fixed but unknownprobability distribution p(x, y). Even more specifically, we will assume thatthe data generating process randomly draws a data point x, a realization ofthe random vector (X0 = 1, X1, . . . , Xd), according to p(x) and then sets itsclass label Y according to the Bernoulli distribution

p(y|x) =

8<

:

⇣1

1+e�!>x

⌘y

⇣1� 1

1+e�!>x

⌘1�y

for y = 1

for y = 0(6.2)

= �(x>w)y(1� �(x>w))1�y

where ! = (!0,!1, . . . ,!d) is a set of unknown coefficients we want torecover (or learn) from the observed data D. Based on the principles ofparameter estimation, we can estimate ! by maximizing the conditional

89

↵ = p(y = 1|x)g(x>w) = logit(x>w)

f(x>w) = g�1(x>w)

= sigmoid(x>w)

= E[y|x]

linear classifiers, our flexibility is restricted because our method must learn the posteriorprobabilities p(y|x) and at the same time have a linear decision surface in Rd. This, however,can be achieved if p(y|x) is modeled as a monotonic function of w€x; e.g., tanh(w€x) or(1 + e≠w

€x)≠1. Of course, a model trained to learn posterior probabilities p(y|x) can be

seen as a “soft” predictor or a scoring function s : X æ [0, 1]. Then, the conversion from sto f is a straightforward application of the maximum a posteriori principle: the predictedoutput is positive if s(x) Ø 0.5 and negative if s(x) < 0.5. More generally, the scoringfunction can be any mapping s : X æ R, with thresholding applied based on any particularvalue · .

8.1 Logistic regression

Let us consider binary classification in Rd, where X = Rd+1 and Y = {0, 1}. Logistic re-gression is a Generalized Linear Model, where the distribution over Y given x is a Bernoullidistribution, and the transfer function is the sigmoid function, also called the logistic func-tion,

‡(t) =11 + e≠t

2≠1

plotted in Figure 8.2. In the same terminology as for GLMs, the transfer function is thesigmoid and the link function—the inverse of the transfer function— is the logit functionlogit(x) = ln x

1≠x, with

1. E[y|x] = ‡(Ê€x)

2. p(y|x) = Bernoulli(–) with – = E[y|x].

The Bernoulli distribution, with – a function of x, is

p(y|x) =

Y_]

_[

11

1+e≠Ê€x

2y

11 ≠

11+e≠Ê€x

21≠y

for y = 1

for y = 0(8.1)

= ‡(x€w)y(1 ≠ ‡(x€w))1≠y

where Ê = (Ê0, Ê1, . . . , Êd) is a set of unknown coe�cients we want to recover (or learn).Notice that our prediction is ‡(Ê€x), and that it satisfies

p(y = 1|x) = ‡(Ê€x).

Therefore, as with many binary classification approaches, our goal is to predict the proba-bility that the class is 1; given this probability, we can infer p(y = 0|x) = 1 ≠ p(y = 1|x).

8.1.1 Predicting class labelsThe function learned by logistic regression returns a probability, rather than an explicitprediction of 0 or 1. Therefore, we have to take this probability estimate and convert it toa suitable prediction of the class. For a previously unseen data point x and a set of learnedcoe�cients w, we simply calculate the posterior probability as

P (Y = 1|x, w) = 11 + e≠w€x

.

84

Page 15: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

What is c(w) for GLMs?

• Still formulating an optimization problem to predict targets y given features x

• The variables we learn is the weight vector w

• What is c(w)?

15

MLE : c(w) / � ln p(D|w)

/ �nX

i=1

ln p(yi|xiw)

argminw

c(w) = argmaxw

p(D|w)

Page 16: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Cross-entropy loss for Logistic Regression

16

ci(w) = yi ln�(w>xi) + (1� yi) ln(1� �(w>xi))

<latexit sha1_base64="H6v0zpMCsFrmirWF3kTxlYabSW8=">AAACX3icjVHPS8MwGE3rrznnrHoSL8EhbIijVUEvwtCLRwXnBmstaZZuwTQtSaqW0n/Sm+DF/8R0TpibBz8IvLz3Pb4vL0HCqFS2/W6YS8srq2uV9epGbbO+ZW3vPMg4FZh0ccxi0Q+QJIxy0lVUMdJPBEFRwEgveLou9d4zEZLG/F5lCfEiNOI0pBgpTfnWM/Zp042QGgdh/lK04CXMfApdxqEr6ShCM+Kjq+IE/txfC5+24BFsOsdZibSl6cDjf9lavtWw2/ak4CJwpqABpnXrW2/uMMZpRLjCDEk5cOxEeTkSimJGiqqbSpIg/IRGZKAhRxGRXj7Jp4CHmhnCMBb6cAUn7KwjR5GUWRToznJLOa+V5F/aIFXhhZdTnqSKcPw9KEwZVDEsw4ZDKghWLNMAYUH1rhCPkUBY6S+p6hCc+ScvgoeTtnPaPrk7a3SupnFUwD44AE3ggHPQATfgFnQBBh+GaWwYNePTXDPrpvXdahpTzy74VebeF/dbtDs=</latexit>

Page 17: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Extra exercises

• Go through the derivation of c(w) for logistic regression

• Derive Maximum Likelihood objective in Section 8.1.2

17

Page 18: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Benefits of GLMs

• Gave a generic update rule, where you only needed to know the transfer for your chosen distribution• e.g., linear regression with transfer f = identity

• e.g., Poisson regression with transfer f = exp

• e.g., logistic regression with transfer f = sigmoid

• We know the objective is convex in w!

18

Page 19: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Convexity

• Convexity of negative log likelihood of (many) exponential families• The negative log likelihood of many exponential families is convex, which

is an important advantage of the maximum likelihood approach

• Why is convexity important?• e.g.,(sigmoid(xw) - y)^2 is nonconvex, but who cares?

19

Page 20: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Cross-entropy loss versus Euclidean loss for classification

• Why not just use

• The notes explain that this is a non-convex objective• from personal experience, it seems to do more poorly in practice

• If no obvious reason to prefer one or the other, we may as well pick the objective that is convex (no local minima)

20

minw2Rd

nX

i=1

(�(x>i w)� yi)

2

<latexit sha1_base64="Ut4UtA6c8YroSPY9QQmuscEMLOY=">AAACR3icbVDLSsQwFE3H1zi+qi7dBAdhZqG0IuhGEN24HMVRYfogTdMxmKQlSdWh9O/cuHXnL7hxoYhLMw/xMV4InJxzz83NiTJGlXacJ6syMTk1PVOdrc3NLywu2csr5yrNJSZtnLJUXkZIEUYFaWuqGbnMJEE8YuQiuj7q6xc3RCqaijPdy4jPUVfQhGKkDRXagcepCAuPI30VJcVtCT0q4PAaFadlEBtG5Tws6L5bBgI2PEW7HDW+HHdlSANPpxn8ntGEm7AX0mawHdp1Z8sZFBwH7gjUwahaof3oxSnOOREaM6RUx3Uy7RdIaooZKWterkiG8DXqko6BAnGi/GKQQwk3DBPDJJXmCA0H7E9HgbhSPR6Zzv6u6q/WJ//TOrlO9vyCiizXRODhQ0nOoE5hP1QYU0mwZj0DEJbU7ArxFZIIaxN9zYTg/v3yODjf3nINPtmpHxyO4qiCNbAOGsAFu+AAHIMWaAMM7sEzeAVv1oP1Yr1bH8PWijXyrIJfVbE+ARPRsn0=</latexit><latexit sha1_base64="Ut4UtA6c8YroSPY9QQmuscEMLOY=">AAACR3icbVDLSsQwFE3H1zi+qi7dBAdhZqG0IuhGEN24HMVRYfogTdMxmKQlSdWh9O/cuHXnL7hxoYhLMw/xMV4InJxzz83NiTJGlXacJ6syMTk1PVOdrc3NLywu2csr5yrNJSZtnLJUXkZIEUYFaWuqGbnMJEE8YuQiuj7q6xc3RCqaijPdy4jPUVfQhGKkDRXagcepCAuPI30VJcVtCT0q4PAaFadlEBtG5Tws6L5bBgI2PEW7HDW+HHdlSANPpxn8ntGEm7AX0mawHdp1Z8sZFBwH7gjUwahaof3oxSnOOREaM6RUx3Uy7RdIaooZKWterkiG8DXqko6BAnGi/GKQQwk3DBPDJJXmCA0H7E9HgbhSPR6Zzv6u6q/WJ//TOrlO9vyCiizXRODhQ0nOoE5hP1QYU0mwZj0DEJbU7ArxFZIIaxN9zYTg/v3yODjf3nINPtmpHxyO4qiCNbAOGsAFu+AAHIMWaAMM7sEzeAVv1oP1Yr1bH8PWijXyrIJfVbE+ARPRsn0=</latexit><latexit sha1_base64="Ut4UtA6c8YroSPY9QQmuscEMLOY=">AAACR3icbVDLSsQwFE3H1zi+qi7dBAdhZqG0IuhGEN24HMVRYfogTdMxmKQlSdWh9O/cuHXnL7hxoYhLMw/xMV4InJxzz83NiTJGlXacJ6syMTk1PVOdrc3NLywu2csr5yrNJSZtnLJUXkZIEUYFaWuqGbnMJEE8YuQiuj7q6xc3RCqaijPdy4jPUVfQhGKkDRXagcepCAuPI30VJcVtCT0q4PAaFadlEBtG5Tws6L5bBgI2PEW7HDW+HHdlSANPpxn8ntGEm7AX0mawHdp1Z8sZFBwH7gjUwahaof3oxSnOOREaM6RUx3Uy7RdIaooZKWterkiG8DXqko6BAnGi/GKQQwk3DBPDJJXmCA0H7E9HgbhSPR6Zzv6u6q/WJ//TOrlO9vyCiizXRODhQ0nOoE5hP1QYU0mwZj0DEJbU7ArxFZIIaxN9zYTg/v3yODjf3nINPtmpHxyO4qiCNbAOGsAFu+AAHIMWaAMM7sEzeAVv1oP1Yr1bH8PWijXyrIJfVbE+ARPRsn0=</latexit><latexit sha1_base64="Ut4UtA6c8YroSPY9QQmuscEMLOY=">AAACR3icbVDLSsQwFE3H1zi+qi7dBAdhZqG0IuhGEN24HMVRYfogTdMxmKQlSdWh9O/cuHXnL7hxoYhLMw/xMV4InJxzz83NiTJGlXacJ6syMTk1PVOdrc3NLywu2csr5yrNJSZtnLJUXkZIEUYFaWuqGbnMJEE8YuQiuj7q6xc3RCqaijPdy4jPUVfQhGKkDRXagcepCAuPI30VJcVtCT0q4PAaFadlEBtG5Tws6L5bBgI2PEW7HDW+HHdlSANPpxn8ntGEm7AX0mawHdp1Z8sZFBwH7gjUwahaof3oxSnOOREaM6RUx3Uy7RdIaooZKWterkiG8DXqko6BAnGi/GKQQwk3DBPDJJXmCA0H7E9HgbhSPR6Zzv6u6q/WJ//TOrlO9vyCiizXRODhQ0nOoE5hP1QYU0mwZj0DEJbU7ArxFZIIaxN9zYTg/v3yODjf3nINPtmpHxyO4qiCNbAOGsAFu+AAHIMWaAMM7sEzeAVv1oP1Yr1bH8PWijXyrIJfVbE+ARPRsn0=</latexit>

ci(w) = yi ln�(w>xi) + (1� yi) ln(1� �(w>xi))

<latexit sha1_base64="H6v0zpMCsFrmirWF3kTxlYabSW8=">AAACX3icjVHPS8MwGE3rrznnrHoSL8EhbIijVUEvwtCLRwXnBmstaZZuwTQtSaqW0n/Sm+DF/8R0TpibBz8IvLz3Pb4vL0HCqFS2/W6YS8srq2uV9epGbbO+ZW3vPMg4FZh0ccxi0Q+QJIxy0lVUMdJPBEFRwEgveLou9d4zEZLG/F5lCfEiNOI0pBgpTfnWM/Zp042QGgdh/lK04CXMfApdxqEr6ShCM+Kjq+IE/txfC5+24BFsOsdZibSl6cDjf9lavtWw2/ak4CJwpqABpnXrW2/uMMZpRLjCDEk5cOxEeTkSimJGiqqbSpIg/IRGZKAhRxGRXj7Jp4CHmhnCMBb6cAUn7KwjR5GUWRToznJLOa+V5F/aIFXhhZdTnqSKcPw9KEwZVDEsw4ZDKghWLNMAYUH1rhCPkUBY6S+p6hCc+ScvgoeTtnPaPrk7a3SupnFUwD44AE3ggHPQATfgFnQBBh+GaWwYNePTXDPrpvXdahpTzy74VebeF/dbtDs=</latexit>

Page 21: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

How can we check convexity?

• Can check the definition of convexity

• Can check second derivative for scalar parameters (e.g. ) and Hessian for multidimensional parameters (e.g., )• e.g., for linear regression (least-squares), the Hessian is

and so positive semi-definite

• e.g., for Poisson regression, the Hessian of the negative log-likelihood is and so positive semi-definite

21

�w

Convex function on an interval.

A function (in black) is convex if and only if theregion above its graph (in green) is a convex set.

Convex functionFrom Wikipedia, the free encyclopedia

In mathematics, a real-valued function f(x) defined on an interval is

called convex (or convex downward or concave upward) if the line

segment between any two points on the graph of the function lies above

or on the graph, in a Euclidean space (or more generally a vector space)

of at least two dimensions. Equivalently, a function is convex if its

epigraph (the set of points on or above the graph of the function) is a

convex set. Well-known examples of convex functions are the quadratic

function and the exponential function for any

real number x.

Convex functions play an important role in many areas of mathematics.

They are especially important in the study of optimization problems

where they are distinguished by a number of convenient properties. For

instance, a (strictly) convex function on an open set has no more than

one minimum. Even in infinite-dimensional spaces, under suitable

additional hypotheses, convex functions continue to satisfy such

properties and, as a result, they are the most well-understood functionals

in the calculus of variations. In probability theory, a convex function

applied to the expected value of a random variable is always less than or

equal to the expected value of the convex function of the random

variable. This result, known as Jensen's inequality, underlies many

important inequalities (including, for instance, the arithmetic–geometric

mean inequality and Hölder's inequality).

Exponential growth is a special case of convexity. Exponential growth

narrowly means "increasing at a rate proportional to the current value",

while convex growth generally means "increasing at an increasing rate (but not necessarily proportionally to current value)".

Contents1 Definition

2 Properties

3 Convex function calculus

4 Strongly convex functions

4.1 Uniformly convex functions

5 Examples

6 See also

7 Notes

8 References

9 External links

Definition

Let X be a convex set in a real vector space and let f : X → R be a function.

f is called convex if:

f is called strictly convex if:

H = X>CX

H = X>X

Page 22: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Prediction with logistic regression

• So far, we have used the prediction f(xw)• eg., xw for linear regression, exp(xw) for Poisson regression

• For binary classification, want to output 0 or 1, rather than the probability value p(y = 1 | x) = sigmoid(xw)

• Sigmoid has few values xw mapped close to 0.5; most values somewhat larger than 0 are mapped close to 0 (and vice versa for 1)

• Decision threshold: • sigmoid(xw) < 0.5 is class 0

• sigmoid(xw) > 0.5 is class 1

22−5 0 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

f(t) =1

1 + e�t

t

Figure 6.2: Sigmoid function in [�5, 5] interval.

as negative (y = 0). Thus, the predictor maps a (d+ 1)-dimensional vectorx = (x0 = 1, x1, . . . , xd) into a zero or one. Note that P (Y = 1|x,w⇤) � 0.5only when w>x � 0. The expression w>x = 0 represents equation of ahyperplane that separates positive and negative examples. Thus, the logisticregression model is a linear classifier.

6.1.2 Maximum conditional likelihood estimation

To frame the learning problem as parameter estimation, we will assume thatthe data set D = {(xi, yi)}ni=1 is an i.i.d. sample from a fixed but unknownprobability distribution p(x, y). Even more specifically, we will assume thatthe data generating process randomly draws a data point x, a realization ofthe random vector (X0 = 1, X1, . . . , Xd), according to p(x) and then sets itsclass label Y according to the Bernoulli distribution

p(y|x) =

8<

:

⇣1

1+e�!>x

⌘y

⇣1� 1

1+e�!>x

⌘1�y

for y = 1

for y = 0(6.2)

= �(x>w)y(1� �(x>w))1�y

where ! = (!0,!1, . . . ,!d) is a set of unknown coefficients we want torecover (or learn) from the observed data D. Based on the principles ofparameter estimation, we can estimate ! by maximizing the conditional

89

Page 23: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Logistic regression is a linear classifier

• Hyperplane separates the two classes• P(y=1 | x, w) > 0.5 only when

• P(y=0 | x, w) > 0.5 only when P(y=1 | x, w) < 0.5, which happens when

23

( , )x1 1y

( , )xi yi( , )x2 2y

w w w x0 1 2 2+ +x1 0=

x1

x2

++

+

+

+

+

+

+ +

Figure 6.1: A data set in R2 consisting of nine positive and nine negativeexamples. The gray line represents a linear decision surface in R2. Thedecision surface does not perfectly separate positives from negatives.

to f is a straightforward application of the maximum a posteriori principle:the predicted output is positive if g(x) � 0.5 and negative if g(x) < 0.5.

6.1 Logistic regression

Let us consider binary classification in Rd, where X = Rd+1 and Y = {0, 1}.The basic idea for many classification approaches is to hypothesize a closed-form representation for the posterior probability that the class label is posi-tive and learn parameters w from data. In logistic regression, this relation-ship can be expressed as

P (Y = 1|x,w) =1

1 + e�w>x, (6.1)

which is a monotonic function of w>x. Function �(t) =�1 + e�t

��1 is calledthe sigmoid function or the logistic function and is plotted in Figure 6.2.

6.1.1 Predicting class labels

For a previously unseen data point x and a set of coefficients w⇤ found fromEq. (6.4) or Eq. (6.11), we simply calculate the posterior probability as

P (Y = 1|x,w⇤) =1

1 + e�w⇤T ·x .

If P (Y = 1|x,w⇤) � 0.5 we conclude that data point x should be labeled aspositive (y = 1). Otherwise, if P (Y = 1|x,w⇤) < 0.5, we label the data point

88

−5 0 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

f(t) =1

1 + e�t

t

Figure 6.2: Sigmoid function in [�5, 5] interval.

as negative (y = 0). Thus, the predictor maps a (d+ 1)-dimensional vectorx = (x0 = 1, x1, . . . , xd) into a zero or one. Note that P (Y = 1|x,w⇤) � 0.5only when w>x � 0. The expression w>x = 0 represents equation of ahyperplane that separates positive and negative examples. Thus, the logisticregression model is a linear classifier.

6.1.2 Maximum conditional likelihood estimation

To frame the learning problem as parameter estimation, we will assume thatthe data set D = {(xi, yi)}ni=1 is an i.i.d. sample from a fixed but unknownprobability distribution p(x, y). Even more specifically, we will assume thatthe data generating process randomly draws a data point x, a realization ofthe random vector (X0 = 1, X1, . . . , Xd), according to p(x) and then sets itsclass label Y according to the Bernoulli distribution

p(y|x) =

8<

:

⇣1

1+e�!>x

⌘y

⇣1� 1

1+e�!>x

⌘1�y

for y = 1

for y = 0(6.2)

= �(x>w)y(1� �(x>w))1�y

where ! = (!0,!1, . . . ,!d) is a set of unknown coefficients we want torecover (or learn) from the observed data D. Based on the principles ofparameter estimation, we can estimate ! by maximizing the conditional

89

−5 0 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

f(t) =1

1 + e�t

t

Figure 6.2: Sigmoid function in [�5, 5] interval.

as negative (y = 0). Thus, the predictor maps a (d+ 1)-dimensional vectorx = (x0 = 1, x1, . . . , xd) into a zero or one. Note that P (Y = 1|x,w⇤) � 0.5only when w>x � 0. The expression w>x = 0 represents equation of ahyperplane that separates positive and negative examples. Thus, the logisticregression model is a linear classifier.

6.1.2 Maximum conditional likelihood estimation

To frame the learning problem as parameter estimation, we will assume thatthe data set D = {(xi, yi)}ni=1 is an i.i.d. sample from a fixed but unknownprobability distribution p(x, y). Even more specifically, we will assume thatthe data generating process randomly draws a data point x, a realization ofthe random vector (X0 = 1, X1, . . . , Xd), according to p(x) and then sets itsclass label Y according to the Bernoulli distribution

p(y|x) =

8<

:

⇣1

1+e�!>x

⌘y

⇣1� 1

1+e�!>x

⌘1�y

for y = 1

for y = 0(6.2)

= �(x>w)y(1� �(x>w))1�y

where ! = (!0,!1, . . . ,!d) is a set of unknown coefficients we want torecover (or learn) from the observed data D. Based on the principles ofparameter estimation, we can estimate ! by maximizing the conditional

89

w>x < 0e.g., w = [2.75 -1/3 -1]

x = [1 4 3]

x>w = �1.8

x = [1 2 1]

x>w = 0.27

Page 24: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Logistic regression versus Linear regression

• Why might one be better than the other? They both use a linear approach

• Linear regression could still learn <x, w> to predict E[Y | x]

• Demo: logistic regression performs better under outliers, when the outlier is still on the correct side of the line

• Conclusion: • logistic regression better reflects the goals of predicting p(y=1 | x), to

finding separating hyperplane

• Linear regression assumes E[Y | x] a linear function of x!

24

Page 25: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Adding regularizers to GLMs• How do we add regularization to logistic regression?

• We had an optimization for logistic regression to get w: minimize negative log-likelihood, i.e. minimize cross-entropy

• Now want to balance negative log-likelihood and regularizer (i.e., the prior for MAP)

• Simply add regularizer to the objective function

25

Page 26: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Adding a regularizer to logistic regression

• Original objective function for logistic regression

26

everything more suitable for further steps, we will slightly rearrange Eq. (6.6). Notice firstthat

✓1� 1

1 + e�w>xi

◆=

e�w>xi

1 + e�w>xi

giving

ll(w) =nX

i=1

⇣� yi · log

⇣1 + e�w

>xi

⌘+ (1� yi) · log

⇣e�w

>xi

� (1� yi) · log⇣1 + e�w

>xi

⌘⌘

=nX

i=1

✓(yi � 1)w>

xi + log

✓1

1 + e�w>xi

◆◆. (6.7)

It does not take much effort to realize that there is no closed-form solution to rll(w) = 0

(we did have this luxury in linear regression, but not here). Thus, we have to proceedwith iterative optimization methods. We will calculate rll(w) and Hll(w) to enable eithermethod (first-order with only the gradient or second-order with the gradient and Hessian),both written as a function of inputs X, class labels y, and the current parameter vector.We can calculate the first and second partial derivatives of ll(w) as follows

@ll(w)

@wj

=nX

i=1

✓(yi � 1) · xij �

1

1 + e�w>xi

· e�w>xi · (�xij)

=nX

i=1

xij · yi � 1 +

e�w>xi

1 + ew>xi

!

=nX

i=1

xij ·✓yi �

1

1 + e�w>xi

= f>j (y � p) ,

where fj is the j-th column (feature) of data matrix X, y is an n-dimensional column vectorof class labels and p is an n-dimensional column vector of (estimated) posterior probabilitiespi = P (Yi = 1|xi,w), for i = 1, ..., n. Considering partial derivatives for every componentof w, we have

rll(w) = X> (y � p) . (6.8)

The second partial derivative of the log-likelihood function can be found as

@2ll(w)

@wj@wd

=nX

i=1

xij ·e�w

>xi

�1 + e�w>xi

�2 · (�xik)

= �nX

i=1

xij ·1

1 + e�w>xi

·✓1� 1

1 + e�w>xi

◆· xik

= �f>j P (I�P) fd,

70

argmaxw

argminw

everything more suitable for further steps, we will slightly rearrange Eq. (6.6). Notice firstthat

✓1� 1

1 + e�w>xi

◆=

e�w>xi

1 + e�w>xi

giving

ll(w) =nX

i=1

⇣� yi · log

⇣1 + e�w

>xi

⌘+ (1� yi) · log

⇣e�w

>xi

� (1� yi) · log⇣1 + e�w

>xi

⌘⌘

=nX

i=1

✓(yi � 1)w>

xi + log

✓1

1 + e�w>xi

◆◆. (6.7)

It does not take much effort to realize that there is no closed-form solution to rll(w) = 0

(we did have this luxury in linear regression, but not here). Thus, we have to proceedwith iterative optimization methods. We will calculate rll(w) and Hll(w) to enable eithermethod (first-order with only the gradient or second-order with the gradient and Hessian),both written as a function of inputs X, class labels y, and the current parameter vector.We can calculate the first and second partial derivatives of ll(w) as follows

@ll(w)

@wj

=nX

i=1

✓(yi � 1) · xij �

1

1 + e�w>xi

· e�w>xi · (�xij)

=nX

i=1

xij · yi � 1 +

e�w>xi

1 + ew>xi

!

=nX

i=1

xij ·✓yi �

1

1 + e�w>xi

= f>j (y � p) ,

where fj is the j-th column (feature) of data matrix X, y is an n-dimensional column vectorof class labels and p is an n-dimensional column vector of (estimated) posterior probabilitiespi = P (Yi = 1|xi,w), for i = 1, ..., n. Considering partial derivatives for every componentof w, we have

rll(w) = X> (y � p) . (6.8)

The second partial derivative of the log-likelihood function can be found as

@2ll(w)

@wj@wd

=nX

i=1

xij ·e�w

>xi

�1 + e�w>xi

�2 · (�xik)

= �nX

i=1

xij ·1

1 + e�w>xi

·✓1� 1

1 + e�w>xi

◆· xik

= �f>j P (I�P) fd,

70

• Adding regularizer

argminw

everything more suitable for further steps, we will slightly rearrange Eq. (6.6). Notice firstthat

✓1� 1

1 + e�w>xi

◆=

e�w>xi

1 + e�w>xi

giving

ll(w) =nX

i=1

⇣� yi · log

⇣1 + e�w

>xi

⌘+ (1� yi) · log

⇣e�w

>xi

� (1� yi) · log⇣1 + e�w

>xi

⌘⌘

=nX

i=1

✓(yi � 1)w>

xi + log

✓1

1 + e�w>xi

◆◆. (6.7)

It does not take much effort to realize that there is no closed-form solution to rll(w) = 0

(we did have this luxury in linear regression, but not here). Thus, we have to proceedwith iterative optimization methods. We will calculate rll(w) and Hll(w) to enable eithermethod (first-order with only the gradient or second-order with the gradient and Hessian),both written as a function of inputs X, class labels y, and the current parameter vector.We can calculate the first and second partial derivatives of ll(w) as follows

@ll(w)

@wj

=nX

i=1

✓(yi � 1) · xij �

1

1 + e�w>xi

· e�w>xi · (�xij)

=nX

i=1

xij · yi � 1 +

e�w>xi

1 + ew>xi

!

=nX

i=1

xij ·✓yi �

1

1 + e�w>xi

= f>j (y � p) ,

where fj is the j-th column (feature) of data matrix X, y is an n-dimensional column vectorof class labels and p is an n-dimensional column vector of (estimated) posterior probabilitiespi = P (Yi = 1|xi,w), for i = 1, ..., n. Considering partial derivatives for every componentof w, we have

rll(w) = X> (y � p) . (6.8)

The second partial derivative of the log-likelihood function can be found as

@2ll(w)

@wj@wd

=nX

i=1

xij ·e�w

>xi

�1 + e�w>xi

�2 · (�xik)

= �nX

i=1

xij ·1

1 + e�w>xi

·✓1� 1

1 + e�w>xi

◆· xik

= �f>j P (I�P) fd,

70

+�kwk22

Page 27: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Other regularizers• Have discussed l2 and l1 regularizers

• Other examples: • elastic net regularization is a combination of l1 and l2 (i.e., l1 + l2):

ensures a unique solution

• capped regularizers: do not prevent large weights

27* Figure from “Robust Dictionary Learning with Capped l1-Norm”, Jiang et al., IJCAI 2015

Does this regularizer still protect against

overfitting?

Page 28: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Practical considerations: outliers• What happens if one sample is bad?

• Regularization helps a little bit

• Can also change losses

• Robust losses• use l1 instead of l2

• even better: use capped l1

• What are the disadvantages to these losses?

28

Page 29: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Exercise: intercept unit

• In linear regression, we added an intercept unit (bias unit) to the features• i.e., added a feature that is always 1 to the feature vector

• Does it make sense to do this for GLMs?• e.g., sigmoid(<x,w> + w_0)

29

Page 30: Generalized linear models and logistic regression · the full version of the Newton-Raphson algorithm with the Hessian matrix. Instead, Gauss-Newton and other types of solutions are

Adding a column of ones to GLMs

• This provides the same outcome as for linear regression

• g(E[y | x]) = x w —> bias unit in x with coefficient w0 shifts the function left or right

30*Figure from http://stackoverflow.com/questions/2480650/role-of-bias-in-neural-networks