Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However,...

74
Linear Regression 1 Matt Gormley Lecture 4 September 19, 2016 School of Computer Science Readings: Bishop, 3.1 Murphy, 7 10701 Introduction to Machine Learning

Transcript of Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However,...

Page 1: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Linear  Regression  

1  

Matt  Gormley  Lecture  4  

September  19,  2016    

School of Computer Science

Readings:  Bishop,  3.1  Murphy,  7  

10-­‐701  Introduction  to  Machine  Learning  

Page 2: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Reminders  

•  Homework  1:    – due  9/26/16  

•  Project  Proposal:  – due  10/3/16  – start  early!  

 

2  

Page 3: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Outline  •  Linear  Regression  

–  Simple  example  –  Model  

•  Learning  –  Gradient  Descent  –  SGD    –  Closed  Form  

•  Advanced  Topics  –  Geometric  and  Probabilistic  Interpretation  of  LMS  –  L2  Regularization  –  L1  Regularization    –  Features  –  Locally-­‐Weighted  Linear  Regression  –  Robust  Regression  

3  

Page 4: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Outline  •  Linear  Regression  

–  Simple  example  –  Model  

•  Learning  (aka.  Least  Squares)  –  Gradient  Descent  –  SGD  (aka.  Least  Mean  Squares  (LMS))  –  Closed  Form  (aka.  Normal  Equations)  

•  Advanced  Topics  –  Geometric  and  Probabilistic  Interpretation  of  LMS  –  L2  Regularization  (aka.  Ridge  Regression)  –  L1  Regularization  (aka.  LASSO)  –  Features  (aka.  non-­‐linear  basis  functions)  –  Locally-­‐Weighted  Linear  Regression  –  Robust  Regression  

  4  

Page 5: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

•  Our goal is to estimate w from a training data of <xi,yi> pairs

•  Optimization goal: minimize squared error (least squares):

•  Why least squares?

- minimizes squared distance between measurements and predicted line

- has a nice probabilistic interpretation

- the math is pretty

Linear regression in 1D

∑ −i

iiw wxy 2)(minargX

Y ε+= wxy

see HW

Slide  courtesy  of  William  Cohen  

Page 6: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Solving Linear Regression in 1D

•  To optimize – closed form:

•  We just take the derivative w.r.t. to w and set to 0:

∂∂w

(yi −wxi )2

i∑ = 2 −xi (yi −wxi )

i∑ ⇒

2 xi (yi −wxi ) = 0i∑ ⇒

xiyi = wxi2

i∑

i∑ ⇒

w =xiyi

i∑

xi2

i∑

2 xiyii∑ − 2 wxixi

i∑ = 0

Slide  courtesy  of  William  Cohen  

Page 7: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Linear regression in 1D

•  Given an input x we would like to compute an output y

•  In linear regression we assume that y and x are related with the following equation:

y = wx+ε where w is a parameter and ε

represents measurement error or other noise

X

Y

What we are trying to predict

Observed values

Slide  courtesy  of  William  Cohen  

Page 8: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Regression example

•  Generated: w=2 •  Recovered: w=2.03 •  Noise: std=1

Slide  courtesy  of  William  Cohen  

Page 9: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Regression example

•  Generated: w=2 •  Recovered: w=2.05 •  Noise: std=2

Slide  courtesy  of  William  Cohen  

Page 10: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Regression example

•  Generated: w=2 •  Recovered: w=2.08 •  Noise: std=4

Slide  courtesy  of  William  Cohen  

Page 11: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Bias term •  So far we assumed that the

line passes through the origin •  What if the line does not? •  No problem, simply change the

model to y = w0 + w1x+ε •  Can use least squares to

determine w0 , w1

n

xwyw i

ii∑ −=

1

0

X

Y

w0

∑ −=

ii

iii

x

wyxw 2

0

1

)(

Slide  courtesy  of  William  Cohen  

Page 12: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Linear  Regression  

12  

Data:  Inputs  are  continuous  vectors  of  length  K.  Outputs  are  continuous  scalars.  

D = {x(i), y(i)}Ni=1 where x � RK and y � R

What  are  some  example  problems  of  

this  form?  

Page 13: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Linear  Regression  

13  

Data:  Inputs  are  continuous  vectors  of  length  K.  Outputs  are  continuous  scalars.  

D = {x(i), y(i)}Ni=1 where x � RK and y � R

Prediction:  Output  is  a  linear  function  of  the  inputs.  y = h�(x) = �1x1 + �2x2 + . . . + �KxK

y = h�(x) = �T x (We  assume  x1  is  1)  

Learning:  finds  the  parameters  that  minimize  some  objective  function.  

�� = argmin�

J(�)

Page 14: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Least  Squares  

14  

Learning:  finds  the  parameters  that  minimize  some  objective  function.        We  minimize  the  sum  of  the  squares:        Why?  

1.  Reduces  distance  between  true  measurements  and  predicted  hyperplane  (line  in  1D)  

2.  Has  a  nice  probabilistic  interpretation  

�� = argmin�

J(�)

J(�) =1

2

N�

i=1

(�T x(i) � y(i))2

Page 15: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Least  Squares  

15  

Learning:  finds  the  parameters  that  minimize  some  objective  function.        We  minimize  the  sum  of  the  squares:        Why?  

1.  Reduces  distance  between  true  measurements  and  predicted  hyperplane  (line  in  1D)  

2.  Has  a  nice  probabilistic  interpretation  

�� = argmin�

J(�)

J(�) =1

2

N�

i=1

(�T x(i) � y(i))2This  is  a  very  general  optimization  setup.  We  could  solve  it  in  lots  of  ways.  Today,  we’ll  consider  three  

ways.  

Page 16: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Least  Squares  

16  

Learning:  Three  approaches  to  solving      

Approach  2:  Stochastic  Gradient  Descent  (SGD)  (take  many  small  steps  opposite  the  gradient)    

Approach  1:  Gradient  Descent  (take  larger  –  more  certain  –  steps  opposite  the  gradient)  

Approach  3:  Closed  Form  (set  derivatives  equal  to  zero  and  solve  for  parameters)  

�� = argmin�

J(�)

Page 17: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Learning  as  optimization  

•  The  main  idea  today  – Make  two  assumptions:    

•  the  classifier  is  linear,  like  naïve  Bayes  –  ie  roughly  of  the  form  fw(x)  =  sign(x  .  w)  

•  we  want  to  find  the  classifier  w  that  “fits  the  data  best”  

– Formalize  as  an  optimization  problem  •  Pick  a  loss  function  J(w,D),  and  find    

argminw  J(w)    •  OR:  Pick  a  “goodness”  function,  often  Pr(D|w),  and  find  the  argmaxw  fD(w)    

17  Slide  courtesy  of  William  Cohen  

Page 18: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Learning as optimization: warmup Find:  optimal  (MLE)  parameter  θ  of  a  binomial Dataset:  D={x1,…,xn},  xi  is  0  or  1,  k  of  them  are  1

18

P(D |θ ) = const θ xi (1−i∏ θ )1−xi =θ k (1−θ )n−k

ddθ

P(D |θ ) = ddθ

θ k (1−θ )n−k( )

=ddθθ k"

#$

%

&'(1−θ )n−k +θ k d

dθ(1−θ )n−k

= kθ k−1(1−θ )n−k +θ k (n− k)(1−θ )n−k−1(−1)

=θ k−1(1−θ )n−k−1 k(1−θ )−θ(n− k)( )Slide  courtesy  of  William  Cohen  

Page 19: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Learning as optimization: warmup Goal:  Find  the  best  parameter  θ  of  a  binomial Dataset:  D={x1,…,xn},  xi  is  0  or  1,  k  of  them  are  1

19

=  0

θ=  0 θ=  1 k-­‐  kθ  –  nθ  +  kθ  =  0

è nθ    =  k è θ    =  k/n

Slide  courtesy  of  William  Cohen  

Page 20: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Learning  as  optimization  with  gradient  ascent  

•  Goal:  Learn  the  parameter  w  of  …  

•  Dataset:  D={(x1,y1)…,(xn  ,  yn)}    •  Use  your  model  to  define  

–  Pr(D|w)  =  ….  •  Set  w  to  maximize  Likelihood  

– Usually  we  use  numeric  methods  to  find  the  optimum  

–  i.e.,  gradient  ascent:  repeatedly  take  a  small  step  in  the  direction  of  the  gradient  

   

20  

Difference  between  ascent  and  descent  is  only  the  sign:    I’m  going  to  be  sloppy  here  about  that.  

Slide  courtesy  of  William  Cohen  

Page 21: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Gradient ascent

21

To  find  argminx  f(x):    •   Start  with  x0  •   For  t=1….  

•   xt+1  =  xt  +  λ  f’(x  t)  where  λ  is  small  

Slide  courtesy  of  William  Cohen  

Page 22: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Gradient descent

22

Likelihood:  ascent    Loss:  descent  

Slide  courtesy  of  William  Cohen  

Page 23: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Pros and cons of gradient descent

•  Simple and often quite effective on ML tasks •  Often very scalable •  Only applies to smooth functions (differentiable) •  Might find a local minimum, rather than a global

one

23 Slide  courtesy  of  William  Cohen  

Page 24: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Pros and cons of gradient descent

24

There is only one local optimum if the function is convex

Slide  courtesy  of  William  Cohen  

Page 25: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Gradient  Descent  

25  

Algorithm 1 Gradient Descent

1: procedure GD(D, �(0))2: � � �(0)

3: while not converged do4: � � � + ���J(�)

5: return �

In  order  to  apply  GD  to  Linear  Regression  all  we  need  is  the  gradient  of  the  objective  function  (i.e.  vector  of  partial  derivatives).    

��J(�) =

����

dd�1

J(�)d

d�2J(�)...

dd�N

J(�)

����

Page 26: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Gradient  Descent  

26  

Algorithm 1 Gradient Descent

1: procedure GD(D, �(0))2: � � �(0)

3: while not converged do4: � � � + ���J(�)

5: return �

There  are  many  possible  ways  to  detect  convergence.    For  example,  we  could  check  whether  the  L2  norm  of  the  gradient  is  below  some  small  tolerance.  

||��J(�)||2 � �Alternatively  we  could  check  that  the  reduction  in  the  objective  function  from  one  iteration  to  the  next  is  small.  

Page 27: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Stochastic  Gradient  Descent  (SGD)  

27  

Algorithm 2 Stochastic Gradient Descent (SGD)

1: procedure SGD(D, �(0))2: � � �(0)

3: while not converged do4: for i � shu�e({1, 2, . . . , N}) do5: � � � + ���J (i)(�)

6: return �

We  need  a  per-­‐example  objective:  

Let J(�) =�N

i=1 J (i)(�)where J (i)(�) = 1

2 (�T x(i) � y(i))2.

Applied  to  Linear  Regression,  SGD  is  called  the  Least  Mean  Squares  (LMS)  algorithm  

Page 28: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Stochastic  Gradient  Descent  (SGD)  

We  need  a  per-­‐example  objective:  

28  

Let J(�) =�N

i=1 J (i)(�)where J (i)(�) = 1

2 (�T x(i) � y(i))2.

Applied  to  Linear  Regression,  SGD  is  called  the  Least  Mean  Squares  (LMS)  algorithm  

Algorithm 2 Stochastic Gradient Descent (SGD)

1: procedure SGD(D, �(0))2: � � �(0)

3: while not converged do4: for i � shu�e({1, 2, . . . , N}) do5: for k � {1, 2, . . . , K} do6: �k � �k + � d

d�kJ (i)(�)

7: return �

Page 29: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Stochastic  Gradient  Descent  (SGD)  

We  need  a  per-­‐example  objective:  

29  

Let J(�) =�N

i=1 J (i)(�)where J (i)(�) = 1

2 (�T x(i) � y(i))2.

Applied  to  Linear  Regression,  SGD  is  called  the  Least  Mean  Squares  (LMS)  algorithm  

Algorithm 2 Stochastic Gradient Descent (SGD)

1: procedure SGD(D, �(0))2: � � �(0)

3: while not converged do4: for i � shu�e({1, 2, . . . , N}) do5: for k � {1, 2, . . . , K} do6: �k � �k + � d

d�kJ (i)(�)

7: return �

Let’s  start  by  calculating  this  partial  derivative  for  the  Linear  Regression  objective  function.  

Page 30: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Partial  Derivatives  for  Linear  Reg.  

30  

d

d�kJ (i)(�) =

d

d�k

1

2(�T x(i) � y(i))2

=1

2

d

d�k(�T x(i) � y(i))2

= (�T x(i) � y(i))d

d�k(�T x(i) � y(i))

= (�T x(i) � y(i))d

d�k(

K�

k=1

�kx(i)k � y(i))

= (�T x(i) � y(i))x(i)k

Let J(�) =�N

i=1 J (i)(�)where J (i)(�) = 1

2 (�T x(i) � y(i))2.

d

d�kJ (i)(�) =

d

d�k

1

2(�T x(i) � y(i))2

=1

2

d

d�k(�T x(i) � y(i))2

= (�T x(i) � y(i))d

d�k(�T x(i) � y(i))

= (�T x(i) � y(i))d

d�k(

K�

k=1

�kx(i)k � y(i))

= (�T x(i) � y(i))x(i)k

d

d�kJ (i)(�) =

d

d�k

1

2(�T x(i) � y(i))2

=1

2

d

d�k(�T x(i) � y(i))2

= (�T x(i) � y(i))d

d�k(�T x(i) � y(i))

= (�T x(i) � y(i))d

d�k(

K�

k=1

�kx(i)k � y(i))

= (�T x(i) � y(i))x(i)k

d

d�kJ (i)(�) =

d

d�k

1

2(�T x(i) � y(i))2

=1

2

d

d�k(�T x(i) � y(i))2

= (�T x(i) � y(i))d

d�k(�T x(i) � y(i))

= (�T x(i) � y(i))d

d�k(

K�

k=1

�kx(i)k � y(i))

= (�T x(i) � y(i))x(i)k

d

d�kJ (i)(�) =

d

d�k

1

2(�T x(i) � y(i))2

=1

2

d

d�k(�T x(i) � y(i))2

= (�T x(i) � y(i))d

d�k(�T x(i) � y(i))

= (�T x(i) � y(i))d

d�k(

K�

k=1

�kx(i)k � y(i))

= (�T x(i) � y(i))x(i)k

Page 31: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Partial  Derivatives  for  Linear  Reg.  

31  

d

d�kJ (i)(�) =

d

d�k

1

2(�T x(i) � y(i))2

=1

2

d

d�k(�T x(i) � y(i))2

= (�T x(i) � y(i))d

d�k(�T x(i) � y(i))

= (�T x(i) � y(i))d

d�k(

K�

k=1

�kx(i)k � y(i))

= (�T x(i) � y(i))x(i)k

Let J(�) =�N

i=1 J (i)(�)where J (i)(�) = 1

2 (�T x(i) � y(i))2.

Page 32: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Partial  Derivatives  for  Linear  Reg.  

32  

Let J(�) =�N

i=1 J (i)(�)where J (i)(�) = 1

2 (�T x(i) � y(i))2.

d

d�kJ (i)(�) = (�T x(i) � y(i))x(i)

k

d

d�kJ(�) =

d

d�k

N�

i=1

J (i)(�)

=N�

i=1

(�T x(i) � y(i))x(i)k

Used  by  SGD  (aka.  LMS)  

Used  by  Gradient  Descent  

Page 33: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Least  Mean  Squares  (LMS)  

33  

Applied  to  Linear  Regression,  SGD  is  called  the  Least  Mean  Squares  (LMS)  algorithm  

Algorithm 3 Least Mean Squares (LMS)

1: procedure LMS(D, �(0))2: � � �(0)

3: while not converged do4: for i � shu�e({1, 2, . . . , N}) do5: for k � {1, 2, . . . , K} do

6: �k � �k + �(�T x(i) � y(i))x(i)k

7: return �

Page 34: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Least  Squares  

34  

Learning:  Three  approaches  to  solving      

Approach  2:  Stochastic  Gradient  Descent  (SGD)  (take  many  small  steps  opposite  the  gradient)    

Approach  1:  Gradient  Descent  (take  larger  –  more  certain  –  steps  opposite  the  gradient)  

Approach  3:  Closed  Form  (set  derivatives  equal  to  zero  and  solve  for  parameters)  

�� = argmin�

J(�)

Page 35: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Background: Matrix Derivatives l  For , define:

l  Trace:

l  Some fact of matrix derivatives (without proof)

© Eric Xing @ CMU, 2006-2011 35

⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢

=∇

fA

fA

fA

fA

Af

mnm

n

A

!

"#"

!

1

111

)(

RR !nmf ×:

, tr ∑=

=n

iiiAA

1, tr aa = BCACABABC trtrtr ==

, tr TA BAB =∇ , tr TTT

A ABCCABCABA +=∇ ( ) TA AAA 1−=∇

Page 36: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

The normal equations l  Write the cost function in matrix form:

l  To minimize J(θ), take derivative and set to zero:

© Eric Xing @ CMU, 2006-2011 36

( ) ( )

( )yyXyyXXX

yXyX

yJ

TTTTTT

T

n

ii

Ti

!!!!

!!

+−−=

−−=

−= ∑=

θθθθ

θθ

θθ

212121

1

2)()( x

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

−−−−

−−−−

=

nx

xx

X!!!

2

1

⎥⎥⎥⎥

⎢⎢⎢⎢

=

ny

yy

y!

" 2

1

( )

( )

( )0

221

22121

=−=

−+=

∇+∇−∇=

+−−∇=∇

yXXX

yXXXXX

yyXyXX

yyXyyXXXJ

TT

TTT

TTTT

TTTTTT

!

!

!!!

!!!!

θ

θθ

θθθ

θθθθ

θθθ

θθ

trtrtr

tryXXX TT !

=⇒ θ The normal equations

( ) yXXX TT !1−=*θ

Page 37: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Comments on the normal equation

l  In most situations of practical interest, the number of data points N is larger than the dimensionality k of the input space and the matrix X is of full column rank. If this condition holds, then it is easy to verify that XTX is necessarily invertible.

l  The assumption that XTX is invertible implies that it is positive definite, thus the critical point we have found is a minimum.

l  What if X has less than full column rank? à regularization (later).

© Eric Xing @ CMU, 2006-2011 37

Page 38: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Direct and Iterative methods l  Direct methods: we can achieve the solution in a single step

by solving the normal equation l  Using Gaussian elimination or QR decomposition, we converge in a finite number

of steps l  It can be infeasible when data are streaming in in real time, or of very large

amount

l  Iterative methods: stochastic or steepest gradient l  Converging in a limiting sense l  But more attractive in large practical problems l  Caution is needed for deciding the learning rate α

© Eric Xing @ CMU, 2006-2011 38

Page 39: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Convergence rate l  Theorem: the steepest descent equation algorithm

converge to the minimum of the cost characterized by normal equation:

If

l  A formal analysis of LMS needs more math; in practice,

one can use a small α, or gradually decrease α.

© Eric Xing @ CMU, 2006-2011 39

Page 40: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Convergence Curves

l  For the batch method, the training MSE is initially large due to uninformed initialization

l  In the online update, N updates for every epoch reduces MSE to a much smaller value.

40 © Eric Xing @ CMU, 2006-2011

Page 41: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Least  Squares  

41  

Learning:  Three  approaches  to  solving      

Approach  2:  Stochastic  Gradient  Descent  (SGD)  (take  many  small  steps  opposite  the  gradient)  

•  pros:    memory  efficient,  fast  convergence,  less  prone  to  local  optima  

•  cons:  convergence  in  practice  requires  tuning  and  fancier  variants    

 

Approach  1:  Gradient  Descent  (take  larger  –  more  certain  –  steps  opposite  the  gradient)    

•  pros:    conceptually  simple,  guaranteed  convergence  •  cons:  batch,  often  slow  to  converge    

Approach  3:  Closed  Form  (set  derivatives  equal  to  zero  and  solve  for  parameters)  

•  pros:    one  shot  algorithm!  •  cons:  does  not  scale  to  large  datasets  (matrix  inverse  is  

bottleneck)    

�� = argmin�

J(�)

Page 42: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Geometric  Interpretation  of  LMS  •  The  predictions  on  the  training  data  are:  

•  Note  that  

 and            is  the  orthogonal  projection  of  

into  the  space  spanned  by  the  columns  of  X

( ) yXXXXXy TT !! 1−== *ˆ θ

( )( )yIXXXXyy TT !!!−=−

−1ˆ

( ) ( )( )( )( )

0

1

1

=

−=

−=−−

yXXXXXX

yIXXXXXyyXTTTT

TTTT

!

!!!

!!  

y! y!

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

−−−−

−−−−

=

nx

xx

X!!!

2

1

⎥⎥⎥⎥

⎢⎢⎢⎢

=

ny

yy

y!

" 2

1

42  ©  Eric  Xing  @  CMU,  2006-­‐2011  

Page 43: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Probabilistic  Interpretation  of  LMS  •  Let  us  assume  that  the  target  variable  and  the  inputs  

are  related  by  the  equation:  

 where  ε  is  an  error  term  of  unmodeled  effects  or  random  

noise  

•  Now  assume  that  ε  follows  a  Gaussian  N(0,σ),  then  we  have:  

•  By  independence  assumption:  

iiT

iy εθ += x

⎟⎟⎠

⎞⎜⎜⎝

⎛ −−= 2

2

221

σθ

σπθ

)(exp);|( iT

iii

yxyp x

⎟⎟

⎜⎜

⎛ −−⎟

⎞⎜⎝

⎛==

∑∏ =

=2

12

1 221

σ

θ

σπθθ

n

i iT

inn

iii

yxypL

)(exp);|()(

x

43  ©  Eric  Xing  @  CMU,  2006-­‐2011  

Page 44: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Probabilistic  Interpretation  of  LMS  •  Hence  the  log-­‐likelihood  is:  

•  Do  you  recognize  the  last  term?        Yes  it  is:    

•  Thus  under  independence  assumption,  LMS  is  equivalent  to  MLE  of  θ !

∑ =−−=

n

i iT

iynl1

22 211

21 )(log)( xθ

σσπθ

∑=

−=n

ii

Ti yJ

1

2

21 )()( θθ x

44  ©  Eric  Xing  @  CMU,  2006-­‐2011  

Page 45: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Non-Linear basis function •  So far we only used the observed values x1,x2,… •  However, linear regression can be applied in the same

way to functions of these values –  Eg: to add a term w x1x2 add a new variable z=x1x2 so each

example becomes: x1, x2, …. z

•  As long as these functions can be directly computed from the observed values the parameters are still linear in the data and the problem remains a multi-variate linear regression problem

ε++++= 22110 kk xwxwwy …

Slide  courtesy  of  William  Cohen  

Page 46: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Non-linear basis functions

•  What type of functions can we use? •  A few common examples: - Polynomial: φj(x) = xj for j=0 … n - Gaussian: - Sigmoid:

- Logs:

φ j (x) =(x −µ j )2σ j

2

φ j (x) =1

1+ exp(−s j x)

Any function of the input values can be used. The solution for the parameters of the regression remains the same.

φ j (x) = log(x +1)

Slide  courtesy  of  William  Cohen  

Page 47: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

General linear regression problem •  Using our new notations for the basis function linear

regression can be written as

•  Where φj(x) can be either xj for multivariate regression or one of the non-linear basis functions we defined

•  … and φ0(x)=1 for the intercept term €

y = w jφ j (x)j= 0

n

Slide  courtesy  of  William  Cohen  

Page 48: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

An example: polynomial basis vectors on a small dataset

– From Bishop Ch 1

Slide  courtesy  of  William  Cohen  

Page 49: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

0th Order Polynomial

n=10 Slide  courtesy  of  William  Cohen  

Page 50: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

1st Order Polynomial

Slide  courtesy  of  William  Cohen  

Page 51: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

3rd Order Polynomial

Slide  courtesy  of  William  Cohen  

Page 52: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

9th Order Polynomial

Slide  courtesy  of  William  Cohen  

Page 53: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Over-fitting

Root-Mean-Square (RMS) Error:

Slide  courtesy  of  William  Cohen  

Page 54: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Polynomial Coefficients

Slide  courtesy  of  William  Cohen  

Page 55: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Regularization

Penalize large coefficient values

JX,y (w) =12

yi − wjφ j (xi )

j∑

#

$%%

&

'((

i∑

2

−λ2w 2

Slide  courtesy  of  William  Cohen  

Page 56: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Regularization: +

Slide  courtesy  of  William  Cohen  

Page 57: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Polynomial Coefficients

none exp(18) huge

Slide  courtesy  of  William  Cohen  

Page 58: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Over Regularization:

Slide  courtesy  of  William  Cohen  

Page 59: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Example:  Stock  Prices  •  Suppose  we  wish  to  predict  Google’s  stock  price  at  time  t+1    

•  What  features  should  we  use?  (putting  all  computational  concerns  aside)  –  Stock  prices  of  all  other  stocks  at  times  t,  t-­‐1,  t-­‐2,  …,  t  -­‐  k  

– Mentions  of  Google  with  positive  /  negative  sentiment  words  in  all  newspapers  and  social  media  outlets  

•  Do  we  believe  that  all  of  these  features  are  going  to  be  useful?  

59  

Page 60: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Ridge  Regression  

•  Adds  an  L2  regularizer  to  Linear  Regression  

   •  Bayesian  interpretation:  MAP  estimation  with  a  Gaussian  prior  on  the  parameters  

60  

JRR(�) = J(�) + �||�||22

=1

2

N�

i=1

(�T x(i) � y(i))2 + �K�

k=1

�2k

�MAP = argmax�

N�

i=1

log p�(y(i)|x(i)) + log p(�)

= argmax�

JRR(�)where  

p(�) � N (0,1�)

Page 61: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

�MAP = argmax�

N�

i=1

log p�(y(i)|x(i)) + log p(�)

= argmax�

JLASSO(�)

JLASSO(�) = J(�) + �||�||1

=1

2

N�

i=1

(�T x(i) � y(i))2 + �K�

k=1

|�k|

LASSO  

•  Adds  an  L1  regularizer  to  Linear  Regression  

   •  Bayesian  interpretation:  MAP  estimation  with  a  Laplace  prior  on  the  parameters  

61  

where  

p(�) � Laplace(0, f(�))

Page 62: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Ridge  Regression  vs  Lasso  

Ridge  Regression:                Lasso:          

Lasso  (l1  penalty)  results  in  sparse  solu3ons  –  vector  with  more  zero  coordinates  Good  for  high-­‐dimensional  problems  –  don’t  have  to  store  all  coordinates!  

βs  with    constant    l1  norm  

βs  with  constant  J(β)  (level  sets  of  J(β))  

βs  with    constant    l2  norm  

β2  

β1  

62  ©  Eric  Xing  @  CMU,  2006-­‐2011  

X X

Page 63: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Optimization  for  LASSO  

Can  we  apply  SGD  to  the  LASSO  learning  problem?  

63  

JLASSO(�) = J(�) + �||�||1

=1

2

N�

i=1

(�T x(i) � y(i))2 + �K�

k=1

|�k|

�MAP = argmax�

N�

i=1

log p�(y(i)|x(i)) + log p(�)

= argmax�

JLASSO(�)

Page 64: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Optimization  for  LASSO  

•  Consider  the  absolute  value  function:  

64  

r(�) = �K�

k=1

|�k|

•  Where  does  a  small  step  opposite  the  gradient  accomplish…  – …  far  from  zero?  – …  close  to  zero?  

•  The  L1  penalty  is  subdifferentiable  (i.e.  not  differentiable  at  0)  

Page 65: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Optimization  for  LASSO  •  The  L1  penalty  is  subdifferentiable  (i.e.  not  differentiable  at  0)  

•  An  array  of  optimization  algorithms  exist  to  handle  this  issue:  –  Coordinate  Descent  – Othant-­‐Wise  Limited  memory  Quasi-­‐Newton  (OWL-­‐QN)  (Andrew  &  Gao,  2007)    and  provably  convergent  variants  

– Block  coordinate  Descent  (Tseng  &  Yun,  2009)  –  Sparse  Reconstruction  by  Separable  Approximation  (SpaRSA)  (Wright  et  al.,  2009)  

–  Fast  Iterative  Shrinkage  Thresholding  Algorithm  (FISTA)  (Beck  &  Teboulle,  2009)  

65  

Page 66: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Data Set Size: 9th Order Polynomial

Slide  courtesy  of  William  Cohen  

Page 67: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Locally weighted linear regression

l  The algorithm: Instead of minimizing now we fit θ to minimize Where do wi's come from?

l  where x is the query point for which we'd like to know its corresponding y

à Essentially we put higher weights on (errors on) training examples that are close to the query point (than those that are further away from the query)

© Eric Xing @ CMU, 2006-2011 67

∑=

−=n

ii

Ti yJ

1

2

21 )()( θθ x

∑=

−=n

ii

Tii ywJ

1

2

21 )()( θθ x

⎟⎟⎠

⎞⎜⎜⎝

⎛ −−= 2

2

2τ)(exp xxi

iw

Page 68: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Parametric vs. non-parametric l  Locally weighted linear regression is the second example we are

running into of a non-parametric algorithm. (what is the first?)

l  The (unweighted) linear regression algorithm that we saw earlier is known as a parametric learning algorithm l  because it has a fixed, finite number of parameters (the θ), which are fit to the data; l  Once we've fit the θ and stored them away, we no longer need to keep the training

data around to make future predictions. l  In contrast, to make predictions using locally weighted linear regression, we need to

keep the entire training set around.

l  The term "non-parametric" (roughly) refers to the fact that the amount of stuff we need to keep in order to represent the hypothesis grows linearly with the size of the training set.

© Eric Xing @ CMU, 2006-2011 68

Page 69: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Robust Regression

l  The best fit from a quadratic regression

l  But this is probably better …

© Eric Xing @ CMU, 2006-2011 69

How can we do this?

Page 70: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

LOESS-based Robust Regression l  Remember what we do in "locally weighted linear regression"?

à we "score" each point for its impotence l  Now we score each point according to its "fitness"

© Eric Xing @ CMU, 2006-2011 70

(Courtesy to Andrew Moore)

Page 71: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Robust regression l  For k = 1 to R…

l  Let (xk ,yk) be the kth datapoint l  Let yest

k be predicted value of yk

l  Let wk be a weight for data point k that is large if the data point fits well and small if it fits badly:

l  Then redo the regression using weighted data points.

l  Repeat whole thing until converged!

© Eric Xing @ CMU, 2006-2011 71

( )2)( estkkk yyw −= φ

Page 72: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Robust regression—probabilistic interpretation

l  What regular regression does:

Assume yk was originally generated using the following recipe:

Computational task is to find the Maximum Likelihood estimation of θ

© Eric Xing @ CMU, 2006-2011 72

),( 20 σθ N+= kT

ky x

Page 73: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Robust regression—probabilistic interpretation

l  What LOESS robust regression does:

Assume yk was originally generated using the following recipe:

with probability p:

but otherwise

Computational task is to find the Maximum Likelihood estimates of θ, p, µ and σhuge.

l  The algorithm you saw with iterative reweighting/refitting does this computation for us. Later you will find that it is an instance of the famous E.M. algorithm

© Eric Xing @ CMU, 2006-2011 73

),( 20 σθ N+= kT

ky x

),(~ huge2σµNky

Page 74: Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However, linear regression can be applied in the same = + + + …

Summary  

•  Linear  regression  predicts  its  output  as  a  linear  function  of  its  inputs  

•  Learning  optimizes  a  function  (equivalently  likelihood  or  mean  squared  error)  using  standard  techniques  (gradient  descent,  SGD,  closed  form)  

•  Regularization  enables  shrinkage  and  model  selection  

©  Eric  Xing  @  CMU,  2005-­‐2015   74