Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However,...

Post on 06-Jun-2019

215 views 0 download

Transcript of Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However,...

Linear  Regression  

1  

Matt  Gormley  Lecture  4  

September  19,  2016    

School of Computer Science

Readings:  Bishop,  3.1  Murphy,  7  

10-­‐701  Introduction  to  Machine  Learning  

Reminders  

•  Homework  1:    – due  9/26/16  

•  Project  Proposal:  – due  10/3/16  – start  early!  

 

2  

Outline  •  Linear  Regression  

–  Simple  example  –  Model  

•  Learning  –  Gradient  Descent  –  SGD    –  Closed  Form  

•  Advanced  Topics  –  Geometric  and  Probabilistic  Interpretation  of  LMS  –  L2  Regularization  –  L1  Regularization    –  Features  –  Locally-­‐Weighted  Linear  Regression  –  Robust  Regression  

3  

Outline  •  Linear  Regression  

–  Simple  example  –  Model  

•  Learning  (aka.  Least  Squares)  –  Gradient  Descent  –  SGD  (aka.  Least  Mean  Squares  (LMS))  –  Closed  Form  (aka.  Normal  Equations)  

•  Advanced  Topics  –  Geometric  and  Probabilistic  Interpretation  of  LMS  –  L2  Regularization  (aka.  Ridge  Regression)  –  L1  Regularization  (aka.  LASSO)  –  Features  (aka.  non-­‐linear  basis  functions)  –  Locally-­‐Weighted  Linear  Regression  –  Robust  Regression  

  4  

•  Our goal is to estimate w from a training data of <xi,yi> pairs

•  Optimization goal: minimize squared error (least squares):

•  Why least squares?

- minimizes squared distance between measurements and predicted line

- has a nice probabilistic interpretation

- the math is pretty

Linear regression in 1D

∑ −i

iiw wxy 2)(minargX

Y ε+= wxy

see HW

Slide  courtesy  of  William  Cohen  

Solving Linear Regression in 1D

•  To optimize – closed form:

•  We just take the derivative w.r.t. to w and set to 0:

∂∂w

(yi −wxi )2

i∑ = 2 −xi (yi −wxi )

i∑ ⇒

2 xi (yi −wxi ) = 0i∑ ⇒

xiyi = wxi2

i∑

i∑ ⇒

w =xiyi

i∑

xi2

i∑

2 xiyii∑ − 2 wxixi

i∑ = 0

Slide  courtesy  of  William  Cohen  

Linear regression in 1D

•  Given an input x we would like to compute an output y

•  In linear regression we assume that y and x are related with the following equation:

y = wx+ε where w is a parameter and ε

represents measurement error or other noise

X

Y

What we are trying to predict

Observed values

Slide  courtesy  of  William  Cohen  

Regression example

•  Generated: w=2 •  Recovered: w=2.03 •  Noise: std=1

Slide  courtesy  of  William  Cohen  

Regression example

•  Generated: w=2 •  Recovered: w=2.05 •  Noise: std=2

Slide  courtesy  of  William  Cohen  

Regression example

•  Generated: w=2 •  Recovered: w=2.08 •  Noise: std=4

Slide  courtesy  of  William  Cohen  

Bias term •  So far we assumed that the

line passes through the origin •  What if the line does not? •  No problem, simply change the

model to y = w0 + w1x+ε •  Can use least squares to

determine w0 , w1

n

xwyw i

ii∑ −=

1

0

X

Y

w0

∑ −=

ii

iii

x

wyxw 2

0

1

)(

Slide  courtesy  of  William  Cohen  

Linear  Regression  

12  

Data:  Inputs  are  continuous  vectors  of  length  K.  Outputs  are  continuous  scalars.  

D = {x(i), y(i)}Ni=1 where x � RK and y � R

What  are  some  example  problems  of  

this  form?  

Linear  Regression  

13  

Data:  Inputs  are  continuous  vectors  of  length  K.  Outputs  are  continuous  scalars.  

D = {x(i), y(i)}Ni=1 where x � RK and y � R

Prediction:  Output  is  a  linear  function  of  the  inputs.  y = h�(x) = �1x1 + �2x2 + . . . + �KxK

y = h�(x) = �T x (We  assume  x1  is  1)  

Learning:  finds  the  parameters  that  minimize  some  objective  function.  

�� = argmin�

J(�)

Least  Squares  

14  

Learning:  finds  the  parameters  that  minimize  some  objective  function.        We  minimize  the  sum  of  the  squares:        Why?  

1.  Reduces  distance  between  true  measurements  and  predicted  hyperplane  (line  in  1D)  

2.  Has  a  nice  probabilistic  interpretation  

�� = argmin�

J(�)

J(�) =1

2

N�

i=1

(�T x(i) � y(i))2

Least  Squares  

15  

Learning:  finds  the  parameters  that  minimize  some  objective  function.        We  minimize  the  sum  of  the  squares:        Why?  

1.  Reduces  distance  between  true  measurements  and  predicted  hyperplane  (line  in  1D)  

2.  Has  a  nice  probabilistic  interpretation  

�� = argmin�

J(�)

J(�) =1

2

N�

i=1

(�T x(i) � y(i))2This  is  a  very  general  optimization  setup.  We  could  solve  it  in  lots  of  ways.  Today,  we’ll  consider  three  

ways.  

Least  Squares  

16  

Learning:  Three  approaches  to  solving      

Approach  2:  Stochastic  Gradient  Descent  (SGD)  (take  many  small  steps  opposite  the  gradient)    

Approach  1:  Gradient  Descent  (take  larger  –  more  certain  –  steps  opposite  the  gradient)  

Approach  3:  Closed  Form  (set  derivatives  equal  to  zero  and  solve  for  parameters)  

�� = argmin�

J(�)

Learning  as  optimization  

•  The  main  idea  today  – Make  two  assumptions:    

•  the  classifier  is  linear,  like  naïve  Bayes  –  ie  roughly  of  the  form  fw(x)  =  sign(x  .  w)  

•  we  want  to  find  the  classifier  w  that  “fits  the  data  best”  

– Formalize  as  an  optimization  problem  •  Pick  a  loss  function  J(w,D),  and  find    

argminw  J(w)    •  OR:  Pick  a  “goodness”  function,  often  Pr(D|w),  and  find  the  argmaxw  fD(w)    

17  Slide  courtesy  of  William  Cohen  

Learning as optimization: warmup Find:  optimal  (MLE)  parameter  θ  of  a  binomial Dataset:  D={x1,…,xn},  xi  is  0  or  1,  k  of  them  are  1

18

P(D |θ ) = const θ xi (1−i∏ θ )1−xi =θ k (1−θ )n−k

ddθ

P(D |θ ) = ddθ

θ k (1−θ )n−k( )

=ddθθ k"

#$

%

&'(1−θ )n−k +θ k d

dθ(1−θ )n−k

= kθ k−1(1−θ )n−k +θ k (n− k)(1−θ )n−k−1(−1)

=θ k−1(1−θ )n−k−1 k(1−θ )−θ(n− k)( )Slide  courtesy  of  William  Cohen  

Learning as optimization: warmup Goal:  Find  the  best  parameter  θ  of  a  binomial Dataset:  D={x1,…,xn},  xi  is  0  or  1,  k  of  them  are  1

19

=  0

θ=  0 θ=  1 k-­‐  kθ  –  nθ  +  kθ  =  0

è nθ    =  k è θ    =  k/n

Slide  courtesy  of  William  Cohen  

Learning  as  optimization  with  gradient  ascent  

•  Goal:  Learn  the  parameter  w  of  …  

•  Dataset:  D={(x1,y1)…,(xn  ,  yn)}    •  Use  your  model  to  define  

–  Pr(D|w)  =  ….  •  Set  w  to  maximize  Likelihood  

– Usually  we  use  numeric  methods  to  find  the  optimum  

–  i.e.,  gradient  ascent:  repeatedly  take  a  small  step  in  the  direction  of  the  gradient  

   

20  

Difference  between  ascent  and  descent  is  only  the  sign:    I’m  going  to  be  sloppy  here  about  that.  

Slide  courtesy  of  William  Cohen  

Gradient ascent

21

To  find  argminx  f(x):    •   Start  with  x0  •   For  t=1….  

•   xt+1  =  xt  +  λ  f’(x  t)  where  λ  is  small  

Slide  courtesy  of  William  Cohen  

Gradient descent

22

Likelihood:  ascent    Loss:  descent  

Slide  courtesy  of  William  Cohen  

Pros and cons of gradient descent

•  Simple and often quite effective on ML tasks •  Often very scalable •  Only applies to smooth functions (differentiable) •  Might find a local minimum, rather than a global

one

23 Slide  courtesy  of  William  Cohen  

Pros and cons of gradient descent

24

There is only one local optimum if the function is convex

Slide  courtesy  of  William  Cohen  

Gradient  Descent  

25  

Algorithm 1 Gradient Descent

1: procedure GD(D, �(0))2: � � �(0)

3: while not converged do4: � � � + ���J(�)

5: return �

In  order  to  apply  GD  to  Linear  Regression  all  we  need  is  the  gradient  of  the  objective  function  (i.e.  vector  of  partial  derivatives).    

��J(�) =

����

dd�1

J(�)d

d�2J(�)...

dd�N

J(�)

����

Gradient  Descent  

26  

Algorithm 1 Gradient Descent

1: procedure GD(D, �(0))2: � � �(0)

3: while not converged do4: � � � + ���J(�)

5: return �

There  are  many  possible  ways  to  detect  convergence.    For  example,  we  could  check  whether  the  L2  norm  of  the  gradient  is  below  some  small  tolerance.  

||��J(�)||2 � �Alternatively  we  could  check  that  the  reduction  in  the  objective  function  from  one  iteration  to  the  next  is  small.  

Stochastic  Gradient  Descent  (SGD)  

27  

Algorithm 2 Stochastic Gradient Descent (SGD)

1: procedure SGD(D, �(0))2: � � �(0)

3: while not converged do4: for i � shu�e({1, 2, . . . , N}) do5: � � � + ���J (i)(�)

6: return �

We  need  a  per-­‐example  objective:  

Let J(�) =�N

i=1 J (i)(�)where J (i)(�) = 1

2 (�T x(i) � y(i))2.

Applied  to  Linear  Regression,  SGD  is  called  the  Least  Mean  Squares  (LMS)  algorithm  

Stochastic  Gradient  Descent  (SGD)  

We  need  a  per-­‐example  objective:  

28  

Let J(�) =�N

i=1 J (i)(�)where J (i)(�) = 1

2 (�T x(i) � y(i))2.

Applied  to  Linear  Regression,  SGD  is  called  the  Least  Mean  Squares  (LMS)  algorithm  

Algorithm 2 Stochastic Gradient Descent (SGD)

1: procedure SGD(D, �(0))2: � � �(0)

3: while not converged do4: for i � shu�e({1, 2, . . . , N}) do5: for k � {1, 2, . . . , K} do6: �k � �k + � d

d�kJ (i)(�)

7: return �

Stochastic  Gradient  Descent  (SGD)  

We  need  a  per-­‐example  objective:  

29  

Let J(�) =�N

i=1 J (i)(�)where J (i)(�) = 1

2 (�T x(i) � y(i))2.

Applied  to  Linear  Regression,  SGD  is  called  the  Least  Mean  Squares  (LMS)  algorithm  

Algorithm 2 Stochastic Gradient Descent (SGD)

1: procedure SGD(D, �(0))2: � � �(0)

3: while not converged do4: for i � shu�e({1, 2, . . . , N}) do5: for k � {1, 2, . . . , K} do6: �k � �k + � d

d�kJ (i)(�)

7: return �

Let’s  start  by  calculating  this  partial  derivative  for  the  Linear  Regression  objective  function.  

Partial  Derivatives  for  Linear  Reg.  

30  

d

d�kJ (i)(�) =

d

d�k

1

2(�T x(i) � y(i))2

=1

2

d

d�k(�T x(i) � y(i))2

= (�T x(i) � y(i))d

d�k(�T x(i) � y(i))

= (�T x(i) � y(i))d

d�k(

K�

k=1

�kx(i)k � y(i))

= (�T x(i) � y(i))x(i)k

Let J(�) =�N

i=1 J (i)(�)where J (i)(�) = 1

2 (�T x(i) � y(i))2.

d

d�kJ (i)(�) =

d

d�k

1

2(�T x(i) � y(i))2

=1

2

d

d�k(�T x(i) � y(i))2

= (�T x(i) � y(i))d

d�k(�T x(i) � y(i))

= (�T x(i) � y(i))d

d�k(

K�

k=1

�kx(i)k � y(i))

= (�T x(i) � y(i))x(i)k

d

d�kJ (i)(�) =

d

d�k

1

2(�T x(i) � y(i))2

=1

2

d

d�k(�T x(i) � y(i))2

= (�T x(i) � y(i))d

d�k(�T x(i) � y(i))

= (�T x(i) � y(i))d

d�k(

K�

k=1

�kx(i)k � y(i))

= (�T x(i) � y(i))x(i)k

d

d�kJ (i)(�) =

d

d�k

1

2(�T x(i) � y(i))2

=1

2

d

d�k(�T x(i) � y(i))2

= (�T x(i) � y(i))d

d�k(�T x(i) � y(i))

= (�T x(i) � y(i))d

d�k(

K�

k=1

�kx(i)k � y(i))

= (�T x(i) � y(i))x(i)k

d

d�kJ (i)(�) =

d

d�k

1

2(�T x(i) � y(i))2

=1

2

d

d�k(�T x(i) � y(i))2

= (�T x(i) � y(i))d

d�k(�T x(i) � y(i))

= (�T x(i) � y(i))d

d�k(

K�

k=1

�kx(i)k � y(i))

= (�T x(i) � y(i))x(i)k

Partial  Derivatives  for  Linear  Reg.  

31  

d

d�kJ (i)(�) =

d

d�k

1

2(�T x(i) � y(i))2

=1

2

d

d�k(�T x(i) � y(i))2

= (�T x(i) � y(i))d

d�k(�T x(i) � y(i))

= (�T x(i) � y(i))d

d�k(

K�

k=1

�kx(i)k � y(i))

= (�T x(i) � y(i))x(i)k

Let J(�) =�N

i=1 J (i)(�)where J (i)(�) = 1

2 (�T x(i) � y(i))2.

Partial  Derivatives  for  Linear  Reg.  

32  

Let J(�) =�N

i=1 J (i)(�)where J (i)(�) = 1

2 (�T x(i) � y(i))2.

d

d�kJ (i)(�) = (�T x(i) � y(i))x(i)

k

d

d�kJ(�) =

d

d�k

N�

i=1

J (i)(�)

=N�

i=1

(�T x(i) � y(i))x(i)k

Used  by  SGD  (aka.  LMS)  

Used  by  Gradient  Descent  

Least  Mean  Squares  (LMS)  

33  

Applied  to  Linear  Regression,  SGD  is  called  the  Least  Mean  Squares  (LMS)  algorithm  

Algorithm 3 Least Mean Squares (LMS)

1: procedure LMS(D, �(0))2: � � �(0)

3: while not converged do4: for i � shu�e({1, 2, . . . , N}) do5: for k � {1, 2, . . . , K} do

6: �k � �k + �(�T x(i) � y(i))x(i)k

7: return �

Least  Squares  

34  

Learning:  Three  approaches  to  solving      

Approach  2:  Stochastic  Gradient  Descent  (SGD)  (take  many  small  steps  opposite  the  gradient)    

Approach  1:  Gradient  Descent  (take  larger  –  more  certain  –  steps  opposite  the  gradient)  

Approach  3:  Closed  Form  (set  derivatives  equal  to  zero  and  solve  for  parameters)  

�� = argmin�

J(�)

Background: Matrix Derivatives l  For , define:

l  Trace:

l  Some fact of matrix derivatives (without proof)

© Eric Xing @ CMU, 2006-2011 35

⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢

=∇

fA

fA

fA

fA

Af

mnm

n

A

!

"#"

!

1

111

)(

RR !nmf ×:

, tr ∑=

=n

iiiAA

1, tr aa = BCACABABC trtrtr ==

, tr TA BAB =∇ , tr TTT

A ABCCABCABA +=∇ ( ) TA AAA 1−=∇

The normal equations l  Write the cost function in matrix form:

l  To minimize J(θ), take derivative and set to zero:

© Eric Xing @ CMU, 2006-2011 36

( ) ( )

( )yyXyyXXX

yXyX

yJ

TTTTTT

T

n

ii

Ti

!!!!

!!

+−−=

−−=

−= ∑=

θθθθ

θθ

θθ

212121

1

2)()( x

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

−−−−

−−−−

=

nx

xx

X!!!

2

1

⎥⎥⎥⎥

⎢⎢⎢⎢

=

ny

yy

y!

" 2

1

( )

( )

( )0

221

22121

=−=

−+=

∇+∇−∇=

+−−∇=∇

yXXX

yXXXXX

yyXyXX

yyXyyXXXJ

TT

TTT

TTTT

TTTTTT

!

!

!!!

!!!!

θ

θθ

θθθ

θθθθ

θθθ

θθ

trtrtr

tryXXX TT !

=⇒ θ The normal equations

( ) yXXX TT !1−=*θ

Comments on the normal equation

l  In most situations of practical interest, the number of data points N is larger than the dimensionality k of the input space and the matrix X is of full column rank. If this condition holds, then it is easy to verify that XTX is necessarily invertible.

l  The assumption that XTX is invertible implies that it is positive definite, thus the critical point we have found is a minimum.

l  What if X has less than full column rank? à regularization (later).

© Eric Xing @ CMU, 2006-2011 37

Direct and Iterative methods l  Direct methods: we can achieve the solution in a single step

by solving the normal equation l  Using Gaussian elimination or QR decomposition, we converge in a finite number

of steps l  It can be infeasible when data are streaming in in real time, or of very large

amount

l  Iterative methods: stochastic or steepest gradient l  Converging in a limiting sense l  But more attractive in large practical problems l  Caution is needed for deciding the learning rate α

© Eric Xing @ CMU, 2006-2011 38

Convergence rate l  Theorem: the steepest descent equation algorithm

converge to the minimum of the cost characterized by normal equation:

If

l  A formal analysis of LMS needs more math; in practice,

one can use a small α, or gradually decrease α.

© Eric Xing @ CMU, 2006-2011 39

Convergence Curves

l  For the batch method, the training MSE is initially large due to uninformed initialization

l  In the online update, N updates for every epoch reduces MSE to a much smaller value.

40 © Eric Xing @ CMU, 2006-2011

Least  Squares  

41  

Learning:  Three  approaches  to  solving      

Approach  2:  Stochastic  Gradient  Descent  (SGD)  (take  many  small  steps  opposite  the  gradient)  

•  pros:    memory  efficient,  fast  convergence,  less  prone  to  local  optima  

•  cons:  convergence  in  practice  requires  tuning  and  fancier  variants    

 

Approach  1:  Gradient  Descent  (take  larger  –  more  certain  –  steps  opposite  the  gradient)    

•  pros:    conceptually  simple,  guaranteed  convergence  •  cons:  batch,  often  slow  to  converge    

Approach  3:  Closed  Form  (set  derivatives  equal  to  zero  and  solve  for  parameters)  

•  pros:    one  shot  algorithm!  •  cons:  does  not  scale  to  large  datasets  (matrix  inverse  is  

bottleneck)    

�� = argmin�

J(�)

Geometric  Interpretation  of  LMS  •  The  predictions  on  the  training  data  are:  

•  Note  that  

 and            is  the  orthogonal  projection  of  

into  the  space  spanned  by  the  columns  of  X

( ) yXXXXXy TT !! 1−== *ˆ θ

( )( )yIXXXXyy TT !!!−=−

−1ˆ

( ) ( )( )( )( )

0

1

1

=

−=

−=−−

yXXXXXX

yIXXXXXyyXTTTT

TTTT

!

!!!

!!  

y! y!

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

−−−−

−−−−

=

nx

xx

X!!!

2

1

⎥⎥⎥⎥

⎢⎢⎢⎢

=

ny

yy

y!

" 2

1

42  ©  Eric  Xing  @  CMU,  2006-­‐2011  

Probabilistic  Interpretation  of  LMS  •  Let  us  assume  that  the  target  variable  and  the  inputs  

are  related  by  the  equation:  

 where  ε  is  an  error  term  of  unmodeled  effects  or  random  

noise  

•  Now  assume  that  ε  follows  a  Gaussian  N(0,σ),  then  we  have:  

•  By  independence  assumption:  

iiT

iy εθ += x

⎟⎟⎠

⎞⎜⎜⎝

⎛ −−= 2

2

221

σθ

σπθ

)(exp);|( iT

iii

yxyp x

⎟⎟

⎜⎜

⎛ −−⎟

⎞⎜⎝

⎛==

∑∏ =

=2

12

1 221

σ

θ

σπθθ

n

i iT

inn

iii

yxypL

)(exp);|()(

x

43  ©  Eric  Xing  @  CMU,  2006-­‐2011  

Probabilistic  Interpretation  of  LMS  •  Hence  the  log-­‐likelihood  is:  

•  Do  you  recognize  the  last  term?        Yes  it  is:    

•  Thus  under  independence  assumption,  LMS  is  equivalent  to  MLE  of  θ !

∑ =−−=

n

i iT

iynl1

22 211

21 )(log)( xθ

σσπθ

∑=

−=n

ii

Ti yJ

1

2

21 )()( θθ x

44  ©  Eric  Xing  @  CMU,  2006-­‐2011  

Non-Linear basis function •  So far we only used the observed values x1,x2,… •  However, linear regression can be applied in the same

way to functions of these values –  Eg: to add a term w x1x2 add a new variable z=x1x2 so each

example becomes: x1, x2, …. z

•  As long as these functions can be directly computed from the observed values the parameters are still linear in the data and the problem remains a multi-variate linear regression problem

ε++++= 22110 kk xwxwwy …

Slide  courtesy  of  William  Cohen  

Non-linear basis functions

•  What type of functions can we use? •  A few common examples: - Polynomial: φj(x) = xj for j=0 … n - Gaussian: - Sigmoid:

- Logs:

φ j (x) =(x −µ j )2σ j

2

φ j (x) =1

1+ exp(−s j x)

Any function of the input values can be used. The solution for the parameters of the regression remains the same.

φ j (x) = log(x +1)

Slide  courtesy  of  William  Cohen  

General linear regression problem •  Using our new notations for the basis function linear

regression can be written as

•  Where φj(x) can be either xj for multivariate regression or one of the non-linear basis functions we defined

•  … and φ0(x)=1 for the intercept term €

y = w jφ j (x)j= 0

n

Slide  courtesy  of  William  Cohen  

An example: polynomial basis vectors on a small dataset

– From Bishop Ch 1

Slide  courtesy  of  William  Cohen  

0th Order Polynomial

n=10 Slide  courtesy  of  William  Cohen  

1st Order Polynomial

Slide  courtesy  of  William  Cohen  

3rd Order Polynomial

Slide  courtesy  of  William  Cohen  

9th Order Polynomial

Slide  courtesy  of  William  Cohen  

Over-fitting

Root-Mean-Square (RMS) Error:

Slide  courtesy  of  William  Cohen  

Polynomial Coefficients

Slide  courtesy  of  William  Cohen  

Regularization

Penalize large coefficient values

JX,y (w) =12

yi − wjφ j (xi )

j∑

#

$%%

&

'((

i∑

2

−λ2w 2

Slide  courtesy  of  William  Cohen  

Regularization: +

Slide  courtesy  of  William  Cohen  

Polynomial Coefficients

none exp(18) huge

Slide  courtesy  of  William  Cohen  

Over Regularization:

Slide  courtesy  of  William  Cohen  

Example:  Stock  Prices  •  Suppose  we  wish  to  predict  Google’s  stock  price  at  time  t+1    

•  What  features  should  we  use?  (putting  all  computational  concerns  aside)  –  Stock  prices  of  all  other  stocks  at  times  t,  t-­‐1,  t-­‐2,  …,  t  -­‐  k  

– Mentions  of  Google  with  positive  /  negative  sentiment  words  in  all  newspapers  and  social  media  outlets  

•  Do  we  believe  that  all  of  these  features  are  going  to  be  useful?  

59  

Ridge  Regression  

•  Adds  an  L2  regularizer  to  Linear  Regression  

   •  Bayesian  interpretation:  MAP  estimation  with  a  Gaussian  prior  on  the  parameters  

60  

JRR(�) = J(�) + �||�||22

=1

2

N�

i=1

(�T x(i) � y(i))2 + �K�

k=1

�2k

�MAP = argmax�

N�

i=1

log p�(y(i)|x(i)) + log p(�)

= argmax�

JRR(�)where  

p(�) � N (0,1�)

�MAP = argmax�

N�

i=1

log p�(y(i)|x(i)) + log p(�)

= argmax�

JLASSO(�)

JLASSO(�) = J(�) + �||�||1

=1

2

N�

i=1

(�T x(i) � y(i))2 + �K�

k=1

|�k|

LASSO  

•  Adds  an  L1  regularizer  to  Linear  Regression  

   •  Bayesian  interpretation:  MAP  estimation  with  a  Laplace  prior  on  the  parameters  

61  

where  

p(�) � Laplace(0, f(�))

Ridge  Regression  vs  Lasso  

Ridge  Regression:                Lasso:          

Lasso  (l1  penalty)  results  in  sparse  solu3ons  –  vector  with  more  zero  coordinates  Good  for  high-­‐dimensional  problems  –  don’t  have  to  store  all  coordinates!  

βs  with    constant    l1  norm  

βs  with  constant  J(β)  (level  sets  of  J(β))  

βs  with    constant    l2  norm  

β2  

β1  

62  ©  Eric  Xing  @  CMU,  2006-­‐2011  

X X

Optimization  for  LASSO  

Can  we  apply  SGD  to  the  LASSO  learning  problem?  

63  

JLASSO(�) = J(�) + �||�||1

=1

2

N�

i=1

(�T x(i) � y(i))2 + �K�

k=1

|�k|

�MAP = argmax�

N�

i=1

log p�(y(i)|x(i)) + log p(�)

= argmax�

JLASSO(�)

Optimization  for  LASSO  

•  Consider  the  absolute  value  function:  

64  

r(�) = �K�

k=1

|�k|

•  Where  does  a  small  step  opposite  the  gradient  accomplish…  – …  far  from  zero?  – …  close  to  zero?  

•  The  L1  penalty  is  subdifferentiable  (i.e.  not  differentiable  at  0)  

Optimization  for  LASSO  •  The  L1  penalty  is  subdifferentiable  (i.e.  not  differentiable  at  0)  

•  An  array  of  optimization  algorithms  exist  to  handle  this  issue:  –  Coordinate  Descent  – Othant-­‐Wise  Limited  memory  Quasi-­‐Newton  (OWL-­‐QN)  (Andrew  &  Gao,  2007)    and  provably  convergent  variants  

– Block  coordinate  Descent  (Tseng  &  Yun,  2009)  –  Sparse  Reconstruction  by  Separable  Approximation  (SpaRSA)  (Wright  et  al.,  2009)  

–  Fast  Iterative  Shrinkage  Thresholding  Algorithm  (FISTA)  (Beck  &  Teboulle,  2009)  

65  

Data Set Size: 9th Order Polynomial

Slide  courtesy  of  William  Cohen  

Locally weighted linear regression

l  The algorithm: Instead of minimizing now we fit θ to minimize Where do wi's come from?

l  where x is the query point for which we'd like to know its corresponding y

à Essentially we put higher weights on (errors on) training examples that are close to the query point (than those that are further away from the query)

© Eric Xing @ CMU, 2006-2011 67

∑=

−=n

ii

Ti yJ

1

2

21 )()( θθ x

∑=

−=n

ii

Tii ywJ

1

2

21 )()( θθ x

⎟⎟⎠

⎞⎜⎜⎝

⎛ −−= 2

2

2τ)(exp xxi

iw

Parametric vs. non-parametric l  Locally weighted linear regression is the second example we are

running into of a non-parametric algorithm. (what is the first?)

l  The (unweighted) linear regression algorithm that we saw earlier is known as a parametric learning algorithm l  because it has a fixed, finite number of parameters (the θ), which are fit to the data; l  Once we've fit the θ and stored them away, we no longer need to keep the training

data around to make future predictions. l  In contrast, to make predictions using locally weighted linear regression, we need to

keep the entire training set around.

l  The term "non-parametric" (roughly) refers to the fact that the amount of stuff we need to keep in order to represent the hypothesis grows linearly with the size of the training set.

© Eric Xing @ CMU, 2006-2011 68

Robust Regression

l  The best fit from a quadratic regression

l  But this is probably better …

© Eric Xing @ CMU, 2006-2011 69

How can we do this?

LOESS-based Robust Regression l  Remember what we do in "locally weighted linear regression"?

à we "score" each point for its impotence l  Now we score each point according to its "fitness"

© Eric Xing @ CMU, 2006-2011 70

(Courtesy to Andrew Moore)

Robust regression l  For k = 1 to R…

l  Let (xk ,yk) be the kth datapoint l  Let yest

k be predicted value of yk

l  Let wk be a weight for data point k that is large if the data point fits well and small if it fits badly:

l  Then redo the regression using weighted data points.

l  Repeat whole thing until converged!

© Eric Xing @ CMU, 2006-2011 71

( )2)( estkkk yyw −= φ

Robust regression—probabilistic interpretation

l  What regular regression does:

Assume yk was originally generated using the following recipe:

Computational task is to find the Maximum Likelihood estimation of θ

© Eric Xing @ CMU, 2006-2011 72

),( 20 σθ N+= kT

ky x

Robust regression—probabilistic interpretation

l  What LOESS robust regression does:

Assume yk was originally generated using the following recipe:

with probability p:

but otherwise

Computational task is to find the Maximum Likelihood estimates of θ, p, µ and σhuge.

l  The algorithm you saw with iterative reweighting/refitting does this computation for us. Later you will find that it is an instance of the famous E.M. algorithm

© Eric Xing @ CMU, 2006-2011 73

),( 20 σθ N+= kT

ky x

),(~ huge2σµNky

Summary  

•  Linear  regression  predicts  its  output  as  a  linear  function  of  its  inputs  

•  Learning  optimizes  a  function  (equivalently  likelihood  or  mean  squared  error)  using  standard  techniques  (gradient  descent,  SGD,  closed  form)  

•  Regularization  enables  shrinkage  and  model  selection  

©  Eric  Xing  @  CMU,  2005-­‐2015   74