Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However,...

Linear Regression

1

Matt Gormley Lecture 4

September 19, 2016

School of Computer Science

Readings: Bishop, 3.1 Murphy, 7

10-‐701 Introduction to Machine Learning

Reminders

•  Homework 1: – due 9/26/16

•  Project Proposal: – due 10/3/16 – start early!

2

Outline •  Linear Regression

–  Simple example –  Model

•  Learning –  Gradient Descent –  SGD –  Closed Form

•  Advanced Topics –  Geometric and Probabilistic Interpretation of LMS –  L2 Regularization –  L1 Regularization –  Features –  Locally-‐Weighted Linear Regression –  Robust Regression

3

Outline •  Linear Regression

–  Simple example –  Model

•  Learning (aka. Least Squares) –  Gradient Descent –  SGD (aka. Least Mean Squares (LMS)) –  Closed Form (aka. Normal Equations)

•  Advanced Topics –  Geometric and Probabilistic Interpretation of LMS –  L2 Regularization (aka. Ridge Regression) –  L1 Regularization (aka. LASSO) –  Features (aka. non-‐linear basis functions) –  Locally-‐Weighted Linear Regression –  Robust Regression

4

•  Our goal is to estimate w from a training data of <xi,yi> pairs

•  Optimization goal: minimize squared error (least squares):

•  Why least squares?

- minimizes squared distance between measurements and predicted line

- has a nice probabilistic interpretation

- the math is pretty

Linear regression in 1D

∑ −i

iiw wxy 2)(minargX

Y ε+= wxy

see HW

Slide courtesy of William Cohen

Solving Linear Regression in 1D

•  To optimize – closed form:

•  We just take the derivative w.r.t. to w and set to 0:

∂∂w

(yi −wxi )2

i∑ = 2 −xi (yi −wxi )

i∑ ⇒

2 xi (yi −wxi ) = 0i∑ ⇒

xiyi = wxi2

i∑

i∑ ⇒

w =xiyi

i∑

xi2

i∑

2 xiyii∑ − 2 wxixi

i∑ = 0


Linear regression in 1D

•  Given an input x we would like to compute an output y

•  In linear regression we assume that y and x are related with the following equation:

y = wx+ε where w is a parameter and ε

represents measurement error or other noise

X

Y

What we are trying to predict

Observed values


Regression example

•  Generated: w=2 •  Recovered: w=2.03 •  Noise: std=1


Regression example



Bias term •  So far we assumed that the

line passes through the origin •  What if the line does not? •  No problem, simply change the

model to y = w0 + w1x+ε •  Can use least squares to

determine w0 , w1

n

xwyw i

ii∑ −=

1

0

X

Y

w0

∑

∑ −=

ii

iii

x

wyxw 2

0

1

)(


Linear Regression

12

Data: Inputs are continuous vectors of length K. Outputs are continuous scalars.

D = {x(i), y(i)}Ni=1 where x � RK and y � R

What are some example problems of

this form?

Linear Regression

13

Data: Inputs are continuous vectors of length K. Outputs are continuous scalars.

D = {x(i), y(i)}Ni=1 where x � RK and y � R

Prediction: Output is a linear function of the inputs. y = h�(x) = �1x1 + �2x2 + . . . + �KxK

y = h�(x) = �T x (We assume x1 is 1)

Learning: finds the parameters that minimize some objective function.

�� = argmin�

J(�)

Least Squares

14

Learning: finds the parameters that minimize some objective function. We minimize the sum of the squares: Why?

1.  Reduces distance between true measurements and predicted hyperplane (line in 1D)

2.  Has a nice probabilistic interpretation

�� = argmin�

J(�)

J(�) =1

2

N�

i=1

(�T x(i) � y(i))2

Least Squares

15

Learning: finds the parameters that minimize some objective function. We minimize the sum of the squares: Why?

1.  Reduces distance between true measurements and predicted hyperplane (line in 1D)

2.  Has a nice probabilistic interpretation

�� = argmin�

J(�)

J(�) =1

2

N�

i=1

(�T x(i) � y(i))2This is a very general optimization setup. We could solve it in lots of ways. Today, we’ll consider three

ways.

Least Squares

16

Learning: Three approaches to solving

Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient)

Approach 1: Gradient Descent (take larger – more certain – steps opposite the gradient)

Approach 3: Closed Form (set derivatives equal to zero and solve for parameters)

�� = argmin�

J(�)

Learning as optimization

•  The main idea today – Make two assumptions:

•  the classifier is linear, like naïve Bayes –  ie roughly of the form fw(x) = sign(x . w)

•  we want to find the classifier w that “fits the data best”

– Formalize as an optimization problem •  Pick a loss function J(w,D), and find

argminw J(w) •  OR: Pick a “goodness” function, often Pr(D|w), and find the argmaxw fD(w)

17 Slide courtesy of William Cohen

Learning as optimization: warmup Find: optimal (MLE) parameter θ of a binomial Dataset: D={x1,…,xn}, xi is 0 or 1, k of them are 1

18

P(D |θ ) = const θ xi (1−i∏ θ )1−xi =θ k (1−θ )n−k

ddθ

P(D |θ ) = ddθ

θ k (1−θ )n−k( )

=ddθθ k"

#$

%

&'(1−θ )n−k +θ k d

dθ(1−θ )n−k

= kθ k−1(1−θ )n−k +θ k (n− k)(1−θ )n−k−1(−1)

=θ k−1(1−θ )n−k−1 k(1−θ )−θ(n− k)( )Slide courtesy of William Cohen

Learning as optimization: warmup Goal: Find the best parameter θ of a binomial Dataset: D={x1,…,xn}, xi is 0 or 1, k of them are 1

19

= 0

θ= 0 θ= 1 k-‐ kθ – nθ + kθ = 0

è nθ = k è θ = k/n


Learning as optimization with gradient ascent

•  Goal: Learn the parameter w of …

•  Dataset: D={(x1,y1)…,(xn , yn)} •  Use your model to define

–  Pr(D|w) = …. •  Set w to maximize Likelihood

– Usually we use numeric methods to find the optimum

–  i.e., gradient ascent: repeatedly take a small step in the direction of the gradient

20

Difference between ascent and descent is only the sign: I’m going to be sloppy here about that.


Gradient ascent

21

To find argminx f(x): •  Start with x0 •  For t=1….

•  xt+1 = xt + λ f’(x t) where λ is small


Gradient descent

22

Likelihood: ascent Loss: descent


Pros and cons of gradient descent

•  Simple and often quite effective on ML tasks •  Often very scalable •  Only applies to smooth functions (differentiable) •  Might find a local minimum, rather than a global

one

23 Slide courtesy of William Cohen

Pros and cons of gradient descent

24

There is only one local optimum if the function is convex


Gradient Descent

25

Algorithm 1 Gradient Descent

1: procedure GD(D, �(0))2: � � �(0)

3: while not converged do4: � � � + ��J(�)

5: return �

In order to apply GD to Linear Regression all we need is the gradient of the objective function (i.e. vector of partial derivatives).

��J(�) =

�

��

dd�1

J(�)d

d�2J(�)...

dd�N

J(�)

�

��

Gradient Descent

26

Algorithm 1 Gradient Descent

1: procedure GD(D, �(0))2: � � �(0)

3: while not converged do4: � � � + ��J(�)

5: return �

There are many possible ways to detect convergence. For example, we could check whether the L2 norm of the gradient is below some small tolerance.

||��J(�)||2 � �Alternatively we could check that the reduction in the objective function from one iteration to the next is small.

Stochastic Gradient Descent (SGD)

27

Algorithm 2 Stochastic Gradient Descent (SGD)

1: procedure SGD(D, �(0))2: � � �(0)

3: while not converged do4: for i � shu�e({1, 2, . . . , N}) do5: � � � + ��J (i)(�)

6: return �

We need a per-‐example objective:

Let J(�) =�N

i=1 J (i)(�)where J (i)(�) = 1

2 (�T x(i) � y(i))2.

Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm



28

Let J(�) =�N

i=1 J (i)(�)where J (i)(�) = 1

2 (�T x(i) � y(i))2.



1: procedure SGD(D, �(0))2: � � �(0)

3: while not converged do4: for i � shu�e({1, 2, . . . , N}) do5: for k � {1, 2, . . . , K} do6: �k � �k + � d

d�kJ (i)(�)

7: return �



29

Let J(�) =�N

i=1 J (i)(�)where J (i)(�) = 1

2 (�T x(i) � y(i))2.



1: procedure SGD(D, �(0))2: � � �(0)

3: while not converged do4: for i � shu�e({1, 2, . . . , N}) do5: for k � {1, 2, . . . , K} do6: �k � �k + � d

d�kJ (i)(�)

7: return �

Let’s start by calculating this partial derivative for the Linear Regression objective function.

Partial Derivatives for Linear Reg.

30

d

d�kJ (i)(�) =

d

d�k

1

2(�T x(i) � y(i))2

=1

2

d

d�k(�T x(i) � y(i))2

= (�T x(i) � y(i))d

d�k(�T x(i) � y(i))

= (�T x(i) � y(i))d

d�k(

K�

k=1

�kx(i)k � y(i))

= (�T x(i) � y(i))x(i)k

Let J(�) =�N

i=1 J (i)(�)where J (i)(�) = 1

2 (�T x(i) � y(i))2.

d

d�kJ (i)(�) =

d

d�k

1

2(�T x(i) � y(i))2

=1

2

d

d�k(�T x(i) � y(i))2

= (�T x(i) � y(i))d

d�k(�T x(i) � y(i))

= (�T x(i) � y(i))d

d�k(

K�

k=1

�kx(i)k � y(i))

= (�T x(i) � y(i))x(i)k

d

d�kJ (i)(�) =

d

d�k

1

2(�T x(i) � y(i))2

=1

2

d

d�k(�T x(i) � y(i))2

= (�T x(i) � y(i))d

d�k(�T x(i) � y(i))

= (�T x(i) � y(i))d

d�k(

K�

k=1

�kx(i)k � y(i))

= (�T x(i) � y(i))x(i)k

d

d�kJ (i)(�) =

d

d�k

1

2(�T x(i) � y(i))2

=1

2

d

d�k(�T x(i) � y(i))2

= (�T x(i) � y(i))d

d�k(�T x(i) � y(i))

= (�T x(i) � y(i))d

d�k(

K�

k=1

�kx(i)k � y(i))

= (�T x(i) � y(i))x(i)k

d

d�kJ (i)(�) =

d

d�k

1

2(�T x(i) � y(i))2

=1

2

d

d�k(�T x(i) � y(i))2

= (�T x(i) � y(i))d

d�k(�T x(i) � y(i))

= (�T x(i) � y(i))d

d�k(

K�

k=1

�kx(i)k � y(i))

= (�T x(i) � y(i))x(i)k


31

d

d�kJ (i)(�) =

d

d�k

1

2(�T x(i) � y(i))2

=1

2

d

d�k(�T x(i) � y(i))2

= (�T x(i) � y(i))d

d�k(�T x(i) � y(i))

= (�T x(i) � y(i))d

d�k(

K�

k=1

�kx(i)k � y(i))

= (�T x(i) � y(i))x(i)k

Let J(�) =�N

i=1 J (i)(�)where J (i)(�) = 1

2 (�T x(i) � y(i))2.


32

Let J(�) =�N

i=1 J (i)(�)where J (i)(�) = 1

2 (�T x(i) � y(i))2.

d

d�kJ (i)(�) = (�T x(i) � y(i))x(i)

k

d

d�kJ(�) =

d

d�k

N�

i=1

J (i)(�)

=N�

i=1

(�T x(i) � y(i))x(i)k

Used by SGD (aka. LMS)

Used by Gradient Descent

Least Mean Squares (LMS)

33


Algorithm 3 Least Mean Squares (LMS)

1: procedure LMS(D, �(0))2: � � �(0)

3: while not converged do4: for i � shu�e({1, 2, . . . , N}) do5: for k � {1, 2, . . . , K} do

6: �k � �k + �(�T x(i) � y(i))x(i)k

7: return �

Least Squares

34





�� = argmin�

J(�)

Background: Matrix Derivatives l  For , define:

l  Trace:

l  Some fact of matrix derivatives (without proof)

© Eric Xing @ CMU, 2006-2011 35

⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

∂

∂

∂

∂

∂

∂

∂

∂

=∇

fA

fA

fA

fA

Af

mnm

n

A

!

"#"

!

1

111

)(

RR !nmf ×:

, tr ∑=

=n

iiiAA

1, tr aa = BCACABABC trtrtr ==

, tr TA BAB =∇ , tr TTT

A ABCCABCABA +=∇ ( ) TA AAA 1−=∇

The normal equations l  Write the cost function in matrix form:

l  To minimize J(θ), take derivative and set to zero:

© Eric Xing @ CMU, 2006-2011 36

( ) ( )

( )yyXyyXXX

yXyX

yJ

TTTTTT

T

n

ii

Ti

!!!!

!!

+−−=

−−=

−= ∑=

θθθθ

θθ

θθ

212121

1

2)()( x

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

−−−−

−−−−

−−−−

=

nx

xx

X!!!

2

1

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=

ny

yy

y!

" 2

1

( )

( )

( )0

221

22121

=−=

−+=

∇+∇−∇=

+−−∇=∇

yXXX

yXXXXX

yyXyXX

yyXyyXXXJ

TT

TTT

TTTT

TTTTTT

!

!

!!!

!!!!

θ

θθ

θθθ

θθθθ

θθθ

θθ

trtrtr

tryXXX TT !

=⇒ θ The normal equations

( ) yXXX TT !1−=*θ

⇓

Comments on the normal equation

l  In most situations of practical interest, the number of data points N is larger than the dimensionality k of the input space and the matrix X is of full column rank. If this condition holds, then it is easy to verify that XTX is necessarily invertible.

l  The assumption that XTX is invertible implies that it is positive definite, thus the critical point we have found is a minimum.

l  What if X has less than full column rank? à regularization (later).

© Eric Xing @ CMU, 2006-2011 37

Direct and Iterative methods l  Direct methods: we can achieve the solution in a single step

by solving the normal equation l  Using Gaussian elimination or QR decomposition, we converge in a finite number

of steps l  It can be infeasible when data are streaming in in real time, or of very large

amount

l  Iterative methods: stochastic or steepest gradient l  Converging in a limiting sense l  But more attractive in large practical problems l  Caution is needed for deciding the learning rate α

© Eric Xing @ CMU, 2006-2011 38

Convergence rate l  Theorem: the steepest descent equation algorithm

converge to the minimum of the cost characterized by normal equation:

If

l  A formal analysis of LMS needs more math; in practice,

one can use a small α, or gradually decrease α.

© Eric Xing @ CMU, 2006-2011 39

Convergence Curves

l  For the batch method, the training MSE is initially large due to uninformed initialization

l  In the online update, N updates for every epoch reduces MSE to a much smaller value.

40 © Eric Xing @ CMU, 2006-2011

Least Squares

41



•  pros: memory efficient, fast convergence, less prone to local optima

•  cons: convergence in practice requires tuning and fancier variants


•  pros: conceptually simple, guaranteed convergence •  cons: batch, often slow to converge


•  pros: one shot algorithm! •  cons: does not scale to large datasets (matrix inverse is

bottleneck)

�� = argmin�

J(�)

Geometric Interpretation of LMS •  The predictions on the training data are:

•  Note that

and is the orthogonal projection of

into the space spanned by the columns of X

( ) yXXXXXy TT !! 1−== *ˆ θ

( )( )yIXXXXyy TT !!!−=−

−1ˆ

( ) ( )( )( )( )

0

1

1

=

−=

−=−−

−

yXXXXXX

yIXXXXXyyXTTTT

TTTT

!

!!!

!!

y! y!

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

−−−−

−−−−

−−−−

=

nx

xx

X!!!

2

1

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=

ny

yy

y!

" 2

1

42 © Eric Xing @ CMU, 2006-‐2011

Probabilistic Interpretation of LMS •  Let us assume that the target variable and the inputs

are related by the equation:

where ε is an error term of unmodeled effects or random

noise

•  Now assume that ε follows a Gaussian N(0,σ), then we have:

•  By independence assumption:

iiT

iy εθ += x

⎟⎟⎠

⎞⎜⎜⎝

⎛ −−= 2

2

221

σθ

σπθ

)(exp);|( iT

iii

yxyp x

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛ −−⎟

⎠

⎞⎜⎝

⎛==

∑∏ =

=2

12

1 221

σ

θ

σπθθ

n

i iT

inn

iii

yxypL

)(exp);|()(

x

43 © Eric Xing @ CMU, 2006-‐2011

Probabilistic Interpretation of LMS •  Hence the log-‐likelihood is:

•  Do you recognize the last term? Yes it is:

•  Thus under independence assumption, LMS is equivalent to MLE of θ !

∑ =−−=

n

i iT

iynl1

22 211

21 )(log)( xθ

σσπθ

∑=

−=n

ii

Ti yJ

1

2

21 )()( θθ x

44 © Eric Xing @ CMU, 2006-‐2011

Non-Linear basis function •  So far we only used the observed values x1,x2,… •  However, linear regression can be applied in the same

way to functions of these values –  Eg: to add a term w x1x2 add a new variable z=x1x2 so each

example becomes: x1, x2, …. z

•  As long as these functions can be directly computed from the observed values the parameters are still linear in the data and the problem remains a multi-variate linear regression problem

ε++++= 22110 kk xwxwwy …


Non-linear basis functions

•  What type of functions can we use? •  A few common examples: - Polynomial: φj(x) = xj for j=0 … n - Gaussian: - Sigmoid:

- Logs:

€

φ j (x) =(x −µ j )2σ j

2

€

φ j (x) =1

1+ exp(−s j x)

Any function of the input values can be used. The solution for the parameters of the regression remains the same.

φ j (x) = log(x +1)


General linear regression problem •  Using our new notations for the basis function linear

regression can be written as

•  Where φj(x) can be either xj for multivariate regression or one of the non-linear basis functions we defined

•  … and φ0(x)=1 for the intercept term €

y = w jφ j (x)j= 0

n

∑


An example: polynomial basis vectors on a small dataset

– From Bishop Ch 1


0th Order Polynomial

n=10 Slide courtesy of William Cohen

1st Order Polynomial


3rd Order Polynomial


9th Order Polynomial


Over-fitting

Root-Mean-Square (RMS) Error:


Polynomial Coefficients


Regularization

Penalize large coefficient values

JX,y (w) =12

yi − wjφ j (xi )

j∑

#

$%%

&

'((

i∑

2

−λ2w 2


Regularization: +


Polynomial Coefficients

none exp(18) huge


Over Regularization:


Example: Stock Prices •  Suppose we wish to predict Google’s stock price at time t+1

•  What features should we use? (putting all computational concerns aside) –  Stock prices of all other stocks at times t, t-‐1, t-‐2, …, t -‐ k

– Mentions of Google with positive / negative sentiment words in all newspapers and social media outlets

•  Do we believe that all of these features are going to be useful?

59

Ridge Regression

•  Adds an L2 regularizer to Linear Regression

•  Bayesian interpretation: MAP estimation with a Gaussian prior on the parameters

60

JRR(�) = J(�) + �||�||22

=1

2

N�

i=1

(�T x(i) � y(i))2 + �K�

k=1

�2k

�MAP = argmax�

N�

i=1

log p�(y(i)|x(i)) + log p(�)

= argmax�

JRR(�)where

p(�) � N (0,1�)

�MAP = argmax�

N�

i=1


= argmax�

JLASSO(�)

JLASSO(�) = J(�) + �||�||1

=1

2

N�

i=1

(�T x(i) � y(i))2 + �K�

k=1

|�k|

LASSO

•  Adds an L1 regularizer to Linear Regression

•  Bayesian interpretation: MAP estimation with a Laplace prior on the parameters

61

where

p(�) � Laplace(0, f(�))

Ridge Regression vs Lasso

Ridge Regression: Lasso:

Lasso (l1 penalty) results in sparse solu3ons – vector with more zero coordinates Good for high-‐dimensional problems – don’t have to store all coordinates!

βs with constant l1 norm

βs with constant J(β) (level sets of J(β))

βs with constant l2 norm

β2

β1

62 © Eric Xing @ CMU, 2006-‐2011

X X

Optimization for LASSO

Can we apply SGD to the LASSO learning problem?

63

JLASSO(�) = J(�) + �||�||1

=1

2

N�

i=1

(�T x(i) � y(i))2 + �K�

k=1

|�k|

�MAP = argmax�

N�

i=1


= argmax�

JLASSO(�)

Optimization for LASSO

•  Consider the absolute value function:

64

r(�) = �K�

k=1

|�k|

•  Where does a small step opposite the gradient accomplish… – … far from zero? – … close to zero?

•  The L1 penalty is subdifferentiable (i.e. not differentiable at 0)

Optimization for LASSO •  The L1 penalty is subdifferentiable (i.e. not differentiable at 0)

•  An array of optimization algorithms exist to handle this issue: –  Coordinate Descent – Othant-‐Wise Limited memory Quasi-‐Newton (OWL-‐QN) (Andrew & Gao, 2007) and provably convergent variants

– Block coordinate Descent (Tseng & Yun, 2009) –  Sparse Reconstruction by Separable Approximation (SpaRSA) (Wright et al., 2009)

–  Fast Iterative Shrinkage Thresholding Algorithm (FISTA) (Beck & Teboulle, 2009)

65

Data Set Size: 9th Order Polynomial


Locally weighted linear regression

l  The algorithm: Instead of minimizing now we fit θ to minimize Where do wi's come from?

l  where x is the query point for which we'd like to know its corresponding y

à Essentially we put higher weights on (errors on) training examples that are close to the query point (than those that are further away from the query)

© Eric Xing @ CMU, 2006-2011 67

∑=

−=n

ii

Ti yJ

1

2

21 )()( θθ x

∑=

−=n

ii

Tii ywJ

1

2

21 )()( θθ x

⎟⎟⎠

⎞⎜⎜⎝

⎛ −−= 2

2

2τ)(exp xxi

iw

Parametric vs. non-parametric l  Locally weighted linear regression is the second example we are

running into of a non-parametric algorithm. (what is the first?)

l  The (unweighted) linear regression algorithm that we saw earlier is known as a parametric learning algorithm l  because it has a fixed, finite number of parameters (the θ), which are fit to the data; l  Once we've fit the θ and stored them away, we no longer need to keep the training

data around to make future predictions. l  In contrast, to make predictions using locally weighted linear regression, we need to

keep the entire training set around.

l  The term "non-parametric" (roughly) refers to the fact that the amount of stuff we need to keep in order to represent the hypothesis grows linearly with the size of the training set.

© Eric Xing @ CMU, 2006-2011 68

Robust Regression

l  The best fit from a quadratic regression

l  But this is probably better …

© Eric Xing @ CMU, 2006-2011 69

How can we do this?

LOESS-based Robust Regression l  Remember what we do in "locally weighted linear regression"?

à we "score" each point for its impotence l  Now we score each point according to its "fitness"

© Eric Xing @ CMU, 2006-2011 70

(Courtesy to Andrew Moore)

Robust regression l  For k = 1 to R…

l  Let (xk ,yk) be the kth datapoint l  Let yest

k be predicted value of yk

l  Let wk be a weight for data point k that is large if the data point fits well and small if it fits badly:

l  Then redo the regression using weighted data points.

l  Repeat whole thing until converged!

© Eric Xing @ CMU, 2006-2011 71

( )2)( estkkk yyw −= φ

Robust regression—probabilistic interpretation

l  What regular regression does:

Assume yk was originally generated using the following recipe:

Computational task is to find the Maximum Likelihood estimation of θ

© Eric Xing @ CMU, 2006-2011 72

),( 20 σθ N+= kT

ky x

Robust regression—probabilistic interpretation

l  What LOESS robust regression does:

Assume yk was originally generated using the following recipe:

with probability p:

but otherwise

Computational task is to find the Maximum Likelihood estimates of θ, p, µ and σhuge.

l  The algorithm you saw with iterative reweighting/refitting does this computation for us. Later you will find that it is an instance of the famous E.M. algorithm

© Eric Xing @ CMU, 2006-2011 73

),( 20 σθ N+= kT

ky x

),(~ huge2σµNky

Summary

•  Linear regression predicts its output as a linear function of its inputs

•  Learning optimizes a function (equivalently likelihood or mean squared error) using standard techniques (gradient descent, SGD, closed form)

•  Regularization enables shrinkage and model selection

© Eric Xing @ CMU, 2005-‐2015 74

Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However,...

Documents

Transcript of Linear’Regression’ - Carnegie Mellon School of …mgormley/courses/10701-f16/...• However,...