Linear Regression - Penn Engineering › ~cis519 › spring2019 › ... · Based on slide by...
Transcript of Linear Regression - Penn Engineering › ~cis519 › spring2019 › ... · Based on slide by...
LinearRegression
RobotImageCredit:Viktoriya Sukhanova ©123RF.com
TheseslideswereassembledbyEricEaton,withgratefulacknowledgementofthemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperattribution.PleasesendcommentsandcorrectionstoEric.
RegressionGiven:– Datawhere
– Correspondinglabelswhere
2
0
1
2
3
4
5
6
7
8
9
1970 1980 1990 2000 2010 2020
Septem
berA
rcticSeaIceExtent
(1,000,000sq
km)
Year
DatafromG.Witt.JournalofStatisticsEducation,Volume21,Number1(2013)
LinearRegressionQuadraticRegression
X =nx(1), . . . ,x(n)
ox(i) 2 Rd
y =ny(1), . . . , y(n)
oy(i) 2 R
• 97samples,partitionedinto67train/30test• Eightpredictors(features):
– 6continuous(4logtransforms),1binary,1ordinal
• Continuousoutcomevariable:– lpsa:log(prostatespecificantigenlevel)
ProstateCancerDataset
BasedonslidebyJeffHowbert
LinearRegression• Hypothesis:
• Fitmodelbyminimizingsumofsquarederrors
5
x
y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX
j=0
✓jxj
Assumex0 =1
y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX
j=0
✓jxj
Figures are courtesy ofGregShakhnarovich
LeastSquaresLinearRegression
6
• CostFunction
• Fitbysolving
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
min✓
J(✓)
IntuitionBehindCostFunction
7
ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
BasedonexamplebyAndrewNg
IntuitionBehindCostFunction
8
0
1
2
3
0 1 2 3
y
x
(forfixed,thisisafunctionofx) (functionoftheparameter)
0
1
2
3
-0.5 0 0.5 1 1.5 2 2.5
ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
BasedonexamplebyAndrewNg
IntuitionBehindCostFunction
9
0
1
2
3
0 1 2 3
y
x
(forfixed,thisisafunctionofx) (functionoftheparameter)
0
1
2
3
-0.5 0 0.5 1 1.5 2 2.5
ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
J([0, 0.5]) =1
2⇥ 3
⇥(0.5� 1)2 + (1� 2)2 + (1.5� 3)2
⇤⇡ 0.58Basedonexample
byAndrewNg
IntuitionBehindCostFunction
10
0
1
2
3
0 1 2 3
y
x
(forfixed,thisisafunctionofx) (functionoftheparameter)
0
1
2
3
-0.5 0 0.5 1 1.5 2 2.5
ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
J([0, 0]) ⇡ 2.333
BasedonexamplebyAndrewNg
J()isconcave
IntuitionBehindCostFunction
11SlidebyAndrewNg
IntuitionBehindCostFunction
12
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
IntuitionBehindCostFunction
13
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
IntuitionBehindCostFunction
14
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
IntuitionBehindCostFunction
15
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
BasicSearchProcedure• Chooseinitialvaluefor• Untilwereachaminimum:– Chooseanewvaluefortoreduce
16
✓
✓ J(✓)
q1q0
J(q0,q1)
FigurebyAndrewNg
BasicSearchProcedure• Chooseinitialvaluefor• Untilwereachaminimum:– Chooseanewvaluefortoreduce
17
✓
✓
J(✓)
q1q0
J(q0,q1)
✓
FigurebyAndrewNg
BasicSearchProcedure• Chooseinitialvaluefor• Untilwereachaminimum:– Chooseanewvaluefortoreduce
18
✓
✓
J(✓)
q1q0
J(q0,q1)
✓
FigurebyAndrewNg
Sincetheleastsquaresobjectivefunctionisconvex(concave),wedon’tneedtoworryaboutlocalminima
GradientDescent• Initialize• Repeatuntilconvergence
19
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneousupdate
forj =0...d
learningrate(small)e.g.,α=0.05
J(✓)
✓
0
1
2
3
-0.5 0 0.5 1 1.5 2 2.5
↵
GradientDescent• Initialize• Repeatuntilconvergence
20
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneousupdate
forj =0...d
ForLinearRegression:@
@✓jJ(✓) =
@
@✓j
1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
=@
@✓j
1
2n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!2
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!⇥ @
@✓j
dX
k=0
✓kx(i)k � y(i)
!
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!x(i)j
=1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘x(i)j
GradientDescent• Initialize• Repeatuntilconvergence
21
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneousupdate
forj =0...d
ForLinearRegression:@
@✓jJ(✓) =
@
@✓j
1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
=@
@✓j
1
2n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!2
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!⇥ @
@✓j
dX
k=0
✓kx(i)k � y(i)
!
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!x(i)j
=1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘x(i)j
GradientDescent• Initialize• Repeatuntilconvergence
22
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneousupdate
forj =0...d
ForLinearRegression:@
@✓jJ(✓) =
@
@✓j
1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
=@
@✓j
1
2n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!2
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!⇥ @
@✓j
dX
k=0
✓kx(i)k � y(i)
!
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!x(i)j
=1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘x(i)j
GradientDescent• Initialize• Repeatuntilconvergence
23
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneousupdate
forj =0...d
ForLinearRegression:@
@✓jJ(✓) =
@
@✓j
1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
=@
@✓j
1
2n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!2
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!⇥ @
@✓j
dX
k=0
✓kx(i)k � y(i)
!
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y(i)
!x(i)j
=1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘x(i)j
GradientDescentforLinearRegression
• Initialize• Repeatuntilconvergence
24
✓
simultaneousupdateforj =0...d
✓j ✓j � ↵1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘x(i)j
• Toachievesimultaneousupdate• AtthestartofeachGDiteration,compute• Usethisstoredvalueintheupdatesteploop
h✓
⇣x(i)
⌘
kvk2 =
sX
i
v2i =q
v21 + v22 + . . .+ v2|v|L2 norm:
k✓new � ✓oldk2 < ✏• Assumeconvergencewhen
GradientDescent
25
(forfixed,thisisafunctionofx) (functionoftheparameters)
h(x)=-900– 0.1x
SlidebyAndrewNg
GradientDescent
26
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
GradientDescent
27
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
GradientDescent
28
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
GradientDescent
29
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
GradientDescent
30
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
GradientDescent
31
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
GradientDescent
32
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
GradientDescent
33
(forfixed,thisisafunctionofx) (functionoftheparameters)
SlidebyAndrewNg
Choosingα
34
αtoosmall
slowconvergence
αtoolarge
Increasingvaluefor J(✓)
• Mayovershoottheminimum• Mayfailtoconverge• Mayevendiverge
Toseeifgradientdescentisworking,printouteachiteration• Thevalueshoulddecreaseateachiteration• Ifitdoesn’t,adjustα
J(✓)
ExtendingLinearRegressiontoMoreComplexModels
• TheinputsX forlinearregressioncanbe:– Originalquantitativeinputs– Transformationofquantitativeinputs
• e.g.log,exp,squareroot,square,etc.
– Polynomialtransformation• example:y =b0 +b1×x +b2×x2 +b3×x3
– Basisexpansions– Dummycodingofcategoricalinputs– Interactionsbetweenvariables
• example:x3 =x1 × x2
Thisallowsuseoflinearregressiontechniquestofitnon-lineardatasets.
LinearBasisFunctionModels
• Generally,
• Typically,sothatactsasabias• Inthesimplestcase,weuselinearbasisfunctions:
h✓(x) =dX
j=0
✓j�j(x)
�0(x) = 1 ✓0
�j(x) = xj
basisfunction
BasedonslidebyChristopherBishop(PRML)
LinearBasisFunctionModels
– Theseareglobal;asmallchangeinx affectsallbasisfunctions
• Polynomialbasisfunctions:
• Gaussianbasisfunctions:
– Thesearelocal;asmallchangeinx onlyaffectnearbybasisfunctions.μj ands controllocationandscale(width).
BasedonslidebyChristopherBishop(PRML)
LinearBasisFunctionModels• Sigmoidal basisfunctions:
where
– Thesearealsolocal;asmallchangeinx onlyaffectsnearbybasisfunctions.μjands controllocationandscale(slope).
BasedonslidebyChristopherBishop(PRML)
ExampleofFittingaPolynomialCurvewithaLinearModel
y = ✓0 + ✓1x+ ✓2x2 + . . .+ ✓px
p =pX
j=0
✓jxj
LinearBasisFunctionModels
• BasicLinearModel:
• GeneralizedLinearModel:
• Oncewehavereplacedthedatabytheoutputsofthebasisfunctions,fittingthegeneralizedmodelisexactlythesameproblemasfittingthebasicmodel– Unlessweusethekerneltrick– moreonthatwhenwecoversupportvectormachines
– Therefore,thereisnopointinclutteringthemathwithbasisfunctions
40
h✓(x) =dX
j=0
✓j�j(x)
h✓(x) =dX
j=0
✓jxj
BasedonslidebyGeoffHinton
LinearAlgebraConcepts• Vector in isanorderedsetofd realnumbers– e.g.,v=[1,6,3,4]isin– “[1,6,3,4]” isacolumnvector:– asopposedtoarowvector:
• Anm-by-n matrix isanobjectwithm rowsandn columns,whereeachentryisarealnumber:
÷÷÷÷÷
ø
ö
ççççç
è
æ
4361
( )4361
÷÷÷
ø
ö
ççç
è
æ
2396784821
Rd
R4
BasedonslidesbyJosephBradley
• Transpose:reflectvector/matrixonline:
( )baba T
=÷÷ø
öççè
æ÷÷ø
öççè
æ=÷÷
ø
öççè
ædbca
dcba T
– Note:(Ax)T=xTAT (We’lldefinemultiplicationsoon…)
• Vectornorms:– Lp normofv =(v1,…,vk)is– Commonnorms:L1,L2– Linfinity =maxi |vi|
• Lengthofavectorv isL2(v)
X
i
|vi|p! 1
p
BasedonslidesbyJosephBradley
LinearAlgebraConcepts
• Vectordotproduct:
– Note:dotproductofu withitself =length(u)2 =
• Matrixproduct:
( ) ( ) 22112121 vuvuvvuuvu +=•=•
÷÷ø
öççè
æ++++
=
÷÷ø
öççè
æ=÷÷
ø
öççè
æ=
2222122121221121
2212121121121111
2221
1211
2221
1211 ,
babababababababa
AB
bbbb
Baaaa
A
kuk22
BasedonslidesbyJosephBradley
LinearAlgebraConcepts
• Vectorproducts:– Dotproduct:
– Outerproduct:
( ) 22112
121 vuvuvv
uuvuvu T +=÷÷ø
öççè
æ==•
( ) ÷÷ø
öççè
æ=÷÷
ø
öççè
æ=
2212
211121
2
1
vuvuvuvu
vvuu
uvT
BasedonslidesbyJosephBradley
LinearAlgebraConcepts
h(x) = ✓|x
x| =⇥1 x1 . . . xd
⇤
Vectorization• Benefitsofvectorization– Morecompactequations– Fastercode(usingoptimizedmatrixlibraries)
• Considerourmodel:
• Let
• Canwritethemodelinvectorized formas45
h(x) =dX
j=0
✓jxj
✓ =
2
6664
✓0✓1...✓d
3
7775
Vectorization• Considerourmodelforn instances:
• Let
• Canwritethemodelinvectorized formas46
h✓(x) = X✓
X =
2
66666664
1 x(1)1 . . . x(1)
d...
.... . .
...
1 x(i)1 . . . x(i)
d...
.... . .
...
1 x(n)1 . . . x(n)
d
3
77777775
✓ =
2
6664
✓0✓1...✓d
3
7775
h⇣x(i)
⌘=
dX
j=0
✓jx(i)j
R(d+1)⇥1 Rn⇥(d+1)
J(✓) =1
2n
nX
i=1
⇣✓|x(i) � y(i)
⌘2
Vectorization• Forthelinearregressioncostfunction:
47
J(✓) =1
2n(X✓ � y)| (X✓ � y)
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
Rn⇥(d+1)
R(d+1)⇥1
Rn⇥1R1⇥n
Let:
y =
2
6664
y(1)
y(2)
...y(n)
3
7775
ClosedFormSolution:
ClosedFormSolution• InsteadofusingGD,solveforoptimal analytically– Noticethatthesolutioniswhen
• Derivation:
Takederivativeandsetequalto0,thensolvefor:
48
✓@
@✓J(✓) = 0
J (✓) =1
2n(X✓ � y)| (X✓ � y)
/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y
1x1J (✓) =1
2n(X✓ � y)| (X✓ � y)
/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y
J (✓) =1
2n(X✓ � y)| (X✓ � y)
/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y
@
@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0
(X|X)✓ �X|y = 0
(X|X)✓ = X|y
✓ = (X|X)�1X|y
✓@
@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0
(X|X)✓ �X|y = 0
(X|X)✓ = X|y
✓ = (X|X)�1X|y
@
@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0
(X|X)✓ �X|y = 0
(X|X)✓ = X|y
✓ = (X|X)�1X|y
@
@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0
(X|X)✓ �X|y = 0
(X|X)✓ = X|y
✓ = (X|X)�1X|y
ClosedFormSolution• CanobtainbysimplypluggingX and into
• IfX TX isnotinvertible(i.e.,singular),mayneedto:– Usepseudo-inverseinsteadoftheinverse
• Inpython,numpy.linalg.pinv(a)
– Removeredundant(notlinearlyindependent)features– Removeextrafeaturestoensurethatd ≤n
49
@
@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0
(X|X)✓ �X|y = 0
(X|X)✓ = X|y
✓ = (X|X)�1X|y
y =
2
6664
y(1)
y(2)
...y(n)
3
7775X =
2
66666664
1 x(1)1 . . . x(1)
d...
.... . .
...
1 x(i)1 . . . x(i)
d...
.... . .
...
1 x(n)1 . . . x(n)
d
3
77777775
✓ y
GradientDescentvs ClosedForm
GradientDescentClosedFormSolution
50
• Requiresmultipleiterations• Needtochooseα• Workswellwhenn islarge• Cansupportincremental
learning
• Non-iterative• Noneedforα• Slowifn islarge
– Computing(X TX)-1 isroughlyO(n3)
ImprovingLearning:FeatureScaling
• Idea:Ensurethatfeaturehavesimilarscales
• Makesgradientdescentconvergemuch faster
51
0
5
10
15
20
0 5 10 15 20
✓1
✓2
BeforeFeatureScaling
0
5
10
15
20
0 5 10 15 20
✓1
✓2
AfterFeatureScaling
FeatureStandardization• Rescalesfeaturestohavezeromeanandunitvariance
– Letμj bethemeanoffeaturej:
– Replaceeachvaluewith:
• sj isthestandarddeviationoffeaturej• Couldalsousetherangeoffeaturej (maxj – minj)forsj
• Mustapplythesametransformationtoinstancesforbothtrainingandprediction
• Outlierscancauseproblems52
µj =1
n
nX
i=1
x(i)j
x(i)j
x(i)j � µj
sj
forj =1...d(notx0!)
QualityofFit
Overfitting:• Thelearnedhypothesismayfitthetrainingsetverywell( )
• ...butfailstogeneralizetonewexamples
53
Prod
uctiv
ity
TimeSpent
Prod
uctiv
ity
TimeSpent
Prod
uctiv
ity
TimeSpent
Underfitting(highbias)
Overfitting(highvariance)
Correctfit
J(✓) ⇡ 0
BasedonexamplebyAndrewNg
Regularization• Amethodforautomaticallycontrollingthecomplexityofthelearnedhypothesis
• Idea:penalizeforlargevaluesof– Canincorporateintothecostfunction– Workswellwhenwehavealotoffeatures,eachthatcontributesabittopredictingthelabel
• Canalsoaddressoverfitting byeliminatingfeatures(eithermanuallyorviamodelselection)
54
✓j
Regularization• Linearregressionobjectivefunction
– istheregularizationparameter()– Noregularizationon!
55
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2+ �
dX
j=1
✓2j
modelfittodata regularization
✓0
� � � 0
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
UnderstandingRegularization
• Notethat
– Thisisthemagnitudeofthefeaturecoefficientvector!
• Wecanalsothinkofthisas:
• L2 regularizationpullscoefficientstoward0
56
dX
j=1
✓2j = k✓1:dk22
dX
j=1
(✓j � 0)2 = k✓1:d � ~0k22
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
UnderstandingRegularization
• Whathappensas?
57
Prod
uctiv
ity
TimeSpentonWork
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
� ! 1
✓0 + ✓1x+ ✓2x2 + ✓3x
3 + ✓4x4
UnderstandingRegularization
• Whathappensas?
58
Prod
uctiv
ity
TimeSpentonWork
0 0 0 0
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
� ! 1
✓0 + ✓1x+ ✓2x2 + ✓3x
3 + ✓4x4
RegularizedLinearRegression
59
• CostFunction
• Fitbysolving
• Gradientupdate:
min✓
J(✓)
✓j ✓j � ↵1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘x(i)j
✓0 ✓0 � ↵1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘
regularization
@
@✓jJ(✓)
@
@✓0J(✓)
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
� ↵�✓j
RegularizedLinearRegression
60
✓0 ✓0 � ↵1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘
• Wecanrewritethegradientstepas:
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
✓j ✓j (1� ↵�)� ↵1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘x(i)j
✓j ✓j � ↵1
n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘x(i)j � ↵�✓j
RegularizedLinearRegression
61
✓ =
0
BBBBB@X|X + �
2
666664
0 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...
......
. . ....
0 0 0 . . . 1
3
777775
1
CCCCCA
�1
X|y
• Toincorporateregularizationintotheclosedformsolution:
RegularizedLinearRegression
62
• Toincorporateregularizationintotheclosedformsolution:
• Canderivethisthesameway,bysolving
• Canprovethatforλ >0,inverseexistsintheequationabove
✓ =
0
BBBBB@X|X + �
2
666664
0 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...
......
. . ....
0 0 0 . . . 1
3
777775
1
CCCCCA
�1
X|y
@
@✓J(✓) = 0