Logistic Regression Demystified (Hopefully)
-
Upload
gabriele-tolomei -
Category
Science
-
view
1.609 -
download
0
Transcript of Logistic Regression Demystified (Hopefully)
Logis&cRegressionDemys&fied(Hopefully)
GabrieleTolomeiYahooLabs,London,UK
19thNovember2015
Introduc&on
• 3componentsneedtobedefined:– Model:describesthesetofhypotheses(hypothesisspace)thatcanberepresented;
– ErrorMeasure(CostFunc2on):measuresthepricethatmustbepaidifamisclassifica&onerroroccurs
– LearningAlgorithm:isresponsibleofpickingthebesthypothesis(accordingtotheerrormeasure)bysearchingthroughthehypothesisspace
TheModel
LinearSignal• Logis&cRegressionisanexampleoflinearmodel• Givenad+1-dimensionalinputx
xT=(x0,x1,…,xd),x0=1• Wedefinethefamilyofreal-valuedfunc&onsFhavingd+1
parametersθθT=(θ0,θ1,…,θd)
• Eachfunc&onfθinFoutputsarealscalarobtainedasalinearcombina&onoftheinputxwiththeparametersθ
fθ(x)means“theapplica&onoffparametrizedbyθtox”anditisreferredtoassignal
HypothesisSpace
• ThesignalaloneisnotenoughtodefinethehypothesisspaceH
• Usuallythesignalispassedthrougha“filter”,i.e.anotherreal-valuedfunc&ong
• hθ(x)=g(fθ(x))definesthehypothesisspace:
ThesetofpossiblehypothesesHchangesdependingontheparametricmodel(fθ)andonthethresholdingfunc&on(g)
1
-1
Thresholdingx
fθ
1
0g=sign g=iden6ty g=logis6c
fθ(x) fθ(x) fθ(x)
TheLogis&cFunc&on
• DomainisR,Codomainis[0,1]• Alsoknownassigmoidfunc&ondotoits“S”shapeorsob
threshold(comparedtohardthresholdimposedbysign)• Whenz=θTxweareapplyinganon-lineartransforma&onto
ourlinearsignal• Outputcanbegenuinelyinterpretedasaprobabilityvalue
Probabilis&cInterpreta&on
• Describingthesetofhypothesesusingthelogis&cfunc&onisnotenoughtostatethattheoutputcanbeinterpretedasaprobability– Allweknowisthatthelogis&cfunc&onalwaysproducearealvalue
between0and1– Otherfunc&onsmaybedefinedhavingthesameproperty
• e.g.,1/πarctan(x)+1/2
• Thekeypointshereare:– theoutputofthelogis&cfunc&oncanbeinterpretedasaprobability
evenduringlearning– thelogis&cfunc&onismathema&callyconvenient!
Probabilis&cInterpreta&on:OddsRa&o• Letp(resp.,q=1-p)betheprobabilityofsuccess(resp.,
failure)ofanevent• odds(success)=p/q=p/(1-p)• odds(failure)=q/p=1/p/q=1/odds(success)• logit(p)=ln(odds(success))=ln(p/q)=ln(p/1-p)• Logis&cRegressionisinfactanordinarylinearregression
wherethelogitistheresponsevariable!
• Thecoefficientsoflogis&cregressionareexpressedintermsofthenaturallogarithmofodds
Probabilis&cInterpreta&on:OddsRa&o
Probabilis&c-generatedDataAsforanyothersupervisedlearningproblemwecanonlydealwithafinitesetDofmlabelledexampleswhichwecantrytolearnfrom
whereeachyiisabinaryvariabletakingontwovalues{-1,+1}Thatmeanswedonothaveaccesstotheindividualprobabilityassociatedwitheachtrainingsample!S&llwecanassumethatdataweobservefromD,i.e.posi2ve(+1)andnega2ve(-1)samplesareactuallygeneratedbyanunderlyingandunknownprobabilityfunc6on(noisytarget)whichwewanttoes&mate
Es&ma&ngtheNoisyTargetMoreformally,giventhegenerictrainingexample(x,y)weclaimthereexistsacondi&onalprobabilityP(y|x),whichisdefinedas:
whereeachφisthenoisytargetfunc&on• Determinis&cfunc&on:givenxasinputitalwaysoutputseithery=
+1ory=-1(mutuallyexclusive)• Noisytargetfunc&on:givenxasinputitalwaysoutputsbothy=+1
andy=-1,eachwitha“degreeofcertainty”associated
Goal:Ifweassumeφ:Rd+1à[0,1]istheunderlyingandunknownnoisytargetwhichgeneratesourexamples,ouraimistofindanes&mateφ*whichbestapproximatesφ
HypothesizedNoisyTargetWeclaimthatthebestes&mateφ*ofφish*θ(x)whichinturnispickedfromthesetofhypothesesdefinedbylogis&cfunc&on
Buthowdoweselecth*θ(x)?2elementsareneeded:- TrainingsetD- ErrorMeasure(CostFunc&on)tominimize
TheErrorMeasure
TheBestHypothesis
IfthehypothesisspaceHismadeofafamilyofparametricmodels,h*θ(x)canbepickedas:
Thatis,wewanttomaximisetheprobabilityofthechosenhypothesisgiventhedataDweobserved
FlippingtheCoin:(Data)LikelihoodWemeasuretheerrorwearemakingbyassumingthath*θ(x)approximatesthetruenoisytargetφ
HowlikelyisthattheobserveddataDhavebeengeneratedbyourselectedhypothesish*θ(x)?
FindthehypothesiswhichmaximisestheprobabilityoftheobserveddataDgivenapar&cularhypothesis
TheLikelihoodFunc&onGivenagenerictrainingexample(x,y)andassumingithasbeengeneratedbyahypothesishθ(x)thelikelihoodfunc&onis:
whereφhasbeenreplacedwithourhypothesisIfweassumethehypothesisisthelogis&cfunc&on
Andbyno&cingthatlogis&cfunc&onissymmetric,i.e.l(-z)=1-l(z),thelikelihoodforasingleexampleis:
TheLikelihoodFunc&on
Havingaccesstoafullsetofmi.i.d.trainingexamplesD
Theoveralllikelihoodfunc&oniscomputedas:
WhyDoesLikelihoodMakeSense?Howdoesthelikelihoodl(yiθTxi)changesw.r.t.thesignofyiandθTxi?
Ifthelabelisconcordantwiththesignal(eitherposi&velyornega&vely)thenl(yiθTxi)approachesto1
Ourpredic&onagreeswiththetruelabel
θTxi>0 θTxi<0
yi>0 ≈1 ≈0
yi<0 ≈0 ≈1
Conversely,ifthelabelisdiscordantwiththesignalthenl(yiθTxi)approachesto0
Ourpredic&ondisagreeswiththetruelabel
MaximumLikelihoodEs&mate
Findthevectorofparametersθsuchthatthelikelihoodfunc&onismaximum
FromMLEtoIn-SampleErrorGenerallyspeaking,givenahypothesishθandatrainingsetDofmlabelledsamplesweareinterestedinmeasuringthe“in-sample”(i.e.training)error
wheree()measureshow“far”thechosenhypothesisisfromthetrueobservedvalue
Howwecan“transform”MLEtoanexpressionsimilartothe“in-sample”errorabove?
FromMLEtoIn-SampleError
FromMLEtoIn-SampleError
Byno&cingthatlogis&cfunc&oncanberewritenasfollows:
Wecanfinallywritethe“in-sample”errortobeminimised:
Cross-EntropyError
TheLearningAlgorithm
PickingtheBestHypothesisSofarwehavedefined:- Themodel- Theerrormeasure(cross-entropy)
Toactuallyselectthebesthypothesis,wehavetopickthevectorofparameterssothattheerrormeasureisminimised
Theusualwayofachievingthisistocomputethegradientwithrespecttoθ(i.e.thevectorofpar&alderiva&ves),setitto0,andsolveitforθ
MeanSquaredErrorvs.Cross-EntropyInthecaseoflinearregressionwehaveasimilarexpressionfortheerrormeasure,i.e.MeanSquaredError(MSE)
MinimisingMSEthroughOrdinaryLeastSquares(OLS)leadstoaclosed-formsolu2onobenreferredtoastheOLSes&matorforθ
TheproblemisthatusingCross-Entropyaserrormeasurewecannotfindaclosed-formsolu&ontotheminimiza&onproblem
Itera2veSolu2on
(Batch)GradientDescentGeneralitera&vemethodforanynonlinearop&miza&on
Underspecificassump&onsonthefunc&ontobeminimisedandonthelearningrateparameterateachitera&on,themethodguaranteestheconvergencetoalocalminimum
globalminimum
Ifthefunc&onisconvexlikethecross-entropyerrorforlogis&cregressionthenthelocalminimumisalsotheglobalminimum
GradientDescent:TheIdea1. Att=0ini&alizethe(guessed)vectorofparametersθtoθ(0)2. Repeatun&lconvergence:
a. Updatethecurrentvectorofparametersθ(t)bytakinga“step”alongthe“steepest”slope:θ(t+1)=θ(t)+ηv
b. Returnto2.
step
Unitvectorrepresen&ngthedirec&onofthesteepestslope
Ques2on:Howdowecomputethedirec&onv?Dependingonhowwesolveitwemaygetdifferentsolu&ons(GradientDescent,ConjugateGradient,etc.)
GradientDescent:TheDirec&onvWealreadyintui&velysaidthatthedirec&onvshouldbethatofthe“steepest”slope
Concretely,thismeansmovingalongthedirec&onwhichmostlyreducesthein-sampleerrorfunc&on
WewantΔEintobeasnega&veaspossible,whichmeansthatweareactuallyreducingtheerrorw.r.t.thepreviousitera&ont-1
GradientDescent:TheDirec&onv
Let’sfirstassumeweareintheunivariatecase,i.e.θ=θinR
GradientDescent:TheDirec&onv
First-orderTaylorapproxima&on Second-ordererrorterm
Tosummarizeandgeneralizetothemul&variatecaseofθ:
Thegreekleternablaindicatesthegradient
GradientDescent:TheDirec&onv
Theunitvectorvonlycontributestothedirec&onandnottothemagnitudeoftheitera&vestepTherefore:- themaximum(i.e.mostposi2ve)stephappenswhenboththe
errorvectorandthedirec&onvectorhavethesamedirec&on- theminimum(i.e.mostnega2ve)stephappenswhenthetwo
vectorshaveoppositedirec&on
GradientDescent:TheDirec&onv
Ateachitera&ont,wewanttheunitvectorwhichmakesexactlythemostnega&vestep
Therefore:
GradientDescent:TheStepηHowthestepmagnitudeηaffectstheconvergence?
ηtoosmall ηtoolarge ηvariable
RuleofthumbDynamicallychangeηpropor&onallytothegradient!
GradientDescent:TheStepηRememberthatateachitera&ontheupdatestrategyis:
where:
Ateachitera&ont,thestepηisfixed
GradientDescent:TheStepηtInsteadofhavingafixedηateachitera&on,useavariableηtasfunc&onofη
Ifwetake
GradientDescent:TheAlgorithm1. Att=0ini&alizethe(guessed)vectorofparametersθtoθ(0)2. Fort=0,1,2,…un&lstop:
a. Computethegradientofthecross-entropyerror(i.e.thevectorofpar&alderiva&ves)
b. Updatethevectorofparameters:θ(t+1)=θ(t)-ηEin(θ(t))c. Returnto2.
3. Returnthefinalvectorofparametersθ(∞)
Discussion:Ini&aliza&on• Howdowechoosetheini&alvalueoftheparametersθ(0)?• Ifthefunc&onisconvexweareguaranteedtoreachthe
globalminimumnomaterwhatistheini&alvalueofθ(0)• Ingeneralwemaygettothelocalminimumnearesttoθ(0)
– Problem:wemaymiss“beter”localminima(oreventheglobalifitexists)
– Solu&on(heuris&c):repea&ngGD100÷1,000&meseach&mewithadifferentθ(0)maygiveasenseofwhatiseventuallytheglobalminimum(noguarantees)
Discussion:Termina&on• Whendoesthealgorithmstop?• Intui&vely,whenθ(t+1)=θ(t)è-ηEin(θ(t))=0èEin(θ(t))=0• Ifthefunc&onisconvexweareguaranteedtoreachtheglobal
minimumwhenEin(θ(t))=0– i.e.thereexistsauniquelocalminimumwhichalsohappenstobethe
globalminimum
• Ingeneralwedon’tknowifeventuallyEin(θ(t))=0thereforewecanuseseveralcriteriaoftermina&on,e.g.,:– stopwheneverthedifferencebetweentwoitera&onsis“smallenough”à
mayconverge“prematurely”– stopwhentheerrorequalstoεàmaynotconvergeifthetargeterroris
notachievable– stopaberTitera&ons– combina&onsoftheaboveinprac&ceworks…
AdvancedTopics• GradientDescentusingsecond-orderapproxima&on
– beterlocalapproxima&onthanfirst-orderbuteachsteprequirescompu&ngthesecondderiva&ve(Hessianmatrix)
– ConjugateGradientmakessecond-approxima&on“faster”asitdoesn’trequiretocomputeexplicitlythefullHessianmatrix
• Stochas&cGradientDescent(SGD)– Ateachsteponlyonesampleisconsideredforcompu&ngthegradient
oftheerrorinsteadofthefulltrainingset
• L1andL2regulariza&ontopenalizeextremeparametervaluesanddealwithoverfi~ng– includetheL1orL2normofthevectorofparametersθinthecross-
entropyerrorfunc&ontobeminimisedduringlearning