Introduction to Bayesian Learning...Coming up Another way of thinking about “What does it mean to...
Transcript of Introduction to Bayesian Learning...Coming up Another way of thinking about “What does it mean to...
MachineLearning
IntroductiontoBayesianLearning
1
Whatwehaveseensofar
Whatdoesitmeantolearn?– Mistake-drivenlearning
• Learningbycounting(andbounding)numberofmistakes
– PAClearnability• Samplecomplexityandboundsonerrorsonunseenexamples
Variouslearningalgorithms– Analyzedalgorithmsunderthesemodelsoflearnability– Inallcases,thealgorithmoutputsafunctionthatproducesalabely foragiveninputx
2
Comingup
Anotherwayofthinkingabout“Whatdoesitmeantolearn?”– Bayesianlearning
Differentlearningalgorithmsinthisregime– NaïveBayes– LogisticRegression
3
Today’slecture
• BayesianLearning
• Maximumaposterioriandmaximumlikelihoodestimation
• Twoexamplesofmaximumlikelihoodestimation– Binomialdistribution– Normaldistribution
4
Today’slecture
• BayesianLearning
• Maximumaposterioriandmaximumlikelihoodestimation
• Twoexamplesofmaximumlikelihoodestimation– Binomialdistribution– Normaldistribution
5
ProbabilisticLearning
Twodifferentnotionsofprobabilisticlearning• Learningprobabilisticconcepts
– Thelearnedconceptisafunctionc:X®[0,1]– c(x)maybeinterpretedastheprobabilitythatthelabel1is
assignedtox– Thelearningtheorythatwehavestudiedbeforeisapplicable
(withsomeextensions)
• BayesianLearning:Useofaprobabilisticcriterioninselectingahypothesis– Thehypothesiscanbedeterministic,aBooleanfunction– Thecriterionforselectingthehypothesisisprobabilistic
6
ProbabilisticLearning
Twodifferentnotionsofprobabilisticlearning• Learningprobabilisticconcepts
– Thelearnedconceptisafunctionc:X®[0,1]– c(x)maybeinterpretedastheprobabilitythatthelabel1is
assignedtox– Thelearningtheorythatwehavestudiedbeforeisapplicable
(withsomeextensions)
• BayesianLearning:Useofaprobabilisticcriterioninselectingahypothesis– Thehypothesiscanbedeterministic,aBooleanfunction– Thecriterionforselectingthehypothesisisprobabilistic
7
ProbabilisticLearning
Twodifferentnotionsofprobabilisticlearning• Learningprobabilisticconcepts
– Thelearnedconceptisafunctionc:X®[0,1]– c(x)maybeinterpretedastheprobabilitythatthelabel1is
assignedtox– Thelearningtheorythatwehavestudiedbeforeisapplicable
(withsomeextensions)
• BayesianLearning:Useofaprobabilisticcriterioninselectingahypothesis– Thehypothesiscanbedeterministic,aBooleanfunction– Thecriterionforselectingthehypothesisisprobabilistic
8
BayesianLearning:Thebasics
• Goal: Tofindthebest hypothesisfromsomespaceHofhypotheses,usingtheobserveddataD
• Definebest =most probablehypothesisinH
• Inordertodothat,weneedtoassumeaprobabilitydistributionovertheclassH
• Wealsoneedtoknowsomethingabouttherelationbetweenthedataobservedandthehypotheses– Aswewillsee,wecanbeBayesianaboutotherthings.e.g.,the
parametersofthemodel
9
BayesianLearning:Thebasics
• Goal: Tofindthebest hypothesisfromsomespaceHofhypotheses,usingtheobserveddataD
• Definebest =most probablehypothesisinH
• Inordertodothat,weneedtoassumeaprobabilitydistributionovertheclassH
• Wealsoneedtoknowsomethingabouttherelationbetweenthedataobservedandthehypotheses– Aswewillsee,wecanbeBayesianaboutotherthings.e.g.,the
parametersofthemodel
10
BayesianLearning:Thebasics
• Goal: Tofindthebest hypothesisfromsomespaceHofhypotheses,usingtheobserveddataD
• Definebest =most probablehypothesisinH
• Inordertodothat,weneedtoassumeaprobabilitydistributionovertheclassH
• Wealsoneedtoknowsomethingabouttherelationbetweenthedataobservedandthehypotheses
– Aswewillseeinthecominglectures,wecanbeBayesianaboutotherthings.e.g.,theparametersofthemodel
11
Bayesianmethodshavemultipleroles
• Providepracticallearningalgorithms
• Combiningpriorknowledgewithobserveddata– Guidethemodeltowardssomethingweknow
• Provideaconceptualframework– Forevaluatingotherlearners
• Providetoolsforanalyzinglearning
12
BayesTheorem
13
𝑃 𝑌 𝑋 = 𝑃 𝑋 𝑌 𝑃(𝑌)
𝑃(𝑋)
BayesTheorem
Shortfor
14
𝑃 𝑌 𝑋 = 𝑃 𝑋 𝑌 𝑃(𝑌)
𝑃(𝑋)
∀𝑥, 𝑦𝑃 𝑌 = 𝑦 𝑋 = 𝑥 = 𝑃 𝑋 = 𝑥 𝑌 = 𝑦 𝑃(𝑌 = 𝑦)
𝑃(𝑋 = 𝑥)
BayesTheorem
15
BayesTheorem
Priorprobability:WhatisourbeliefinYbeforeweseeX?
16
BayesTheorem
Likelihood:WhatisthelikelihoodofobservingXgivenaspecificY?
Priorprobability:WhatisourbeliefinYbeforeweseeX?
17
BayesTheorem
Posteriorprobability:WhatistheprobabilityofYgiventhatXisobserved?
Likelihood:WhatisthelikelihoodofobservingXgivenaspecificY?
Priorprobability:WhatisourbeliefinYbeforeweseeX?
18
BayesTheorem
Posteriorprobability:WhatistheprobabilityofYgiventhatXisobserved?
Likelihood:WhatisthelikelihoodofobservingXgivenaspecificY?
Priorprobability:WhatisourbeliefinYbeforeweseeX?
19
Posterior/ Likelihood£ Prior
ProbabilityRefresher
• Productrule:
• Sumrule:
• EventsA,Bareindependentif:– P(A,B)=P(A)P(B)– Equivalently,P(A|B)=P(A),P(B|A)=P(B)
• TheoremofTotalprobability:FormutuallyexclusiveeventsA1,A2,!,An (i.e Ai Å Aj =;)with
20
BayesianLearning
21
GivenadatasetD,wewanttofindthebest hypothesish
Whatdoesbestmean?
Bayesianlearninguses𝑃 ℎ 𝐷), theconditionalprobabilityofahypothesisgiventhedata,todefinebest.
BayesianLearning
22
GivenadatasetD,wewanttofindthebesthypothesishWhatdoesbestmean?
BayesianLearning
23
Posteriorprobability:Whatistheprobabilitythathisthehypothesis,giventhatthedataDisobserved?
GivenadatasetD,wewanttofindthebesthypothesishWhatdoesbestmean?
BayesianLearning
24
Posteriorprobability:Whatistheprobabilitythathisthehypothesis,giventhatthedataDisobserved?
GivenadatasetD,wewanttofindthebesthypothesishWhatdoesbestmean?
Keyinsight:BothhandDareevents.• D:Theeventthatweobservedthis particulardataset• h:Theeventthatthehypothesishisthetruehypothesis
SowecanapplytheBayesrulehere.
BayesianLearning
25
Posteriorprobability:Whatistheprobabilitythathisthehypothesis,giventhatthedataDisobserved?
GivenadatasetD,wewanttofindthebesthypothesishWhatdoesbestmean?
Keyinsight:BothhandDareevents.• D:Theeventthatweobservedthis particulardataset• h:Theeventthatthehypothesishisthetruehypothesis
BayesianLearning
26
Posteriorprobability:Whatistheprobabilitythathisthehypothesis,giventhatthedataDisobserved?
Priorprobabilityofh:Backgroundknowledge.Whatdoweexpectthehypothesistobeevenbeforeweseeanydata?Forexample,intheabsenceofanyinformation,maybetheuniformdistribution.
GivenadatasetD,wewanttofindthebesthypothesishWhatdoesbestmean?
BayesianLearning
27
Posteriorprobability:Whatistheprobabilitythathisthehypothesis,giventhatthedataDisobserved?
Priorprobabilityofh:Backgroundknowledge.Whatdoweexpectthehypothesistobeevenbeforeweseeanydata?Forexample,intheabsenceofanyinformation,maybetheuniformdistribution.
Likelihood:Whatistheprobabilitythatthisdatapoint(anexampleoranentiredataset)isobserved,giventhatthehypothesisish?
GivenadatasetD,wewanttofindthebesthypothesishWhatdoesbestmean?
BayesianLearning
28
Posteriorprobability:Whatistheprobabilitythathisthehypothesis,giventhatthedataDisobserved?
Priorprobabilityofh:Backgroundknowledge.Whatdoweexpectthehypothesistobeevenbeforeweseeanydata?Forexample,intheabsenceofanyinformation,maybetheuniformdistribution.
WhatistheprobabilitythatthedataDisobserved(independentofanyknowledgeaboutthehypothesis)?
Likelihood:Whatistheprobabilitythatthisdatapoint(anexampleoranentiredataset)isobserved,giventhatthehypothesisish?
GivenadatasetD,wewanttofindthebesthypothesishWhatdoesbestmean?
Today’slecture
• BayesianLearning
• Maximumaposterioriandmaximumlikelihoodestimation
• Twoexamplesofmaximumlikelihoodestimation– Binomialdistribution– Normaldistribution
29
Choosingahypothesis
Givensomedata,findthemostprobablehypothesis– TheMaximumaPosteriorihypothesishMAP
30
Choosingahypothesis
Givensomedata,findthemostprobablehypothesis– TheMaximumaPosteriorihypothesishMAP
31
Choosingahypothesis
Givensomedata,findthemostprobablehypothesis– TheMaximumaPosteriorihypothesishMAP
32
Posterior/ Likelihood£ Prior
Choosingahypothesis
Givensomedata,findthemostprobablehypothesis– TheMaximumaPosteriorihypothesishMAP
Ifweassumethatthepriorisuniform– SimplifythistogettheMaximumLikelihoodhypothesis
33
i.e.P(hi)=P(hj),forallhi,hj
Oftencomputationallyeasiertomaximizeloglikelihood
Choosingahypothesis
Givensomedata,findthemostprobablehypothesis– TheMaximumaPosteriorihypothesishMAP
Ifweassumethatthepriorisuniform– SimplifythistogettheMaximumLikelihoodhypothesis
34
i.e.P(hi)=P(hj),forallhi,hj
Oftencomputationallyeasiertomaximizeloglikelihood
Choosingahypothesis
Givensomedata,findthemostprobablehypothesis– TheMaximumaPosteriorihypothesishMAP
Ifweassumethatthepriorisuniform– SimplifythistogettheMaximumLikelihoodhypothesis
35
i.e.P(hi)=P(hj),forallhi,hj
Oftencomputationallyeasiertomaximizeloglikelihood
BruteforceMAPlearner
Input:DataDandahypothesissetH1. Calculatetheposteriorprobabilityforeachh2 H
2. Outputthehypothesiswiththehighestposteriorprobability
36
BruteforceMAPlearner
Input:DataDandahypothesissetH1. Calculatetheposteriorprobabilityforeachh2 H
2. Outputthehypothesiswiththehighestposteriorprobability
37
Difficulttocompute,exceptforthemost
simplehypothesisspaces
Today’slecture
• BayesianLearning
• Maximumaposterioriandmaximumlikelihoodestimation
• Twoexamplesofmaximumlikelihoodestimation– Bernoullitrials– Normaldistribution
38
MaximumLikelihoodestimation
MaximumLikelihoodestimation(MLE)
Whatweneedinordertodefinelearning:1. AhypothesisspaceH2. AmodelthatsayshowdataDisgeneratedgivenh
39
Example1:Bernoullitrials
TheCEOofastartuphiresyouforyourfirstconsultingjob
• CEO:Mycompanymakeslightbulbs.Ineedtoknowwhatistheprobabilitytheyarefaulty.
• You:Sure.Icanhelpyouout.Aretheyallidentical?
• CEO:Yes!
• You:Excellent.Iknowhowtohelp.Weneedtoexperiment…
40
Faultylightbulbs
Theexperiment:Tryout100lightbulbs80work,20don’t
You:TheprobabilityisP(failure)=0.2
CEO:Buthowdoyouknow?
You:Because…
41
Bernoullitrials
• P(failure)=p,P(success)=1– p
• Eachtrialisi.i.d– Independentandidenticallydistributed
• YouhaveseenD={80work,20don’t}
• Themostlikelyvalueofpforthisobservationis?
42
𝑃 𝐷 𝑝 = 10080 𝑝23 1 − 𝑝 53
argmax;
𝑃 𝐷 𝑝 = argmax;
10080 𝑝23 1 − 𝑝 53
Bernoullitrials
• P(failure)=p,P(success)=1– p
• Eachtrialisi.i.d– Independentandidenticallydistributed
• YouhaveseenD={80work,20don’t}
• Themostlikelyvalueofpforthisobservationis?
43
𝑃 𝐷 𝑝 = 10080 𝑝23 1 − 𝑝 53
argmax;
𝑃 𝐷 𝑝 = argmax;
10080 𝑝23 1 − 𝑝 53
Bernoullitrials
• P(failure)=p,P(success)=1– p
• Eachtrialisi.i.d– Independentandidenticallydistributed
• YouhaveseenD={80work,20don’t}
• Themostlikelyvalueofpforthisobservationis?
44
𝑃 𝐷 𝑝 = 10080 𝑝23 1 − 𝑝 53
argmax;
𝑃 𝐷 𝑝 = argmax;
10080 𝑝23 1 − 𝑝 53
Bernoullitrials
• P(failure)=p,P(success)=1– p
• Eachtrialisi.i.d– Independentandidenticallydistributed
• YouhaveseenD={80work,20don’t}
• Themostlikelyvalueofpforthisobservationis?
45
𝑃 𝐷 𝑝 = 10080 𝑝23 1 − 𝑝 53
argmax;
𝑃 𝐷 𝑝 = argmax;
10080 𝑝23 1 − 𝑝 53
The“learning”algorithm
SayyouhaveaWorkandbNot-Work
Calculus101:SetthederivativetozeroPbest =b/(a+b)
46
𝑝<=>? = argmax;
𝑃(𝐷|ℎ)
= argmax;
log 𝑃(𝐷 |ℎ)
= argmax;
log 𝑎 + 𝑏𝑎 𝑝F 1 − 𝑝 <
= argmax;
𝑎 log 𝑝 + 𝑏log(1 − 𝑝)
The“learning”algorithm
SayyouhaveaWorkandbNot-Work
Calculus101:SetthederivativetozeroPbest =b/(a+b)
47
Loglikelihood𝑝<=>? = argmax
;𝑃(𝐷|ℎ)
= argmax;
log 𝑃(𝐷 |ℎ)
= argmax;
log 𝑎 + 𝑏𝑎 𝑝F 1 − 𝑝 <
= argmax;
𝑎 log 𝑝 + 𝑏log(1 − 𝑝)
The“learning”algorithm
SayyouhaveaWorkandbNot-Work
Calculus101:SetthederivativetozeroPbest =b/(a+b)
48
Loglikelihood𝑝<=>? = argmax
;𝑃(𝐷|ℎ)
= argmax;
log 𝑃(𝐷 |ℎ)
= argmax;
log 𝑎 + 𝑏𝑎 𝑝F 1 − 𝑝 <
= argmax;
𝑎 log 𝑝 + 𝑏log(1 − 𝑝)
The“learning”algorithm
SayyouhaveaWorkandbNot-Work
Calculus101:SetthederivativetozeroPbest =b/(a+b)
49
Loglikelihood𝑝<=>? = argmax
;𝑃(𝐷|ℎ)
= argmax;
log 𝑃(𝐷 |ℎ)
= argmax;
log 𝑎 + 𝑏𝑎 𝑝F 1 − 𝑝 <
= argmax;
𝑎 log 𝑝 + 𝑏log(1 − 𝑝)
The“learning”algorithm
SayyouhaveaWorkandbNot-Work
Calculus101:SetthederivativetozeroPbest =b/(a+b)
50
Loglikelihood𝑝<=>? = argmax
;𝑃(𝐷|ℎ)
= argmax;
log 𝑃(𝐷 |ℎ)
= argmax;
log 𝑎 + 𝑏𝑎 𝑝F 1 − 𝑝 <
= argmax;
𝑎 log 𝑝 + 𝑏log(1 − 𝑝)
The“learning”algorithm
SayyouhaveaWorkandbNot-Work
Calculus101:SetthederivativetozeroPbest =b/(a+b)
51
ThemodelweassumedisBernoulli.Youcouldassumeadifferentmodel!Nextwewillconsiderothermodelsandseehowtolearntheirparameters.
Loglikelihood𝑝<=>? = argmax
;𝑃(𝐷|ℎ)
= argmax;
log 𝑃(𝐷 |ℎ)
= argmax;
log 𝑎 + 𝑏𝑎 𝑝F 1 − 𝑝 <
= argmax;
𝑎 log 𝑝 + 𝑏log(1 − 𝑝)
Today’slecture
• BayesianLearning
• Maximumaposterioriandmaximumlikelihoodestimation
• Twoexamplesofmaximumlikelihoodestimation– Bernoullitrials– Normaldistribution
52
MaximumLikelihoodestimation
MaximumLikelihoodestimation(MLE)
Whatweneedinordertodefinelearning:1. AhypothesisspaceH2. AmodelthatsayshowdataDisgeneratedgivenh
53
MaximumLikelihoodandleastsquares
SupposeHconsistsofrealvaluedfunctionsInputsarevectorsx 2 <d andtheoutputisarealnumbery2 <
Supposethetrainingdataisgeneratedasfollows:• Aninputxi isdrawnrandomly(sayuniformlyatrandom)• Thetruefunctionfisappliedtogetf(xi)• Thisvalueisthenperturbedbynoiseei
– DrawnindependentlyaccordingtoanunknownGaussianwithzeromean
Saywehavemtrainingexamples(xi,yi)generatedbythisprocess
54
Example2:
MaximumLikelihoodandleastsquares
SupposeHconsistsofrealvaluedfunctionsInputsarevectorsx 2 <d andtheoutputisarealnumbery2 <
Supposethetrainingdataisgeneratedasfollows:• Aninputxi isdrawnrandomly(sayuniformlyatrandom)• Thetruefunctionfisappliedtogetf(xi)• Thisvalueisthenperturbedbynoiseei
– DrawnindependentlyaccordingtoanunknownGaussianwithzeromean
Saywehavemtrainingexamples(xi,yi)generatedbythisprocess
55
Example2:
MaximumLikelihoodandleastsquares
SupposeHconsistsofrealvaluedfunctionsInputsarevectorsx 2 <d andtheoutputisarealnumbery2 <
Supposethetrainingdataisgeneratedasfollows:• Aninputxi isdrawnrandomly(sayuniformlyatrandom)• Thetruefunctionfisappliedtogetf(xi)• Thisvalueisthenperturbedbynoiseei
– DrawnindependentlyaccordingtoanunknownGaussianwithzeromean
Saywehavemtrainingexamples(xi,yi)generatedbythisprocess
56
Example2:
MaximumLikelihoodandleastsquares
SupposeHconsistsofrealvaluedfunctionsInputsarevectorsx 2 <d andtheoutputisarealnumbery2 <
Supposethetrainingdataisgeneratedasfollows:• Aninputxi isdrawnrandomly(sayuniformlyatrandom)• Thetruefunctionfisappliedtogetf(xi)• Thisvalueisthenperturbedbynoiseei
– DrawnindependentlyaccordingtoanunknownGaussian withzeromean
Saywehavem trainingexamples(xi,yi)generatedbythisprocess
57
Example2:
MaximumLikelihoodandleastsquares
Theerrorforthisexampleis𝑦G − ℎ 𝑥G
Webelieve thatthiserrorisfromaGaussiandistributionwithmean=0andstandarddeviation=¾
Wecancomputetheprobabilityofobservingonedatapoint(xi,yi),ifitweregeneratedusingthefunctionh
58
Example:
Supposewehaveahypothesish.Wewanttoknowwhatistheprobabilitythataparticularlabelyiwasgeneratedbythishypothesisash(xi)?
MaximumLikelihoodandleastsquares
Theerrorforthisexampleis𝑦G − ℎ 𝑥G
Webelieve thatthiserrorisfromaGaussiandistributionwithmean=0andstandarddeviation=¾
Wecancomputetheprobabilityofobservingonedatapoint(xi,yi),ifitweregeneratedusingthefunctionh
59
Example:
Supposewehaveahypothesish.Wewanttoknowwhatistheprobabilitythataparticularlabelyiwasgeneratedbythishypothesisash(xi)?
MaximumLikelihoodandleastsquares
Probabilityofobservingonedatapoint(xi,yi),ifitweregeneratedusingthefunctionh
EachexampleinourdatasetD={(xi,yi)}isgeneratedindependently bythisprocess
60
Example:
MaximumLikelihoodandleastsquares
Probabilityofobservingonedatapoint(xi,yi),ifitweregeneratedusingthefunctionh
EachexampleinourdatasetD={(xi,yi)}isgeneratedindependently bythisprocess
61
Example:
Ourgoalistofindthemostlikelyhypothesis
MaximumLikelihoodandleastsquares
62
Example:
MaximumLikelihoodandleastsquares
Ourgoalistofindthemostlikelyhypothesis
63
Example:
MaximumLikelihoodandleastsquares
Ourgoalistofindthemostlikelyhypothesis
64
Example:
Howdowemaximizethisexpression?Anyideas?
MaximumLikelihoodandleastsquares
Ourgoalistofindthemostlikelyhypothesis
65
Example:
Howdowemaximizethisexpression?Anyideas?
Answer:Takethelogarithmtosimplify
MaximumLikelihoodandleastsquares
Ourgoalistofindthemostlikelyhypothesis
66
Example:
MaximumLikelihoodandleastsquares
Ourgoalistofindthemostlikelyhypothesis
67
Example:
MaximumLikelihoodandleastsquares
Ourgoalistofindthemostlikelyhypothesis
68
Example:
Becauseweassumedthatthestandarddeviationisaconstant.
MaximumLikelihoodandleastsquares
Themostlikelyhypothesisis
Ifweconsiderthesetoflinearfunctionsasourhypothesisspace:h(xi)=wT xi
69
Example:
Thisistheprobabilisticexplanationforleastsquaresregression
MaximumLikelihoodandleastsquares
Themostlikelyhypothesisis
Ifweconsiderthesetoflinearfunctionsasourhypothesisspace:h(xi)=wT xi
70
Example:
Thisistheprobabilisticexplanationforleastsquaresregression
MaximumLikelihoodandleastsquares
Themostlikelyhypothesisis
Ifweconsiderthesetoflinearfunctionsasourhypothesisspace:h(xi)=wT xi
71
Example:
Thisistheprobabilisticexplanationforleastsquaresregression
Linearregression:Twoperspectives
72
BayesianperspectiveWebelievethattheerrorsareNormallydistributedwithzeromeanandafixedvariance
Findthelinearregressor usingthemaximumlikelihoodprinciple
LossminimizationperspectiveWewanttominimizethedifferencebetweenthesquaredlosserrorofourprediction
Minimizethetotalsquaredloss
Today’slecture:Summary
• BayesianLearning– Anotherwaytoask:Whatisthebesthypothesisforadataset?
– Twoanswerstothequestion:Maximumaposteriori(MAP)andmaximumlikelihoodestimation(MLE)
• Wesawtwoexamplesofmaximumlikelihoodestimation– Binomialdistribution,normaldistribution– YoushouldbeabletoapplybothMAPandMLEtosimpleproblems
73