parameter - UC Santa Barbara
Transcript of parameter - UC Santa Barbara
![Page 1: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/1.jpg)
Parameter Estimation
![Page 2: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/2.jpg)
2PR , ANN, & ML
Notational Convention Probabilities
Mass (discrete) function: capital lettersDensity (continuous) function: small letters
Vector vs. scalar Scalar: plainVector: bold 2D: smallHigher dimension: capital
Notes in a continuous state of fluctuation until a topic is finished (many updates)
![Page 3: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/3.jpg)
3PR , ANN, & ML
Parameter Estimation Optimal classifier maximizes
a prior probability class-conditional density
Assumption no correlation time independent statistics
)()()|()|(
xxx
pPpp ii
i
![Page 4: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/4.jpg)
4PR , ANN, & ML
Popular Approaches Parametric: assume a certain parametric
form for p(x|wi) and estimate the parameters Nonparametric: does not assume a
parametric form for p(x|wi) and estimate the density profile directly
Boundary: estimate the separation hyperplane (hypersurface) between p(x|wi) and p(x|wj)
![Page 5: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/5.jpg)
5PR , ANN, & ML
a prior probability Given the numbers of occurrence:
if number of samples are large enough the selection process is not biasedCaveat: sampling may be biased
kiMnP
Mn
nnn
ii
k
ii
kk
,,1)(
),(,),,(),,(
1
2211
![Page 6: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/6.jpg)
6PR , ANN, & ML
Class conditional density More complicated (not a single number, but
a distribution) assume a certain form estimate the parameters
What form should we assume? Many, but in this courseWe use almost exclusively Gaussian
![Page 7: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/7.jpg)
7PR , ANN, & ML
Gaussian (or Normal) Scalar case
Vector case
Unknowns
class mean and variance
2
2)(21
21),()|( i
iux
iiii eNxp
)]()[(21
2/1
1
||21)()|( i
TieNp
idiii
uxΣux
ΣΣ,μx
Gaussian Distribution
![Page 8: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/8.jpg)
8PR , ANN, & ML
feature
2
2)(21
21
ux
e
population
2
![Page 9: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/9.jpg)
9PR , ANN, & ML
Why Gaussian (Normal)? Central limit theorem predicts normal distribution
from IID experiments In reality
There are only two numbers in the scalar case (mean and variance) to estimate, (or d + d(d+1)/2 in d-dimensions)
Nice mathematical properties (e.g., Fourier transform of a Gaussian is a Gaussian. Products and summation of Gaussian remain Gaussian, Any linear transform of a Gaussian is a Gaussian)
![Page 10: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/10.jpg)
10PR , ANN, & ML
Projection
Transformation
In particular, a whitening transform can diagonalize the covariance matrix
![Page 11: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/11.jpg)
11PR , ANN, & ML
Parameter Estimation
Maximum likelihood estimator Parameters have fixed but unknown values
Bayesian estimator parameters as random variables with know a
prior distributionsBayesian estimator allows us to change the a
priori distribution by incorporating measurements to sharpen the profile
![Page 12: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/12.jpg)
12PR , ANN, & ML
Graphically MLE Bayesian
parameters
likelihood
![Page 13: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/13.jpg)
13PR , ANN, & ML
Maximum Likelihood Estimator Given
n labeled samples (observations)
an assumed distribution of e parameters
samples are drawn independently from
Find
parameter that best explains the observations
},,,{ 21 nxxxX
},,,{ 21 e θ
),|()|( θXX jj pp
![Page 14: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/14.jpg)
14PR , ANN, & ML
MLE Formulation
Maximize
n
jj
j
n
j
ppl
pp
1
1
)|(log)(log)(
)|()|(
θxθ|Xθ
θxθX
Or
n
jjpl
p
10)|(log)(
0)(
θxθ
θ|X
θθ
θ
Log likelihood
![Page 15: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/15.jpg)
15PR , ANN, & ML
An Example
22
21
2
2
1
2
21
2
221
2
22
)(21
)(21
21
)(
)|(log
)(21log
21)|(log
)(21log
21)|(log
21)|( 2
2
j
j
j
jj
jj
ux
j
x
x
xp
xxp
u
uxxp
expj
θ
θ
θ
θ
θ
![Page 16: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/16.jpg)
16PR , ANN, & ML
An Example (cont.)
2
12
2
11
1 12
2
21
2
1 2
1
)ˆ(1ˆ
1ˆ
0)(1
0)(
uxn
xn
x
x
n
jj
n
jj
n
j
n
j
j
n
j
j
Class mean as sample meanclass variance as sample variance
)(ˆ2
1)()ˆ,ˆ()(
1)|()(2
2
ˆ)ˆ(
21
i
ux
iiiiii pepN
xpxpxg i
i
n-1, MLE is biased!
![Page 17: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/17.jpg)
17PR , ANN, & ML
2
22
2
21 )(
21
1
)(21
1 21
21
uxn
j
uxn
jee
1 2coin weight
population
![Page 18: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/18.jpg)
18PR , ANN, & ML
22
2
21
2 )ˆ(21
21
)ˆ(21
11 21
21
uxn
j
uxn
jee
12
0x
If too narrow, many sampling points will be outside 2 width with low likelihood of occurrence
If too wide, 1/ becomes too small and reduces the likelihood of occurrence
![Page 19: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/19.jpg)
19PR , ANN, & ML
A Quick Word on MAP MAP (Maximum a posteriori) estimator Similar to MLE with one additional twist
Maximize the (log) likelihood, l(.) and p(.), prior probability of parameter values (if
you know it), e.g., the mean is more likely to be uo with a normal distribution
MLE has a uniform prior, MAP not necessarily
The added term is a case of “regularization”
![Page 20: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/20.jpg)
20PR , ANN, & ML
Bayesian Estimator Note that MLE is a batch estimator
All data have to be keptDifficult to update estimationDifficult to incorporate other evidence Insist on a single measurement
Bayesian estimatorAllow the freedom that parameters in
themselves can be random variablesAllow multiple evidence Allow iterative update
![Page 21: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/21.jpg)
21PR , ANN, & ML
Bayesian Estimator Based on Bayes rule
With X at our disposal
jjj
iiii Pxp
PxpxP
xPxP)()|(
)()|()(
),()|(
jjj
iiii Pp
PpP
PP)|(),|(
)|(),|()(
),,(),|(XXxXXx
Xx,XxXx
![Page 22: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/22.jpg)
22PR , ANN, & ML
Bayes Rule Formulation Assume
X comes from only one class is independent of X
jjjj
iiiiii Pp
PpP
Pp)(),|(
)(),|(),(
),,()|(
XxXx
XxXxXx,
)( ip
![Page 23: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/23.jpg)
23PR , ANN, & ML
How can X be used? The distribution is known (e.g., normal), the
parameters are unknown For estimating class parameters class parameters then constrain x put it all together
)( X|θp)( θ|xp
θX|θθ|xX|x dppp )()()(
![Page 24: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/24.jpg)
24PR , ANN, & ML
Ideally
θX|θθ|xX|x dppp )()()(
)ˆ()()()(
0
ˆ1)(
θ|xθX|θθ|xX|x
θX|θ
pdppp
otherwisesomep
Otherwise, all possible ’s are used
Bayes Rule Formulation (cont.)
This is MLE!
![Page 25: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/25.jpg)
25PR , ANN, & ML
Graphic Interpretation
},,,{ 21 nxxxX
},,,{ 21 e θ }',,','{' 21 e θ }",,","{" 21 e θ},,,{ 21 e θ
)()|( Pp θx
![Page 26: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/26.jpg)
26PR , ANN, & ML
An example Estimating mean of a normal distribution Variance is known Using n samples First step
2
2
2
2
)(21
)(21
1
21),()(
21)|(
)()()|()|(
o
o
k
eNup
ep
puppp
ooo
xn
k
X
XXX
duupuxpxp )|()|()|( XX
Currentevidence
Previous and otherevidence
Key to Bayesian: Both current and prior
evidence can be used
![Page 27: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/27.jpg)
27PR , ANN, & ML
Then
2
2
21
22
22
2
2
2
2
)(21})(2)1{(
21
)(21)(
21
1
21'
21
21)|(
n
n
o
on
kk
o
o
ok
ee
eep
n
xnn
o
xn
k
X
n
kkn
noo
on
oo
no
on
xn
m
nif
n
nm
nn
1
2222
22
222
22
2
22
2
1
1
![Page 28: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/28.jpg)
28PR , ANN, & ML
X helps in Defining the mean Reducing the uncertainty in mean Trust new data if
Class variance is small Number of sample is large Prior is uncertain
o
nm
2
2
o
n
2
2
o
n
2
2
o
n
![Page 29: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/29.jpg)
29PR , ANN, & ML
Second step2
2)(21
21)|(
x
exp
dxufwhere
fN
dee
dupxpxpxg
n
nn
n
nn
nnn
u
n
xn
n
)}(21exp{),(
),(),(
}2
121{
)|()|()|()(
22
22
22
22
22
)(21)(
21
2
2
2
2
XX
An example (cont.)
Third step
![Page 30: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/30.jpg)
30PR , ANN, & ML
Graphical Interpretation: MLE)(p
2
2)(21
1 21)|(
kxn
k
ep X
)(p
![Page 31: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/31.jpg)
31PR , ANN, & ML
Graphical Interpretation: Bayesian
)(p
2
2)(21
1 21)|(
kxn
k
ep X
)(p
)(21
)()|()|(
2
2)(21
1
upe
upupupkxn
k
XX
![Page 32: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/32.jpg)
32PR , ANN, & ML
Results of Iterative Process Start with a prior distribution Incorporate current batch of data Generate a new prior Goodness of new prior = goodness of old
prior * goodness of interpretation Usually
Prior distribution sharpen (Bayesian learning)Uncertainty drops
![Page 33: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/33.jpg)
33PR , ANN, & ML
MLE vs. Bayes Faster (differentiation) Single model Known model p(x|)
Less information
Slow (integration) Multiple weighted Unknown model fine More information
(nonuniform prior)
![Page 34: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/34.jpg)
34PR , ANN, & ML
Does it really make a difference? Yes, Bayesian classifier and MAP will in
general give different results when used to classify new samples
Because MAP (MLE) keeps only one hypothesis while Bayesian keeps multiple, weighted hypotheses
![Page 35: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/35.jpg)
35PR , ANN, & ML
Example MLE Bayesian
)|(maxarg'
),'|(maxarg)|'(
Xθθ
θxXxx
Pwhere
pp
θx
θX|θθ|xX|x' dppp )()(maxarg)(
0)|(,1)|(,3.)|(0)|(,1)|(,3.)|(
1)|(,0)|(,4.)|(
333
222
111
PPpPPpPPp
XXX
)( X|xp
)|(4.0*3.0*3.1*4.)|(6.1*3.1*3.0*4.)|(
XXX
xppp
Only one hypothesis (1) is kept
![Page 36: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/36.jpg)
36PR , ANN, & ML
Gibbs Sampler Bayesian classifier is optimal, but can be
very expensive – especially when a large number of hypotheses are kept and evaluated
Gibbs – randomly pick one hypothesis according to the current posterior distribution
Can be shown (later) to be related knn classifier and the expected error is at mosttwice as bad as Bayesian
)( X|θp
![Page 37: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/37.jpg)
37PR , ANN, & ML
An Example: Naïve Bayesian Features are a conjunction of attributes Bayes theorem states that a posterior
probability should be maximized Naïve Bayesian classifier assumes
independence of attributes
)|()(maxarg),,,(
)()|,,,(maxarg
),,,|(maxarg
21
21
21
jiijc
n
jjn
c
njc
caPcPaaaP
cPcaaaP
aaacPc
j
j
j
![Page 38: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/38.jpg)
38PR , ANN, & ML
Example
Day Outlook Temperature Humidity Wind Play tennis
D1 Sunny Hot High Weak No
D2 Sunny Host High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cold Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
![Page 39: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/39.jpg)
39PR , ANN, & ML
Example (cont) <Outlook=sunny, Temperature=cool,
Humidity=high, Wind=strong> PlayTennis=yes? Or no?
)|()|(
)|()|()(maxarg},{
jj
jjjnoyesc
NB
cstrongWindPchighHumidityP
ccooleTemperaturPcsunnyOutlookPcPcj
6.53)|(
33.93)|(
36.145)(
64.149)(
nostrongWindP
yesstrongWindP
noplayTennisP
yesplayTennisP
0206.0)|()|()|()|()(
0053.0)|()|()|()|()(
nostrongPnohighPnocoolPnosunnyPnoP
yesstrongPyeshighPyescoolPyessunnyPyesP
![Page 40: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/40.jpg)
40PR , ANN, & ML
Caveat Guarding against zero probability P(ai|cj)
Especially for small sample sizes and large set of attribute values
Use m-estimate instead If attribute ai can take k values, then p=1/k
estimateprior :samples) more m (add size sample equivalent :
cin samples of #:
a attribute with cin samples of #:
)|(
j
ij
pm
n
n
mnmpn
cap
j
i
j
i
c
a
c
aji
![Page 41: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/41.jpg)
41PR , ANN, & ML
More Examples Web page classification/Newsgroup
classification Like/dislike for web pages Science/sports/entertainment categories for
web pages/newsgroups
![Page 42: parameter - UC Santa Barbara](https://reader031.fdocuments.net/reader031/viewer/2022011918/61d81fcdb1f08d47116a2be9/html5/thumbnails/42.jpg)
42PR , ANN, & ML
More Examples (cont.) Select common occurring words as features
(at least k times in documents) Eliminate stop words (the, it, etc.) and
punctuations Word stemming (like, liked etc.) P(wordk |classj) is independent of word
position in the document Achieve 89% accuracy for classifying
documents for 20 newsgroups