Maximum Likelihood-Maximum Entropy Duality : Session 1 Pushpak Bhattacharyya Scribed by Aditya Joshi...
-
Upload
theodora-adams -
Category
Documents
-
view
212 -
download
0
Transcript of Maximum Likelihood-Maximum Entropy Duality : Session 1 Pushpak Bhattacharyya Scribed by Aditya Joshi...
Maximum Likelihood-Maximum Entropy Duality: Session 1
Pushpak Bhattacharyya
Scribed by Aditya JoshiPresented in NLP-AI talk on 14th January, 2014
• Phenomenon/Event could be a linguistic process such as POS tagging or sentiment prediction.
• Model uses data in order to “predict” future observations w.r.t. a phenomenon
Data/Observation
Phenomenon/Event
Model
Notations
X : x1, x2, x3.... xm (m observations)
A: Random variable with n possible outcomes such as a1, a2, a3... an
e.g. One coin throw : a1= 0, a2=1
One dice throw: a1=1, a2=2, a3=3, a4=4, a5=5, a6=6
Goal
• Goal: Estimate P(ai) = Pi
Two paths
ML ME
Are they equivalent?
Calculating probability from data
Suppose in X: x1, x2, x3.... xm (m observations),
ai occurs f(ai) = fi times
e.g. Dice: If outcomes are 1 1 2 3 1 5 3 4 2 1F(1) = 4, f(2) = 2, f(3) = 2, f(4) = 1, f(5) = 1, f(6)=0
and m = 10Hence, P1 = 4/10, P2=2/10, P3=2/10, P4=1/10,
P5=1/10, P6=1/10
In general, the task is...
• Task: Get θ : the probability vector <P(θi)> from X
MLE
• MLE: θ* = argmax Pr (X; θ)
With i.i.d. (identical independence) assumption,
θ* = argmax π Pr (X; θ)
Where, θ : <P1, P2, ... Pn>
θ
θ i=1
m
P(a1) P(a2)P(an)
What is known about: θ : <P1, P2, ... Pn>
Σ Pi = 1
Pi >= 0 for all i
Introducing Entropy:
H(θ)= - ΣPi ln Pi
i=1
n
i=1
n
Entropy of distribution <P1, P2, ... Pn>
Some intuition
Example with diceOutcomes = 1,2,3,4,5,6P(1) + P(2)+P(3)+...P(6) = 1
Entropy(Dice) = H(θ)= - ΣP(i) ln P(i)
Now, there is a principle called Laplace’s Principle of Unbiased(?) reasoning
i=1
6
The best estimate for the dice
P(1) = P(2) = ... P(6) = 1/6
We will now prove it assuming:NO KNOWLEDGE about the dice except that it has six outcomes, each with probability >= 0 and Σ Pi = 1
What does “best” mean?
• “BEST” means most consistent with the situiation.
• “Best” means that these Pi values should be such that they maximize the entropy.
Optimization formulation
• Max. - ΣP(i) log P(i)
Subject to:
i=1
6
Σ Pi = 1
Pi >= 0 for i = 1 to 6
i=1
6
Solving the optimization (1/2)
Using Lagrangian multipliers, the optimization can be written as:
Q = - ΣP(i) log P(i) – λ ( ΣP(i) – 1) - Σ βi P(i) i=1
6
i=1
6
i=1
6
For now, let us ignore the last term. We will come to it later.
Solving the optimization (2/2)
Differentiating Q w.r.t. P(i), we get
δQ/δP(i) = - log (P(i) – 1 – λ
Equating to zero,log P(i) + 1 + λ = 0log P(i) = -(1+ λ)
P(i) = e -(1+ λ)
This means that to maximize entropy, every P(i) must be equal.
This shows that P(1) = P(2) = ... P(6)
But,P(1) + P(2) + .. + P(6) = 1
Therefore P(1) = P(2) = ... P(6) = 1/6
Introducing data in the notion of entropy
Now, we introduce data:X : x1, x2, x3.... xm (m observations)
A: a1, a2, a3... an (n outcomes)e.g. For a coin, In absence of data: P(H) = P(T) = 1/2 ... (As shown in the previous proof)
However, if data X is observed as follows: Obs-1: H H T H T T H H H (m=10) (n=2)
P(H) = 6/10, P(T) = 4/10
Obs-2: T T H T H T H T T T (m=10) (n=2)P(H) = 3/10, P(T) = 7/10
Which of these is a valid estimate?
Change in entropy
Entropy reduces as data is observed!
Emax(uniform distribution)
E2 : P(H) = 3/10
E1 : P(H) = 6/10
Start of Duality
• Maximizing entropy in this situation is same as minimizing the `entropy reduction’ distance. i.e. – Minimizing “relative entropy”
Emax(Maximum entropy:uniform distribution)
Edata(Entropy whenObservations are made)
Entropy reduction
Concluding remarks
Thus, in this discussion of ML-ME duality, we will show that:
MLE minimizes relative entropy distance from uniform distribution.
Question: The entropy corresponds to probability vectors. The distance can be measured by squared distances. WHY relative entropy?