Maximum Likelihood-Maximum Entropy Duality : Session 1 Pushpak Bhattacharyya Scribed by Aditya Joshi...

Maximum Likelihood-Maximum Entropy Duality: Session 1

Pushpak Bhattacharyya

Scribed by Aditya JoshiPresented in NLP-AI talk on 14th January, 2014

• Phenomenon/Event could be a linguistic process such as POS tagging or sentiment prediction.

• Model uses data in order to “predict” future observations w.r.t. a phenomenon

Data/Observation

Phenomenon/Event

Model

Notations

X : x1, x2, x3.... xm (m observations)

A: Random variable with n possible outcomes such as a1, a2, a3... an

e.g. One coin throw : a1= 0, a2=1

One dice throw: a1=1, a2=2, a3=3, a4=4, a5=5, a6=6

Goal

• Goal: Estimate P(ai) = Pi

Two paths

ML ME

Are they equivalent?

Calculating probability from data

Suppose in X: x1, x2, x3.... xm (m observations),

ai occurs f(ai) = fi times

e.g. Dice: If outcomes are 1 1 2 3 1 5 3 4 2 1F(1) = 4, f(2) = 2, f(3) = 2, f(4) = 1, f(5) = 1, f(6)=0

and m = 10Hence, P1 = 4/10, P2=2/10, P3=2/10, P4=1/10,

P5=1/10, P6=1/10

In general, the task is...

• Task: Get θ : the probability vector <P(θi)> from X

MLE

• MLE: θ* = argmax Pr (X; θ)

With i.i.d. (identical independence) assumption,

θ* = argmax π Pr (X; θ)

Where, θ : <P1, P2, ... Pn>

θ

θ i=1

m

P(a1) P(a2)P(an)

What is known about: θ : <P1, P2, ... Pn>

Σ Pi = 1

Pi >= 0 for all i

Introducing Entropy:

H(θ)= - ΣPi ln Pi

i=1

n

i=1

n

Entropy of distribution <P1, P2, ... Pn>

Some intuition

Example with diceOutcomes = 1,2,3,4,5,6P(1) + P(2)+P(3)+...P(6) = 1

Entropy(Dice) = H(θ)= - ΣP(i) ln P(i)

Now, there is a principle called Laplace’s Principle of Unbiased(?) reasoning

i=1

6

The best estimate for the dice

P(1) = P(2) = ... P(6) = 1/6

We will now prove it assuming:NO KNOWLEDGE about the dice except that it has six outcomes, each with probability >= 0 and Σ Pi = 1

What does “best” mean?

• “BEST” means most consistent with the situiation.

• “Best” means that these Pi values should be such that they maximize the entropy.

Optimization formulation

• Max. - ΣP(i) log P(i)

Subject to:

i=1

6

Σ Pi = 1

Pi >= 0 for i = 1 to 6

i=1

6

Solving the optimization (1/2)

Using Lagrangian multipliers, the optimization can be written as:

Q = - ΣP(i) log P(i) – λ ( ΣP(i) – 1) - Σ βi P(i) i=1

6

i=1

6

i=1

6

For now, let us ignore the last term. We will come to it later.

Solving the optimization (2/2)

Differentiating Q w.r.t. P(i), we get

δQ/δP(i) = - log (P(i) – 1 – λ

Equating to zero,log P(i) + 1 + λ = 0log P(i) = -(1+ λ)

P(i) = e -(1+ λ)

This means that to maximize entropy, every P(i) must be equal.

This shows that P(1) = P(2) = ... P(6)

But,P(1) + P(2) + .. + P(6) = 1

Therefore P(1) = P(2) = ... P(6) = 1/6

Introducing data in the notion of entropy

Now, we introduce data:X : x1, x2, x3.... xm (m observations)

A: a1, a2, a3... an (n outcomes)e.g. For a coin, In absence of data: P(H) = P(T) = 1/2 ... (As shown in the previous proof)

However, if data X is observed as follows: Obs-1: H H T H T T H H H (m=10) (n=2)

P(H) = 6/10, P(T) = 4/10

Obs-2: T T H T H T H T T T (m=10) (n=2)P(H) = 3/10, P(T) = 7/10

Which of these is a valid estimate?

Change in entropy

Entropy reduces as data is observed!

Emax(uniform distribution)

E2 : P(H) = 3/10

E1 : P(H) = 6/10

Start of Duality

• Maximizing entropy in this situation is same as minimizing the `entropy reduction’ distance. i.e. – Minimizing “relative entropy”

Emax(Maximum entropy:uniform distribution)

Edata(Entropy whenObservations are made)

Entropy reduction

Concluding remarks

Thus, in this discussion of ML-ME duality, we will show that:

MLE minimizes relative entropy distance from uniform distribution.

Question: The entropy corresponds to probability vectors. The distance can be measured by squared distances. WHY relative entropy?

Maximum Likelihood-Maximum Entropy Duality : Session 1 Pushpak Bhattacharyya Scribed by Aditya Joshi...

Documents

Transcript of Maximum Likelihood-Maximum Entropy Duality : Session 1 Pushpak Bhattacharyya Scribed by Aditya Joshi...