Principle of Maximum Entropy

31
Principle of Maximum Entropy Jiawang Liu [email protected] 2012.6

description

Principle of Maximum Entropy

Transcript of Principle of Maximum Entropy

Page 2: Principle of Maximum Entropy

What is Entropy

Principle of Maximum Entropy

Relation to Maximum Likelihood

MaxEnt methods and Bayesian

Applications

NLP(POS tagging)

Logistic regression

Q&A

Outline

Page 3: Principle of Maximum Entropy

What is Entropy

In information theory, entropy is the measure of the

amount of information that is missing before reception

and is sometimes referred to as Shannon entropy.

Uncertainty

Page 4: Principle of Maximum Entropy

Why maximum entropy?

Minimize commitment

Model all that is known and assume nothing about what is unknown

Principle of Maximum Entropy

Subject to precisely stated prior data, which must be a

proposition that expresses testable information, the

probability distribution which best represents the

current state of knowledge is the one with largest

information theoretical entropy.

Page 5: Principle of Maximum Entropy

Overview

Should guarantee the uniqueness and consistency of

probability assignments obtained by different methods

Makes explicit our freedom in using different forms of

prior data

Admits the most ignorance beyond the stated prior data

Principle of Maximum Entropy

Page 6: Principle of Maximum Entropy

Testable information The principle of maximum entropy is useful explicitly

only when applied to testable information

A piece of information is testable if it can be determined

whether a give distribution is consistent with it.

An example:

Principle of Maximum Entropy

The expectation of the variable x is 2.87

and

p2 + p3 > 0.6

Page 7: Principle of Maximum Entropy

General solution Entropy maximization with no testable information

Given testable information

Seek the probability distribution which maximizes information

entropy, subject to the constraints of the information.

A constrained optimization problem. It can be typically solved

using the method of Lagrange Multipliers.

Principle of Maximum Entropy

Page 8: Principle of Maximum Entropy

General solution Question

Seek the probability distribution which maximizes information

entropy, subject to some linear constraints.

Mathematical problem

Optimization Problem

non-linear programming with linear constraints

Idea

Principle of Maximum Entropy

non-linear programming with linear constraints

• Lagrange multipliers

non-linear programming with no constraints

• partial differential

get result

• Let derivative equals to 0

Page 9: Principle of Maximum Entropy

General solution Constraints

Some testable information I about a quantity x taking values in

{x1, x2,..., xn}. Express this information as m constraints on the

expectations of the functions fk, that is, we require our

probability distribution to satisfy

Furthermore, the probabilities must sum to one, giving the

constraint

Objective function

Principle of Maximum Entropy

Page 10: Principle of Maximum Entropy

General solution The probability distribution with maximum information

entropy subject to these constraints is

The normalization constant is determined by

The λk parameters are Lagrange multipliers whose

particular values are determined by the constraints

according to

These m simultaneous equations do not generally possess a closed form

solution, and are usually solved by numerical methods.

Principle of Maximum Entropy

Page 11: Principle of Maximum Entropy

Training Model

Generalized Iterative Scaling (GIS) (Darroch and

Ratcliff, 1972)

Improved Iterative Scaling (IIS) (Della Pietra et al.,

1995)

Principle of Maximum Entropy

Page 12: Principle of Maximum Entropy

Training Model Generalized Iterative Scaling (GIS) (Darroch and

Ratcliff, 1972) Compute dj, j=1, …, k+1

Initialize (any values, e.g., 0)

Repeat until converge

• For each j

– Compute

– Compute

– Update

Principle of Maximum Entropy

Page 13: Principle of Maximum Entropy

Training Model Generalized Iterative Scaling (GIS) (Darroch and

Ratcliff, 1972)

The running time of each iteration is O(NPA):

• N: the training set size

• P: the number of classes

• A: the average number of features that are active for a given

event (a, b).

Principle of Maximum Entropy

Page 14: Principle of Maximum Entropy

Relation to Maximum Likelihood Likelihood function

P(x) is the distribution of estimation

is the empirical distribution

Log-Likelihood function

Principle of Maximum Entropy

Page 15: Principle of Maximum Entropy

Relation to Maximum Likelihood Theorem

The model p* C with maximum entropy is the model in the

parametric family p(y|x) that maximizes the likelihood of the

training sample.

Coincidence?

Entropy – the measure of uncertainty

Likelihood – the degree of identical to knowledge

Maximum entropy - assume nothing about what is unknown

Maximum likelihood – impartially understand the knowledge

Knowledge = complementary set of uncertainty

Principle of Maximum Entropy

Page 16: Principle of Maximum Entropy

MaxEnt methods and Bayesian Bayesian methods

p(H|DI) = p(H|I)p(D|HI) / p(D|I)

H stands for some hypothesis whose truth we want to judge

D for a set of data

I for prior information

Difference

A single application of Bayes’ theorem gives us only a

probability, not a probability distribution

MaxEnt gives us necessarily a probability distribution, not just a

probability.

Principle of Maximum Entropy

Page 17: Principle of Maximum Entropy

MaxEnt methods and Bayesian Difference (continue)

Bayes’ theorem cannot determine the numerical value of any

probability directly from our information. To apply it one must first

use some other principle to translae information into numerical

values for p(H|I), p(D|HI), p(D|I)

MaxEnt does not require for input the numerical values of any

probabilities on the hypothesis space.

In common

The updating of a state of knowledge

Bayes’ rule and MaxEnt are completely compatible and can be

seen as special cases of the method of MaxEnt. (Giffin et al.

2007)

Principle of Maximum Entropy

Page 18: Principle of Maximum Entropy

Maximum Entropy Model NLP: POS Tagging, Parsing, PP attachment, Text

Classification, LM, …

POS Tagging

Features

Model

Applications

Page 19: Principle of Maximum Entropy

Maximum Entropy Model POS Tagging

Tagging with MaxEnt Model

The conditional probability of a tag sequence t1,…, tn is

given a sentence w1,…, wn and contexts C1,…, Cn

Model Estimation

• The model should reflect the data

– use the data to constrain the model

• What form should the constraints take?

– constrain the expected value of each feature

Applications

Page 20: Principle of Maximum Entropy

Maximum Entropy Model POS Tagging

The Constraints

• Expected value of each feature must satisfy some constraint Ki

• A natural choice for Ki is the average empirical count

• derived from the training data (C1, t1), (C2, t2)…, (Cn, tn)

Applications

Page 21: Principle of Maximum Entropy

Maximum Entropy Model POS Tagging

MaxEnt Model

• The constraints do not uniquely identify a model

• The maximum entropy model is the most uniform model

– makes no assumptions in addition to what we know from the data

• Set the weights to give the MaxEnt model satisfying the constraints

– use Generalised Iterative Scaling (GIS)

Smoothing

• empirical counts for low frequency features can be unreliable

• Common smoothing technique is to ignore low frequency features

• Use a prior distribution on the parameters

Applications

Page 22: Principle of Maximum Entropy

Maximum Entropy Model Logistic regression

Classification

• Linear regression for classification

• The problems of linear regression for classification

Applications

Page 23: Principle of Maximum Entropy

Maximum Entropy Model Logistic regression

Hypothesis representation

• What function is used to represent our hypothesis in classification

• We want our classifier to output values between 0 and 1

• When using linear regression we did hθ(x) = (θT x)

• For classification hypothesis representation we do

hθ(x) = g((θT x))

Where we define g(z), z is a real number

g(z) = 1/(1 + e-z)

This is the sigmoid function, or the logistic function

Applications

Page 24: Principle of Maximum Entropy

Maximum Entropy Model Logistic regression

Cost function for logistic regression

• Hypothesis representation

• Linear regression uses the following function to determine θ

• Define cost(hθ(xi), y) = 1/2(hθ(x

i) - yi)2

• Redefine J(Θ)

• J(Θ) does not work for logistic regression, since it’s a non-convex function

Applications

Page 25: Principle of Maximum Entropy

Maximum Entropy Model Logistic regression

Cost function for logistic regression

• A convex logistic regression cost function

Applications

Page 26: Principle of Maximum Entropy

Maximum Entropy Model Logistic regression

Simplified cost function

• For binary classification problems y is always 0 or 1

• So we can write cost function is

cost(hθ, (x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) )

• So, in summary, our cost function for the θ parameters can be defined as

• Find parameters θ which minimize J(θ)

Applications

Page 27: Principle of Maximum Entropy

Maximum Entropy Model Logistic regression

How to minimize the logistic regression cost function

Use gradient descent to minimize J(θ)

Applications

Page 28: Principle of Maximum Entropy

Maximum Entropy Model Logistic regression

Advanced optimization

• Good for large machine learning problems (e.g. huge feature set)

• What is gradient descent actually doing?

– compute J(θ) and the derivatives

– plug these values into gradient descent

• Alternatively, instead of gradient descent to minimize the cost function we

could use

– Conjugate gradient

– BFGS (Broyden-Fletcher-Goldfarb-Shanno)

– L-BFGS (Limited memory - BFGS)

Applications

Page 29: Principle of Maximum Entropy

Maximum Entropy Model Logistic regression

Why do we chose this function when other cost functions exist?

• This cost function can be derived from statistics using the principle

of maximum likelihood estimation

– Note this does mean there's an underlying Gaussian assumption

relating to the distribution of features

• Also has the nice property that it's convex

Applications

Page 30: Principle of Maximum Entropy

Thanks!

Q&A

Page 31: Principle of Maximum Entropy

Jaynes, E. T., 1988, 'The Relation of Bayesian and Maximum Entropy Methods',

in Maximum-Entropy and Bayesian Methods in Science and Engineering (Vol. 1),

Kluwer Academic Publishers, p. 25-26.

https://www.coursera.org/course/ml

The elements of statistical learning, 4.4.

Kitamura, Y., 2006, Empirical Likelihood Methods in Econometrics: Theory and

Practice, Cowles Foundation Discussion Papers 1569, Cowles Foundation, Yale

University

http://en.wikipedia.org/wiki/Principle_of_maximum_entropy

Lazar, N., 2003, "Bayesian Empirical Likelihood", Biometrika, 90, 319-326.

Giffin, A. and Caticha, A., 2007, Updating Probabilities with Data and Moments

Guiasu, S. and Shenitzer, A., 1985, 'The principle of maximum entropy', The

Mathematical Intelligencer, 7(1), 42-48.

Harremoës P. and Topsøe F., 2001, Maximum Entropy Fundamentals, Entropy,

3(3), 191-226.

http://en.wikipedia.org/wiki/Logistic_regression

Reference