Principle of Maximum Entropy

Principle of Maximum Entropy

Jiawang Liu [email protected]

2012.6

mailto:[email protected]





What is Entropy


Relation to Maximum Likelihood

MaxEnt methods and Bayesian

Applications

NLP(POS tagging)

Logistic regression

Q&A

Outline

What is Entropy

In information theory, entropy is the measure of the

amount of information that is missing before reception

and is sometimes referred to as Shannon entropy.

Uncertainty

Why maximum entropy?

Minimize commitment

Model all that is known and assume nothing about what is unknown


Subject to precisely stated prior data, which must be a

proposition that expresses testable information, the

probability distribution which best represents the

current state of knowledge is the one with largest

information theoretical entropy.

Overview

Should guarantee the uniqueness and consistency of

probability assignments obtained by different methods

Makes explicit our freedom in using different forms of

prior data

Admits the most ignorance beyond the stated prior data


Testable information The principle of maximum entropy is useful explicitly

only when applied to testable information

A piece of information is testable if it can be determined

whether a give distribution is consistent with it.

An example:


The expectation of the variable x is 2.87

and

p2 + p3 > 0.6

General solution Entropy maximization with no testable information

Given testable information

Seek the probability distribution which maximizes information

entropy, subject to the constraints of the information.

A constrained optimization problem. It can be typically solved

using the method of Lagrange Multipliers.


General solution Question

Seek the probability distribution which maximizes information

entropy, subject to some linear constraints.

Mathematical problem

Optimization Problem

non-linear programming with linear constraints

Idea


non-linear programming with linear constraints

• Lagrange multipliers

non-linear programming with no constraints

• partial differential

get result

• Let derivative equals to 0

General solution Constraints

Some testable information I about a quantity x taking values in

{x1, x2,..., xn}. Express this information as m constraints on the

expectations of the functions fk, that is, we require our

probability distribution to satisfy

Furthermore, the probabilities must sum to one, giving the

constraint

Objective function


General solution The probability distribution with maximum information

entropy subject to these constraints is

The normalization constant is determined by

The λk parameters are Lagrange multipliers whose

particular values are determined by the constraints

according to

These m simultaneous equations do not generally possess a closed form

solution, and are usually solved by numerical methods.


Training Model

Generalized Iterative Scaling (GIS) (Darroch and

Ratcliff, 1972)

Improved Iterative Scaling (IIS) (Della Pietra et al.,

1995)


Training Model Generalized Iterative Scaling (GIS) (Darroch and

Ratcliff, 1972) Compute dj, j=1, …, k+1

Initialize (any values, e.g., 0)

Repeat until converge

• For each j

– Compute

– Compute

– Update


Training Model Generalized Iterative Scaling (GIS) (Darroch and

Ratcliff, 1972)

The running time of each iteration is O(NPA):

• N: the training set size

• P: the number of classes

• A: the average number of features that are active for a given

event (a, b).


Relation to Maximum Likelihood Likelihood function

P(x) is the distribution of estimation

is the empirical distribution

Log-Likelihood function


Relation to Maximum Likelihood Theorem

The model p* C with maximum entropy is the model in the

parametric family p(y|x) that maximizes the likelihood of the

training sample.

Coincidence?

Entropy – the measure of uncertainty

Likelihood – the degree of identical to knowledge

Maximum entropy - assume nothing about what is unknown

Maximum likelihood – impartially understand the knowledge

Knowledge = complementary set of uncertainty


MaxEnt methods and Bayesian Bayesian methods

p(H|DI) = p(H|I)p(D|HI) / p(D|I)

H stands for some hypothesis whose truth we want to judge

D for a set of data

I for prior information

Difference

A single application of Bayes’ theorem gives us only a

probability, not a probability distribution

MaxEnt gives us necessarily a probability distribution, not just a

probability.


MaxEnt methods and Bayesian Difference (continue)

Bayes’ theorem cannot determine the numerical value of any

probability directly from our information. To apply it one must first

use some other principle to translae information into numerical

values for p(H|I), p(D|HI), p(D|I)

MaxEnt does not require for input the numerical values of any

probabilities on the hypothesis space.

In common

The updating of a state of knowledge

Bayes’ rule and MaxEnt are completely compatible and can be

seen as special cases of the method of MaxEnt. (Giffin et al.

2007)


Maximum Entropy Model NLP: POS Tagging, Parsing, PP attachment, Text

Classification, LM, …

POS Tagging

Features

Model

Applications

Maximum Entropy Model POS Tagging

Tagging with MaxEnt Model

The conditional probability of a tag sequence t1,…, tn is

given a sentence w1,…, wn and contexts C1,…, Cn

Model Estimation

• The model should reflect the data

– use the data to constrain the model

• What form should the constraints take?

– constrain the expected value of each feature

Applications


The Constraints

• Expected value of each feature must satisfy some constraint Ki

• A natural choice for Ki is the average empirical count

• derived from the training data (C1, t1), (C2, t2)…, (Cn, tn)

Applications


MaxEnt Model

• The constraints do not uniquely identify a model

• The maximum entropy model is the most uniform model

– makes no assumptions in addition to what we know from the data

• Set the weights to give the MaxEnt model satisfying the constraints

– use Generalised Iterative Scaling (GIS)

Smoothing

• empirical counts for low frequency features can be unreliable

• Common smoothing technique is to ignore low frequency features

• Use a prior distribution on the parameters

Applications

Maximum Entropy Model Logistic regression

Classification

• Linear regression for classification

• The problems of linear regression for classification

Applications


Hypothesis representation

• What function is used to represent our hypothesis in classification

• We want our classifier to output values between 0 and 1

• When using linear regression we did hθ(x) = (θT x)

• For classification hypothesis representation we do

hθ(x) = g((θT x))

Where we define g(z), z is a real number

g(z) = 1/(1 + e-z)

This is the sigmoid function, or the logistic function

Applications


Cost function for logistic regression

• Hypothesis representation

• Linear regression uses the following function to determine θ

• Define cost(hθ(xi), y) = 1/2(hθ(x

i) - yi)2

• Redefine J(Θ)

• J(Θ) does not work for logistic regression, since it’s a non-convex function

Applications


Cost function for logistic regression

• A convex logistic regression cost function

Applications


Simplified cost function

• For binary classification problems y is always 0 or 1

• So we can write cost function is

cost(hθ, (x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) )

• So, in summary, our cost function for the θ parameters can be defined as

• Find parameters θ which minimize J(θ)

Applications


How to minimize the logistic regression cost function

Use gradient descent to minimize J(θ)

Applications


Advanced optimization

• Good for large machine learning problems (e.g. huge feature set)

• What is gradient descent actually doing?

– compute J(θ) and the derivatives

– plug these values into gradient descent

• Alternatively, instead of gradient descent to minimize the cost function we

could use

– Conjugate gradient

– BFGS (Broyden-Fletcher-Goldfarb-Shanno)

– L-BFGS (Limited memory - BFGS)

Applications


Why do we chose this function when other cost functions exist?

• This cost function can be derived from statistics using the principle

of maximum likelihood estimation

– Note this does mean there's an underlying Gaussian assumption

relating to the distribution of features

• Also has the nice property that it's convex

Applications

Thanks!

Q&A

Jaynes, E. T., 1988, 'The Relation of Bayesian and Maximum Entropy Methods',

in Maximum-Entropy and Bayesian Methods in Science and Engineering (Vol. 1),

Kluwer Academic Publishers, p. 25-26.

https://www.coursera.org/course/ml

The elements of statistical learning, 4.4.

Kitamura, Y., 2006, Empirical Likelihood Methods in Econometrics: Theory and

Practice, Cowles Foundation Discussion Papers 1569, Cowles Foundation, Yale

University

http://en.wikipedia.org/wiki/Principle_of_maximum_entropy

Lazar, N., 2003, "Bayesian Empirical Likelihood", Biometrika, 90, 319-326.

Giffin, A. and Caticha, A., 2007, Updating Probabilities with Data and Moments

Guiasu, S. and Shenitzer, A., 1985, 'The principle of maximum entropy', The

Mathematical Intelligencer, 7(1), 42-48.

Harremoës P. and Topsøe F., 2001, Maximum Entropy Fundamentals, Entropy,

3(3), 191-226.

http://en.wikipedia.org/wiki/Logistic_regression

Reference

http://bayes.wustl.edu/etj/articles/relationship.pdf

https://www.coursera.org/course/ml

http://cowles.econ.yale.edu/P/cd/d15b/d1569.pdf

http://cowles.econ.yale.edu/P/cd/d15b/d1569.pdf

http://en.wikipedia.org/wiki/Principle_of_maximum_entropy

http://arxiv.org/abs/0708.1593

http://en.wikipedia.org/wiki/Logistic_regression

Principle of Maximum Entropy

Technology

Transcript of Principle of Maximum Entropy