Download - Conditional Random Fields: Probabilistic Models Pusan National University AILAB. Kim, Minho.

Conditional Random Fields: Probabilistic Models

Pusan National University

AILAB.

Kim, Minho

Labeling Sequence Data Problem

• X is a random variable over data sequences• Y is a random variable over label sequences

• Yi is assumed to range over a finite label alphabet A

• The problem:– Learn how to give labels from a closed set Y to a data sequence

X

Birds like flowersX:

x1 x2 x3

noun verb noun

y1 y2 y3

Y:

Generative Probabilistic Models

• Learning problem:

Choose Θ to maximize joint likelihood:

L(Θ)= Σ log pΘ (yi,xi)

• The goal: maximization of the joint likelihood of training examples

y = argmax p*(y|x) = argmax p*(y,x)/p(x)

• Needs to enumerate all possible observation sequences

Hidden Markov Model

• In a Hidden Markov Model (HMM) we do not observe the sequence that the model passed through (X) but only some probabilistic function of it (Y). Thus, it is a Markov model with the addition of emission probabilities:

Bik = P(Yt = k|Xt = i)

POS Tagging in HMM

• Learning(Maximum Likelihood Estimation)

)(),(

)|Pr(

)()(

)|Pr(

)()(

)|Pr(

12

1212

1

11

i

iiii

ii

iiiiii

i

iiii

PfreqPWfreq

PW

PPfreqPPPfreq

PPP

PfreqPPfreq

PP

HMM – why not?

• Advantages:– Estimation very easy.– Closed form solution– The parameters can be estimated with

relatively high confidence from small samples

• But:– The model represents all possible (x,y)

sequences and defines joint probability over all possible observation and label sequences needless effort

Discriminative Probabilistic Models

“Solve the problem you need to solve”: The traditional approach inappropriately uses a generative joint model in order to solve a conditional problem in which the observations are given. To classify we need p(y|x) – there’s no need to implicitly approximate p(x).

Generative Discriminative

Discriminative Models - Estimation

• Choose Θy to maximize conditional likelihood:

L(Θy)= Σ log pΘy(yi|xi)

• Estimation usually doesn’t have closed form

• Example – MinMI discriminative approach (2nd week lecture)

Maximum Entropy Markov Model

• MEMM: – a conditional model that represents the

probability of reaching a state given an observation and the previous state

– These conditional probabilities are specified by exponential models based on arbitrary observation features

POS Tagging in MEMM

• Optimal sequence

• Joint probability

},,,{

)|Pr(maxarg'

111,1

iiiiii

iii

PPWWWH

HPP

Pp

phph

hp'

)',Pr(),Pr(

)|Pr(

k

j

phfjjph

1

),(),Pr(

}1,0{),( iii PHf

MEMM: the Label bias problem

The Label Bias Problem: Solutions

• Determinization of the Finite State MachineNot always possibleMay lead to combinatorial explosion

• Start with a fully connected model and let the training procedure to find a good structurePrior structural knowledge has proven to be

valuable in information extraction tasks

Random Field Model: Definition

• Let G = (V, E) be a finite graph, and let A be a finite alphabet.

• The configuration space Ω is the set of all labelings of the vertices in V by letters in A. If C is a part of V and ω is an element of Ω is a configuration, the ωc denotes the configuration restricted to C.

• A random field on G is a probability distribution on Ω.

Random Field Model: The Problem

• Assume that a finite number of features can define a class

• The features fi(w) are given and fixed.

• The goal: estimating λ to maximize likelihood for training examples

Conditional Random Field: Definition

• X – random variable over data sequences

• Y - random variable over label sequences

• Yi is assumed to range over a finite label alphabet A

• Discriminative approach: we construct a conditional model p(y|x) and do not explicitly model marginal p(x)

CRF - Definition

• Let G = (V, E) be a finite graph, and let A be a finite alphabet

• Y is indexed by the vertices of G • Then (X,Y) is a conditional random field if the

random variables Yv, conditioned on X, obey the Markov property with respect to the graph:

p(Y|X,Yw,w≠v) = p(Yv|X,Yw,w~v),

where w~v means that w and v are neighbors in G

CRF on Simple Chain Graph

• We will handle the case when G is a simple chain: G = (V = {1,…,m}, E={ (I,i+1) })

HMM (Generative) MEMM (Discriminative) CRF

Fundamental Theorem of Random Fields (Hammersley & Clifford)

• Assumption:– G structure is a tree, of which simple chain is

a private case

CRF – the Learning Problem

• Assumption: the features fk and gk are given and fixed.– For example, a boolean feature gk is TRUE if

the word Xi is upper case and the label Yi is a “noun”.

• The learning problem– We need to determine the parameters

Θ = (λ1, λ2, . . . ; µ1, µ2, . . .) from training data D = {(x(i), y(i))} with empirical distribution p~(x, y).

21

최대 엔트로피 모델

• 우리가 알아낸 제약 조건을 다 만족하는 확률 분포들 중에서 엔트로피가 최대가 되는 확률 분포를 취함

• 알고 있는 정보는 반영하되 , 확실하지 않은 경우에 대해서는 불확실성 정도를 최대로 두어 균일한 확률 분포를 구성

( ) ( ) ( ) ( ) ( ) 1p dans p en p a p aucoursde p pendant 제약조건

( ) 1/ 5, ( ) 1/ 5, ( ) 1/ 5, ( ) 1/ 5, ( ) 1/ 5p dans p en p a p aucoursde p pendant 엔트로피를 최대로 하는 확률 분포

22

최대 엔트로피 원리

• 제약조건을 만족하는 확률 분포들 중 엔트로피가 최대가 되도록 모델을 구성

• 알려진 또는 사용하고자 하는 정보에 대해 확실히 지켜주고 , 고려하지 않은 경우나 모르는 경우에 대해서는 동등하게 가중치를 줌으로써 특정 부분에 치우치지 않는 분포를 구한다

,

( ) ( , ) log ( , )x X y Y

H p p x y p x y

Ref. [1]

23

최대 엔트로피 예

• 이벤트 공간

• 경험적 데이터

• 엔트로피를 최대로 하는 확률 분포– 제약조건 : E[NN, NNS, NNP, NNPS, VBZ, VBD]=1

NN NNS NNP NNPS VBZ VBD

3 5 11 13 3 1

Ref. [3]

24

최대 엔트로피 예

– N* 이 V* 보다 더 빈번하게 발생 , 이를 자질 함수로 추가

– 고유명사가 보통명사보다 더 빈번하게 발생

{ , , , }, { } 32 / 36N Nf NN NNS NNP NNPS E f

NN NNS NNP NNPS VBZ VBD

8/36 8/36 8/36 8/36 2/36 2/36

{ , }, { } 24 / 36P pf NNP NNPS E f

4/36 4/36 12/36 12/36 2/36 2/36

25

최대 엔트로피 모델 구성 요소

• 자질 함수– 정해놓은 조건들을 만족하는지 여부를 확인 – 일반적으로 이진 함수로 정의

• 제약조건– 기대치를 구할 때 사용하는 정보는 학습문서로 한정

• 파라미터 추정 알고리즘– 자질 함수의 가중치를 구하는 방법– GIS, IIS

26

최대 엔트로피 모델에서 확률 계산 방법

• 자질 함수를 정의• 제약조건을 정의• 선택한 알고리즘을 이용해 자질 함수의

가중치 계산• 가중치를 이용해 각각의 확률 계산• 여러 확률 값 중 제일 큰 값을 최종확률로

선택

27

자질 함수

• Trigger 형태로 , 정해놓은 제약조건을 만족하였는지 여부를 구분해주는 함수

• 고려되고 있는 문맥에 사용하고자 하는 정보들이 적용가능한지 결정

1( )

0

if h meets some conditionf h

otherwise

Ref. [1]

28

제약조건

, 1

[ ] [ ], 1 , 1

[ ] ( , ) ( , ) ( ) ( | ) ( , ) , ( ) :

: , .

[ ]

j j

n

j j i i i j i i ih H y Y i

j

E f E f j k

E f p h y f h y p h p y h f h y p h

H

E f p

p

학습 문서에서뽑아낸 값

제대로 알기어려우며 안다고 해도 너무 커서평균값을 바로 구하는 것이힘들 수 있다

학습 문서에서발견된 경우만 고려하는 근사화 된 수식을 이용하여계산

1

( , ) ( , )

( , ) :

, :

y

n

i i j i ii

i i

h y f h y

p h y

H Y

학습 문서에서얻어낸 것

각각 있을 수 있는 모든 문맥과 원하는 출력값의집합n: 학습 문서에서발견된 문맥 h와 의곱집합으로 얻을 수 있는 총 가지수

모델에서고려하는 경우의수

Ref. [1]

29

파라미터 추정• 정해진 자질 함수를 학습 문서에 적용시켜 얻어낸 확률

정보를 가장 잘 반영하는 p* 를 최우추정법 (Maximum Likelihood Estimation) 사용하여 구한다

( , ) ( , )

1 1

,

{ | [ ] [ ], {1,....., }}

1{ | ( | ) , ( )

( )

( ) ( , ) log ( | )

* arg max ( ) arg max ( )

( , ) :

:

j j

j j

k kf x y f x yj j

yj j

x y

q Q p P

j j

P p E f E f j k

Q p p y x Z xZ x

L p p x y p y x

p L q H p

where

p x y

f

학습 문서에서얻어낸 확률값k: 자질 함수의개수

자질 함수 에해당하는 가중치Ref. [1]

30

IIS (Improved Iterative Scaling)

1 2

#

#

1

, ,....,

, *

0, {1,2,...., }

2. i

( , ) exp( ( , )) ( ) .

, ( , ) ( , ).

. : .

3.

n

i i i

n

ii

i i i i

i

f f f

p

i n

x y f x y p f

f x y f x y

b

*i

i

ix, y

입력데이터 자질

출력데이터

파라미터 확률분포

알고리즘1.

각 에대해서

a. p(x)p(y| x)f 를 만족하는 를 구한다

단

가 수렴하면 끝을 .내고 그렇지않을 경우 2. 로 간다

Ref. [1]

31

GIS (General Iterative Scaling),

1

11

,

,, 1

max ( , )

( , ) ( , )

( , ) ( , ) :

1( , ) ( , ) ( , ) :

x,y

Kdef

ix y

i

K

K ii

p i ix y

N

i i i j jx y j

C f x y

f x y C f x y

E f p x y f x y

f p x y f x y f x yN

p

모든가능한 x, y에대한 이벤트 공간에서의합

E 경험적기대값

where

N: 학습문서에 있는 요소들의수

모든 가능한 의결합의합을 구하는 것은 데이타 집

, 1

.

, x .

1( ) ( | ) ( , ) ( | ) ( , )

p i

N

p i i j j i j jx y j y

E f

E f p x p y x f x y p y x f x yN

합이크거나 무한하기때문에어렵다

따라서 를 학습 문서에나타난 로 근사화한다

Ref. [2]

32

GIS (General Iterative Scaling)(1) (1)

(1)

1 1( , ) ( , )(1) (1)

,1 1

1. . 1, 1 1

,

{ , ( , ) .

1( , ) ( ) ( )

3. 1

i i

p i

i

K Kf x y f x y

i ix yi i

j K

E f

x y

x y where zZ

i K

i i

(n)

(n)

의초기값을 설정한다 보통

를 계산하고 n=1로 설정

2. 주어진 }를 가지고 학습문서에있는 각 요소 (x, y)에대해 p 를 계산한다

p

모든 1 에대하 ( )

( )

1( 1) ( )

.

.

( )

5. , .

n

n

ip

p in n C

ip

E f

E f

E f

i

i i

여 를 구한다

4. 파라미터 를 업데이트한다

파라미터값이수렴하면 멈추고 그렇지않으면 n을 하나 증가시키고 2. 로 간다

Ref. [2]

Conclusions

• Conditional random fields offer a unique combination of properties:– discriminatively trained models for sequence segmentation and

labeling– combination of arbitrary and overlapping observation features

from both the past and future– efficient training and decoding based on dynamic programming

for a simple chain graph– parameter estimation guaranteed to find the global optimum

• CRFs main current limitation is the slow convergence of the training algorithm relative to MEMMs, let alone to HMMs, for which training on fully observed data is very efficient.