Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

29
Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009

Transcript of Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

Page 1: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

Dirichlet Processes in Dialogue Modelling

Nigel CrookMarch 2009

Page 2: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

• The COMPANIONS project• Dialogue Acts• Document Clustering

• Multinomial Distribution• Dirichlet Distribution• Graphical Models• Bayesian Finite Mixture Models• Dirichlet Processes• Chinese Restaurant Process• Concluding Thoughts

Inputs and OutputsOverview

With thanks to ...Percy Liang and Dan Klein(UC Berkeley)1

1 Structured Bayesian Nonparametric Models with Variational Inference, ACL Tutorial in Prague, Czech Republic on June 24, 2007.

Page 3: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

The COMPANIONS projectCOMPANIONS: Intelligent, Persistent,

Personalised Multimodal Interfaces to the Internet

One Companion on many platforms

“Okay, but pleaseplay some

relaxing musicthen”

“Your pulse is abit high,

please slowdown a bit.”

Page 4: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

The COMPANIONS projectProposed Dialogue System Architecture

USER

Language

Understand

Speech

Sinthesizer

Language

Generation

Speech

RecognitionSignal

Signal

Words

Words

Dialogue

ModelConcepts

User Intentions

(DAs)

Dialogue

Manager

DB

System

Intentions

(DAs)

Page 5: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

Dialogue ActsA Dialogue Act is a linguistic abstraction that attempts to capture the intension/purpose of an utterance.

DAs are based on the concept of a speech act – “When we say something, we do something” (Austin, 1962)

Examples of DAs labels using the DAMSL scheme on the Switchboard corpus :

Example Dialogue Act

Me, I’m in the legal department. Statement-non-opinion

Uh-huh. Acknowledge (Backchannel)

I think it’s great Statement-opinion

That’s exactly it. Agree/Accept

So, - Abandoned or Turn-Exit

I can imagine. Appreciation

Do you have to have any special training? Yes-No-Question

Page 6: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

Dialogue Act Classification

Research question: Can major DA categories be identified automatically through the clustering of utterances?

Each utterance can be treated as a ‘bag of (content) words’ …

What time is the next train to Oxford ?

Can then apply methods from document clustering

Page 7: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

Document Clustering

Working example: Document clustering

Page 8: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

Document Clustering

Each document is a ‘bag of (content) words’

How many clusters?

In parametric methods the number of clusters is specified at the outset.

Bayesian nonparametric methods (Gaussian Processes and Dirichlet Processes) automatically detect how many clusters there are.

Page 9: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

Multinomial Distribution

A multinomial probability distribution is a distribution over all the possible outcomes of multinomial experiment.

1i

i1 2 3 4 5 6

A fair dice

1 2 3 4 5 6

A weighted dice 1i

i

)(~ 6lMultinomiazEach draw from a multinomial distribution yields an integer

e.g. 5, 2, 3, 2, 6 …

Page 10: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

0

Dirichlet Distribution

Each point on a k dimensional simplex is a multinomial probability distribution:

1

1

1

1

3

2

3.0

5.0

2.0

1 2 3

0

0

1

1 2 3

1i

i

1i

i

Page 11: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

Dirichlet Distribution

0 1

1

1

3

2

1

A Dirichlet Distribution is a distribution over multinomial distributions in the simplex.

0 1

1

1

1

3

21

1

3

11

1

2

Page 12: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

Dirichlet Distribution

The Dirichlet Distribution is parameterised by a set of concentration constants defined over the k-simplex

}1{0)( 1 kiik

A draw from a Dirichlet Distribution written as:

)~ (Dirichletk

where is a multinomial distribution over k outcomes.

Page 13: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

Dirichlet DistributionExample draws from a Dirichlet Distribution over the 3-simplex:

0

1

1

2

3

1

2

0

3

1

2

3

Dirichlet(5,5,5)

Dirichlet(0.2, 5, 0.2)

Dirichlet(0.5,0.5,0.5)

Page 14: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

Graphical ModelsA

B

p(A,B) = p(B|A)p(A)

A

Bii n

A

B2 BnB1

Page 15: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

~Dirichletk(, …, )Components z (z (1 … k)) are drawn from a base measure G0

z ~ G0 (e.g. Dirichletv(, …, ))For each data point (document) a component z is drawn:

zi ~ Multinomial()and the data point is drawn from some distribution F()

xi ~ F(z ) (e.g. Multinomial(z ))i i

Bayesian Finite Mixture Model

zi

i n

xi

zz k

Parameters: = (, ) = (1 … k,1 … k ) Hidden variables z = (z1 … zn)Observed data x = (x1 … xn)

k

z

n

iiiziz zpxpppxzp

i1 1

)|()|()()(),,,(

Page 16: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

Bayesian Finite Mixture ModelDocument clustering example:

k = 2 clusters

~Dirichletk(, )1 2

Choose a source for each data point (document) i {1, … n}:zi ~ Multinomialk() z1 = 1 z2 = 2 z3 = 2 z4 = 1 z5 = 2

Generate the data point (words in document) using source:xi ~ Multinomialv(z ))

xi = ACAAB x2 = ACCBCC x3 = CCC x4 = CABAAC x5 = ACC

1 2 3 1 2 3

1z 2

zv = 3 word types

z ~ Dirichletv(, , )

Page 17: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

Data Generation Demo

Component = 1 words = Id: 0 [1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1]Component = 0 words = Id: 1 [0, 1, 2, 2, 0, 0, 1, 0, 1, 0, 0, 0, 2, 1, 0, 2]Component = 1 words = Id: 2 [1, 1, 2, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1]Component = 1 words = Id: 3 [1, 1, 1, 1, 1]Component = 1 words = Id: 4 [1, 1, 1, 1, 1, 1, 1, 1]Component = 0 words = Id: 5 [0, 2, 0, 0, 0, 2, 2, 0, 0, 1, 0, 2, 0, 2, 1, 2, 0, 0, 2]Component = 1 words = Id: 6 [1, 1, 1, 1, 1, 1]Component = 0 words = Id: 7 [0, 2, 2, 0, 0, 2, 2, 0, 2, 0]Component = 0 words = Id: 8 [0, 0, 2, 1, 2, 2]Component = 0 words = Id: 9 [2, 0, 1, 0, 2, 0, 2, 1, 0, 2, 2, 1, 1, 2, 0]Component = 1 words = Id: 10 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2]Component = 2 words = Id: 11 [0, 0, 2, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0]Component = 0 words = Id: 12 [1, 0, 1, 0, 0, 0, 2, 2, 0, 0, 2, 0, 2, 1, 0, 0]Component = 1 words = Id: 13 [1, 1, 1, 2, 1, 1, 1]Component = 0 words = Id: 14 [0, 2, 2, 0, 2, 0, 2, 0, 0, 0, 2, 1, 2]Component = 0 words = Id: 15 [2, 0, 0, 0, 1, 2, 0, 2, 0, 2, 0, 2, 0]Component = 1 words = Id: 16 [1, 1, 1, 1, 1]Component = 0 words = Id: 17 [1, 1, 0, 0, 2, 1, 2, 0, 0, 0, 1, 2, 1]Component = 1 words = Id: 18 [1, 1, 1, 1, 1, 1, 0, 2, 1]Component = 1 words = Id: 19 [1, 1, 0, 2, 1, 1, 1, 1, 0]Component = 2 words = Id: 20 [0, 1, 0, 2, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 2]

Page 18: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

Dirichlet ProcessesDirichlet Processes can be thought of as a generalisation of

infinite-dimensional Dirichlet distributions … but not quite!

As the dimension k of a Dirichlet distribution increases …

k

α,

k

α,~ Dirichlet

k = 2 k = 4 k = 6 k = 8 k = 10 k = 12 k = 18

Dirichlet distribution is symmetric

For a Dirichlet Process need the larger components to appear near the beginning of the distribution on average

Page 19: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

Dirichlet Processes

Stick breaking construction (GEM) …

1

1

)1(

),1(~k

ikik

k

Beta

1

Page 20: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

Dirichlet Processes Mixture Model

Definition

~ Dirichletk(, …, )Components z z (1 … k)

z ~ G0

For each data point (document) z is drawn:zi ~ Multinomial()

and the data point is drawn from some distribution F()xi ~ F(z ) (e.g. Multinomial(z ))i i

GEM()(1 … )

Page 21: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

x1 x2 x3 x4 x5 x6 x7

The Chinese Restaurant Process is one view of DPs

Tables = clustersCustomers = data points (documents)Dishes = component parameters

Chinese Restaurant Process

1 2 3 4 5…

0 1

1

12

1

22

1

3

1

33

2

4

1

4

1

4

2

4

Page 22: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

Chinese Restaurant Process

Shut your eyes if you don’t want to see any more maths …

0

1

11

1G

i

i

jj

i | 1, …, i-1 ~

The “rich get richer” principle: tables with more customers get more customers on average

Page 23: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

CRP Initial Clustering Demo

Initial table allocations

100 documents3 sources5 to 20 words per document

Page 24: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

CRP Table parameters

Each cluster (table) is given a parameter (dish) i which all

the data points (customers) in that cluster share.

These are drawn from the base measure G0 (a Dirichlet distribution in this case)

1 2 3 4 5

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

Page 25: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

CRP InferenceGoal of Bayesian inference is to calculate the posterior:

zi

i n

xi

zz k

p(, , z | x)

The posterior cannot usually be sampled directly. Can use Gibbs sampling …

)|()|()(

1~,,,|

1

1011 jk

k

jkkkk xpxpG

C j x

1

1

)|()(k

jkkk xpxpC

Page 26: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

CRP Inference - reclustering

1 2 3 4 5

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

x1 x2

x3

x4 x5

x6 x7

1 2 3 541 2 3 54

x4x2

Page 27: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

CRP Inference – table updates

1 2 3 4

1 2 3 1 2 3 1 2 3 1 2 3

x1 x2x3 x4 x5

x6x7

1 2 3 1 2 3 1 2 3 1 2 31 2 3( ) =++ + =

Page 28: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

CRP Inference Demo

Page 29: Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.

Concluding Thoughts

CRP Works well at the toy document clustering example

Document size 100+ words

Up to 6 word types

100 – 500 documents

Will it work when clustering utterances?

Utterance size 1 – 20 words

Up to 6 word types

100 – 500 documents

This is much a much harder classification problem