Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.
-
Upload
luke-shelton -
Category
Documents
-
view
222 -
download
0
Transcript of Dirichlet Processes in Dialogue Modelling Nigel Crook March 2009.
Dirichlet Processes in Dialogue Modelling
Nigel CrookMarch 2009
• The COMPANIONS project• Dialogue Acts• Document Clustering
• Multinomial Distribution• Dirichlet Distribution• Graphical Models• Bayesian Finite Mixture Models• Dirichlet Processes• Chinese Restaurant Process• Concluding Thoughts
Inputs and OutputsOverview
With thanks to ...Percy Liang and Dan Klein(UC Berkeley)1
1 Structured Bayesian Nonparametric Models with Variational Inference, ACL Tutorial in Prague, Czech Republic on June 24, 2007.
The COMPANIONS projectCOMPANIONS: Intelligent, Persistent,
Personalised Multimodal Interfaces to the Internet
One Companion on many platforms
“Okay, but pleaseplay some
relaxing musicthen”
“Your pulse is abit high,
please slowdown a bit.”
The COMPANIONS projectProposed Dialogue System Architecture
USER
Language
Understand
Speech
Sinthesizer
Language
Generation
Speech
RecognitionSignal
Signal
Words
Words
Dialogue
ModelConcepts
User Intentions
(DAs)
Dialogue
Manager
DB
System
Intentions
(DAs)
Dialogue ActsA Dialogue Act is a linguistic abstraction that attempts to capture the intension/purpose of an utterance.
DAs are based on the concept of a speech act – “When we say something, we do something” (Austin, 1962)
Examples of DAs labels using the DAMSL scheme on the Switchboard corpus :
Example Dialogue Act
Me, I’m in the legal department. Statement-non-opinion
Uh-huh. Acknowledge (Backchannel)
I think it’s great Statement-opinion
That’s exactly it. Agree/Accept
So, - Abandoned or Turn-Exit
I can imagine. Appreciation
Do you have to have any special training? Yes-No-Question
Dialogue Act Classification
Research question: Can major DA categories be identified automatically through the clustering of utterances?
Each utterance can be treated as a ‘bag of (content) words’ …
What time is the next train to Oxford ?
Can then apply methods from document clustering
Document Clustering
Working example: Document clustering
Document Clustering
Each document is a ‘bag of (content) words’
How many clusters?
In parametric methods the number of clusters is specified at the outset.
Bayesian nonparametric methods (Gaussian Processes and Dirichlet Processes) automatically detect how many clusters there are.
Multinomial Distribution
A multinomial probability distribution is a distribution over all the possible outcomes of multinomial experiment.
1i
i1 2 3 4 5 6
A fair dice
1 2 3 4 5 6
A weighted dice 1i
i
)(~ 6lMultinomiazEach draw from a multinomial distribution yields an integer
e.g. 5, 2, 3, 2, 6 …
0
Dirichlet Distribution
Each point on a k dimensional simplex is a multinomial probability distribution:
1
1
1
1
3
2
3.0
5.0
2.0
1 2 3
0
0
1
1 2 3
1i
i
1i
i
Dirichlet Distribution
0 1
1
1
3
2
1
A Dirichlet Distribution is a distribution over multinomial distributions in the simplex.
0 1
1
1
1
3
21
1
3
11
1
2
Dirichlet Distribution
The Dirichlet Distribution is parameterised by a set of concentration constants defined over the k-simplex
}1{0)( 1 kiik
A draw from a Dirichlet Distribution written as:
)~ (Dirichletk
where is a multinomial distribution over k outcomes.
Dirichlet DistributionExample draws from a Dirichlet Distribution over the 3-simplex:
0
1
1
2
3
1
2
0
3
1
2
3
Dirichlet(5,5,5)
Dirichlet(0.2, 5, 0.2)
Dirichlet(0.5,0.5,0.5)
Graphical ModelsA
B
p(A,B) = p(B|A)p(A)
A
Bii n
A
B2 BnB1
~Dirichletk(, …, )Components z (z (1 … k)) are drawn from a base measure G0
z ~ G0 (e.g. Dirichletv(, …, ))For each data point (document) a component z is drawn:
zi ~ Multinomial()and the data point is drawn from some distribution F()
xi ~ F(z ) (e.g. Multinomial(z ))i i
Bayesian Finite Mixture Model
zi
i n
xi
zz k
Parameters: = (, ) = (1 … k,1 … k ) Hidden variables z = (z1 … zn)Observed data x = (x1 … xn)
k
z
n
iiiziz zpxpppxzp
i1 1
)|()|()()(),,,(
Bayesian Finite Mixture ModelDocument clustering example:
k = 2 clusters
~Dirichletk(, )1 2
Choose a source for each data point (document) i {1, … n}:zi ~ Multinomialk() z1 = 1 z2 = 2 z3 = 2 z4 = 1 z5 = 2
Generate the data point (words in document) using source:xi ~ Multinomialv(z ))
xi = ACAAB x2 = ACCBCC x3 = CCC x4 = CABAAC x5 = ACC
1 2 3 1 2 3
1z 2
zv = 3 word types
z ~ Dirichletv(, , )
Data Generation Demo
Component = 1 words = Id: 0 [1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1]Component = 0 words = Id: 1 [0, 1, 2, 2, 0, 0, 1, 0, 1, 0, 0, 0, 2, 1, 0, 2]Component = 1 words = Id: 2 [1, 1, 2, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1]Component = 1 words = Id: 3 [1, 1, 1, 1, 1]Component = 1 words = Id: 4 [1, 1, 1, 1, 1, 1, 1, 1]Component = 0 words = Id: 5 [0, 2, 0, 0, 0, 2, 2, 0, 0, 1, 0, 2, 0, 2, 1, 2, 0, 0, 2]Component = 1 words = Id: 6 [1, 1, 1, 1, 1, 1]Component = 0 words = Id: 7 [0, 2, 2, 0, 0, 2, 2, 0, 2, 0]Component = 0 words = Id: 8 [0, 0, 2, 1, 2, 2]Component = 0 words = Id: 9 [2, 0, 1, 0, 2, 0, 2, 1, 0, 2, 2, 1, 1, 2, 0]Component = 1 words = Id: 10 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2]Component = 2 words = Id: 11 [0, 0, 2, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0]Component = 0 words = Id: 12 [1, 0, 1, 0, 0, 0, 2, 2, 0, 0, 2, 0, 2, 1, 0, 0]Component = 1 words = Id: 13 [1, 1, 1, 2, 1, 1, 1]Component = 0 words = Id: 14 [0, 2, 2, 0, 2, 0, 2, 0, 0, 0, 2, 1, 2]Component = 0 words = Id: 15 [2, 0, 0, 0, 1, 2, 0, 2, 0, 2, 0, 2, 0]Component = 1 words = Id: 16 [1, 1, 1, 1, 1]Component = 0 words = Id: 17 [1, 1, 0, 0, 2, 1, 2, 0, 0, 0, 1, 2, 1]Component = 1 words = Id: 18 [1, 1, 1, 1, 1, 1, 0, 2, 1]Component = 1 words = Id: 19 [1, 1, 0, 2, 1, 1, 1, 1, 0]Component = 2 words = Id: 20 [0, 1, 0, 2, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 2]
Dirichlet ProcessesDirichlet Processes can be thought of as a generalisation of
infinite-dimensional Dirichlet distributions … but not quite!
As the dimension k of a Dirichlet distribution increases …
k
α,
k
α,~ Dirichlet
k = 2 k = 4 k = 6 k = 8 k = 10 k = 12 k = 18
Dirichlet distribution is symmetric
For a Dirichlet Process need the larger components to appear near the beginning of the distribution on average
Dirichlet Processes
Stick breaking construction (GEM) …
1
1
)1(
),1(~k
ikik
k
Beta
1
Dirichlet Processes Mixture Model
Definition
~ Dirichletk(, …, )Components z z (1 … k)
z ~ G0
For each data point (document) z is drawn:zi ~ Multinomial()
and the data point is drawn from some distribution F()xi ~ F(z ) (e.g. Multinomial(z ))i i
GEM()(1 … )
x1 x2 x3 x4 x5 x6 x7
The Chinese Restaurant Process is one view of DPs
Tables = clustersCustomers = data points (documents)Dishes = component parameters
Chinese Restaurant Process
1 2 3 4 5…
0 1
1
12
1
22
1
3
1
33
2
4
1
4
1
4
2
4
Chinese Restaurant Process
Shut your eyes if you don’t want to see any more maths …
0
1
11
1G
i
i
jj
i | 1, …, i-1 ~
The “rich get richer” principle: tables with more customers get more customers on average
CRP Initial Clustering Demo
Initial table allocations
100 documents3 sources5 to 20 words per document
CRP Table parameters
Each cluster (table) is given a parameter (dish) i which all
the data points (customers) in that cluster share.
These are drawn from the base measure G0 (a Dirichlet distribution in this case)
1 2 3 4 5
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
CRP InferenceGoal of Bayesian inference is to calculate the posterior:
zi
i n
xi
zz k
p(, , z | x)
The posterior cannot usually be sampled directly. Can use Gibbs sampling …
)|()|()(
1~,,,|
1
1011 jk
k
jkkkk xpxpG
C j x
1
1
)|()(k
jkkk xpxpC
CRP Inference - reclustering
1 2 3 4 5
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
x1 x2
x3
x4 x5
x6 x7
1 2 3 541 2 3 54
x4x2
CRP Inference – table updates
1 2 3 4
1 2 3 1 2 3 1 2 3 1 2 3
x1 x2x3 x4 x5
x6x7
1 2 3 1 2 3 1 2 3 1 2 31 2 3( ) =++ + =
CRP Inference Demo
Concluding Thoughts
CRP Works well at the toy document clustering example
Document size 100+ words
Up to 6 word types
100 – 500 documents
Will it work when clustering utterances?
Utterance size 1 – 20 words
Up to 6 word types
100 – 500 documents
This is much a much harder classification problem