Preliminary Exam
description
Transcript of Preliminary Exam
![Page 1: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/1.jpg)
Preliminary Exam
Department of Electrical and Computer Engineering
Preliminary Exam
Department of Electrical and Computer Engineering
Submitted to:Dr. Joseph Picone, Examining Committee Chair
Dr. Iyad Obeid, Committee Member, Depat. of Electrical and Computer EngineeringDr. Marc Sobel, Committee Member, Department of Statistics
Dr. Chang-Hee Won, Committee Member, Depat. of Electrical and Computer EngineeringDr. Slobodan Vucetic, Committee Member, Dept. of Computer and Information Sciences
March 6, 2012
prepared by:
Amir Harati, PhD CandidatePhD Advisor: Dr. Joseph Picone, Professor and ChairDepartment of Electrical and Computer Engineering
Temple University , College of Engineering1947 North 12th Street
Philadelphia, Pennsylvania 19122Tel: 215-204-7597
Email: [email protected]
Hierarchical Dirichlet Processesand
Infinite HMMs
![Page 2: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/2.jpg)
Motivation• Parametric models can capture a
bounded amount of information from the data.
• Real data is complex and therefore parametric assumptions is wrong.
• Nonparametric models can lead to model selection/averaging solutions without paying the cost of these methods.
• In addition Bayesian methods often provide a mathematically well defined framework, with better extendibility. All possible data sets of size n
From [1]
![Page 3: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/3.jpg)
Motivation• Speech recognizer architecture.• Performance of the system depends
on the quality of acoustic models.• HMMs and mixture models are
frequently used for acoustic modeling.
• Number of models and parameter sharing is among the most important model selection problems in speech recognizer.
• Can hierarchical nonparametric Bayesian modeling help us?
AcousticFront-end
Acoustic ModelsP(A/W)
Language ModelP(W) Search
InputSpeech
Recognized Utterance
![Page 4: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/4.jpg)
Outline• Background• Hierarchical Dirichlet Process.• Posterior Sampling in CRF.• Augmented Posterior Representation Sampler.• HDP-HMM• Direct Assignment Sampler.• Block Sampler.• Sequential Sampler.• Demonstrations.• Future works and Discussion.
![Page 5: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/5.jpg)
Background• is a measurable space where is
the sigma algebra• A measure over is a function
from such that:
• For a probability measure• A Dirichlet distribution is a
distribution over the K-dimensional probability simplex.
• Examples of Dirichlet distributions( , )
( , ) 0,
0
:A A Ai i ii
1
From [2]
![Page 6: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/6.jpg)
Background• A Dirichlet Process (DP) is a random
probability measure over such that for any measurable partition over we have:
• And we write: • DP is discrete with probability one:
• is the base distribution and acts like mean of DP. Is the concentration parameter and is proportional to the inverse of the variance.
• Stick-breaking construction:
• Polya urn scheme:
• Chinese restaurant process (CRP) :
,
0G
1
1 1 0 01
1| ,..., , , ~
1 1k
i
i ik
G GN N
*1 1 0 01
| ,..., , , ~1 1k
Kk
i ik
mG G
N N
![Page 7: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/7.jpg)
Hierarchical Dirichlet Process (HDP)• Grouped data clustering problem:
consider topic modeling problem. In this problem, each document is a group and we are interested to model each document with a mixture while sharing mixtures across the groups.
• For each group we need a DP. We use a hierarchical architecture to share clusters across the groups.
• Sharing of atoms obtained by using a common DP as the based distribution for each group.
0
0 0
| , ~ ( , )
| , ~ ( , )
| ~
| ~
j
ji j j
ji ji ji
G H DP H
G G DP G
G G
x F for j J
1
| ~ ( )
| , ~ ( , )
| , ~ ( )
| ~
| , ~
j
k
ji j j
ji k ji jik
GEM
DP
H H
z
x z F
From [3,4]
![Page 8: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/8.jpg)
HDP• Stick-breaking construction
From [5]
![Page 9: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/9.jpg)
HDPChinese Restaurant Franchise (CRF)• Each group is corresponding to a
restaurant.• There is a franchise wide menu with
unbounded number of entries. • Number of dishes is logarithmically
proportional to the number of tables and double logarithmically to the number of data.
• Reinforcement effect: New customers tends to sit at tables with many other customers and choose dishes that are chosen by many other tables.
From [6]
![Page 10: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/10.jpg)
HDP• Posterior Distribution
**1
0 | , , ~ , k
K
kkH m
G H DP mm
**0 10| , , ~ , k
K
j kkj j
j
G nG G DP n
n
jθ
**
0 1 0 1
0
0 0 01
, ,..., | , , ~ , ,...,
| , ~ ,
k
K K
K
kk
G Dir m m
G H DP H
G G
*θ
**
0 1 0 1 1
0 0 0
01
, ,..., | , ~ ( , ,..., )
| , ~ ( , )
k
j j jK j K j K
j
K
j j j jkk
Dir n n
G G DP G
G G
jθ
Interpretation: At the beginning is large and therefore is large and is concentrated around . After many tables become occupied gets smaller and as result becomes smaller but will not be concentrated around .=> NEW DRAWS ARE NOT LIKELY BUT IF THEY HAPPENS, THEY WOULD BE DIFFERENT FROM THE AVERAGE.
0 0j
jG 0G0 0j
jG0G
![Page 11: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/11.jpg)
Posterior sampling in CRF• Sampling table assignment (t): Given
foods and table labels sample
• If a new table is in selected, sample its food:
• Sample foods (k):
• For exponential family we can just update the cached statistics . For Gaussian emissions, we can calculate the likelihoods using :
, if t previously used| ,
| , , , if t
ji
jt
xjijt jikji
ji ji new newji ji
n f xp t t
p x t t t
t kt k
1
| , , jijinew
jiKxxji new k
ji ji k ji jikk
mp x t t f x f x
m m
t k
, If k previously used| ,
, If k =k
ji ji
new
newji
new
x xk k jijt
jt x newjik
m f xp k k
f x
t k
\
|
, 1,...,,|
| ,
k jiji
k ji
jinew
j ij i D xx
k ji
j ij i D x
x newji jik
h f x d
f x k Kh f x d
f x h f x d k k
, If k previously used| ,
, If k =knew
k k
jt newk
m fp k
f
jt jt
jt
x xjt-jt
xjt
xt k
x
1
1
1; , , 1,...,
1
1; , ,
1
ji
k
jinew
k kxk ji d ji k k
k k
x newji d jik
f x t x k Kd
f x t x k kd
( )
1
( ) ( )
1
T
k
k
L lk k l
L l l T Tk k l
L
L
x
x x
![Page 12: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/12.jpg)
Posterior Representation Sampler• Sample Z:
• if a new component is chosen :
• Sample m :Antoniak showed that if then the distribution of unique draws fromhas this form:Where s(N,K) is the Stirling number of first kind.
• Alternatively we can simulate a CRF:For each set and n=0 . For each customer in restaurant j eating dish k sample:
Increment n and if x=1 increment
• Sample and
0
If k previously used( | , )
If k=k
ji
jinew
xjk k jiji
ji x newj jik
f xp z
f x
z
0
0 1 0 0 0 0
0 0 0 0 0 0
0 1 0 0
| ~ ,1
, , (1 )
| , , ~ , (1 )
, ~ , (1 )
new newK
j
new newj jK j j j j
v Beta
v
v Beta v v
v v
( | , , , ) ( , )
mkjk j k j k k
k j k
p m m n s n mn
z
0 1 0 1
0 1 0 1 1
, ,..., | , , ~ , ,...,
, ,..., | , ~ ( , ,..., )
K K
j j jK j K j K
G Dir m m
Dir n n
*
j
θ
θ
~ , ~iGEM z
( )
| , ,( )
Kp K N s N KN
2, 1,...,j k K 0jkm
~ k
k
x Bern
jkm
![Page 13: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/13.jpg)
Topic Modeling [3,4]
![Page 14: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/14.jpg)
Hidden Markov models (HMMs)• HMMs are a dynamic variant of
mixture models.• An HMM can be characterized by
transition and emission matrices.• Number of states and number of
mixtures should be specified a priori. Topology is also fixed.
• Infinite HMMs : an HMM with unbounded number of states and mixtures per state.
• For each state we have to replace the transition matrix with a DP.
• DPs should be linked to make state sharing possible.
• HDP is used to tie state transition distributions .
• Each state can independently use another DP to model an unbounded emission mixture.
• Original HDP-HMM suffering from lack of state persistence. This problem is solved by adding a sticky parameter.
![Page 15: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/15.jpg)
HDP-HMM• Definition:
• CRF with loyal customers: Each restaurant has a special dish which is also served in other restaurants. If a customer eats the specialty dish (likely) then his children goes to the same restaurant and likely eat the same dish. However, if the customer eats another dish then his children go to the restaurant indexed by that dish and more likely eat their specialty dish.
1
**
1 1
1
**
, 1
| ~ ( )
| , ~ ( , )
| ~ ( )
| , ~ ( )
| , ~
| , ~
| , ~
t
t
t t
jj
j
kj
t t j zj
t j t zj
t kj t z sk j
GEM
DP
GEM
H H
z z
s z
x z F
From [7]
![Page 16: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/16.jpg)
Direct Assignment Sampler • Sample augmented state:
• Sampling by sampling first auxiliary variables:– Sample m using
– Alternatively, we can simulate a CRF.– Sample override variable:
– Adjust the number of informative tables:
• Sample
11
~ ( , ) ( ) ( , 1)K
t k t t K t tk
z f x z k f x z K
, , 11
~ ( ) ( , ) ( ) ( , 1)k
k
K
t k j t t k K t t kj
s f x s j f x s K
1
1 1
,
, 1
,0
1
1 1
( ) | | , ,
( ) | | , ,
( ) | | ,
,
, ,
k
new
t
t t
tkj
k j t t ttk
newk K t t tt
k
newt ttk
tk t k z t
tz kz t t
nf x p x x z k s j t
n
f x p x x z k s j tn
f x p x x z k tn
f x n z k
n k z z k
1
1
, , 11
21
1 ,0
,
,
( ) , 1,...,
( ), 1
k
k
tnew
t
tk t
K
k j t k K tj
zkK t tk
k z
n z k
f x f x k K
f x f x k K
( )1~ , ,...,n
KDir m m
( | , , , ) ( , ) ,
mkjk j k j k k
k j k
p m m n s n m k jn
z
~ , ,
1j jjj
Binomial m
jkjk
jj j
m j km
m j k
1
1
1
1| | , , ; , , 1,..., , 1,...,
1
1| | , , ; , , 1,..., ,
1
1| | , ; ,
kj
newkj
newk
kj kj
t t d t kj kj k
kj kj
new newt t d t
newt d t
p x x z k s j t t x k K j Kd
p x x z k s j t t x k K j jd
p x x z k t t x
,
1newk k
d
![Page 17: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/17.jpg)
Some Notes• Sampling from the override variable is performed to cancel the bias introduced by
sticky parameter. Sticky parameter practically change (override) the dish which is going to assigned to the table. In order to have an unbiased estimate we have to bring this into account.
• Direct assignment sampler suffer from slow convergence rates.• Parameters are integrate d out , in other word this sampler can only used for
inference not learning.• If we want to perform learning we have to sample parameters by simulation
(more computation).• We need to sample all states at once.• We are interested to do both of learning and inference at once.
![Page 18: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/18.jpg)
Forward-Backward Probabilities• Joint probability of state and mixture component can be wrote as:
• Forward probabilities include:
• Backward probability is: • In this work, forward probabilities are approximated by• For backward probabilities we can write:
• And finally we have:
1:
1 1: , 1: 1 1:
, | , ,
| , , , | | | , , | , ,t t t
t t T
t t T t z t z s t t t T t
p z s x
p z z x p s f x p x z p x z
z π,ψ,θ
π θ π θ,ψ π θ,ψ
1 1: , 1: 1| , , , | | | , ,t t tt t T t z t z s t tp z z x p s f x p x z π θ π θ,ψ
1: | , ,t T tp x z π θ,ψ
1 1: ,| , , , | |t t tt t T t z t z sp z z x p s f x π θ
1: , 1 1
1 , 1,
, 1,, 1 1 1
| , ,
| | |
1 1
| 1,...
1 1
t t t tt t
t t
t T t t t t
t z t z t z s t t tz s
L L
ki il t z s t t tt t i l
p x z m z
p z p s f x m z t T
t T
f x m z t T k Lm k
t T
π θ,ψ
1
1:
, 1,
, | , ,
|t t t
t t T
z k kj t z s t t t
p z k s j x
f x m z
z π,ψ,θ ,| ; ,t tt z s t kj kjf x x
![Page 19: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/19.jpg)
Block Sampler• Compute the backward probabilities:
• Sample augmented state:
• Sample override variable and adjust number of tables similar to the pervious algorithm.
• Update the cache and then sample
• Sample and
• Sample
• Optionally sample hyper-parameters.
, 1 1 1,1 1
; ,L L
t t ki il t il il t ti l
m k N x m i
1
L L
t t k,j t t tk=1 j=1
, , , , , 1,
z ,s ~ f x δ z ,k δ s ,j
; ,tk j t z k k j t k j k j t tf x N x m k
1 1
1
~ ,..., ,...,
~ / ,..., /
k k k kk L kL
k k kL
Dir n n n
Dir L n L n
k k
kj
, ,~ | ,k j k jp
1~ / ,..., / LDir L m L m
![Page 20: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/20.jpg)
Particle Filter• Dynamic system
• Update :
• Propagate :
1 1
1
,
,
t t
tt
x f z
z g z
( )
( )1 11 1 ( ) ( )
( )11 11
| | || | ( ),
| |it
t iNt t t t tt N t i i
t t t t t Nzt jit t tj
p x z p z x p x zp z x p z x z
p x y p x z
1 11 1 1| | , |t t
t t t t t tp z x p z z x p z x dz
![Page 21: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/21.jpg)
Sequential Learning and Inference• Calculate the weights:
• Resample the particles:
• Propagate the particles:
• If a new state is initiated :
• Update the hyper-parameter.• Resample
( )
( )1
( )
( )1
( )
( )
( )( ) ( ) ( ) ( )
11( )
1
( ) ( ) ( )
( )1( ) ( )
( ) ( )1 ( ) ( )
1, ( )1( ) ( ) 1
, ,
, 1,...,
,
, 1
it
it t
it
iT t
it
it
iLi i i it
t t l t tN ljtj
i i it ltz lt x i
l t ti itz ti i
l t t i it L t x i
t ti i Ltz t
vv q Z x
v
nf x l L
nq Z x
f x l Ln
( )1 ( )
1
| it
NN t i
t t tZi
p Z x Z
( )
( )
( )( ) ( )( ) ( )11 1
( ) ( )11( ) ( ) ( ) ( )
1 1 1 11 ( ) ( )111
( ) ( ) ( )1( )
1 ( ) ( )1
( ) ( ),, , 1 , ,
,~ | ,
,
1, 1
1
it
it
ii ii itt tt t
i iLl t ti i i i
t t t t l tL j jll t tj
i i it t ti
t i it t
i iz tz z t z z t
q Z xz p z Z x z
q Z x
L z LL
L L otherwise
n n S
( )11 ,i
ttz t
S s x
( )
( ) ( )1
( )0
( ) ( ) ( ) ( )0, 1 0, 0 0, 01,
| ~ ,1
, , (1 )it
i it t
it
i i i it t tL t
v Beta
v
( )1
( ) ( )1 ,1, 1 1, , 1
~ ,..., ,it
i it t tL t
Dir m m
![Page 22: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/22.jpg)
State persistence demo [7]
![Page 23: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/23.jpg)
Fast switching Demo [7]
![Page 24: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/24.jpg)
Comparison to Sparse Dirichlet Prior [7]
![Page 25: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/25.jpg)
Speaker Diarization [7]
![Page 26: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/26.jpg)
Speaker Diarization [7]
![Page 27: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/27.jpg)
Alice in the wonder land [1,3] • Training over 1000 character.• Test over another 1000 character.• output : Characters (including space and punctuation.)
![Page 28: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/28.jpg)
Future works• Can we use a similar approach to speaker diarization to discover a new set of acoustic
units (instead of phonemes) ? This problem seems to fit particularly well in a non parametric settings since the number of units is not know a priori and should be estimated from the data. The only important difficulty is to form a dictionary for new units.
• How can we define a structured HDP-HMM (e.g. Left-right ) without violating Bayesian framework (no heuristic)?
• From experiments, we know speaker dependent models works significantly better than speaker independent models. For example, the performance of a speech recognizer with gender based models is better than performance of a speech recognizer with universal models for all speakers. Nonparametric Bayesian framework provides two important features that can facilitate speaker dependent systems: 1-Number of clusters of speakers is not known a priori and could possibly grow with obtaining new data. 2-Paramter sharing and model (and state )tying can be accomplished elegantly using proper hierarchies. Depending on the available training data, the system would have different number of models for different acoustic units. All acoustic units are tied. Moreover each model has different number of sates and different number of mixtures for each state.
![Page 29: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/29.jpg)
References1. Ghahramani, Z. (2010). Bayesian Hidden Markov Models and Extensions. Uppsala,
Sweden: invited talk at CoNLL.2. Ghahramani, Z. (2005) ,Tutorial on Nonparametric Bayesian Methods, talk UAI3. Teh, Y., Jordan, M., Beal, M., & Blei, D. (2004). Hierarchical Dirichlet Processes.
Technical Report 653 UC Berkeley.4. Teh, Y., & Jordan, M. (2010). Hierarchical Bayesian Nonparametric Models with
Applications. In N. Hjort, C. Holmes, P. Mueller, & S. Walker, Bayesian Nonparametrics: Principles and Practice. Cambridge, UK: Cambridge University Press
5. Y.W. Teh. (2009), Bayesian Nonparametrics, talk MLSS Cambridge 6. M. I. Jordan (2005), Dirichlet processes, Chinese restaurant processes and all that,
Tutorial presentation at the NIPS Conference7. Fox, E., Sudderth, E., Jordan, M., & Willsky, A. (2011). A Sticky HDP-HMM with
Application to Speaker Diarization. The Annalas of Applied Statistics, 5, 1020-1056.8. Rodriguez, A. (2011, July). On-Line Learning for the Infinite Hidden Markov.
Communications in Statistics: Simulation and Computation, 40(6), 879-893.
![Page 30: Preliminary Exam](https://reader036.fdocuments.net/reader036/viewer/2022081421/56815ab1550346895dc85d9e/html5/thumbnails/30.jpg)
Thank You!