8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye...
-
date post
21-Dec-2015 -
Category
Documents
-
view
220 -
download
0
Transcript of 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye...
![Page 1: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/1.jpg)
8/12/2003
Text-Constrained Speaker Recognition
Using Hidden Markov Models
Kofi A. Boakye
International Computer Science Institute
![Page 2: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/2.jpg)
8/12/2003
Outline
IntroductionDesign and System DescriptionInitial ResultsSystem Enhancements
•More words•Higher order cepstra•Cepstral Mean Subtraction
ConclusionsFuture Work
![Page 3: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/3.jpg)
8/12/2003
Introduction
Speaker Recognition Problem: Determine if spoken segment is putative target
• Also referred to as Speaker Verification/Authentication
![Page 4: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/4.jpg)
8/12/2003
Introduction
claimed identity: Sally
Method of Solution Requires Two Phases:
Similar to speech recognition, though “noise” (inter-speaker variability) is now signal.
Training Phase
Testing Phase
![Page 5: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/5.jpg)
8/12/2003
Introduction
Also like speech recognition, different domains exist
Two major divisions:
1) Text-dependent/Text-constrained
• Highly constrained text spoken by person
• Examples: fixed phrase, prompted phrase
2) Text-independent
• Unconstrained text spoken by person
• Example: conversational speech
![Page 6: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/6.jpg)
8/12/2003
Introduction
Text-dependent systems can have high performance because of input constraints
• More acoustic variation arises from speaker distinction(vs. phones)
Text-independent systems have greater flexibility
![Page 7: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/7.jpg)
8/12/2003
Introduction
Question: Is it possible to capitalize on advantages of text dependent systems in text-independent domains?
Answer: Yes!
![Page 8: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/8.jpg)
8/12/2003
Introduction
Idea: Limit words of interest to a select group-Words should have high frequency in domain-Words should have high speaker-discriminative quality What kind of words match these criteria for conversational speech ?1) Discourse markers (like, well, now…)2) Filled pauses (um, uh)3) Backchannels (yeah, right, uhhuh, …)
These words are fairly spontaneous and represent an “involuntary speaking style” (Heck, WS2002)
![Page 9: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/9.jpg)
8/12/2003
Likelihood Ratio Detector:
Λ = p(X|S) /p(X|UBM)
Task is a detection problem, so use likelihood ratio detector
-In implementation, log-likelihood is used
Design
Feature Extraction
Background Model
Speaker Model
/ Λ> Θ Accept
< Θ Rejectsignal
adapt
![Page 10: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/10.jpg)
8/12/2003
Design
State-of-the Art Speaker Recognition Systems use Gaussian Mixture Models
•Speaker’s acoustic space is represented by many-component mixture of Gaussians
speaker 2
speaker 1
![Page 11: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/11.jpg)
8/12/2003
Design
Speaker models are obtained via adaptation of a Universal Background Model (UBM)
•Probabilistically align target training data into UBM mixture states
•Update mixture weights, means and variances based on the number of occurrences in mixtures
•Gives very good performance, but…
Target training data
![Page 12: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/12.jpg)
8/12/2003
Design
Concern: GMMs utilize a “bag-of-frames” approach•Frames assumed to be independent•Sequential information is not really utilized Alternative: Use HMMs•Do likelihood test on output from recognizer, which is an accumulated log-probability score•Text-independent system has been analyzed (Weber et al. from Dragon Systems)•Let’s try a text-dependent one!
![Page 13: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/13.jpg)
8/12/2003
System Word-level HMM-UBM detectors
Word Extractor
HMM-UBMN
HMM-UBM2
HMM-UBM1
Topology:
Left-to-right HMM with self-loops and no skips
4 Gaussian components per state
Number of states related to number of phones and median number of frames for word
Com
bin
atio
n
Λsignal
![Page 14: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/14.jpg)
8/12/2003
System
HMMs implemented using HMM toolkit (HTK)-Used for speech recognitionInput features were 12 mel-cepstra, first differences, and zeroth order cepstrum (energy parameter) Adaptation:Means were adapted using Maximum A Posteriori adaptation
In cases of no adaptation data, UBM was used-LLR score cancels
jmjmjm
jm
NjmN
N
jm
_^
![Page 15: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/15.jpg)
8/12/2003
Word Selection
13 Words:
Discourse markers: {actually, anyway, like, see, well, now}
Filled pauses: {um, uh}
Backchannels: {yeah, yep, okay, uhhuh, right }
Words account for approx: 8% of total tokens
![Page 16: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/16.jpg)
8/12/2003
Recognition Task
NIST Extended Data Evaluation:
Training for 1,2,4,8, and 16 complete conversation sides and testing on one side (side duration ~2.5 mins)
Uses Switchboard I corpus
-Conversational telephone speech
Cross-validation method where data is partitioned
Test on one partition; use others for background models and normalization
For project, used splits 4-6 for background and 1 for testing with 8-conversation training
![Page 17: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/17.jpg)
8/12/2003
Scoring
LLR(X) = log(p(X|S)) – log(p(X|UBM))
Target score: output of adapted HMM scoring forced alignment recognition of word from true transcripts (aligned via SRI recognizer)
UBM score: output of non-adapted HMM scoring same forced alignment
Frame normalization:
Word normalization: Average of word-level frame normalizations
N-best normalization: Frame normalization on n best matching (i.e. high log-prob) words
j j
i i ii
frames
UBMettscore
#
arg
![Page 18: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/18.jpg)
8/12/2003
Observations:
1) Frame norm result = word norm result
2) EER of n-best decreases with increasing n
-Suggests benefit from an increase in data
Normalization method EERframe 2.87%word 2.87%2-best 17.23%4-best 10.36%8-best 6.96%
Initial Results
![Page 19: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/19.jpg)
8/12/2003
Comparable results: Sturim et al. text-dependent GMM
Yielded EER of 1.3%
-Larger word pool (50 words)
-Channel normalization
Initial Results
![Page 20: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/20.jpg)
8/12/2003
Observations:
EERs for most lie in a small range around 7%
-Suggests that words, as a group, share some qualities
-last two may differ greatly partly because of data scarcity
Best word (“yeah”) yielded EER of 4.63% compared with 2.87% for all words
0
5
10
15
20
25
30
35
40
EER (%)
yeah uh
lik
eum
uhhu
hok
ayrig
htno
wwel
l
actu
ally
see
anyw
ay yep
EER by Word
Initial Results
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
Word Frequencies
![Page 21: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/21.jpg)
8/12/2003
System Enhancements
![Page 22: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/22.jpg)
8/12/2003
System Enhancements: New Words
Some discourse markers and backchannels are bigrams
6 Additional Words Bigrams:
Discourse markers:{you_know, you_see, i_think, i_mean}
Backchannels:{i_see, i_know}
Total coverage of ~10% with these additional words
![Page 23: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/23.jpg)
8/12/2003
System Enhancements: New WordsResults
EER reduced from 2.87% to 2.53%
•Significant reduction, especially given the size of coverage increase
![Page 24: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/24.jpg)
8/12/2003
System Enhancements: New WordsResults
0
10
20
30
40
50
60
EER(%)
EER by Word
0
5000
10000
15000
20000
25000
30000
35000
Word Frequencies
Observations:
Well-performing bigrams have comparable EERs
Poorly-performing bigrams suffer from a paucity of data
•Suggests possibility of frequency threshold for performance
![Page 25: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/25.jpg)
8/12/2003
System Enhancements: More Cepstra
Idea: Higher order cepstra may posses more variability that can be used for speaker discrimination
Input features modified to 19 mel-cepstra from 12
![Page 26: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/26.jpg)
8/12/2003
System Enhancements: More CepstraResults
EER Reduced from 2.87% to 1.88%
![Page 27: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/27.jpg)
8/12/2003
System Enhancements: CMS
Idea: Channel response may introduce undesirable variability (e.g., the same speaker on different handsets), so try and remove it
Common approach is to perform Cepstral Mean Subtraction (CMS)
•Convolutional effects in the time domain become additive effects in the log power domain:
X(,t) = S(,t)C(,t)
log|X(,t)|2 = log|S(,t)|2 + log|C(,t)|2
![Page 28: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/28.jpg)
8/12/2003
System Enhancements: CMSResults
EER reduced from 2.87% to 1.35%
Poor performance in low false alarm region
•possibly due to small number of data points
•also may have removed ‘good’ channel info
![Page 29: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/29.jpg)
8/12/2003
System Enhancements: Combined SystemResults
“grab bag” system yields EER of 1.01%
Suffers from same problem of poor performance for low false alarms
![Page 30: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/30.jpg)
8/12/2003
Conclusions
Well performing text-dependent speaker recognition in an unconstrained speech domain is very feasible
Benefit of sequential information appears to have been established
Benefits of higher order cepstra and CMS for input features have been demonstrated
![Page 31: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/31.jpg)
8/12/2003
Future Work
-Analyze performance with ASR output
-Closer analysis of word frequency to performance
-More words!
-Normalizations (Hnorm, Tnorm)
-Examine influence of word context
(e.g., “well” as discourse marker and as adverb)
![Page 32: 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.](https://reader030.fdocuments.net/reader030/viewer/2022032704/56649d565503460f94a34752/html5/thumbnails/32.jpg)
8/12/2003
Fin