8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye...

8/12/2003

Text-Constrained Speaker Recognition

Using Hidden Markov Models

Kofi A. Boakye

International Computer Science Institute

8/12/2003

Outline

IntroductionDesign and System DescriptionInitial ResultsSystem Enhancements

•More words•Higher order cepstra•Cepstral Mean Subtraction

ConclusionsFuture Work

8/12/2003

Introduction

Speaker Recognition Problem: Determine if spoken segment is putative target

• Also referred to as Speaker Verification/Authentication

8/12/2003

Introduction

claimed identity: Sally

Method of Solution Requires Two Phases:

Similar to speech recognition, though “noise” (inter-speaker variability) is now signal.

Training Phase

Testing Phase

8/12/2003

Introduction

Also like speech recognition, different domains exist

Two major divisions:

1) Text-dependent/Text-constrained

• Highly constrained text spoken by person

• Examples: fixed phrase, prompted phrase

2) Text-independent

• Unconstrained text spoken by person

• Example: conversational speech

8/12/2003

Introduction

Text-dependent systems can have high performance because of input constraints

• More acoustic variation arises from speaker distinction(vs. phones)

Text-independent systems have greater flexibility

8/12/2003

Introduction

Question: Is it possible to capitalize on advantages of text dependent systems in text-independent domains?

Answer: Yes!

8/12/2003

Introduction

Idea: Limit words of interest to a select group-Words should have high frequency in domain-Words should have high speaker-discriminative quality What kind of words match these criteria for conversational speech ?1) Discourse markers (like, well, now…)2) Filled pauses (um, uh)3) Backchannels (yeah, right, uhhuh, …)

These words are fairly spontaneous and represent an “involuntary speaking style” (Heck, WS2002)

8/12/2003

Likelihood Ratio Detector:

Λ = p(X|S) /p(X|UBM)

Task is a detection problem, so use likelihood ratio detector

-In implementation, log-likelihood is used

Design

Feature Extraction

Background Model

Speaker Model

/ Λ> Θ Accept

< Θ Rejectsignal

adapt

8/12/2003

Design

State-of-the Art Speaker Recognition Systems use Gaussian Mixture Models

•Speaker’s acoustic space is represented by many-component mixture of Gaussians

speaker 2

speaker 1

8/12/2003

Design

Speaker models are obtained via adaptation of a Universal Background Model (UBM)

•Probabilistically align target training data into UBM mixture states

•Update mixture weights, means and variances based on the number of occurrences in mixtures

•Gives very good performance, but…

Target training data

8/12/2003

Design

Concern: GMMs utilize a “bag-of-frames” approach•Frames assumed to be independent•Sequential information is not really utilized Alternative: Use HMMs•Do likelihood test on output from recognizer, which is an accumulated log-probability score•Text-independent system has been analyzed (Weber et al. from Dragon Systems)•Let’s try a text-dependent one!

8/12/2003

System Word-level HMM-UBM detectors

Word Extractor

HMM-UBMN

HMM-UBM2

HMM-UBM1

Topology:

Left-to-right HMM with self-loops and no skips

4 Gaussian components per state

Number of states related to number of phones and median number of frames for word

Com

bin

atio

n

Λsignal

8/12/2003

System

HMMs implemented using HMM toolkit (HTK)-Used for speech recognitionInput features were 12 mel-cepstra, first differences, and zeroth order cepstrum (energy parameter) Adaptation:Means were adapted using Maximum A Posteriori adaptation

In cases of no adaptation data, UBM was used-LLR score cancels

jmjmjm

jm

NjmN

N

jm

_^

8/12/2003

Word Selection

13 Words:

Discourse markers: {actually, anyway, like, see, well, now}

Filled pauses: {um, uh}

Backchannels: {yeah, yep, okay, uhhuh, right }

Words account for approx: 8% of total tokens

8/12/2003

Recognition Task

NIST Extended Data Evaluation:

Training for 1,2,4,8, and 16 complete conversation sides and testing on one side (side duration ~2.5 mins)

Uses Switchboard I corpus

-Conversational telephone speech

Cross-validation method where data is partitioned

Test on one partition; use others for background models and normalization

For project, used splits 4-6 for background and 1 for testing with 8-conversation training

8/12/2003

Scoring

LLR(X) = log(p(X|S)) – log(p(X|UBM))

Target score: output of adapted HMM scoring forced alignment recognition of word from true transcripts (aligned via SRI recognizer)

UBM score: output of non-adapted HMM scoring same forced alignment

Frame normalization:

Word normalization: Average of word-level frame normalizations

N-best normalization: Frame normalization on n best matching (i.e. high log-prob) words

j j

i i ii

frames

UBMettscore

#

arg

8/12/2003

Observations:

1) Frame norm result = word norm result

2) EER of n-best decreases with increasing n

-Suggests benefit from an increase in data

Normalization method EERframe 2.87%word 2.87%2-best 17.23%4-best 10.36%8-best 6.96%

Initial Results

8/12/2003

Comparable results: Sturim et al. text-dependent GMM

Yielded EER of 1.3%

-Larger word pool (50 words)

-Channel normalization

Initial Results

8/12/2003

Observations:

EERs for most lie in a small range around 7%

-Suggests that words, as a group, share some qualities

-last two may differ greatly partly because of data scarcity

Best word (“yeah”) yielded EER of 4.63% compared with 2.87% for all words

0

5

10

15

20

25

30

35

40

EER (%)

yeah uh

lik

eum

uhhu

hok

ayrig

htno

wwel

l

actu

ally

see

anyw

ay yep

EER by Word

Initial Results

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

Word Frequencies

8/12/2003

System Enhancements

8/12/2003

System Enhancements: New Words

Some discourse markers and backchannels are bigrams

6 Additional Words Bigrams:

Discourse markers:{you_know, you_see, i_think, i_mean}

Backchannels:{i_see, i_know}

Total coverage of ~10% with these additional words

8/12/2003

System Enhancements: New WordsResults

EER reduced from 2.87% to 2.53%

•Significant reduction, especially given the size of coverage increase

8/12/2003

System Enhancements: New WordsResults

0

10

20

30

40

50

60

EER(%)

EER by Word

0

5000

10000

15000

20000

25000

30000

35000

Word Frequencies

Observations:

Well-performing bigrams have comparable EERs

Poorly-performing bigrams suffer from a paucity of data

•Suggests possibility of frequency threshold for performance

8/12/2003

System Enhancements: More Cepstra

Idea: Higher order cepstra may posses more variability that can be used for speaker discrimination

Input features modified to 19 mel-cepstra from 12

8/12/2003

System Enhancements: More CepstraResults

EER Reduced from 2.87% to 1.88%

8/12/2003

System Enhancements: CMS

Idea: Channel response may introduce undesirable variability (e.g., the same speaker on different handsets), so try and remove it

Common approach is to perform Cepstral Mean Subtraction (CMS)

•Convolutional effects in the time domain become additive effects in the log power domain:

X(,t) = S(,t)C(,t)

log|X(,t)|2 = log|S(,t)|2 + log|C(,t)|2

8/12/2003

System Enhancements: CMSResults

EER reduced from 2.87% to 1.35%

Poor performance in low false alarm region

•possibly due to small number of data points

•also may have removed ‘good’ channel info

8/12/2003

System Enhancements: Combined SystemResults

“grab bag” system yields EER of 1.01%

Suffers from same problem of poor performance for low false alarms

8/12/2003

Conclusions

Well performing text-dependent speaker recognition in an unconstrained speech domain is very feasible

Benefit of sequential information appears to have been established

Benefits of higher order cepstra and CMS for input features have been demonstrated

8/12/2003

Future Work

-Analyze performance with ASR output

-Closer analysis of word frequency to performance

-More words!

-Normalizations (Hnorm, Tnorm)

-Examine influence of word context

(e.g., “well” as discourse marker and as adverb)

8/12/2003

Fin

8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye...

Documents

Transcript of 8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye...