Speaker Recognition

20
Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors http://www.cubs.buffalo. edu

description

Speaker Recognition. Sharat.S.Chikkerur Center for Unified Biometrics and Sensors http://www.cubs.buffalo.edu. Speech Fundamentals. Characterizing speech Content (Speech recognition) Signal representation (Vocoding) Waveform Parametric( Excitation, Vocal Tract) - PowerPoint PPT Presentation

Transcript of Speaker Recognition

Page 1: Speaker Recognition

Speaker Recognition

Sharat.S.ChikkerurCenter for Unified Biometrics and Sensors

http://www.cubs.buffalo.edu

Page 2: Speaker Recognition

Speech Fundamentals Characterizing speech

Content (Speech recognition) Signal representation (Vocoding)

• Waveform• Parametric( Excitation, Vocal Tract)

Signal analysis (Gender determination, Speaker recognition)

Terminologies Phonemes :

• Basic discrete units of speech. • English has around 42 phonemes.• Language specific

Types of speech• Voiced speech• Unvoiced speech(Fricatives)• Plosives

Formants

Page 3: Speaker Recognition

Speech production

Speech production mechanism Speech production model

Impulse Train

Generator

Glottal Pulse ModelG(z)

Vocal TractModelV(z)

Radiation Model

R(z)

Noise source

Pitch Av

AN

17 cm

Page 4: Speaker Recognition

Nature of speech

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000-1

-0.5

0

0.5

1

Time

Fre

quen

cy

0 0.2 0.4 0.6 0.8 1 1.20

1000

2000

3000

4000

Spectrogram

Page 5: Speaker Recognition

Vocal Tract modeling

Signal Spectrum Smoothened Signal Spectrum

•The smoothened spectrum indciates the locations of the formants of each user

•The smoothened spectrum is obtained by cepstral coefficients

Page 6: Speaker Recognition

Parametric Representations: Formants

Formant Frequencies Characterizes the frequency response of the vocal tract Used in characterization of vowels Can be used to determine the gender

0 500 1000 1500 2000 2500 3000 3500 40000

2

4

6

8

0 500 1000 1500 2000 2500 3000 3500 40000

5

10

15

Page 7: Speaker Recognition

0 500 1000 1500 2000 2500 3000 3500 4000-2

0

2

4

0 500 1000 1500 2000 2500 3000 3500 4000-1

0

1

2

3

0 500 1000 1500 2000 2500 3000 3500 4000-0.5

0

0.5

1

1.5

Parametric Representations:LPC

][][][ nGuknsansk

k

Linear predictive coefficients Used in vocoding Spectral estimation

0 500 1000 1500 2000 2500 3000 3500 4000-2

0

2

4

0 500 1000 1500 2000 2500 3000 3500 4000-2

0

2

4

5

2

20

40

200

Page 8: Speaker Recognition

0 1000 2000 3000 4000 5000 6000-1.5

-1

-0.5

0

0.5

0 1000 2000 3000 4000 5000 6000-2

-1

0

1

0 1000 2000 3000 4000 5000 6000-1.5

-1

-0.5

0

0.5

Parametric Representations:Cepstrum

P[n] G(z)

V(z) R(z)

u[n]

Pitch Av

AN

D[] L[] D-1[]

x1[n]*x2[n]x1‘[n]+x2‘[n] y1‘[n]+y2‘[n]

y1[n]*y2[n]

DFT[] LOG[] IDFT[]

x1[n]*x2[n]

X1(z)X2(z)

x1‘[n]+x2‘[n]

log(X1(z)) + log(X2(z))

5

10

40

Page 9: Speaker Recognition

Speaker Recognition

Definition It is the method of recognizing a person based on his voice It is one of the forms of biometric identification

Depends of speaker dependent characteristics.

Speaker Recognition

Speaker Identification Speaker VerificationSpeaker Detection

TextDependent

TextIndependent

TextDependent

TextIndependent

T ra n sm is s ion S p e ech S yn th e s is S p ee ch en h an ce m e nt A ids to h an d ica pp ed S p ee ch R e co g n it ion S p e ake r V e rif ica tion

S p ee ch A p p lica tio ns

Page 10: Speaker Recognition

Generic Speaker Recognition System

PreprocessingFeature

ExtractionPattern

Matching

PreprocessingFeature

ExtractionSpeaker Model

Verification

Enrollment

A/D Conversion

End point detection

Pre-emphasis filter

Segmentation

LAR

Cepstrum

LPCC

MFCC

Stochastic Models

GMM

HMM

Template Models

DTW

Distance Measures

Speech signalAnalysis Frames Feature Vector

Score

Choice of features

Differentiating factors b/w speakers include vocal tract shape and behavioral traits

Features should have high inter-speaker and low intra speaker variation

Page 11: Speaker Recognition

Our Approach

Silence Removal

Cepstrum Coefficients

Cepstral Normalization Long time average

Polynomial Function Expansion

Dynamic Time Warping

Distance Computation

Reference Template

Preprocessing

Feature Extraction

Speaker model

Matching

Page 12: Speaker Recognition

Silence Removal

N

kavg

Wn

kn

kxN

E

knwkxE

1

2

1

2

][1

][][

Preprocessing

Feature Extraction

Speaker model

Matching

Page 13: Speaker Recognition

Pre-emphasis Preprocessing

Feature Extraction

Speaker model

Matching

95.0

)1()( 1

a

azzH

Page 14: Speaker Recognition

Segmentation Preprocessing

Feature Extraction

Speaker model

Matching

Short time analysis

The speech signal is segmented into overlapping ‘Analysis Frames’

The speech signal is assumed to be stationary within this frame

frame analysis theoflength : N

frame analysisn:

)(2cos46.054.0][

][][

th n

kn

Q

N

nnw

knwkxQ

Q31 Q32 Q33 Q34

Page 15: Speaker Recognition

Feature Representation Preprocessing

Feature Extraction

Speaker model

Matching

Speech signal and spectrum of two users uttering ‘ONE’

Page 16: Speaker Recognition

Speaker Model

F1 = [a1…a10,b1…b10]

F2 = [a1…a10,b1…b10]

FN = [a1…a10,b1…b10]

…………….

…………….

9

1

21

9

11

1 5

jj

jjj

j

P

Pc

b

jP

Page 17: Speaker Recognition

Dynamic Time Warping

NMnwm

mymymymY

nxnxnxnX

K

K

),(

1....Mm ,)}()....(),({)(

1....N n ,)}().....(),({)(

21

21 Preprocessing

Feature Extraction

Speaker model

Matching

K

iii

N

nT

mtnrmYnXD

nwYnXDD

1

2

1

)()())(),((

))((),({min

•The DTW warping path in the n-by-m matrix is the path which has minimum average cumulative cost. The unmarked area is the constrain that path is allowed to go.

Page 18: Speaker Recognition

Resultsa0 a1 r0 r1 s0 s1

a0 0 0.1226 0.3664 0.3297 0.4009 0.4685a1 0.1226 0 0.5887 0.3258 0.4086 0.4894r0 0.3664 0.5887 0 0.0989 0.3299 0.4243r1 0.3297 0.3258 0.0989 0 0.367 0.4287s0 0.4009 0.4086 0.3299 0.367 0 0.1401s1 0.4685 0.4894 0.4243 0.4287 0.1401 0

•Distances are normalized w.r.t. length of the speech signal

•Intra speaker distance less than inter speaker distance

•Distance matrix is symmetric

Page 19: Speaker Recognition

Matlab Implementation

Page 20: Speaker Recognition

THANK YOU