Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker...

46
Evaluation of Speaker Recognition Algorithms

Transcript of Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker...

Page 1: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Evaluation of Speaker Recognition Algorithms

Page 2: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Speaker Recognition

• Speech Recognition and Speaker Recognition

• speaker recognition performance is dependent on the channel, noise quality.

• Two sets of data one to enroll and the other to verify.

Page 3: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Data Collection and processing

• MFCC extraction

• Test Algorithms include

AHS(Arithmetic Harmonic Sphericity)

Gaussian Divergence

Radial Basis Function

Linear Discriminant Analysis etc.,

Page 4: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

cepstrum

• Cepstrum is a common transform, used to gain information from a speech signal, whose x-axis is quefrency.

• Used to separate transfer function from excitation signal.

X(ω)=G(ω)H(ω)

log|X(ω) | =log|G(ω) | +log|H(ω) |

F−1log|X(ω) | =F−1log|G(ω) | +F−1log|H(ω) |

Page 5: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Cepstrum

Page 6: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Cepstrum

Page 7: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

MFCC Extraction

Page 8: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

MFCC Extraction

• Short-time FFT• Frame Blocking and Windowing Eg: First Frame size=N samples Second Frame size begins M(M<N) Overlap of N-M samples and so on…• Window Function: y(n)=x(n)w(n) Eg: Hamming Window: w(n)=0.54-0.46cos(2πn/N-1), 0<n<N-1

Page 9: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

• Mel-Frequency Wrapping

Mel frequency scale is linear upto 1000Hz and logarithmic above 1000 Hz.

mel(f)=2595*log(1+f / 700)

Page 10: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Mel-Spaced Filter bank

Page 11: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

MFCC

• Cepstrum log mel spectrum back to time = MFCC

MFCCs(Cn) given by

where Sk is the mel power spectrum coefficients

Page 12: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Arithmetic Harmonic Sphericity

• Function of eigen values of a test covariance matrix relative to a reference covariance matrix for speakers x and y, defined by

where D is the dimensionality of the covariance matrix.

x

y

Page 13: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Gaussian Divergence

• Mixture of gaussian densities to model the distribution of the features of each speaker.

Page 14: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

YOHO Dataset

Sampling Frequency 8kHz

Page 15: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Performance – AHS with 138 subjects and 24 MFCCs

Page 16: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Performance – Gaussian Div with 138 subjects and 24 MFCCs

Page 17: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Performance – AHS with 138 subjects and 12 MFCCs

Page 18: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Performance – Gaussian Div with 138 subjects and 12 MFCCs

Page 19: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

• Probability Density Functions

Example 2:

Review of Probability and Statistics

f(x)

xa=0.25 b=0.75

0)( otherwise 10)1(2

3)( 2 xfxxxf

Probability that x is between 0.25 and 0.75 is

547.0)3

(2

3)1(

2

3)75.025.0(

75.0

25.0

375.0

25.0

2

x

x

xxdxxxP

Page 20: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

• Cumulative Distribution Functions

cumulative distribution function (c.d.f.) F(x) for c.r.v. X is:

example:

Review of Probability and Statistics

f(x)

xb=0.75

0)( otherwise 10)1(2

3)( 2 xfxxxf

C.D.F. of f(x) is

)3

(2

3)

3(

2

3)1(

2

3)(

3

0

3

0

2 xx

yydyyxF

xy

y

x

x

dyyfxXPxF )()()(

Page 21: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

• Expected Values and Variance

expected (mean) value of c.r.v. X with p.d.f. f(x) is:

example 1 (discrete):

example 2 (continuous):

Review of Probability and Statistics

dxxfxXEX )()(

E(X) = 2·0.05+3·0.10+ … +9·0.05 = 5.35 0.05

0.250.20

0.150.10

0.15

0.05 0.05

1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

8

3)

42(

2

3)(

2

3)1(

2

3)(

1

0

421

0

31

0

2

x

x

xxdxxxdxxxXE

0)( otherwise 10)1(2

3)( 2 xfxxxf

Page 22: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Review of Probability and Statistics

• The Normal (Gaussian) Distribution

the p.d.f. of a normal distribution is

xxf x 22 2/)(e2

1),;(

where μ is the mean and σ is the standard deviation

μ

σ

Page 23: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Review of Probability and Statistics

• The Normal Distribution

any arbitrary p.d.f. can be constructed by summing N weighted Gaussians (mixtures of Gaussians)

w1 w2 w3 w4 w5 w6

Page 24: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

A Markov Model (Markov Chain) is:

• similar to a finite-state automata, with probabilities of transitioning from one state to another:

Review of Markov Model?

S1 S5S2 S3 S4

0.5

0.5 0.3

0.7

0.1

0.9 0.8

0.2

• transition from state to state at discrete time intervals

• can only be in 1 state at any given time

1.0

Page 25: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Transition Probabilities: • no assumptions (full probabilistic description of system):

P[qt = j | qt-1= i, qt-2= k, … , q1=m]

• usually use first-order Markov Model: P[qt = j | qt-1= i] = aij

• first-order assumption: transition probabilities depend only on previous state

• aij obeys usual rules:

• sum of probabilities leaving a state = 1 (must leave a state)

Review of Markov Model?

N

jij

ij

ia

jia

1

1

,0

Page 26: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

S1 S2 S3

0.5

0.5 0.3

0.7

Transition Probabilities: • example:

Review of Markov Model?

a11 = 0.0 a12 = 0.5 a13 = 0.5 a1Exit=0.0 =1.0a21 = 0.0 a22 = 0.7 a23 = 0.3 a2Exit=0.0 =1.0a31 = 0.0 a32 = 0.0 a33 = 0.0 a3Exit=1.0 =1.0

1.0

Page 27: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Transition Probabilities: • probability distribution function:

Review of Markov Model?

S1 S2 S30.6

0.4

p(remain in state S2 exactly 1 time) = 0.4 ·0.6 = 0.240p(remain in state S2 exactly 2 times) = 0.4 ·0.4 ·0.6 = 0.096p(remain in state S2 exactly 3 times) = 0.4 ·0.4 ·0.4 ·0.6 = 0.038

= exponential decay (characteristic of Markov Models)

Page 28: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

• Example 1: Single Fair Coin

Review of Markov Model?

S1 S2

0.5

0.5

0.5 0.5

S1 corresponds to e1 = Heads a11 = 0.5 a12 = 0.5S2 corresponds to e2 = Tails a21 = 0.5 a22 = 0.5

• Generate events:H T H H T H T T T H H

corresponds to state sequenceS1 S2 S1 S1 S2 S1 S2 S2 S2 S1 S1

Page 29: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

• Example 2: Weather

Review of Markov Model?

S1S2

0.25

0.4

0.7 0.5

S3

0.20.05

0.70.1

0.1

Page 30: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

• Example 2: Weather (con’t)

• S1 = event1 = rain S2 = event2 = clouds A = {aij} = S3 = event3 = sun

• what is probability of {rain, rain, rain, clouds, sun, clouds, rain}?Obs. = {r, r, r, c, s, c, r}S = {S1, S1, S1, S2, S3, S2, S1}time = {1, 2, 3, 4, 5, 6, 7} (days)

= P[S1] P[S1|S1] P[S1|S1] P[S2|S1] P[S3|S2] P[S2|S3] P[S1|S2]

= 0.5 · 0.7 · 0.7 · 0.25 · 0.1 · 0.7 · 0.4

= 0.001715

Review of Markov Model?

10.70.20.

10.50.40.

05.25.70. π1 = 0.5π2 = 0.4π3 = 0.1

Page 31: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

• Example 2: Weather (con’t)

• S1 = event1 = rain S2 = event2 = clouds A = {aij} = S3 = event3 = sunny

• what is probability of {sun, sun, sun, rain, clouds, sun, sun}?Obs. = {s, s, s, r, c, s, s}S = {S3, S3, S3, S1, S2, S3, S3}time = {1, 2, 3, 4, 5, 6, 7} (days)

= P[S3] P[S3|S3] P[S3|S3] P[S1|S3] P[S2|S1] P[S3|S2] P[S3|S3]

= 0.1 · 0.1 · 0.1 · 0.2 · 0.25 · 0.1 · 0.1

= 5.0x10-7

Review of Markov Model?

10.70.20.

10.50.40.

05.25.70. π1 = 0.5π2 = 0.4π3 = 0.1

Page 32: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Simultaneous speech and speaker recognition using hybrid architecture

– Dominique Genoud, Dan Ellis, Nelson Morgan

• The automatic recognition process of the human voice is often divided in two part– speech recognition – speaker recognition

Page 33: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Traditional System

• Traditional state of the art speaker recognition system task can be divided into two parts-– Feature Extraction– Model Creation

Page 34: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Feature ExtractionFrame 1 Frame 2 Frame 3 Frame NWindow

Function

Frame Length

Frame Overlap

Signal processing

Signal processing

Signal processing

X1=X11

X1

2

…X1

d

X3=

X31X3

2

…X3

d

Xn=

Xn1

Xn

2

…Xn

d

Frame Vector

Frame 1 Frame 2 Frame 3 Frame NWindowFunction

Frame Length

Frame Overlap

Signal processing

Signal processing

Signal processing

X1=X11

X1

2

…X1

d

X3=

X31X3

2

…X3

d

Xn=

Xn1

Xn

2

…Xn

d

Frame Vector

Page 35: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Model Creation

• Once the feature is extracted, a model can be created using various techniques i.e. Gaussian Mixture Model.

• Once the model is created we can find distance from one model to another

• Based on the distance a decision can be inferred.

Page 36: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

A simultaneous speaker and speech recognition

• A system that models the “phone” of the speaker and also the speakers features and combines them into a model could perform very well.

Page 37: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

A simultaneous speaker and speech recognition

• Maximum a posteriori (MAP) estimation is used to generate speaker-specific models from a set of speaker independent (SI) seed models.

• Assuming no prior knowledge about the speaker distribution, the a posteriori probability Pr is approximated by the score defined as

where the speaker-specific models for all, world model.

( | )x

Page 38: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

A simultaneous speaker and speech recognition

• In the previous equation, was determined to be 0.02 empirically.

• Using Viterbi algroithm, N probable speaker P(x| ) can be found.

• Results:– Author reported 0.7% EER compared to 5.6%

EER of GMM based system on the same dataset of 100 person.

Page 39: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Speech and Speaker Combination

• Posteriori probabilities and Likelihoods Combination for Speech and Speaker Recognition

• Mohamed Faouzi BenZeghiba, Eurospeech 2003.

• Authors used a combination of HMM/ANN (MLP) system for this work.

• For the features of the speech, he used 12 MFCC coefficients with energy and their first derivatives were calculated every 10 ms over a 30 ms window.

Page 40: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

System Description

ˆ{ }

[log( ( | , ))]maxWw

P W X

ˆ{ }

max[log( ( | ))]s ss

P X

W

S

is the word from a set of finite word {W}

is the speakers from a set of finite registered speakers {S}

is ANN parameters.

Page 41: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

System Description

Probability that a speaker is accepted is

( ) log( ( | )) log( ( | )sLLR X P X P X Threshold

LLR(X) is the likelihood ratio. s Is GMM model is the background

model where its parameters are derived from using MAP adaptation and the world data set

Page 42: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Combination

• Use of MLP adaptation.– shifting the boundaries between the phone classes

without strongly affecting the posterior probabilities of the speech sounds of other speakers

ˆ ˆ( , ){ , }max[log( ( | , ) log( ( | ))]w s s sw s

P W X P X

• Author proposed following formula to combine the both system

Page 43: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Combination

• Using posteriori on the test set it can be shown that-

ˆ ˆ( , ) 1{ , }max[ log( ( | , ) log( ( | ))]w s s sw s

P W X P X

• Probability that a speaker is accepted is

2 log( ( | , ) ( )sP W X LLR X threshold

Determined from a posteriori of the test set.1 2,

Page 44: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

HMM-Parameter Estimation

1 1

1 11 1

1

( ) ( ) ( )( , )

( ) ( ) ( )

( ) ( , )

i ij j t tt N N

t ij j t ti j

N

t tj

t a b o jp i j

i a b o j

i p i j

• Given an observation sequence O, determine the model parameters (A,B,π) that maximize P(O|λ)

where λ= (A,B,π)

• γt(i) is the probability of being in state i, then

Page 45: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

HMM-Parameter Estimation

• = Expected frequency in state i at time t=1

Expected number of transitions from state i to state j

Expected number from state iija

kExpected number of times in state j and observing symbol v( )

Expected number of times in state jjb k

j

Page 46: Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

• Thank You