Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker...

Evaluation of Speaker Recognition Algorithms

Speaker Recognition

• Speech Recognition and Speaker Recognition

• speaker recognition performance is dependent on the channel, noise quality.

• Two sets of data one to enroll and the other to verify.

Data Collection and processing

• MFCC extraction

• Test Algorithms include

AHS(Arithmetic Harmonic Sphericity)

Gaussian Divergence

Radial Basis Function

Linear Discriminant Analysis etc.,

cepstrum

• Cepstrum is a common transform, used to gain information from a speech signal, whose x-axis is quefrency.

• Used to separate transfer function from excitation signal.

X(ω)=G(ω)H(ω)

log|X(ω) | =log|G(ω) | +log|H(ω) |

F−1log|X(ω) | =F−1log|G(ω) | +F−1log|H(ω) |

Cepstrum

MFCC Extraction

MFCC Extraction

• Short-time FFT• Frame Blocking and Windowing Eg: First Frame size=N samples Second Frame size begins M(M<N) Overlap of N-M samples and so on…• Window Function: y(n)=x(n)w(n) Eg: Hamming Window: w(n)=0.54-0.46cos(2πn/N-1), 0<n<N-1

• Mel-Frequency Wrapping

Mel frequency scale is linear upto 1000Hz and logarithmic above 1000 Hz.

mel(f)=2595*log(1+f / 700)

Mel-Spaced Filter bank

MFCC

• Cepstrum log mel spectrum back to time = MFCC

MFCCs(Cn) given by

where Sk is the mel power spectrum coefficients

Arithmetic Harmonic Sphericity

• Function of eigen values of a test covariance matrix relative to a reference covariance matrix for speakers x and y, defined by

where D is the dimensionality of the covariance matrix.

x

y

Gaussian Divergence

• Mixture of gaussian densities to model the distribution of the features of each speaker.

YOHO Dataset

Sampling Frequency 8kHz

Performance – AHS with 138 subjects and 24 MFCCs

Performance – Gaussian Div with 138 subjects and 24 MFCCs

Performance – AHS with 138 subjects and 12 MFCCs

Performance – Gaussian Div with 138 subjects and 12 MFCCs

• Probability Density Functions

Example 2:

Review of Probability and Statistics

f(x)

xa=0.25 b=0.75

0)( otherwise 10)1(2

3)( 2 xfxxxf

Probability that x is between 0.25 and 0.75 is

547.0)3

(2

3)1(

2

3)75.025.0(

75.0

25.0

375.0

25.0

2

x

x

xxdxxxP

• Cumulative Distribution Functions

cumulative distribution function (c.d.f.) F(x) for c.r.v. X is:

example:


f(x)

xb=0.75


3)( 2 xfxxxf

C.D.F. of f(x) is

)3

(2

3)

3(

2

3)1(

2

3)(

3

0

3

0

2 xx

yydyyxF

xy

y

x

x

dyyfxXPxF )()()(

• Expected Values and Variance

expected (mean) value of c.r.v. X with p.d.f. f(x) is:

example 1 (discrete):

example 2 (continuous):


dxxfxXEX )()(

E(X) = 2·0.05+3·0.10+ … +9·0.05 = 5.35 0.05

0.250.20

0.150.10

0.15

0.05 0.05

1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

8

3)

42(

2

3)(

2

3)1(

2

3)(

1

0

421

0

31

0

2

x

x

xxdxxxdxxxXE


3)( 2 xfxxxf


• The Normal (Gaussian) Distribution

the p.d.f. of a normal distribution is

xxf x 22 2/)(e2

1),;(

where μ is the mean and σ is the standard deviation

μ

σ


• The Normal Distribution

any arbitrary p.d.f. can be constructed by summing N weighted Gaussians (mixtures of Gaussians)

w1 w2 w3 w4 w5 w6

A Markov Model (Markov Chain) is:

• similar to a finite-state automata, with probabilities of transitioning from one state to another:

Review of Markov Model?

S1 S5S2 S3 S4

0.5

0.5 0.3

0.7

0.1

0.9 0.8

0.2

• transition from state to state at discrete time intervals

• can only be in 1 state at any given time

1.0

Transition Probabilities: • no assumptions (full probabilistic description of system):

P[qt = j | qt-1= i, qt-2= k, … , q1=m]

• usually use first-order Markov Model: P[qt = j | qt-1= i] = aij

• first-order assumption: transition probabilities depend only on previous state

• aij obeys usual rules:

• sum of probabilities leaving a state = 1 (must leave a state)


N

jij

ij

ia

jia

1

1

,0

S1 S2 S3

0.5

0.5 0.3

0.7

Transition Probabilities: • example:


a11 = 0.0 a12 = 0.5 a13 = 0.5 a1Exit=0.0 =1.0a21 = 0.0 a22 = 0.7 a23 = 0.3 a2Exit=0.0 =1.0a31 = 0.0 a32 = 0.0 a33 = 0.0 a3Exit=1.0 =1.0

1.0

Transition Probabilities: • probability distribution function:


S1 S2 S30.6

0.4

p(remain in state S2 exactly 1 time) = 0.4 ·0.6 = 0.240p(remain in state S2 exactly 2 times) = 0.4 ·0.4 ·0.6 = 0.096p(remain in state S2 exactly 3 times) = 0.4 ·0.4 ·0.4 ·0.6 = 0.038

= exponential decay (characteristic of Markov Models)

• Example 1: Single Fair Coin


S1 S2

0.5

0.5

0.5 0.5

S1 corresponds to e1 = Heads a11 = 0.5 a12 = 0.5S2 corresponds to e2 = Tails a21 = 0.5 a22 = 0.5

• Generate events:H T H H T H T T T H H

corresponds to state sequenceS1 S2 S1 S1 S2 S1 S2 S2 S2 S1 S1

• Example 2: Weather


S1S2

0.25

0.4

0.7 0.5

S3

0.20.05

0.70.1

0.1

• Example 2: Weather (con’t)

• S1 = event1 = rain S2 = event2 = clouds A = {aij} = S3 = event3 = sun

• what is probability of {rain, rain, rain, clouds, sun, clouds, rain}?Obs. = {r, r, r, c, s, c, r}S = {S1, S1, S1, S2, S3, S2, S1}time = {1, 2, 3, 4, 5, 6, 7} (days)

= P[S1] P[S1|S1] P[S1|S1] P[S2|S1] P[S3|S2] P[S2|S3] P[S1|S2]

= 0.5 · 0.7 · 0.7 · 0.25 · 0.1 · 0.7 · 0.4

= 0.001715


10.70.20.

10.50.40.

05.25.70. π1 = 0.5π2 = 0.4π3 = 0.1

• Example 2: Weather (con’t)

• S1 = event1 = rain S2 = event2 = clouds A = {aij} = S3 = event3 = sunny

• what is probability of {sun, sun, sun, rain, clouds, sun, sun}?Obs. = {s, s, s, r, c, s, s}S = {S3, S3, S3, S1, S2, S3, S3}time = {1, 2, 3, 4, 5, 6, 7} (days)

= P[S3] P[S3|S3] P[S3|S3] P[S1|S3] P[S2|S1] P[S3|S2] P[S3|S3]

= 0.1 · 0.1 · 0.1 · 0.2 · 0.25 · 0.1 · 0.1

= 5.0x10-7


10.70.20.

10.50.40.

05.25.70. π1 = 0.5π2 = 0.4π3 = 0.1

Simultaneous speech and speaker recognition using hybrid architecture

– Dominique Genoud, Dan Ellis, Nelson Morgan

• The automatic recognition process of the human voice is often divided in two part– speech recognition – speaker recognition

Traditional System

• Traditional state of the art speaker recognition system task can be divided into two parts-– Feature Extraction– Model Creation

Feature ExtractionFrame 1 Frame 2 Frame 3 Frame NWindow

Function

Frame Length

Frame Overlap

Signal processing

Signal processing

Signal processing

X1=X11

X1

2

…X1

d

X3=

X31X3

2

…X3

d

Xn=

Xn1

Xn

2

…Xn

d

Frame Vector

Frame 1 Frame 2 Frame 3 Frame NWindowFunction

Frame Length

Frame Overlap

Signal processing

Signal processing

Signal processing

X1=X11

X1

2

…X1

d

X3=

X31X3

2

…X3

d

Xn=

Xn1

Xn

2

…Xn

d

Frame Vector

Model Creation

• Once the feature is extracted, a model can be created using various techniques i.e. Gaussian Mixture Model.

• Once the model is created we can find distance from one model to another

• Based on the distance a decision can be inferred.

A simultaneous speaker and speech recognition

• A system that models the “phone” of the speaker and also the speakers features and combines them into a model could perform very well.


• Maximum a posteriori (MAP) estimation is used to generate speaker-specific models from a set of speaker independent (SI) seed models.

• Assuming no prior knowledge about the speaker distribution, the a posteriori probability Pr is approximated by the score defined as

where the speaker-specific models for all, world model.

( | )x


• In the previous equation, was determined to be 0.02 empirically.

• Using Viterbi algroithm, N probable speaker P(x| ) can be found.

• Results:– Author reported 0.7% EER compared to 5.6%

EER of GMM based system on the same dataset of 100 person.

Speech and Speaker Combination

• Posteriori probabilities and Likelihoods Combination for Speech and Speaker Recognition

• Mohamed Faouzi BenZeghiba, Eurospeech 2003.

• Authors used a combination of HMM/ANN (MLP) system for this work.

• For the features of the speech, he used 12 MFCC coefficients with energy and their first derivatives were calculated every 10 ms over a 30 ms window.

System Description

ˆ{ }

[log( ( | , ))]maxWw

P W X

ˆ{ }

max[log( ( | ))]s ss

P X

W

S

is the word from a set of finite word {W}

is the speakers from a set of finite registered speakers {S}

is ANN parameters.

System Description

Probability that a speaker is accepted is

( ) log( ( | )) log( ( | )sLLR X P X P X Threshold

LLR(X) is the likelihood ratio. s Is GMM model is the background

model where its parameters are derived from using MAP adaptation and the world data set

Combination

• Use of MLP adaptation.– shifting the boundaries between the phone classes

without strongly affecting the posterior probabilities of the speech sounds of other speakers

ˆ ˆ( , ){ , }max[log( ( | , ) log( ( | ))]w s s sw s

P W X P X

• Author proposed following formula to combine the both system

Combination

• Using posteriori on the test set it can be shown that-

ˆ ˆ( , ) 1{ , }max[ log( ( | , ) log( ( | ))]w s s sw s

P W X P X

• Probability that a speaker is accepted is

2 log( ( | , ) ( )sP W X LLR X threshold

Determined from a posteriori of the test set.1 2,

HMM-Parameter Estimation

1 1

1 11 1

1

( ) ( ) ( )( , )

( ) ( ) ( )

( ) ( , )

i ij j t tt N N

t ij j t ti j

N

t tj

t a b o jp i j

i a b o j

i p i j

• Given an observation sequence O, determine the model parameters (A,B,π) that maximize P(O|λ)

where λ= (A,B,π)

• γt(i) is the probability of being in state i, then

HMM-Parameter Estimation

• = Expected frequency in state i at time t=1

Expected number of transitions from state i to state j

Expected number from state iija

kExpected number of times in state j and observing symbol v( )

Expected number of times in state jjb k

j

• Thank You

Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker...

Documents

Transcript of Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker...