Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker...
-
Upload
della-caldwell -
Category
Documents
-
view
251 -
download
3
Transcript of Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker...
Evaluation of Speaker Recognition Algorithms
Speaker Recognition
• Speech Recognition and Speaker Recognition
• speaker recognition performance is dependent on the channel, noise quality.
• Two sets of data one to enroll and the other to verify.
Data Collection and processing
• MFCC extraction
• Test Algorithms include
AHS(Arithmetic Harmonic Sphericity)
Gaussian Divergence
Radial Basis Function
Linear Discriminant Analysis etc.,
cepstrum
• Cepstrum is a common transform, used to gain information from a speech signal, whose x-axis is quefrency.
• Used to separate transfer function from excitation signal.
X(ω)=G(ω)H(ω)
log|X(ω) | =log|G(ω) | +log|H(ω) |
F−1log|X(ω) | =F−1log|G(ω) | +F−1log|H(ω) |
Cepstrum
Cepstrum
MFCC Extraction
MFCC Extraction
• Short-time FFT• Frame Blocking and Windowing Eg: First Frame size=N samples Second Frame size begins M(M<N) Overlap of N-M samples and so on…• Window Function: y(n)=x(n)w(n) Eg: Hamming Window: w(n)=0.54-0.46cos(2πn/N-1), 0<n<N-1
• Mel-Frequency Wrapping
Mel frequency scale is linear upto 1000Hz and logarithmic above 1000 Hz.
mel(f)=2595*log(1+f / 700)
Mel-Spaced Filter bank
MFCC
• Cepstrum log mel spectrum back to time = MFCC
MFCCs(Cn) given by
where Sk is the mel power spectrum coefficients
Arithmetic Harmonic Sphericity
• Function of eigen values of a test covariance matrix relative to a reference covariance matrix for speakers x and y, defined by
where D is the dimensionality of the covariance matrix.
x
y
Gaussian Divergence
• Mixture of gaussian densities to model the distribution of the features of each speaker.
YOHO Dataset
Sampling Frequency 8kHz
Performance – AHS with 138 subjects and 24 MFCCs
Performance – Gaussian Div with 138 subjects and 24 MFCCs
Performance – AHS with 138 subjects and 12 MFCCs
Performance – Gaussian Div with 138 subjects and 12 MFCCs
• Probability Density Functions
Example 2:
Review of Probability and Statistics
f(x)
xa=0.25 b=0.75
0)( otherwise 10)1(2
3)( 2 xfxxxf
Probability that x is between 0.25 and 0.75 is
547.0)3
(2
3)1(
2
3)75.025.0(
75.0
25.0
375.0
25.0
2
x
x
xxdxxxP
• Cumulative Distribution Functions
cumulative distribution function (c.d.f.) F(x) for c.r.v. X is:
example:
Review of Probability and Statistics
f(x)
xb=0.75
0)( otherwise 10)1(2
3)( 2 xfxxxf
C.D.F. of f(x) is
)3
(2
3)
3(
2
3)1(
2
3)(
3
0
3
0
2 xx
yydyyxF
xy
y
x
x
dyyfxXPxF )()()(
• Expected Values and Variance
expected (mean) value of c.r.v. X with p.d.f. f(x) is:
example 1 (discrete):
example 2 (continuous):
Review of Probability and Statistics
dxxfxXEX )()(
E(X) = 2·0.05+3·0.10+ … +9·0.05 = 5.35 0.05
0.250.20
0.150.10
0.15
0.05 0.05
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
8
3)
42(
2
3)(
2
3)1(
2
3)(
1
0
421
0
31
0
2
x
x
xxdxxxdxxxXE
0)( otherwise 10)1(2
3)( 2 xfxxxf
Review of Probability and Statistics
• The Normal (Gaussian) Distribution
the p.d.f. of a normal distribution is
xxf x 22 2/)(e2
1),;(
where μ is the mean and σ is the standard deviation
μ
σ
Review of Probability and Statistics
• The Normal Distribution
any arbitrary p.d.f. can be constructed by summing N weighted Gaussians (mixtures of Gaussians)
w1 w2 w3 w4 w5 w6
A Markov Model (Markov Chain) is:
• similar to a finite-state automata, with probabilities of transitioning from one state to another:
Review of Markov Model?
S1 S5S2 S3 S4
0.5
0.5 0.3
0.7
0.1
0.9 0.8
0.2
• transition from state to state at discrete time intervals
• can only be in 1 state at any given time
1.0
Transition Probabilities: • no assumptions (full probabilistic description of system):
P[qt = j | qt-1= i, qt-2= k, … , q1=m]
• usually use first-order Markov Model: P[qt = j | qt-1= i] = aij
• first-order assumption: transition probabilities depend only on previous state
• aij obeys usual rules:
• sum of probabilities leaving a state = 1 (must leave a state)
Review of Markov Model?
N
jij
ij
ia
jia
1
1
,0
S1 S2 S3
0.5
0.5 0.3
0.7
Transition Probabilities: • example:
Review of Markov Model?
a11 = 0.0 a12 = 0.5 a13 = 0.5 a1Exit=0.0 =1.0a21 = 0.0 a22 = 0.7 a23 = 0.3 a2Exit=0.0 =1.0a31 = 0.0 a32 = 0.0 a33 = 0.0 a3Exit=1.0 =1.0
1.0
Transition Probabilities: • probability distribution function:
Review of Markov Model?
S1 S2 S30.6
0.4
p(remain in state S2 exactly 1 time) = 0.4 ·0.6 = 0.240p(remain in state S2 exactly 2 times) = 0.4 ·0.4 ·0.6 = 0.096p(remain in state S2 exactly 3 times) = 0.4 ·0.4 ·0.4 ·0.6 = 0.038
= exponential decay (characteristic of Markov Models)
• Example 1: Single Fair Coin
Review of Markov Model?
S1 S2
0.5
0.5
0.5 0.5
S1 corresponds to e1 = Heads a11 = 0.5 a12 = 0.5S2 corresponds to e2 = Tails a21 = 0.5 a22 = 0.5
• Generate events:H T H H T H T T T H H
corresponds to state sequenceS1 S2 S1 S1 S2 S1 S2 S2 S2 S1 S1
• Example 2: Weather
Review of Markov Model?
S1S2
0.25
0.4
0.7 0.5
S3
0.20.05
0.70.1
0.1
• Example 2: Weather (con’t)
• S1 = event1 = rain S2 = event2 = clouds A = {aij} = S3 = event3 = sun
• what is probability of {rain, rain, rain, clouds, sun, clouds, rain}?Obs. = {r, r, r, c, s, c, r}S = {S1, S1, S1, S2, S3, S2, S1}time = {1, 2, 3, 4, 5, 6, 7} (days)
= P[S1] P[S1|S1] P[S1|S1] P[S2|S1] P[S3|S2] P[S2|S3] P[S1|S2]
= 0.5 · 0.7 · 0.7 · 0.25 · 0.1 · 0.7 · 0.4
= 0.001715
Review of Markov Model?
10.70.20.
10.50.40.
05.25.70. π1 = 0.5π2 = 0.4π3 = 0.1
• Example 2: Weather (con’t)
• S1 = event1 = rain S2 = event2 = clouds A = {aij} = S3 = event3 = sunny
• what is probability of {sun, sun, sun, rain, clouds, sun, sun}?Obs. = {s, s, s, r, c, s, s}S = {S3, S3, S3, S1, S2, S3, S3}time = {1, 2, 3, 4, 5, 6, 7} (days)
= P[S3] P[S3|S3] P[S3|S3] P[S1|S3] P[S2|S1] P[S3|S2] P[S3|S3]
= 0.1 · 0.1 · 0.1 · 0.2 · 0.25 · 0.1 · 0.1
= 5.0x10-7
Review of Markov Model?
10.70.20.
10.50.40.
05.25.70. π1 = 0.5π2 = 0.4π3 = 0.1
Simultaneous speech and speaker recognition using hybrid architecture
– Dominique Genoud, Dan Ellis, Nelson Morgan
• The automatic recognition process of the human voice is often divided in two part– speech recognition – speaker recognition
Traditional System
• Traditional state of the art speaker recognition system task can be divided into two parts-– Feature Extraction– Model Creation
Feature ExtractionFrame 1 Frame 2 Frame 3 Frame NWindow
Function
Frame Length
Frame Overlap
Signal processing
Signal processing
Signal processing
X1=X11
X1
2
…X1
d
X3=
X31X3
2
…X3
d
Xn=
Xn1
Xn
2
…Xn
d
Frame Vector
Frame 1 Frame 2 Frame 3 Frame NWindowFunction
Frame Length
Frame Overlap
Signal processing
Signal processing
Signal processing
X1=X11
X1
2
…X1
d
X3=
X31X3
2
…X3
d
Xn=
Xn1
Xn
2
…Xn
d
Frame Vector
Model Creation
• Once the feature is extracted, a model can be created using various techniques i.e. Gaussian Mixture Model.
• Once the model is created we can find distance from one model to another
• Based on the distance a decision can be inferred.
A simultaneous speaker and speech recognition
• A system that models the “phone” of the speaker and also the speakers features and combines them into a model could perform very well.
A simultaneous speaker and speech recognition
• Maximum a posteriori (MAP) estimation is used to generate speaker-specific models from a set of speaker independent (SI) seed models.
• Assuming no prior knowledge about the speaker distribution, the a posteriori probability Pr is approximated by the score defined as
where the speaker-specific models for all, world model.
( | )x
A simultaneous speaker and speech recognition
• In the previous equation, was determined to be 0.02 empirically.
• Using Viterbi algroithm, N probable speaker P(x| ) can be found.
• Results:– Author reported 0.7% EER compared to 5.6%
EER of GMM based system on the same dataset of 100 person.
Speech and Speaker Combination
• Posteriori probabilities and Likelihoods Combination for Speech and Speaker Recognition
• Mohamed Faouzi BenZeghiba, Eurospeech 2003.
• Authors used a combination of HMM/ANN (MLP) system for this work.
• For the features of the speech, he used 12 MFCC coefficients with energy and their first derivatives were calculated every 10 ms over a 30 ms window.
System Description
ˆ{ }
[log( ( | , ))]maxWw
P W X
ˆ{ }
max[log( ( | ))]s ss
P X
W
S
is the word from a set of finite word {W}
is the speakers from a set of finite registered speakers {S}
is ANN parameters.
System Description
Probability that a speaker is accepted is
( ) log( ( | )) log( ( | )sLLR X P X P X Threshold
LLR(X) is the likelihood ratio. s Is GMM model is the background
model where its parameters are derived from using MAP adaptation and the world data set
Combination
• Use of MLP adaptation.– shifting the boundaries between the phone classes
without strongly affecting the posterior probabilities of the speech sounds of other speakers
ˆ ˆ( , ){ , }max[log( ( | , ) log( ( | ))]w s s sw s
P W X P X
• Author proposed following formula to combine the both system
Combination
• Using posteriori on the test set it can be shown that-
ˆ ˆ( , ) 1{ , }max[ log( ( | , ) log( ( | ))]w s s sw s
P W X P X
• Probability that a speaker is accepted is
2 log( ( | , ) ( )sP W X LLR X threshold
Determined from a posteriori of the test set.1 2,
HMM-Parameter Estimation
1 1
1 11 1
1
( ) ( ) ( )( , )
( ) ( ) ( )
( ) ( , )
i ij j t tt N N
t ij j t ti j
N
t tj
t a b o jp i j
i a b o j
i p i j
• Given an observation sequence O, determine the model parameters (A,B,π) that maximize P(O|λ)
where λ= (A,B,π)
• γt(i) is the probability of being in state i, then
HMM-Parameter Estimation
• = Expected frequency in state i at time t=1
Expected number of transitions from state i to state j
Expected number from state iija
kExpected number of times in state j and observing symbol v( )
Expected number of times in state jjb k
j
• Thank You