SNRAwarePLDAModelingforRobust Speaker*Veriﬁcaonmwmak/papers/SYSU-CMU-2015.pdf ·...

SNR-‐Aware PLDA Modeling for Robust Speaker Verifica?on

Department of Electronic and Informa?on Engineering The Hong Kong Polytechnic University

廣東順德中山大學-‐卡內基梅隆大學國際聯合研究院(SYSU-‐CMU-‐Joint Research Ins?tute)

28 Dec. 2015

Man-Wai MAK [email protected]

http://www.eie.polyu.edu.hk/~mwmak

http://www.eie.polyu.edu.hk/~mwmak/papers/SYSU-CMU-2015.pdf

2

Contents

1.  I-‐Vector/PLDA for Speaker Verifica?on 2.  SNR-‐Aware PLDA Modeling

–  SNR-‐Invariant PLDA –  Mixture of PLDA

3.  Experiments on SRE12

4.  Conclusions

2

3

I-‐Vectors for Speaker Verifica4on •  State-‐of-‐the-‐art method for speaker verifica?on •  Factor analysis model:

!µs =

!µ +Txs

•  Instead of using the high-‐dimension to present the speaker s, we use the low-‐dimension (typically 500) i-‐vector xs to represent the speaker.

•  T is es?mated by an EM algorithm using the u]erances of many speakers. T represents the subspace in which the i-‐vectors vary.

•  Given T, es?mate xs for each target speaker and test u]erance xt

UBM supervector Low-‐rank total variability matrix

Speaker-‐dependent i-‐vector

(61440×500)

!µs

4

I-‐Vectors for Speaker Verifica4on •  Given an u]erance, we align its acous?c vectors against a UBM

to obtain the sufficient sta?s?cs:

•  The i-‐vector of the u]erance is the posterior mean of the latent factor of the factor analysis model:

Alignment

UBM

i-vector of utterance i: hxi|Oi = L

�1i T

T(⌃(b))�1

f̃i

L

�1i = cov(xi,xi|O) =

⇣I+T

T⌃

(b)�1NiT

⌘�1

4

5

I-‐Vectors for Speaker Verifica4on

Align ot with UBM

Ni =

ni,1I 0 ! 00 ni,2I 0 00 0 ! 00 0 " ni,MI

⎡

⎣

⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥

!fi =

!fi ,1!"fi ,M

!

"

####

$

%

&&&&

hxi|Oi = L

�1i T

T(⌃(b))�1

f̃i

L

�1i = cov(xi,xi|O) =

⇣I+T

T⌃

(b)�1NiT

⌘�1

5

6


UBM

Training Data

Training Total Variability Matrix

I-‐Vector Extractor LDA+WCCN

U]erance from Target Speaker s

Test u]erance t

Scoring Method

Decision Maker Reject θ<

θ≥Accept

xs

xt

WTxs

WTxt

T

•  Given an u]erance from speaker s and a total variability matrix T, we es?mate his/her i-‐Vector xs

•  Because T defines the combined space describing both speaker variability and channel variability, we use LDA+WCCN to remove channel variability

7


Before LDA (x) Ader LDA

Each point represents an u]erance. Each marker type represents a speaker.

WTx

7

8

I-‐Vectors Scoring

SCD xs,xt( ) =WTxs,W

TxtWTxs W

Txt

•  Given the i-‐vector of target speaker and the i-‐vector of a test u]erance, we compute the cosine-‐distance score:

•  If the score is larger than a threshold θ, then we accept the speaker; otherwise we reject the speaker.

SCD(xs,xt )∈ [0,1]

8

Probabilis4c LDA for SV •  PLDA is based on a genera?ve model that uses pre-‐processed

i-‐vectors as input •  It aims to model the speaker and channel variability in the i-‐

vector space •  The method assumes that there is a speaker subspace V

within the i-‐vector space •  The i-‐vector xs is wri]en as:

i-vector extracted from the utterance of

speaker s Global mean of all i-vectors Defining

Speaker subspace

Speaker factor

Residual noise with covariance Σ

xs =m+Vzs +εs

9

10

Probabilis4c LDA for SV •  Similarly, the i-‐vector xt from a test u]erance is wri]en as:

•  Ini?a?vely, you may think of zs and zt are projected vectors on the speaker subspace defined by the eigenvectors in V.

•  But unlike PCA, given an i-‐vector xt , there are infinite numbers of zt. So, we need to consider the joint density of xt and zt when compu?ng the likelihood of xt

xt =m+Vzt +εt

10

11

PLDA Scoring

x t =m+Vz+ εt

x s =m+Vz+ εsxt =m+Vzt +εtxs =m+Vzs +εs

against

H0: Same speaker H1: Different speaker

11

12

Conven4onal Noise Robust PLDA

•  In conven?onal mul?-‐condi?on training, we pool i-‐vectors from various background noise levels to train m, V and Σ.

EM Algorithm {m,V,Σ}

I-vectors with 2 SNR ranges

13

Conven4onal Noise Robust PLDA •  Conven?onal i-‐vector/PLDA systems use a channel

space (with covariance ) to handle all SNR condi?ons.

I-‐Vector/PLDA Scoring

Enrollment Utterances

PLDA Scores

{m,V,Σ}

Σ

14

Contents




4.  Conclusions

15

•  We argue that the varia?on caused by SNR variability can be modeled by an SNR subspace and u]erances falling within a narrow SNR range should share the same SNR factor (Li & Mak, Interspeech15; Li & Mak, T-‐ASLP 15)

SNR Subspace

SNR Factor 2

Group1

Group2

Group3

SNR Factor 1

SNR Factor 3

SNR Invariant PLDA

16

6 dB

•  Method of modeling SNR informa?on

clean 15 dB

SNR Subspace

w6dB

wcln

w15dB

I-vector Space

i-vector

SNR Invariant PLDA

17

SNR-‐invariant PLDA •  PLDA:

•  By adding an SNR factor to the conven?onal PLDA, we have SNR-‐invariant PLDA:

where U denotes the SNR subspace, is an SNR factor, and is the speaker (iden?ty) factor for speaker i.

•  Note that it is not the same as PLDA with channel subspace R:

k kij i k ij= + + +x m Vh Uw ε

wk

ih

ij i ij= + +x m Vh ε

xij =m+Vhi +Rrij + εij

i: Speaker index j: Session index

k: SNR index

18

SNR-‐invariant PLDA •  We separate I-‐vectors into different groups

according to the SNR of their u]erances


EM Algorithm {m,V,U,Σ}

19

Compared with Conven4onal PLDA


Conventional PLDA

ij i ij= + +x m Vh ε

SNR-Invariant PLDA

20

PLDA vs SNR-‐invariant PLDA

PLDA SNR-‐invariant PLDA

Generative Model

ij i ij= + +x m Vh ε k kij i k ij= + + +x m Vh Uw ε

p(x) = N (x |m,VVT +Σ) ( ) ( | , )T Tp N= + +x x m VV UU Σ

{ }=θ m,V,Σ { }=θ m,V,U,Σ

21

PLDA vs SNR-‐invariant PLDA


E-Step

1 11

| ( )iHTi i ijjX − −

== −∑h L V Σ x m

1| | | TTi i i i iX X X−= +h h L h h


22

PLDA versus SNR-‐invariant PLDA M-Step

1( ) | |T Tij i i iij ij

X X−

⎡ ⎤ ⎡ ⎤= − ⎣ ⎦⎣ ⎦∑ ∑V x m h h h

( )( ) | ( )T Tij ij i ijij

ii

X

H

⎡ ⎤− − − −⎣ ⎦=∑

∑x m x m V h x m

Σ

SNR-‐invariant PLDA Score

23

24

Contents




4.  Conclusions

25

Mixture of PLDA (mPLDA) •  Conven?onal i-‐vector/PLDA systems use a single PLDA

model to handle all SNR condi?ons.

PLDA Model

Enrollment i-vectors

PLDA Scores

{m,V,Σ}

26

•  We argue that a PLDA model should focus on a small range of SNR.

PLDA Model 1

PLDA Score

PLDA Model 2

PLDA Model 3

PLDA Score

PLDA Score

Mixture of PLDA (mPLDA)

27

•  The full spectrum of SNRs is handled by a mixture of PLDA in which the posteriors of the indicator variables depend on the u]erance’s SNR (Mak, Interspeech14; Mak et al., T-‐ASLP 16)

PLDA Model 1

PLDA Score PLDA

Model 2

PLDA Model 3

SNR Es?mator

SN

R P

oste

rior E

stim

ator

M.W. Mak, X.M. Pang and J.T. Chien, "Mixture of PLDA for Noise Robust I-Vector Speaker Verification", IEEE/ACM Trans. on Audio Speech and Language Processing, vol. 24, No. 1, pp. 13-0142, Jan. 2016.

Mixture of PLDA (mPLDA)

28

Mo4va4on of mPLDA •  The idea of mPLDA is based on two hypotheses:

1.  Different levels of background noise will cause the i-‐vectors to fall on different regions of the i-‐vector space

2.  SNR variability nega?vely affects PLDA speaker recogni?on accuracy, but its effect can be mi?gated by explicitly modelling the SNR-‐dependent speaker subspaces through mixture of PLDA.

29

Mo4va4on of mPLDA •  To verify these two hypotheses, we corrupted 7,156 clean

telephone u]erances from 763 speakers with babble noise at 6dB and 15dB using the FaNT tool

•  This results in 3 sets of i-‐vectors: clean, 15dB, and 6dB •  Then, a GMM is constructed as shown below.

FaNT

FaNT

I-Vector Extraction

I-Vector Extraction

Compute mean & cov

Compute mean & cov

I-Vector Extraction

Compute mean & cov

Construct GMM

Clean speech

{1/3, ⌧k,�k}3k=1

6dB

15dB

⌧1,�1

⌧3,�3

30

Mo4va4on of mPLDA •  We used par??on coefficients (PC) and par??on entropy

coefficients (PE) to quan?fy the cluster separability of the three groups of i-‐vectors.

PC à 1 and PE à 0 mean that the clusters are well separated

31

Mo4va4on of mPLDA •  To verify the 2nd hypothesis, we perform speaker

iden?fica?on experiments under SNR-‐match and SNR-‐ mismatch condi?ons.

•  There are 9 combina?ons of PLDA models and SNR groups, of which three are matched in training and test condi?ons and six are mismatched.

•  The SID accuracy gradually decreases when the SNR of the training data progressively deviates from that of the test data.

32

mPLDA: Model Parameters

2

For modeling SNR of utts.

For modeling SNR-dependent i-vectors

•  Model Parameters:

33

Graphical Model of mPLDA

For modeling SNR of utts.

For modeling SNR-dependent i-vectors

`ij : SNR of the j-th utterance from the i-th speaker

xij: i-vector of the j-th utterance from the i-th speaker

V ={Vk}k=1K

π ={πk}k=1K

34

Graphical Model: PLDA vs. mPLDA

`ij : SNR of the j-th utterance from the i-th speaker

PLDA mPLDA

35

Genera4ve Model for mPLDA

where the posterior prob of SNR is

Pos

terio

r of S

NR

: SNR in dB

36

PLDA vs. mPLDA

PLDA Mixture of PLDA

Generative Model

37

EM: PLDA vs. mPLDA Auxiliary Function

PLDA:

Mixture of PLDA:

Latent indicator variables:

SNR of training utterances:

Speaker indexes

Session indexes

No. of mixtures

Latent speaker factors:

38

EM: PLDA vs. mPLDA


E-Step


39

EM: PLDA vs. mPLDA M-Step

40

Likelihood-‐Ra4o Scores of mPLDA •  Same-‐speaker likelihood:

i-vectors of target and test speakers

SNR of target and test utterances

41

Likelihood-‐Ra4o Scores of mPLDA •  Different-‐speaker likelihood:

•  Verifica?on Score = Same-speaker likelihood

Different-speaker likelihood

41 #For full derivation, see http://bioinfo.eie.polyu.edu.hk/mPLDA/SuppMaterials.pdf

Complexity Analysis

42

Dimension of i-vectors

43

Types of mPLDA •  The mixture of PLDA models can be of two types:

1.  SNR-‐independent mPLDA (SI-‐mPLDA) 2.  SNR-‐dependent mPLDA (SD-‐mPLDA)

44

Types of mPLDA •  SNR-‐independent mPLDA is the supervised version of Hinton’s mixture of factor analyzers, where the supervision comes from the speaker labels

•  Equivalent to clustering in i-‐vector space with the subspaces Vk of clusters determined by PLDA

•  No guidance from SNR informa?on.

45

SI-‐mPLDA vs. SD-‐mPLDA

Mixture weights independent of the SNR of utterances.

p(x) =KX

k=1

⇢kN (x,VkVTk +⌃k)

•  SNR-‐independent mPLDA:

•  SNR-‐dependent mPLDA:

Posterior prob. of SNR obtained from a 1-D GMM

46

Cluster Alignment in mPLDA

SNR-independent mPLDA SNR-dependent mPLDA

In SD-mPLDA, i-vectors that are aligned to the same mixture component have similar SNR

47

SNR-‐dependent vs. SNR-‐independent

Performance on CC4 of NIST12 (male)

PLDA

SNR-indepedent mPLDA

SNR-dependent mPLDA

48

Contents




4.  Conclusions

49

Data and Features •  Evalua4on dataset: Common evalua?on condi?on 1 and 4 of

NIST SRE 2012 core set. •  Parameteriza4on: 19 MFCCs together with energy plus their

1st and 2nd deriva?ves à 60-‐Dim •  UBM: gender-‐dependent, 1024 mixtures •  Total Variability Matrix: gender-‐dependent, 500 total factors •  I-‐Vector Preprocessing:

Ø Whitening by WCCN then length normaliza?on Ø For SI-‐PLDA, followed by NFA (500-‐dim à 200-‐dim) + WCCN Ø For mPLDA, followed by LDA (500-‐dim à 200-‐dim) + WCCN

50

Distribu4on of SNR in SRE12

Each SNR region is handled by a specific set of SNR factors

51

Finding SNR Groups

Training Utterances

SNR Distribu4ons •  SNR Distribution of training and test utterances in CC4

52

Test Utterances

Training Utterances

Performance on SRE12

Method Parameters Male Female

K Q EER(%) minDCF EER(%) minDCF

PLDA -‐ -‐ 5.42 0.371 7.53 0.531

SDmPLDA -‐ -‐ 5.28 0.415 7.70 0.539

SNR-‐Invariant PLDA

3 40 5.42 0.382 6.93 0.528

5 40 5.28 0.381 6.89 0.522

6 40 5.29 0.388 6.90 0.536

8 30 5.56 0.384 7.05 0.545

No. of SNR Groups

No. of SNR factors (dim of ) wk 53

CC1


Method Parameters

Male Female


PLDA -‐ -‐ 2.40 0.332 2.19 0.335

SNR-‐dependent mPLDA

-‐ -‐ 2.47 0.283 2.07 0.328


3 40 1.96 0.277 1.74 0.290

6 40 1.99 0.278 1.72 0.290

No. of SNR Groups

No. of SNR factors (dim of ) wk

54

CC2


Method Parameters Male Female


PLDA -‐ -‐ 3.13 0.312 2.82 0.341

SD-‐mPLDA -‐ -‐ 2.88 0.329 2.71 0.332


3 40 2.72 0.289 2.36 0.314

5 40 2.67 0.291 2.38 0.322

6 40 2.63 0.287 2.43 0.319

8 30 2.70 0.292 2.29 0.313

No. of SNR Groups

55


CC4


Method Parameters

Male Female


PLDA -‐ -‐ 2.86 0.286 2.47 0.343

SNR-‐dependent mPLDA

-‐ -‐ 2.86 0.295 2.59 0.332


3 40 2.47 0.273 2.07 0.294

6 40 2.48 0.275 2.04 0.294

No. of SNR Groups


56

CC5


CC4, Female

Conventional PLDA

SNR-Invariant PLDA

57

Conclusions

•  We show that while I-‐vectors of different SNR fall on different regions of the I-‐vector space, they vary within a single cluster in an SNR-‐subspace.

•  Therefore, it is possible to model the SNR variability by adding an SNR loading matrix and SNR factors to the conven?onal PLDA model.

•  We also show that I-‐vectors derived from u]erances of different SNR live in different speaker subspaces.

•  Therefore, it is possible to model SNR variability by mixture of SNR-‐dependent PLDA

58

Bibliography 1.  M.W. Mak, X.M. Pang and J.T. Chien, "Mixture of PLDA for Noise Robust I-‐Vector Speaker Verifica?on",

IEEE/ACM Trans. on Audio Speech and Language Processing, vol. 24, No. 1, pp. 13-‐0142, Jan. 2016.

2.  Na Li and M.W. Mak, "SNR-‐Invariant PLDA Modeling in Nonparametric Subspace for Robust Speaker Verifica?on", IEEE/ACM Trans. on Audio Speech and Language Processing, vol. 23, no. 10, pp. 1648-‐1659, Oct. 2015.

3.  W. Rao and M.W. Mak, "Boos?ng the Performance of I-‐Vector Based Speaker Verifica?on via U]erance Par??oning", IEEE Trans. on Audio, Speech and Language Processing, vol. 21, no. 5, pp. 1012-‐1022, May 2013.

4.  N. Li and M.W. Mak, "SNR-‐Invariant PLDA with Mul?ple Speaker Subspaces", ICASSP'16, March, 2016.

5.  X.M. Pang and M.W. Mak, "Noise Robust Speaker Verifica?on via the Fusion of SNR-‐Independent and SNR-‐Dependent PLDA", InternaAonal Journal of Speech Technology, Oct. 2015.

6.  M.W. Mak, "Fast Scoring for Mixture of PLDA in I-‐Vector/PLDA Speaker Verifica?on” Proc. APSIPA’15, pp. 587-‐593, Dec. 2015, Hong Kong.

7.  M.W. Mak and H.B. Yu, " A Study of Voice Ac?vity Detec?on Techniques for NIST Speaker Recogni?on Evalua?ons", Computer Speech & Language, vol. 28, No. 1, Jan 2014, pp. 295-‐313.

8.  N. Li and M.W. Mak, "SNR-‐Invariant PLDA Modeling for Robust Speaker Verifica?on, Interspeech'15, Sept. 2015, Dresden, Germany, pp. 2317 -‐ 2321.

9.  P. Kenny, “Bayesian speaker verifica?on with heavy-‐tailed priors,” in Proc. of Odyssey: Speaker and Language RecogniAon Workshop, Brno, Czech Republic, June 2010.

10.  N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-‐end factor analysis for speaker verifica?on,” IEEE TransacAons on Audio, Speech and Language Processing, vol. 19, no. 4, pp. 788–798, May 2011.

59

Acknowledgment

60 Xiaomin Pang Zhili Tan Shibiao Wan Wei RAO Na LI

SNRAwarePLDAModelingforRobust Speaker*Veriﬁcaonmwmak/papers/SYSU-CMU-2015.pdf ·...

Documents

Transcript of SNRAwarePLDAModelingforRobust Speaker*Veriﬁcaonmwmak/papers/SYSU-CMU-2015.pdf ·...

SNRAware*PLDA*Modeling*for*Robust Speaker*Veriﬁcaonmwmak/papers/SYSU-CMU-2015.pdf ·...

Documents

Transcript of SNRAware*PLDA*Modeling*for*Robust Speaker*Veriﬁcaonmwmak/papers/SYSU-CMU-2015.pdf ·...

SNRAwarePLDAModelingforRobust Speaker*Veriﬁcaonmwmak/papers/SYSU-CMU-2015.pdf ·...

Transcript of SNRAwarePLDAModelingforRobust Speaker*Veriﬁcaonmwmak/papers/SYSU-CMU-2015.pdf ·...