Download - NIST SRE 2008 Workshop Loquendo - Politecnico di Torino ... · NIST SRE 2008 Workshop Loquendo - Politecnico di Torino LPT site presentation Daniele Colibro, Claudio Vair Loquendo

NIST SRE 2008 Workshop

Loquendo - Politecnico di TorinoLPT site presentation

Daniele Colibro, Claudio VairLoquendo

Fabio Castaldo, Emanuele Dalmasso, Pietro LafaceDipartimento di Automatica e Informatica Politecnico di Torino

June 17, 2008

2NIST SRE 2008 Workshop: 17-18 June

Outline

Main goals

System description, key features

Training approach

Development history

Summed condition tests

Unsupervised adaptation tests


Our main goals

Improve performance on short durations

Improve performance on mismatched conditions

Deal with the new interview data

Perform all SRE 2008 tests


System descriptionTwo GMM systems were used for this evaluation

Phonetic GMM (PGMM) with 1408 (128x11) Gaussians

GMM with 512/1024/2048 Gaussians

GMM and PGMM main featuresFeature Domain Intersession Compensation (FDIC) in training and testing

Speaker factors + Relevance MAP in training

Standard log-likelihood computation in test

No discriminative models (e.g. SVM)

FoCal toolkit used for fusion, calibration and Log Likelihood computation, with prior weighted Logistic Regression objective

NEW!

NEW!


Key featuresAll conditions

Extended training set for gender conditioned UBMs, Intersession and Eigen-Speaker modeling

Extended data set for ZNorm and TNorm

Intersession compensation using condition-dependent subspaces


Key featuresShort duration

Speaker factors

Mismatched conditions:Intersession subspace estimated on more data

Speaker factors

Parameter tuning (# of Gaussians and # of MFCC parameters)

InterviewsDevelopment data for channel compensation


Acoustic features

Standard MFCC parametersGMM-25: 12 cep. (c1-c12) + 13 delta (Δc0-Δc12)

GMM-43: 18 cep. (c1-c18) + 19 delta (Δc0- Δc18) +6 double delta (ΔΔc0- ΔΔc5)

PGMM-36: 18 cepstrals (c1-c18) + 18 delta (Δc1- Δc18)

All systems perform feature warping to a Gaussian distribution on a 3 sec sliding window excluding silence frames


Training dataNo use of SRE06 for training

Gender dependent UBMs:SRE04 + SRE05

Intersession compensation eigen-matrix U:SRE04 + SRE05Interview development data

Speaker factors eigen-matrix V:SRE99 + SRE00 + SRE03 + SRE05 +1029 females and 828 males randomly selected from the Fisher English Training speech Part 1/2 among the speakers contributing at least 3 utterancesOverall: 2079 female and 1634 male speakers


Intersession compensation

Intersession compensation can be done:In model domain

In feature domain

Directly during test, adjusting the likelihood computation

No relevant performance differences in our tests


Matrix U training

Speaker model training by Relevance Map

Intersession compensation matrix U computed using the differences between models of the same speaker.

Interview: differences with the interviewee near microphone (channel 2)

Matrix U computed using EM-PCA


The speaker models are trained using Feature Domain Intersession Compensation (FDIC) features by a sequence of1. Eigen-speaker MAP modeling:

2. Relevance MAP adaptation using the Gaussian occupation statistics computed on s' (relevance = 16)

Speaker modeling

yVms +=′ UBM

z Dss +′=′′


Matrix V training

Relevance MAP training of speaker models, at least 3 utterances for each speaker

The eigen-speaker matrix V computed using EM-PCA

or

Maximum Likelihood + Minimum Divergence training


Interaction between U and V matrices

The matrices U and V can be trained using different approaches1. U’ first, then V’ using FDIC features depending on U’

2. U’’ and V’’ independently

3. V’’’ first, then U’’’ using models depending on V’’’

In our results, approaches 1. and 2. are equivalent, whereas 3. gives worse performance.

U and V are “orthogonal”

Different U can be used with the same V without accuracy loss


Score Normalization

We performed ZT-Normalization as we did in SRE06

For the trials involving phone call tests, our ZNorm set includes 1252 female and 1103 male files taken from the SRE04/05 phone calls.

For trials involving microphone and interview speech, the ZNorm set includes 204 female and 180 male additional files taken from theSRE05 microphone subset.

The same sets, selected according to the training conditions, were used for training the TNorm models.

For the 3/8conv training conditions, the ZNorm set is the same used in SRE06, which includes 80 speakers with 3/8sides from SRE04.


Development history: 1conv4w-1conv4w

SRE2006 - 1conv4w-1conv4w - All Trials

3.0

3.5

4.0

4.5

5.0

5.5

6.0

0.150

0.175

0.200

0.225

0.250

0.275

0.300

EER 5.88 5.57 5.23 5.01 5.66 4.62 4.59 4.28

Min DCF 0.278 0.278 0.264 0.257 0.272 0.243 0.239 0.221

GMM SRE06

GMM-25-512 MAP

U40TelMic V300+D16 NormExt

GMM-43-1024 MAP U60TelMic

V300+D16 NormExt U60Tel

EER: -28%DCF: -20%

GMM 25 - 512GMM 43 - 1024


Development history: 1conv4w-1convmic

SRE2006 - 1conv4w-1convmic - All Trials

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

0.125

0.150

0.175

0.200

0.225

0.250

0.275

0.300

0.325

EER 6.42 4.72 3.73 3.43 4.68 4.48 3.67

Min DCF 0.27 0.198 0.179 0.149 0.217 0.216 0.191

GMM SRE06GMM-25-512

MAP U40TelMic

V300+D16 NormExtGMM-43-1024

MAP U60TelMic

V300+D16 NormExt

EER: -46%DCF: -45%


Development history: 10sec-10sec

SRE2006 - 10sec4w-10sec4w - All Trials

15.0

16.0

17.0

18.0

19.0

20.0

21.0

22.0

23.0

24.0

25.0

0.725

0.750

0.775

0.800

0.825

0.850

EER 24.24 21.32 19.63 18.02 17.39

Min DCF 0.884 0.801 0.772 0.757 0.748

GMM SRE06 GMM-25-512 V300 U40TelMic

GMM-25-1024 V300 U40TelMic



EER: -28%DCF: -15%


Interview development test set

Development test set defined on SRE08_MX5_DEV13 Males + 3 Females, 6 sessions per speaker, 9 channels per session (324 audio files)

Each audio split in 3 min chunks (3466 elements)

Gender dependent tests, no same session tests, uniform cross channel test distribution.

Male: 7200 target tests, 17280 impostor tests

Female: 7290 target tests, 17496 impostor tests


Interview intersession normalization

Supervector differences with the interviewee near microphone (channel 2) using parallel chunks of the same session

The speaker and the phonetic content of parallel chunks are the same => the compensation is focused on microphone differences

20 interview channel eigenvectors appended to 30 Tel+Mic eigenvectors


Interview VADDevelopment: VAD based on energy distribution of interviewer and interviewee near field microphone + Loquendo ASR

Development & Test: VAD based on NIST’s interviewee VAD/ASR:

if ((VAD_NIST and ASR_NIST) speech % > 40%)VAD = (VAD_NIST and ASR_NIST)

else if (VAD_NIST speech % > 40%)VAD = VAD_NIST

else if (ASR_NIST speech % > 40%)VAD = ASR_NIST

else VAD=1 for each frame.

This VAD information is further filtered by the Loquendo ASR


Interview development historyMIX5-Develop - Interview-Interview (3 male speakers)

5.0

5.5

6.0

6.5

7.0

7.5

8.0

8.5

9.0

0.300

0.325

0.350

0.375

0.400

EER 7.25 6.91 6.48 6.31 8.57 8.14 7.15

Min DCF 0.363 0.343 0.332 0.318 0.4 0.378 0.338

GMM-25-512 V300

U40TelMic

U50TelMic + U40Int

U30TelMic + U20Int U20Int M+F All+LoqASR

NistVAD+LoqASR

NistVAD+NistASR+LoqASR


CalibrationCombination of our 3 systems by linear fusion with Logistic Regression

parameters estimated on SRE06 data using the FOCAL tool

depending on each condition that appears both in SRE08 and SRE06

For the interview conditions, the weights are borrowed by the most similar conditions, substituting the microphone to the interview condition

For a long training condition, we used the weights computed for the corresponding short2 interview condition


Performed tests

The SRE08 primary system has been tested on allthe evaluation conditions

Unsupervised adaptation scores have been submitted for the 10sec-10sec condition

The SRE06 mothballed system has been tested on the short2-short3 condition


Sub-systems comparison (i)

Short2Int-Short3Int Short2Int-Short3Tel

Short2Tel-Short3Int0.10.2

0.51.0

2

5

10

20

3040506070

0.1 0.2 0.5 1.0 2 5 10 20 30 40 50 60 70

Mis

s Pr

obab

ility

[%]

False Alarm Probability [%]

LPT06PGMM

GMM-1024-25VGMM-512-25

LPT08 Primary

0.10.2

0.51.0

2

5

10

20

3040506070

0.1 0.2 0.5 1.0 2 5 10 20 30 40 50 60 70

Mis

s Pr

obab

ility

[%]


LPT06PGMM

GMM-1024-25VGMM-512-25

LPT08 Primary

0.10.2

0.51.0

2

5

10

20

3040506070

0.1 0.2 0.5 1.0 2 5 10 20 30 40 50 60 70

Mis

s Pr

obab

ility

[%]


LPT06PGMM

GMM-1024-25VGMM-512-25

LPT08 Primary

LPT 08 vs 06 EER: -62%DCF: -57%

LPT 08 vs 06 EER: -29%DCF: -25%

LPT 08 vs 06EER: -46%DCF: -45%


0.10.2

0.51.0

2

5

10

20

3040506070

0.1 0.2 0.5 1.0 2 5 10 20 30 40 50 60 70

Mis

s Pr

obab

ility

[%]


Short2Tel-Short3Tel - All Trials

LPT06PGMM

GMM-512-25VGMM-1024-43

LPT08 Primary

0.10.2

0.51.0

2

5

10

20

3040506070

0.1 0.2 0.5 1.0 2 5 10 20 30 40 50 60 70M

iss

Prob

abili

ty [%

]


Short2Tel-Short3Mic - All Trials

LPT06PGMM

GMM-512-25VGMM-512-25

LPT08 Primary

Sub-systems comparison (ii)

Short2Tel-Short3Tel – All trials Short2Tel-Short3Mic – All trials

LPT 08 vs 06 EER: -26%DCF: -32%

LPT 08 vs 06 EER: -17%DCF: -9%


2 Wires Conditions

o Segmentation based on speaker factorso Reduced accuracy loss due to segmentation, both for summed test or train

conditionso Accuracy loss due to intrinsic increase of false alarm rate

0.10.2

0.51.0

2

5

10

20

3040506070

0.1 0.2 0.5 1.0 2 5 10 20 30 40 50 60 70

Mis

s Pr

obab

ility

[%]


LPT Primary - Tel-Tel Trials

short2-summedshort2-short3**2 (simulation)

short2-short3

0.10.2

0.51.0

2

5

10

20

3040506070

0.1 0.2 0.5 1.0 2 5 10 20 30 40 50 60 70

Mis

s Pr

obab

ility

[%]


LPT Primary - Tel-Tel Trials

3summed-short33conv-short3


Unsupervised Adaptation10sec-10sec

The selection of trials to be used for adaptation was carried out using the primary system scores, obtained using the un-adapted models

Original model for trial selection

Updated model for scoring


636

75 True speaker trialsselected foradaptation

Impostor trialsselected foradaptation

Unsupervised adaptation 10s-10s

Gender dependent adaptation thresholds tuned on SRE0610sec-10sec

636

1484

True speaker trialsselected foradaptation

True speaker trialsnot selected foradaptation

NIST SRE 2008

13.0

14.0

15.0

16.0

17.0

18.0

0.300

0.400

0.500

0.600

0.700

0.800

EER Std 14.73 16.09 14.57 16.34

EER Adapted 14.44 15.79 14.56 15.83

Min DCF Std 0.635 0.683 0.652 0.761

Min DCF Adapted 0.612 0.660 0.645 0.754

SRE06 Male SRE06 Female SRE08 Male SRE08 Female


ConclusionsSignificant improvements were obtained using the speaker factors on the 10sec-10sec condition

SRE06

Now !


ConclusionsContribution of channel normalization for the interview conditions

0.10.2

0.51.0

2

5

10

20

3040506070

0.1 0.2 0.5 1.0 2 5 10 20 30 40 50 60 70

Mis

s Pr

obab

ility

[%]


Short2Int-Short3Int - All Trials

GMM-512-25 U Tel+MicGMM-512-25 U Tel+Mic+Int


References

N. Brummer and J. du Preez “Application-Independent Evaluation of Speaker Detection”, Computer Speech and Language Vol. 20, 2-3, pp. 230-275, 2006.

F. Castaldo, D. Colibro, E. Dalmasso, P. Laface, and C. Vair, “Compensation of Nuisance Factors for Speaker and Language Recognition”, IEEE Trans. on Audio, Speech, and Language Processing. Vol. 15-7, pp. 1969-1978, 2007.

F. Castaldo, D. Colibro, E. Dalmasso, P. Laface, and C. Vair, “Stream-Based Speaker Segmentation Using Speaker Factors and Eigenvoices”,Proc. ICASSP-2008, pp. 4133-4136.

P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel: “A Study of Inter-Speaker Variability in SpeakerVerification” , IEEE Transactions on Audio, Speech and Language Processing, July 2008.

R. Kuhn, J.C. Junqua, P. Nguyen, and N. Niedzielski, ”Rapid Speaker Adaptation in Eigenvoice Space”, IEEE Trans. on Speech and Audio Processing, Vol.8, No.6, Nov. 2000, pp. 695-707.


Thank you !