NIST SRE 2008 Workshop
Loquendo - Politecnico di TorinoLPT site presentation
Daniele Colibro, Claudio VairLoquendo
Fabio Castaldo, Emanuele Dalmasso, Pietro LafaceDipartimento di Automatica e Informatica Politecnico di Torino
June 17, 2008
2NIST SRE 2008 Workshop: 17-18 June
Outline
Main goals
System description, key features
Training approach
Development history
Summed condition tests
Unsupervised adaptation tests
3NIST SRE 2008 Workshop: 17-18 June
Our main goals
Improve performance on short durations
Improve performance on mismatched conditions
Deal with the new interview data
Perform all SRE 2008 tests
4NIST SRE 2008 Workshop: 17-18 June
System descriptionTwo GMM systems were used for this evaluation
Phonetic GMM (PGMM) with 1408 (128x11) Gaussians
GMM with 512/1024/2048 Gaussians
GMM and PGMM main featuresFeature Domain Intersession Compensation (FDIC) in training and testing
Speaker factors + Relevance MAP in training
Standard log-likelihood computation in test
No discriminative models (e.g. SVM)
FoCal toolkit used for fusion, calibration and Log Likelihood computation, with prior weighted Logistic Regression objective
NEW!
NEW!
5NIST SRE 2008 Workshop: 17-18 June
Key featuresAll conditions
Extended training set for gender conditioned UBMs, Intersession and Eigen-Speaker modeling
Extended data set for ZNorm and TNorm
Intersession compensation using condition-dependent subspaces
6NIST SRE 2008 Workshop: 17-18 June
Key featuresShort duration
Speaker factors
Mismatched conditions:Intersession subspace estimated on more data
Speaker factors
Parameter tuning (# of Gaussians and # of MFCC parameters)
InterviewsDevelopment data for channel compensation
7NIST SRE 2008 Workshop: 17-18 June
Acoustic features
Standard MFCC parametersGMM-25: 12 cep. (c1-c12) + 13 delta (Δc0-Δc12)
GMM-43: 18 cep. (c1-c18) + 19 delta (Δc0- Δc18) +6 double delta (ΔΔc0- ΔΔc5)
PGMM-36: 18 cepstrals (c1-c18) + 18 delta (Δc1- Δc18)
All systems perform feature warping to a Gaussian distribution on a 3 sec sliding window excluding silence frames
8NIST SRE 2008 Workshop: 17-18 June
Training dataNo use of SRE06 for training
Gender dependent UBMs:SRE04 + SRE05
Intersession compensation eigen-matrix U:SRE04 + SRE05Interview development data
Speaker factors eigen-matrix V:SRE99 + SRE00 + SRE03 + SRE05 +1029 females and 828 males randomly selected from the Fisher English Training speech Part 1/2 among the speakers contributing at least 3 utterancesOverall: 2079 female and 1634 male speakers
9NIST SRE 2008 Workshop: 17-18 June
Intersession compensation
Intersession compensation can be done:In model domain
In feature domain
Directly during test, adjusting the likelihood computation
No relevant performance differences in our tests
10NIST SRE 2008 Workshop: 17-18 June
Matrix U training
Speaker model training by Relevance Map
Intersession compensation matrix U computed using the differences between models of the same speaker.
Interview: differences with the interviewee near microphone (channel 2)
Matrix U computed using EM-PCA
11NIST SRE 2008 Workshop: 17-18 June
The speaker models are trained using Feature Domain Intersession Compensation (FDIC) features by a sequence of1. Eigen-speaker MAP modeling:
2. Relevance MAP adaptation using the Gaussian occupation statistics computed on s' (relevance = 16)
Speaker modeling
yVms +=′ UBM
z Dss +′=′′
12NIST SRE 2008 Workshop: 17-18 June
Matrix V training
Relevance MAP training of speaker models, at least 3 utterances for each speaker
The eigen-speaker matrix V computed using EM-PCA
or
Maximum Likelihood + Minimum Divergence training
13NIST SRE 2008 Workshop: 17-18 June
Interaction between U and V matrices
The matrices U and V can be trained using different approaches1. U’ first, then V’ using FDIC features depending on U’
2. U’’ and V’’ independently
3. V’’’ first, then U’’’ using models depending on V’’’
In our results, approaches 1. and 2. are equivalent, whereas 3. gives worse performance.
U and V are “orthogonal”
Different U can be used with the same V without accuracy loss
14NIST SRE 2008 Workshop: 17-18 June
Score Normalization
We performed ZT-Normalization as we did in SRE06
For the trials involving phone call tests, our ZNorm set includes 1252 female and 1103 male files taken from the SRE04/05 phone calls.
For trials involving microphone and interview speech, the ZNorm set includes 204 female and 180 male additional files taken from theSRE05 microphone subset.
The same sets, selected according to the training conditions, were used for training the TNorm models.
For the 3/8conv training conditions, the ZNorm set is the same used in SRE06, which includes 80 speakers with 3/8sides from SRE04.
15NIST SRE 2008 Workshop: 17-18 June
Development history: 1conv4w-1conv4w
SRE2006 - 1conv4w-1conv4w - All Trials
3.0
3.5
4.0
4.5
5.0
5.5
6.0
0.150
0.175
0.200
0.225
0.250
0.275
0.300
EER 5.88 5.57 5.23 5.01 5.66 4.62 4.59 4.28
Min DCF 0.278 0.278 0.264 0.257 0.272 0.243 0.239 0.221
GMM SRE06
GMM-25-512 MAP
U40TelMic V300+D16 NormExt
GMM-43-1024 MAP U60TelMic
V300+D16 NormExt U60Tel
EER: -28%DCF: -20%
GMM 25 - 512GMM 43 - 1024
16NIST SRE 2008 Workshop: 17-18 June
Development history: 1conv4w-1convmic
SRE2006 - 1conv4w-1convmic - All Trials
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
0.125
0.150
0.175
0.200
0.225
0.250
0.275
0.300
0.325
EER 6.42 4.72 3.73 3.43 4.68 4.48 3.67
Min DCF 0.27 0.198 0.179 0.149 0.217 0.216 0.191
GMM SRE06GMM-25-512
MAP U40TelMic
V300+D16 NormExtGMM-43-1024
MAP U60TelMic
V300+D16 NormExt
EER: -46%DCF: -45%
17NIST SRE 2008 Workshop: 17-18 June
Development history: 10sec-10sec
SRE2006 - 10sec4w-10sec4w - All Trials
15.0
16.0
17.0
18.0
19.0
20.0
21.0
22.0
23.0
24.0
25.0
0.725
0.750
0.775
0.800
0.825
0.850
EER 24.24 21.32 19.63 18.02 17.39
Min DCF 0.884 0.801 0.772 0.757 0.748
GMM SRE06 GMM-25-512 V300 U40TelMic
GMM-25-1024 V300 U40TelMic
GMM-43-1024 V300 U60TelMic
GMM-43-2048 V300 U60TelMic
EER: -28%DCF: -15%
18NIST SRE 2008 Workshop: 17-18 June
Interview development test set
Development test set defined on SRE08_MX5_DEV13 Males + 3 Females, 6 sessions per speaker, 9 channels per session (324 audio files)
Each audio split in 3 min chunks (3466 elements)
Gender dependent tests, no same session tests, uniform cross channel test distribution.
Male: 7200 target tests, 17280 impostor tests
Female: 7290 target tests, 17496 impostor tests
19NIST SRE 2008 Workshop: 17-18 June
Interview intersession normalization
Supervector differences with the interviewee near microphone (channel 2) using parallel chunks of the same session
The speaker and the phonetic content of parallel chunks are the same => the compensation is focused on microphone differences
20 interview channel eigenvectors appended to 30 Tel+Mic eigenvectors
20NIST SRE 2008 Workshop: 17-18 June
Interview VADDevelopment: VAD based on energy distribution of interviewer and interviewee near field microphone + Loquendo ASR
Development & Test: VAD based on NIST’s interviewee VAD/ASR:
if ((VAD_NIST and ASR_NIST) speech % > 40%)VAD = (VAD_NIST and ASR_NIST)
else if (VAD_NIST speech % > 40%)VAD = VAD_NIST
else if (ASR_NIST speech % > 40%)VAD = ASR_NIST
else VAD=1 for each frame.
This VAD information is further filtered by the Loquendo ASR
21NIST SRE 2008 Workshop: 17-18 June
Interview development historyMIX5-Develop - Interview-Interview (3 male speakers)
5.0
5.5
6.0
6.5
7.0
7.5
8.0
8.5
9.0
0.300
0.325
0.350
0.375
0.400
EER 7.25 6.91 6.48 6.31 8.57 8.14 7.15
Min DCF 0.363 0.343 0.332 0.318 0.4 0.378 0.338
GMM-25-512 V300
U40TelMic
U50TelMic + U40Int
U30TelMic + U20Int U20Int M+F All+LoqASR
NistVAD+LoqASR
NistVAD+NistASR+LoqASR
22NIST SRE 2008 Workshop: 17-18 June
CalibrationCombination of our 3 systems by linear fusion with Logistic Regression
parameters estimated on SRE06 data using the FOCAL tool
depending on each condition that appears both in SRE08 and SRE06
For the interview conditions, the weights are borrowed by the most similar conditions, substituting the microphone to the interview condition
For a long training condition, we used the weights computed for the corresponding short2 interview condition
23NIST SRE 2008 Workshop: 17-18 June
Performed tests
The SRE08 primary system has been tested on allthe evaluation conditions
Unsupervised adaptation scores have been submitted for the 10sec-10sec condition
The SRE06 mothballed system has been tested on the short2-short3 condition
24NIST SRE 2008 Workshop: 17-18 June
Sub-systems comparison (i)
Short2Int-Short3Int Short2Int-Short3Tel
Short2Tel-Short3Int0.10.2
0.51.0
2
5
10
20
3040506070
0.1 0.2 0.5 1.0 2 5 10 20 30 40 50 60 70
Mis
s Pr
obab
ility
[%]
False Alarm Probability [%]
LPT06PGMM
GMM-1024-25VGMM-512-25
LPT08 Primary
0.10.2
0.51.0
2
5
10
20
3040506070
0.1 0.2 0.5 1.0 2 5 10 20 30 40 50 60 70
Mis
s Pr
obab
ility
[%]
False Alarm Probability [%]
LPT06PGMM
GMM-1024-25VGMM-512-25
LPT08 Primary
0.10.2
0.51.0
2
5
10
20
3040506070
0.1 0.2 0.5 1.0 2 5 10 20 30 40 50 60 70
Mis
s Pr
obab
ility
[%]
False Alarm Probability [%]
LPT06PGMM
GMM-1024-25VGMM-512-25
LPT08 Primary
LPT 08 vs 06 EER: -62%DCF: -57%
LPT 08 vs 06 EER: -29%DCF: -25%
LPT 08 vs 06EER: -46%DCF: -45%
25NIST SRE 2008 Workshop: 17-18 June
0.10.2
0.51.0
2
5
10
20
3040506070
0.1 0.2 0.5 1.0 2 5 10 20 30 40 50 60 70
Mis
s Pr
obab
ility
[%]
False Alarm Probability [%]
Short2Tel-Short3Tel - All Trials
LPT06PGMM
GMM-512-25VGMM-1024-43
LPT08 Primary
0.10.2
0.51.0
2
5
10
20
3040506070
0.1 0.2 0.5 1.0 2 5 10 20 30 40 50 60 70M
iss
Prob
abili
ty [%
]
False Alarm Probability [%]
Short2Tel-Short3Mic - All Trials
LPT06PGMM
GMM-512-25VGMM-512-25
LPT08 Primary
Sub-systems comparison (ii)
Short2Tel-Short3Tel – All trials Short2Tel-Short3Mic – All trials
LPT 08 vs 06 EER: -26%DCF: -32%
LPT 08 vs 06 EER: -17%DCF: -9%
26NIST SRE 2008 Workshop: 17-18 June
2 Wires Conditions
o Segmentation based on speaker factorso Reduced accuracy loss due to segmentation, both for summed test or train
conditionso Accuracy loss due to intrinsic increase of false alarm rate
0.10.2
0.51.0
2
5
10
20
3040506070
0.1 0.2 0.5 1.0 2 5 10 20 30 40 50 60 70
Mis
s Pr
obab
ility
[%]
False Alarm Probability [%]
LPT Primary - Tel-Tel Trials
short2-summedshort2-short3**2 (simulation)
short2-short3
0.10.2
0.51.0
2
5
10
20
3040506070
0.1 0.2 0.5 1.0 2 5 10 20 30 40 50 60 70
Mis
s Pr
obab
ility
[%]
False Alarm Probability [%]
LPT Primary - Tel-Tel Trials
3summed-short33conv-short3
27NIST SRE 2008 Workshop: 17-18 June
Unsupervised Adaptation10sec-10sec
The selection of trials to be used for adaptation was carried out using the primary system scores, obtained using the un-adapted models
Original model for trial selection
Updated model for scoring
28NIST SRE 2008 Workshop: 17-18 June
636
75 True speaker trialsselected foradaptation
Impostor trialsselected foradaptation
Unsupervised adaptation 10s-10s
Gender dependent adaptation thresholds tuned on SRE0610sec-10sec
636
1484
True speaker trialsselected foradaptation
True speaker trialsnot selected foradaptation
NIST SRE 2008
13.0
14.0
15.0
16.0
17.0
18.0
0.300
0.400
0.500
0.600
0.700
0.800
EER Std 14.73 16.09 14.57 16.34
EER Adapted 14.44 15.79 14.56 15.83
Min DCF Std 0.635 0.683 0.652 0.761
Min DCF Adapted 0.612 0.660 0.645 0.754
SRE06 Male SRE06 Female SRE08 Male SRE08 Female
29NIST SRE 2008 Workshop: 17-18 June
ConclusionsSignificant improvements were obtained using the speaker factors on the 10sec-10sec condition
SRE06
Now !
30NIST SRE 2008 Workshop: 17-18 June
ConclusionsContribution of channel normalization for the interview conditions
0.10.2
0.51.0
2
5
10
20
3040506070
0.1 0.2 0.5 1.0 2 5 10 20 30 40 50 60 70
Mis
s Pr
obab
ility
[%]
False Alarm Probability [%]
Short2Int-Short3Int - All Trials
GMM-512-25 U Tel+MicGMM-512-25 U Tel+Mic+Int
31NIST SRE 2008 Workshop: 17-18 June
References
N. Brummer and J. du Preez “Application-Independent Evaluation of Speaker Detection”, Computer Speech and Language Vol. 20, 2-3, pp. 230-275, 2006.
F. Castaldo, D. Colibro, E. Dalmasso, P. Laface, and C. Vair, “Compensation of Nuisance Factors for Speaker and Language Recognition”, IEEE Trans. on Audio, Speech, and Language Processing. Vol. 15-7, pp. 1969-1978, 2007.
F. Castaldo, D. Colibro, E. Dalmasso, P. Laface, and C. Vair, “Stream-Based Speaker Segmentation Using Speaker Factors and Eigenvoices”,Proc. ICASSP-2008, pp. 4133-4136.
P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel: “A Study of Inter-Speaker Variability in SpeakerVerification” , IEEE Transactions on Audio, Speech and Language Processing, July 2008.
R. Kuhn, J.C. Junqua, P. Nguyen, and N. Niedzielski, ”Rapid Speaker Adaptation in Eigenvoice Space”, IEEE Trans. on Speech and Audio Processing, Vol.8, No.6, Nov. 2000, pp. 695-707.
32NIST SRE 2008 Workshop: 17-18 June
Thank you !
Top Related