Speaker Authentication - Interspeech 2011_v3

27
H. Aronowitz (IBM Research) Interspeech 2011 1/27 Hagai Aronowitz, Ron Hoory Jason Pelecanos, David Nahamoo IBM Research – Haifa IBM T.J. Watson Research Center New Developments in Voice Biometrics for User Authentication

Transcript of Speaker Authentication - Interspeech 2011_v3

Page 1: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 1/27

Hagai Aronowitz, Ron Hoory Jason Pelecanos, David Nahamoo IBM Research – Haifa IBM T.J. Watson Research Center

New Developments in Voice Biometrics for User Authentication

Page 2: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 2/27

Speaker Verification for Mobile Banking Transactions

Mobile banking services “I want to transfer 10K Dollars from my account to account

#53463985” Current solution is based on RSA SecurID

Proposed solution: multi-factor authentication– Speaker verification– Face recognition– …

Page 3: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 3/27

The User Authentication Evaluation

The evaluation focuses on speaker verification

Wells-Fargo (WF) bank collected data from 750 employees

IBM Research participated in the evaluation

Evaluation rules are similar to NIST-SRE rules (however, gender is assumed to be unknown)

Page 4: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 4/27

Outline

1. Evaluation description

2. Technology Text-independent Text-dependent

3. Improvements

4. Results

5. Post-evaluation work and conclusions

Page 5: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 5/27

Evaluation Description

Page 6: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 6/27

Authentication Conditions1. A global digit string such as 0123456789

Attackers may use a recording Easiest to classify Denoted by the global condition

2. A speaker dependent password such as “4131024773” May be eavesdropped / recorded We assume the worst case scenario: impostor knows the password Denoted by the speaker condition

3. A prompted random digit-string Hardest to accurately authenticate Denoted by the prompted condition

4. Free speech More natural especially for call-center scenario Denoted by the TI condition

Page 7: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 7/27

WF POT Data 750 speakers (200 for Dev, 550 for Eval) Data recorded over 4 weeks 4 sessions recorded per speaker

2 landline + 2 cellular

Each session consists of all authentication conditions Some digit-strings are repeated 3 times in order to allow

enrollment/verification with more that a single repetition Dev data

Condition Dev data

Global Same digit-strings as evaluated

Speakerdifferent digit-strings than evaluated

prompted

TI different text than evaluated

Page 8: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 8/27

Technology

Page 9: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 9/27

Speaker Verification SystemsText-independent systems

GMM-based Joint Factor Analysis (JFA)

GMM-based Nuisance Attribute Projection (NAP)

We use both systems for all authentication conditions

Text-dependent system

HMM-based NAP

We use this system for the global condition only

Page 10: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 10/27

GMM-Based JFA A standard JFA-system:

Hyperparameters (m, V, D, U) estimated from standard telephony data

- Switchboard-II, NIST 2004 & 2006

- 12,711 sessions in total

Front end: VAD + 12 MFCC+12 Δ+12 ΔΔ + feature warping

Linear scoring

Symmetric scoring: forward + reverse scoring

ZT-score normalization using WF-POT Dev data

- 800 sessions (200 speakers X 4 sessions)

UxDzVymM

Page 11: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 11/27

GMM-Based NAP Baseline UBM & NAP are trained from NIST 2004

Supervectors created using normalized GMM-means

Front end: 13 MFCC+13 Δ+ VAD + feature warping

Dot product scoring

ZT-score normalization using same data as the JFA system

Page 12: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 12/27

HMM-Based NAP Speaker independent (SI) HMM training

- Using text-matched Dev data (200 speakers X 4 sessions) MAP adaptation estimation of session dependent HMMs

- 3 repetitions used for enrollment- 1 or 2 repetitions used for verification

Supervectors created using normalized GMM-means of the HMMs

Front-end NAP Scoring Score normalization

Same as for the GMM-NAP system

Page 13: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 13/27

Linear FusionGlobal condition HMM-NAP score – 50% GMM-JFA score – 25% GMM-NAP score – 25%

Speaker, prompted & TI conditions GMM-JFA score – 50% GMM-NAP score – 50%

Page 14: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 14/27

Improvements

Page 15: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 15/27

ImprovementsMain method – tuning to the WF POT Dev data JFA – hard to tune because needs large amounts of data HMM-NAP – already tuned to Dev data GMM-NAP – we can tune the UBM and the NAP

projection

Methodology Research focused on the global condition Conclusions have been applied to other conditions

Extended dataset for the global condition 6 different 10 digit-strings + 2 textual passwords

(“At WF my voice is my password”, “There is no place like home”) Channel conditions:

- 75% mismatched trials- 25% matched trials

Page 16: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 16/27

GMM-Based NAP ImprovementsImproved NAP

2-wire NAP In [1] we have shown that removal of the speaker-

subspace improves accuracy compared to no subspace removal

In [2] we have shown than removal of dominant components of the speaker-subspace on top of the channel-subspace outperforms standard NAP

Theoretic motivation was given for speaker-ID in 2-wire data [2] but improvements were observes also for 4-wire speaker-ID

On the WF data we observe 6% rel. error reduction[1] H. Aronowitz, “Speaker Recognition using Kernel-PCA and Intersession Variability

Modeling”, in Proc. Interspeech, 2007.[2] Y. A. Solewicz, H. Aronowitz, "Two-Wire Nuisance Attribute Projection", in Proc.

Interspeech 2009.

Page 17: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 17/27

NAP training data Baseline: NIST-2004 IBM 2003 digits dataset

21% error reduction The whole WF Dev-set

26% error reduction Text matched utterances from WF Dev-set

29% error reductionSetup UBM is trained from NIST04 data Same trend is observed when UBM is trained from IBM

2003 digits dataset / WF-POT data

GMM-Based NAP ImprovementsNAP training

Page 18: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 18/27

GMM-Based NAP ImprovementsUBM training data

UBM training data Baseline: NIST-2004 IBM 2003 digits dataset

4% error reduction Text matched utterances from WF Dev-set

15% error reduction

Setup NAP is trained from text-matched WF-POT data Same trend is observed when NAP is trained from NIST04

or IBM 2003 digits dataset

Page 19: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 19/27

GMM-Based NAP ImprovementsSummary

Methods1. 2-wire NAP2. Text matched data for UBM training3. Text matched data for NAP training

Results 40% error reduction compared to using NIST dev data for

UBM and NAP training 25% error reduction compared to using IBM 2003 digits

dataset for UBM and NAP training These techniques have been successfully used for the

speaker, prompted and TI conditions

Page 20: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 20/27

Results

Page 21: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 21/27

Results on NIST-2008

GMM JFA GMM NAP

1.4 3.6

Condition short2-short3 tel-tel MalesResults are in EER (%)

Page 22: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 22/27

Results for Single Verification UtteranceMatched channel

Condition GMMJFA

GMMNAP

HMMNAP

Fused

Global 1.70 1.01 0.90 0.70Speaker 2.21 1.82 - 1.26Prompted 6.49 5.63 - 3.40TI 1.24 1.35 - 0.65

Mismatched channel

Condition GMMJFA

GMMNAP

HMMNAP

Fused

Global 5.07 2.99 2.35 1.95Speaker 5.68 5.05 - 3.64Prompted 12.33 11.85 - 8.33TI 4.24 4.85 - 2.50

Page 23: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 23/27

Results for Two Verification UtterancesMatched channel

Mismatched channel

Condition GMMJFA

GMMNAP

HMMNAP

Fused

Global 1.05 0.86 0.66 0.55Speaker 1.50 1.37 - 0.85

Condition GMMJFA

GMMNAP

HMMNAP

Fused

Global 3.34 1.99 1.66 1.41Speaker 4.11 3.97 - 2.74

Page 24: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 24/27

TI Accuracy as Function of Session Length Enrollment Two sessions (1 landline + 1 cellular) Enrollment sessions length: ~25 sec each

Page 25: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 25/27

Post-Evaluation Work & Conclusions

Page 26: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 26/27

Post Evaluation Work Error reduction (~20%) i-vector based system Weighted symmetric scoring* Robust scoring*

Handling estimation uncertainty by weighting the contribution of each Gaussian using a geometric mean of the Gaussian occupancy counts.Motivated by [Campbell, 2010].

Goat Detection Talk given earlier today by Orith Toledo-RonenFast JFA scoring Using efficient approximated factors estimation*

* H. Aronowitz, O Barkan, “New Developments in Joint Factor Analysis for Speaker Verification”, in Proc. Interspeech 2011.Talk will be given today at 4:20 PM

Page 27: Speaker Authentication - Interspeech 2011_v3

H. Aronowitz (IBM Research) Interspeech 2011 27/27

Conclusions

1. We evaluated JFA, GMM-NAP, HMM-NAP and a fused system on 4 authentication conditions

2. HMM-NAP was the best standalone system for the global condition

3. GMM-NAP outperformed JFA on the TD conditions due to its full usage of the WF POT Dev dataBaseline GMM-NAP was improved by 40% using better Dev data for UBM and NAP-projection estimation and using 2-wire-NAP

4. EERs lower than 1% have been obtained for the matched channel condition

5. EER triples for the mismatched channel condition

6. Multi-condition authentication leads to even smaller EERs