AUDIO-VISUAL SPEAKER IDENTIFICATION USING THE CUAVE DATABASE

1
AUDIO-VISUAL SPEAKER IDENTIFICATION USING THE CUAVE DATABASE AUDIO-VISUAL SPEAKER IDENTIFICATION USING THE CUAVE DATABASE David Dean, Patrick Lucey and Sridha Sridharan Speech, Audio, Image and Video Research Laboratory, QUT ABSTRACT The freely available nature of the CUAVE database allows it to provide a valuable platform to form benchmarks and compare research. This paper shows that the CUAVE database can successfully be used to test speaker identification systems, with performance comparable to existing systems implemented on other databases. Additionally, this research shows that the optimal configuration for decision-fusion of an audio- visual speaker identification system relies heavily on the video modality in all but clean speech conditions. A1 A2 A3 A4 V1 V2 V3 V4 Visual Speaker Models Speaker Decision Acoustic Speaker Models Fusion Lip Location and Tracking Acoustic (MFCC) Feature Extraction Visual (PCA) Feature Extraction LIP LOCATION AND TRACKING RESULTS Figure A shows the response of the fused system to speech-babble audio corruption over a range of signal-to-noise ratios (SNRs). The dashed line indicates the best possible performance at each noise level. The performance of this system is comparable to existing research on XM2VTS and other audio-visual databases. From Figure A, it would appear that the best non- adaptive fusion performance is obtained for relatively low audio-weights. This is confirmed by Figure B, which shows the optimal audio-weight for best performance at each noise level favors the visual domain (is < 0.5) for all but the clean acoustic speech. MODELLING AND FUSION For these experiments, text-dependent speaker modelling was performed, meaning that the speakers say the same utterance for both training and testing. Ten experiments were ran, one for each digit in the CUAVE database. Using phoneme transcriptions obtained using earlier research on the CUAVE database, speaker independent HMMs were trained for each phoneme separately in both the audio and visual modalities. The structure of these HMMs was identical for both modalities, with 3 hidden states for the phone models, and 1 hidden state for the short pause model. These speaker independent models were then adapted using MLLR adaption into speaker-dependent models. For each word under test in the CUAVE database, the ten most likely speakers in either modality are combined using weighted- sum late fusion of the normalized scores from each modality. FEATURE EXTRACTION Acoustic features were represented by the first 15 Mel frequency cepstral coefficients (MFCCs), a normalised energy coefficient and the first and second derivatives of these 16 coefficients. Acoustic features were captured at 100 samples/second. Visual features were represented by a 50- dimensional vector consisting of each tracked- lip image projected into a PCA-space based on 1,000 randomly selected tracked-lip images. Visual features were captured at 29.97 samples/second. V A F s s s 1

description

A1. V1. A2. V2. A3. V3. A4. V4. AUDIO-VISUAL SPEAKER IDENTIFICATION USING THE CUAVE DATABASE. David Dean, Patrick Lucey and Sridha Sridharan Speech, Audio, Image and Video Research Laboratory, QUT. ABSTRACT - PowerPoint PPT Presentation

Transcript of AUDIO-VISUAL SPEAKER IDENTIFICATION USING THE CUAVE DATABASE

Page 1: AUDIO-VISUAL SPEAKER IDENTIFICATION USING THE CUAVE DATABASE

AUDIO-VISUAL SPEAKER IDENTIFICATION USING THE CUAVE DATABASEAUDIO-VISUAL SPEAKER IDENTIFICATION USING THE CUAVE DATABASEDavid Dean, Patrick Lucey and Sridha Sridharan

Speech, Audio, Image and Video Research Laboratory, QUT

ABSTRACTThe freely available nature of the CUAVE database allows it to provide a valuable platform to form benchmarks and compare research. This paper shows that the CUAVE database can successfully be used to test speaker identification systems, with performance comparable to existing systems implemented on other databases. Additionally, this research shows that the optimal configuration for decision-fusion of an audio-visual speaker identification system relies heavily on the video modality in all but clean speech conditions.

A1 A2 A3 A4

V1 V2 V3 V4

Visual Speaker Models

Speaker Decision

Acoustic Speaker Models

Fusion

Lip Location and Tracking

Acoustic (MFCC) Feature Extraction

Visual (PCA) Feature Extraction

LIP LOCATION AND TRACKING

RESULTS

Figure A shows the response of the fused system to speech-babble audio corruption over a range of signal-to-noise ratios (SNRs). The dashed line indicates the best possible performance at each noise level. The performance of this system is comparable to existing research on XM2VTS and other audio-visual databases.

From Figure A, it would appear that the best non-adaptive fusion performance is obtained for relatively low audio-weights. This is confirmed by Figure B, which shows the optimal audio-weight for best performance at each noise level favors the visual domain (is < 0.5) for all but the clean acoustic speech.

MODELLING AND FUSIONFor these experiments, text-dependent speaker modelling was performed, meaning that the speakers say the same utterance for both training and testing. Ten experiments were ran, one for each digit in the CUAVE database.

Using phoneme transcriptions obtained using earlier research on the CUAVE database, speaker independent HMMs were trained for each phoneme separately in both the audio and visual modalities. The structure of these HMMs was identical for both modalities, with 3 hidden states for the phone models, and 1 hidden state for the short pause model. These speaker independent models were then adapted using MLLR adaption into speaker-dependent models.

For each word under test in the CUAVE database, the ten most likely speakers in either modality are combined using weighted-sum late fusion of the normalized scores from each modality.

The fusion experiments were conducted over a range of α values to discover the best fusion parameters at each noise level.

FEATURE EXTRACTIONAcoustic features were represented by the first 15 Mel frequency cepstral coefficients (MFCCs), a normalised energy coefficient and the first and second derivatives of these 16 coefficients. Acoustic features were captured at 100 samples/second.

Visual features were represented by a 50-dimensional vector consisting of each tracked-lip image projected into a PCA-space based on 1,000 randomly selected tracked-lip images. Visual features were captured at 29.97 samples/second.

VAF sss 1