Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav...

35
Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised by Prof. Israel Cohen

Transcript of Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav...

Page 1: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Audio-visual processing of speech with DNN

Ido Ariav

Electrical Engineering DepartmentTechnion - Israel Institute of Technology

Supervised by Prof. Israel Cohen

Page 2: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Outline

▪ Background - Voice Activity Detection

▪ Deep Multimodal Architectures for Voice Activity Detection

▪ results

Page 3: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Voice Activity Detection (VAD)Some background..

Page 4: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Voice Activity Detection (VAD)

▪ Many applications - speech and speaker recognition, speech enhancement, dominant speaker identification, hearing-improvement devices, etc.

Page 5: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Voice Activity Detection (VAD)

▪ a preliminary block to other speech related applications

Page 6: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Traditional Methods

▪ simple acoustic features (e.g. zero-crossings), model-based methods (e.g. GMM)

Page 7: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Traditional Methods

Performance deteriorates in presence of noise

Cannot model highly non-stationary noise (transients)

-3 dB thresh-4 dB thresh

Page 8: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Deep NN

▪ Deep learning to the rescue!

Page 9: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Deep NN

▪ But wait… speech is a time-series so why should we treat it as a discrete classification problem?

speech

Page 10: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Multimodal

▪ Any other sensors we could use??

▪ Video is especially useful in challenging acoustic environments

Page 11: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Deep Multimodal Architectures for Voice Activity Detection

Page 12: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Problem Setting

▪ a multimodal setting, audio and video signals are both available.

Page 13: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Problem Setting

▪ Stationary background noise and transients (metronome, keyboard typing, hammering) are added to the clean signal

▪ 11 speakers, each recording 120 seconds long

Speech

transients

Page 14: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Deep architecture for VAD

Page 15: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Feature Extraction

▪ Audio Features - MFCC (Mel-frequency cepstral coefficients)

▪ Video Features - motion vectors (MV)

▪ MV capture both spatial and temporal information

Page 16: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Transient Reducing AE

▪ a special AE is designed for both fusing the audio and video signals, and reducing the effect of noises and transients

Clean

mushroom

Page 17: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Recurrent Neural Network

▪ The transient reducing AE is followed by a multilayered RNN

▪ The length of the temporal window is learned instead of being arbitrarily predetermined.

▪ a sigmoid on the RNN output produces a probability measure for the presence of speech in each frame 𝑛

Page 18: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Experimental Results

Comparison of our method to “Audio-Visual Voice Activity Detection Using Diffusion Maps” by Dov et al.

Our method produces less false alarms

Page 19: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Experimental Results

Comparison to “Audio-Visual Voice Activity Detection Using Diffusion Maps” by Dov et al. and “robust audio-visual speech recognition using audio-visual voice activity detection“ by Tamura et al.

Colored noise with 5 dB SNR and hammering transient

Page 20: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Experimental Results

Comparison to “Audio-Visual Voice Activity Detection Using Diffusion Maps” by Dov et al. and “robust audio-visual speech recognition using audio-visual voice activity detection“ by Tamura et al.

Babble noise with 10 dB SNR and keyboard transient

Page 21: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

That’s nice, but still not end-to-end…

Page 22: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

End-to-End VAD

Page 23: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Video Feature Extraction

▪ Residual networks

Page 24: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Audio Feature Extraction

▪ a WaveNet encoder

▪ stacked residual blocks of dilated convolutions

▪ captures long-range temporal dependencies

Page 25: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Audio Feature Extraction

▪ Dilated convolutions

Page 26: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Audio Feature Extraction

Page 27: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Feature Fusion - MCB

▪ Feature vectors fusion -

2048

2048

2048

Page 28: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Feature Fusion - MCB

▪ The best of all worlds – MCB

▪ approximated by projecting the jointouter product to a lower dimensionalspace, using a count sketch function

Whatever we choose..

Page 29: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Feature Fusion - MCB

Page 30: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Feature Fusion - MCB

▪ can easily be extended for more than two modalities

▪ able to choose the desired size for the joint vector

▪ MCB output size is set to be 1024

Page 31: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Dataset

▪ more challenging dataset compared to our previous work - each sample of the evaluation set contains a different mixture of background noise, transient, and SNR

▪ Training set – noised every iteration

▪ Evaluation set – noised once at init

Page 32: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Experimental Results

Comparison of our method to “Audio-Visual Voice Activity Detection Using Diffusion Maps” by Dov et al. and our previous work

Page 33: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Experimental Results

A comparison of our 4 different architectures –with MCB\concatenation, and with shared\joint LSTM

Shared LSTM + MCB is best

Page 34: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Discussion

▪ Features are learned from raw data

▪ fusion of the modalities via an MCB module, higher order relationsbetween the two modalities are explored

▪ Can be utilized to other domains (ECG)

Page 35: Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav Electrical Engineering Department Technion - Israel Institute of Technology Supervised

Questions?Thank you..