Computational models of human visual attention driven by auditory cues

Copyright©2014 NTT corp. All Rights Reserved.

Computational models of human visual attention driven by auditory cues

Akisato Kimura, Ph.D

NTT Communication Science Laboratories

(Most of the content presented in this talk are based on the collaborative research with National Institute of Informatics, Japan.)

1Copyright©2014 NTT corp. All Rights Reserved.

Visual attention

Visual attention is a built-in mechanism of the human visual system for scene understanding.

http://www.tobii.com/eye-tracking-research/global/library/white-papers/tobii-eye-tracking-white-paper/


Simulating visual attention is essential

Such a pre-selection mechanism would be essential in enabling computers to undertake

HCI

[http://www.icub.org]

Visual assistance

[https://www.google.com/glass]

Object detection

[Donoser et al. 09]


Saliency as a measure of attention

Saliency = attractiveness of visual attention

• Simple, easy to implement, reasonable outputs

Input image Saliency map [Itti et al. 98]

Estimating human visual focus of attention

Low

High


Related work

Visual saliency

• Saliency map model [Itti 1998]

• Shannon self-information [Bruce 2005]

• Incorporating temporal dynamics [Itti 2009]

[Itti et al. 98] [Bruce et al. 05]

Input image


Visual attention modulated by audios

Sounds are strongly related to events that draw human visual attention.

Without audio

With audio

[Song et al.11]

Speaking


Related work

Visual saliency

• Saliency map model [Itti 1998]

• Shannon self information [Bruce 2005]

• Incorporating temporal dynamics [Itti 2009]

Auditory saliency

• Center-surround mechanism [Kayser 2005]

• Bayesian surprise [Schauerte 2013]

Audio-visual saliency

• Multi-modal saliency for robotics []

• Sound source localization[Nakajima 2013]

[Itti et al. 98] [Bruce et al. 05]

[Kayser et al. 05]Audio spectrogram

Input image

[Itti et al. 03] [Nakajima et al. 13]Input video

Human visual attention models with the help of auditory information is underway.


Main content of this talk

Our recent challenges to simulate human visual attention driven by auditory cues

• Auditory information plays a supportive rolein contrast to standard multi-modal fusion approaches

• Our strategy is built on two psychophysical findings

1. Audio-visual temporal alignment leads to benefits changes are both synchronized and transient[Van der Burget, PLoS One 2010]

2. Auditory attention modulates visual attention in a feature-specific manner [Aveninen, PLoS ONE 2012]


Our strategy

1. Audio-visual temporal alignment leads to benefits changes are both synchronized and transient[Van der Burget, PLoS One 2010]

2. Auditory attention modulates visual attention in a feature-specific manner [Aveninen, PLoS ONE 2012]

Following those findings…

1. detect transient events in visual and auditory domains separately

2. look for visual features synchronized withdetected auditory events

3. modulate saliency maps by feature selection.


Previous method – Bayesian surprise

Image signal

Input video

Bayesian surprise

Visual saliency(from only visual features)

Conventional saliency maps


Our strategy

Audio signalImage signal

Input video

Bayesian surprise

Visual saliency(from selected visual features)

Auditory Surprise

Selecting visual features synchronized with the auditory events

Modulating saliency mapswith the selected features

Proposed saliency maps


Bayesian surprise

Audio signalImage signal

Input video

Bayesian surprise

Auditory Surprise


Concept of Bayesian surprise

Continuously similar features Low saliency values

Unexpected features high saliency values


Visual Bayesian surprise

Intensity×6Color×12Orientation×24Flicker×6Motion×24

Kullback-Leibler divergence

Prior

Observations = Feature maps

Input video

Gaussian pyramid scale

surprise

Visual surprise for 72 feature maps

Input video Visual surprise

Low

High

PosteriorBayes

72 visual feature maps

[Itti, Vision Research 2009]


Auditory Bayesian surprise

Spectrograms as an observation

Audio signal

Spectrogram

Auditory surprise

Prior Posterior

Observation at frequency 𝜔

Surprise at frequency 𝜔

Averaged over frequencies

𝜔

Low

High

Auditory surprise

[Schauerte, ICASSP2013]


Audio-visual synchronization

Bayesian surprise

Auditory Surprise

Selecting visual featuressynchronized with the audio



Correlation-based detection

Type of features

Time

The window width depends on the length of auditory events

𝜃𝑠

Visual surprise Auditory surprise

Feature 𝑓

Feature 𝑓

Averaging over pixels

Auditory event

360 features

Calculating correlation


Visual feature selection

Bayesian surprise


Auditory Surprise

Selecting visual features synchronized with the audio

Modulating saliency mapswith the selected features

Proposed saliency maps


Selecting visual features

Time

Type of features Binarization

Frequency of “synchronization”

Selected features

Voting with threshold 𝜃𝑐

TimeFinal saliency map

Emphasizing selected features by summing up only selected features

360 types of features𝑁 < 360 types of features


Experimental setup

Detecting scan-paths for ground truth• 15 subjects

• 6 videos (The DIEM project)

• Using Tobii TX300

Evaluation criteria

• Normalized Scanpath Saliency (NSS) [Peters 2009]

• Baseline:• Saliency map model [Itti 2003], Bayesian surprise [Itti 2009],

Sound source localization [Nakajima 2013]


Experimental results – summary

The proposed model produced best NSS scores for all the videos


Qualitative evaluation – Video 2


Qualitative evaluation – Video 2

Input Baseline

Auditory surprise Proposed


Detailed evaluation – Video 1

Frame

NSS

Surp

rise

NSS(Proposed)

NSS(Baseline)

Auditory surprise

Auditory event

Feature Intensity Color Orientation Flicker Motion Total

Baseline 30 60 120 30 120 360

Proposed 8 17 46 0 0 71

Selected visual features

The proposed model outperformed the baseline in many frames


Some extensions

Drawbacks of the proposed method

• 2-pass algorithm:Whole the video should be scanned first to detect synchronization.

Recent updates

• Sequential estimation of visual & auditory surprise via exponential smoothing

Video 1 Video 2 Video 3 Video 4 Video 5 Video 6

Itti2009 2.896 1.816 0.790 1.209 0.318 0.513

Nakajima2013 1.857 0.992 0.540 1.073 0.368 0.216

Proposed (new) 3.077 1.820 0.791 1.273 0.318 0.513


Conclusion

Our recent challenges to simulate human visual attention driven by auditory cues

• Auditory information plays a supportive role

• Our model is built on recent psychophysical findings

Human visual attention models with the help of auditory information is underway.

• Auditory attention models

• Auditory cues other than synchronization


Reference

• Kimura, Yonetani, Hirayama “Computational models of human visual attention and their implementations: A survey,” IEICE Transactions on Information and Systems, Vol.E96-D, No.3, 2013.

• Nakajima, Sugimoto, Kawamoto “Incorporating audio signals into constructing a visual saliency map,” Proc. Pacific-Rim Symposium on Image and Video Technology (PSIVT2013).

• Nakajima, Kimura, Sugimoto, Kashino “Visual attention driven by auditory cues: Selecting visual features in synchronization with attracting auditory events,” Proc. International Conference on Multimedia Modeling (MMM2015).

• Nakajima, Kimura, Sugimoto, Kashino “An online computational model of human visual attention considering spatio-temporal synchronization with auditory events,” IPSJ Technical Report, CVIM195-57, 2015 (in Japanese).

Computational models of human visual attention driven by auditory cues

Technology

Transcript of Computational models of human visual attention driven by auditory cues