Computational models of human visual attention driven by auditory cues
-
Upload
akisato-kimura -
Category
Technology
-
view
226 -
download
5
Transcript of Computational models of human visual attention driven by auditory cues
![Page 1: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/1.jpg)
Copyright©2014 NTT corp. All Rights Reserved.
Computational models of human visual attention driven by auditory cues
Akisato Kimura, Ph.D
NTT Communication Science Laboratories
(Most of the content presented in this talk are based on the collaborative research with National Institute of Informatics, Japan.)
![Page 2: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/2.jpg)
1Copyright©2014 NTT corp. All Rights Reserved.
Visual attention
Visual attention is a built-in mechanism of the human visual system for scene understanding.
http://www.tobii.com/eye-tracking-research/global/library/white-papers/tobii-eye-tracking-white-paper/
![Page 3: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/3.jpg)
2Copyright©2014 NTT corp. All Rights Reserved.
Simulating visual attention is essential
Such a pre-selection mechanism would be essential in enabling computers to undertake
HCI
[http://www.icub.org]
Visual assistance
[https://www.google.com/glass]
Object detection
[Donoser et al. 09]
![Page 4: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/4.jpg)
3Copyright©2014 NTT corp. All Rights Reserved.
Saliency as a measure of attention
Saliency = attractiveness of visual attention
• Simple, easy to implement, reasonable outputs
Input image Saliency map [Itti et al. 98]
Estimating human visual focus of attention
Low
High
![Page 5: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/5.jpg)
4Copyright©2014 NTT corp. All Rights Reserved.
Related work
Visual saliency
• Saliency map model [Itti 1998]
• Shannon self-information [Bruce 2005]
• Incorporating temporal dynamics [Itti 2009]
[Itti et al. 98] [Bruce et al. 05]
Input image
![Page 6: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/6.jpg)
5Copyright©2014 NTT corp. All Rights Reserved.
Visual attention modulated by audios
Sounds are strongly related to events that draw human visual attention.
Without audio
With audio
[Song et al.11]
Speaking
![Page 7: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/7.jpg)
6Copyright©2014 NTT corp. All Rights Reserved.
Related work
Visual saliency
• Saliency map model [Itti 1998]
• Shannon self information [Bruce 2005]
• Incorporating temporal dynamics [Itti 2009]
Auditory saliency
• Center-surround mechanism [Kayser 2005]
• Bayesian surprise [Schauerte 2013]
Audio-visual saliency
• Multi-modal saliency for robotics []
• Sound source localization[Nakajima 2013]
[Itti et al. 98] [Bruce et al. 05]
[Kayser et al. 05]Audio spectrogram
Input image
[Itti et al. 03] [Nakajima et al. 13]Input video
Human visual attention models with the help of auditory information is underway.
![Page 8: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/8.jpg)
7Copyright©2014 NTT corp. All Rights Reserved.
Main content of this talk
Our recent challenges to simulate human visual attention driven by auditory cues
• Auditory information plays a supportive rolein contrast to standard multi-modal fusion approaches
• Our strategy is built on two psychophysical findings
1. Audio-visual temporal alignment leads to benefits changes are both synchronized and transient[Van der Burget, PLoS One 2010]
2. Auditory attention modulates visual attention in a feature-specific manner [Aveninen, PLoS ONE 2012]
![Page 9: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/9.jpg)
8Copyright©2014 NTT corp. All Rights Reserved.
Our strategy
1. Audio-visual temporal alignment leads to benefits changes are both synchronized and transient[Van der Burget, PLoS One 2010]
2. Auditory attention modulates visual attention in a feature-specific manner [Aveninen, PLoS ONE 2012]
Following those findings…
1. detect transient events in visual and auditory domains separately
2. look for visual features synchronized withdetected auditory events
3. modulate saliency maps by feature selection.
![Page 10: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/10.jpg)
9Copyright©2014 NTT corp. All Rights Reserved.
Previous method – Bayesian surprise
Image signal
Input video
Bayesian surprise
Visual saliency(from only visual features)
Conventional saliency maps
![Page 11: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/11.jpg)
10Copyright©2014 NTT corp. All Rights Reserved.
Our strategy
Audio signalImage signal
Input video
Bayesian surprise
Visual saliency(from selected visual features)
Auditory Surprise
Selecting visual features synchronized with the auditory events
Modulating saliency mapswith the selected features
Proposed saliency maps
![Page 12: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/12.jpg)
11Copyright©2014 NTT corp. All Rights Reserved.
Bayesian surprise
Audio signalImage signal
Input video
Bayesian surprise
Auditory Surprise
![Page 13: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/13.jpg)
12Copyright©2014 NTT corp. All Rights Reserved.
Concept of Bayesian surprise
Continuously similar features Low saliency values
Unexpected features high saliency values
![Page 14: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/14.jpg)
13Copyright©2014 NTT corp. All Rights Reserved.
Visual Bayesian surprise
Intensity×6Color×12Orientation×24Flicker×6Motion×24
Kullback-Leibler divergence
Prior
Observations = Feature maps
Input video
Gaussian pyramid scale
surprise
Visual surprise for 72 feature maps
Input video Visual surprise
Low
High
PosteriorBayes
72 visual feature maps
[Itti, Vision Research 2009]
![Page 15: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/15.jpg)
14Copyright©2014 NTT corp. All Rights Reserved.
Auditory Bayesian surprise
Spectrograms as an observation
Audio signal
Spectrogram
Auditory surprise
Prior Posterior
Observation at frequency 𝜔
Surprise at frequency 𝜔
Averaged over frequencies
𝜔
Low
High
Auditory surprise
[Schauerte, ICASSP2013]
![Page 16: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/16.jpg)
15Copyright©2014 NTT corp. All Rights Reserved.
Audio-visual synchronization
Bayesian surprise
Auditory Surprise
Selecting visual featuressynchronized with the audio
Visual saliency(from selected visual features)
![Page 17: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/17.jpg)
16Copyright©2014 NTT corp. All Rights Reserved.
Correlation-based detection
Type of features
Time
The window width depends on the length of auditory events
𝜃𝑠
Visual surprise Auditory surprise
Feature 𝑓
Feature 𝑓
Averaging over pixels
Auditory event
360 features
Calculating correlation
![Page 18: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/18.jpg)
17Copyright©2014 NTT corp. All Rights Reserved.
Visual feature selection
Bayesian surprise
Visual saliency(from selected visual features)
Auditory Surprise
Selecting visual features synchronized with the audio
Modulating saliency mapswith the selected features
Proposed saliency maps
![Page 19: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/19.jpg)
18Copyright©2014 NTT corp. All Rights Reserved.
Selecting visual features
Time
Type of features Binarization
Frequency of “synchronization”
Selected features
Voting with threshold 𝜃𝑐
TimeFinal saliency map
Emphasizing selected features by summing up only selected features
360 types of features𝑁 < 360 types of features
![Page 20: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/20.jpg)
19Copyright©2014 NTT corp. All Rights Reserved.
Experimental setup
Detecting scan-paths for ground truth• 15 subjects
• 6 videos (The DIEM project)
• Using Tobii TX300
Evaluation criteria
• Normalized Scanpath Saliency (NSS) [Peters 2009]
• Baseline:• Saliency map model [Itti 2003], Bayesian surprise [Itti 2009],
Sound source localization [Nakajima 2013]
![Page 21: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/21.jpg)
20Copyright©2014 NTT corp. All Rights Reserved.
Experimental results – summary
The proposed model produced best NSS scores for all the videos
![Page 22: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/22.jpg)
21Copyright©2014 NTT corp. All Rights Reserved.
Qualitative evaluation – Video 2
![Page 23: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/23.jpg)
22Copyright©2014 NTT corp. All Rights Reserved.
Qualitative evaluation – Video 2
Input Baseline
Auditory surprise Proposed
![Page 24: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/24.jpg)
23Copyright©2014 NTT corp. All Rights Reserved.
Detailed evaluation – Video 1
Frame
NSS
Surp
rise
NSS(Proposed)
NSS(Baseline)
Auditory surprise
Auditory event
Feature Intensity Color Orientation Flicker Motion Total
Baseline 30 60 120 30 120 360
Proposed 8 17 46 0 0 71
Selected visual features
The proposed model outperformed the baseline in many frames
![Page 25: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/25.jpg)
24Copyright©2014 NTT corp. All Rights Reserved.
Some extensions
Drawbacks of the proposed method
• 2-pass algorithm:Whole the video should be scanned first to detect synchronization.
Recent updates
• Sequential estimation of visual & auditory surprise via exponential smoothing
Video 1 Video 2 Video 3 Video 4 Video 5 Video 6
Itti2009 2.896 1.816 0.790 1.209 0.318 0.513
Nakajima2013 1.857 0.992 0.540 1.073 0.368 0.216
Proposed (new) 3.077 1.820 0.791 1.273 0.318 0.513
![Page 26: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/26.jpg)
25Copyright©2014 NTT corp. All Rights Reserved.
Conclusion
Our recent challenges to simulate human visual attention driven by auditory cues
• Auditory information plays a supportive role
• Our model is built on recent psychophysical findings
Human visual attention models with the help of auditory information is underway.
• Auditory attention models
• Auditory cues other than synchronization
![Page 27: Computational models of human visual attention driven by auditory cues](https://reader034.fdocuments.net/reader034/viewer/2022042615/55a945281a28ab75318b4608/html5/thumbnails/27.jpg)
26Copyright©2014 NTT corp. All Rights Reserved.
Reference
• Kimura, Yonetani, Hirayama “Computational models of human visual attention and their implementations: A survey,” IEICE Transactions on Information and Systems, Vol.E96-D, No.3, 2013.
• Nakajima, Sugimoto, Kawamoto “Incorporating audio signals into constructing a visual saliency map,” Proc. Pacific-Rim Symposium on Image and Video Technology (PSIVT2013).
• Nakajima, Kimura, Sugimoto, Kashino “Visual attention driven by auditory cues: Selecting visual features in synchronization with attracting auditory events,” Proc. International Conference on Multimedia Modeling (MMM2015).
• Nakajima, Kimura, Sugimoto, Kashino “An online computational model of human visual attention considering spatio-temporal synchronization with auditory events,” IPSJ Technical Report, CVIM195-57, 2015 (in Japanese).