Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf ·...
Transcript of Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf ·...
![Page 1: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/1.jpg)
Deep Learning for Computer VisionSpring 2019
http://vllab.ee.ntu.edu.tw/dlcv.html (primary)
https://ceiba.ntu.edu.tw/1072CommE5052 (grade, etc.)
FB: DLCV Spring 2019
Yu-Chiang Frank Wang 王鈺強, Associate Professor
Dept. Electrical Engineering, National Taiwan University
2019/06/05
![Page 2: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/2.jpg)
What’s to Be Covered Today…
• Guest Lecture• Title:
NTUEE系友有問必答系列CV領域求學、研究及實習經驗分享
• Speaker:Dr. Wei-Sheng Lai 賴威昇 (B97)Univ. California, Merced
• Time/Location:10am @ BL113 (i.e., the 2nd class)
2
![Page 3: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/3.jpg)
What’s to Be Covered Today…
• VIP Talk• Title:
Face Recognition & Anti-Spoofing for Identity Authentication
• Speaker:Dr. Shang-Hong LaiPrincipal Researcher, Microsoft AI R&D Center
• Time/Location:11am @ BL113 (i.e., the 3rd class)
3
![Page 4: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/4.jpg)
What’s to be Covered …
• Learning Beyond Images (Part II)• Audio-Visual Event Localization• Spatial Audio Generation• Decomposing Sounds of Visual Objects
• About Final Presentation• Date/time: 6/25 Tue 1:30pm-5pm • Remarks
4
![Page 5: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/5.jpg)
Visual vs. Audio-Visual Event Localization
• Recognizing video event categories• Visual vs. audio-visual features
5
walk run run jump jump
Frame 1 2 3 4 5
Visual
dog dog bark talking cat background
Frame 1 2 3 4 5
Visual
Audio talking dog bark dog bark background background
AV Event background dog bark background background background
![Page 6: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/6.jpg)
Audio-Visual Event Localization• Goal
• Identify event/activity labels across video frames by jointly observing visual and audio features in the input video.
• References• Audio-visual event localization in unconstrained videos, ECCV’18• Dual-modality seq2seq network for audio-visual event localization, ICASSP’19
6
![Page 7: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/7.jpg)
Audio-Visual Event Localization
• Demo video
7Tian, J. Shi, B. Li, Z. Duan, and C. Xu. Audio-visual event localization in unconstrained videos. ECCV 2018
![Page 8: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/8.jpg)
Audio-Visual Event Localization
• Audio-visual event localization in unconstrained videos, ECCV’18• Network Architecture
8
Audio based visual attention(Audio localization)
Video frames
Audio segments
![Page 9: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/9.jpg)
Audio-Visual Event Localization• Dual-modality seq2seq network for audio-visual event localization, ICASSP’19
9
- Decoder
Input : Fused states ℎ𝑓𝑓 , 𝑐𝑐𝑓𝑓audio feature 𝑎𝑎visual feature 𝑣𝑣
Output : Event categories 𝑦𝑦1 …𝑦𝑦𝑡𝑡
- Encoder
Input : image and audio segment (𝑡𝑡 segments)
Output: audio feature 𝑎𝑎1 …𝑎𝑎𝑡𝑡visual feature 𝑣𝑣1 … 𝑣𝑣𝑡𝑡
- Fusion
Input : the last hidden and cell states from
audio and visual modality respectively
Output : Fused states ℎ𝑓𝑓 , 𝑐𝑐𝑓𝑓
![Page 10: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/10.jpg)
Audio-Visual Event Localization• Dual-modality seq2seq network for audio-visual event localization, ICASSP’19
• Evaluation• Audio-Visual Event (AVE) Dataset (ECCV’18):
AVE dataset includes 4143 videos with 28 categories and videos are labeled with audio-visual events every second. AVE dataset covers wide range domain events (e.g., church bell, dog barking, truck, bus, clock, violin, etc.).
10
![Page 11: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/11.jpg)
Audio-Visual Event Localization• Dual-modality seq2seq network for audio-visual event localization, ICASSP’19
• Evaluation• Metric: frame-wise accuracy• % of correct matchings over all test frames.
Can be calculated in fully supervised (every frame label is used in training phase) or weakly supervised (only the average labels are used ) settings.
11
![Page 12: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/12.jpg)
Audio-Visual Event Localization• Dual-modality seq2seq network for audio-visual event localization, ICASSP’19
• Evaluation• Metric: frame-wise accuracy• % of correct matchings over all test frames.
Can be calculated in fully supervised (every frame label is used in training phase) or weakly supervised (only the average labels are used ) settings.
12
![Page 13: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/13.jpg)
Spatial Audio Generation• Goal
• The audio spatial information is highly related with visual scene. • Given a single channel audio, we would like to generate spatial audio data by
observing visual data.
• References• 2.5D Visual Sound, CVPR’19• Self-Supervised Audio Spatialization with Correspondence Classifier, ICIP’19• Self-Supervised Generation of Spatial Audio for 360 Video, NeurIPS’18
13
![Page 14: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/14.jpg)
Spatial Audio Generation
• 2.5D Visual Sound, CVPR 2019• Network architecture
14
Predict difference masks only
![Page 15: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/15.jpg)
Spatial Audio Generation• 2.5D Visual Sound, CVPR 2019
• Evaluation
15
Notice: only FAIR-Play dataset is collected in 2D video. The remaining datasets are 360 videos. The 360 spatial audio is transformed to 2D audio by pretrained audio decoders.
![Page 16: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/16.jpg)
Spatial Audio Generation• 2.5D Visual Sound, CVPR 2019
16
![Page 17: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/17.jpg)
Spatial Audio Generation
• Self-Supervised Audio Spatialization with Correspondence Classifier, ICIP 2019• Network architecture
17
![Page 18: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/18.jpg)
Spatial Audio Generation• Self-Supervised Generation of Spatial Audio for 360 Video, NeurIPS’18
• Network architecture
18
The format of 360 audio
![Page 19: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/19.jpg)
Spatial Audio Generation
• Self-Supervised Generation of Spatial Audio for 360 Video, NeurIPS’18• Evaluation• Metric
• STFT distance: Complex L2 norm between ground truth and predicted spectrogram
• Envelope distance (ENV):~STFT distance but use Hilbert transform
• Earth Mover’s Distance (EMD):The energy of the sound field measured over a small window.
19
![Page 20: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/20.jpg)
Spatial Audio Generation
• Self-Supervised Generation of Spatial Audio for 360 Video, NeurIPS’18• Results
20
![Page 21: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/21.jpg)
Spatial Audio Generation• Self-Supervised Generation of Spatial Audio for 360 Video, NeurIPS’18
21
![Page 22: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/22.jpg)
Decomposing Sounds of Visual Objects
• Goal• Separating mixed sounds into separate ones corresponding to the associated objects• Can be done in supervised or (preferably) unsupervised way
• References• The Sound of Pixels, ECCV 208• The Sound of Motions, Arxiv• Co-Separating Sounds of Visual Objects, Arxiv
22
![Page 23: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/23.jpg)
Decomposing Sounds of Visual Objects
• The Sound of Pixels, ECCV 2018
23
Training pipeline: concatenated videos as visual inputs + mixed audio sources.
mixed audio
No GT audio data
![Page 24: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/24.jpg)
Decomposing Sounds of Visual Objects
• The Sound of Pixels, ECCV 2018
24
mixed audio
evaluation
![Page 25: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/25.jpg)
Decomposing Sounds of Visual Objects
• The Sound of Pixels, ECCV 2018• Evaluation
25
![Page 26: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/26.jpg)
Decomposing Sounds of Visual Objects
• The Sound of Motions (Zhao, Gan, Ma, & Torralba), Arxiv
26
![Page 27: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/27.jpg)
Decomposing Sounds of Visual Objects
• The Sound of Motions (Zhao, Gan, Ma, & Torralba), Arxiv• Network architecture
27
![Page 28: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/28.jpg)
Decomposing Sounds of Visual Objects
• The Sound of Motions (Zhao, Gan, Ma, & Torralba), Arxiv• Evaluation
28
![Page 29: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/29.jpg)
Decomposing Sounds of Visual Objects
• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Disentangle sounds in realistic videos, even in cases where an object was not
observed individually during training.
29
![Page 30: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/30.jpg)
Decomposing Sounds of Visual Objects
• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Network architecture
30
![Page 31: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/31.jpg)
Decomposing Sounds of Visual Objects
• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Network architecture
• Audio-Visual Separator
31
![Page 32: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/32.jpg)
Decomposing Sounds of Visual Objects• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.
• Network architecture
32
![Page 33: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/33.jpg)
Decomposing Sounds of Visual Objects
• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Evaluation
33
![Page 34: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/34.jpg)
Decomposing Sounds of Visual Objects
• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Demo Video
34
![Page 35: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/35.jpg)
About Final Challenges & Presentation• Presentation Date/Time
• 6/25 Tue 1:30pm-5pm• If you cannot participate, you need to let me/TAs/your team members know in advance.
• Final 35% (Bonus up to 1%+3%+3%)• Code / Kaggle 10%: Kaggle for references, final accuracy evaluated by TAs
• Early baseline: bonus 1% (due 6/15 Sat 1am)• TA baseline
Public: Weak 5% / Strong 5%; Private: bonus up to 3% (due 6/24 Mon 2:00 am)• Approach & Presentation 25%
• Novelty and technical contribution (10%)• Completeness of experiments (10%) (e.g., comparisons to baseline and recent
models, ablation studies, visualization, etc.)• Presentation (Oral + Poster) 5% + bonus up to 3% (top 3 teams voted by class)
• For both challenges, you need to upload your code to github and provide readme files, so that TAs will be able to reproduce your results!
• If TAs cannot reproduce your results, 0/20 points will be given unless minor errors (i.e., no credits for the approach part).
35
![Page 36: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition](https://reader033.fdocuments.net/reader033/viewer/2022050305/5f6d5edbcd432a43b40deafc/html5/thumbnails/36.jpg)
About Final Challenges & Presentation• Intra-Group Evaluation
• Every student needs to rate each member in his/her team. (e.g., 1~5)• Every student needs to specify the contributions of each member in the team. • (Optional)
Students can provide additional remarks on the team members if necessary. The comments will not be released to other team members but accessible to instructor and TAs only.
36