Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf ·...

36
Deep Learning for Computer Vision Spring 2019 http://vllab.ee.ntu.edu.tw/dlcv.html (primary) https://ceiba.ntu.edu.tw/1072CommE5052 (grade, etc.) FB: DLCV Spring 2019 Yu-Chiang Frank Wang 王鈺強, Associate Professor Dept. Electrical Engineering, National Taiwan University 2019/06/05

Transcript of Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf ·...

Page 1: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Deep Learning for Computer VisionSpring 2019

http://vllab.ee.ntu.edu.tw/dlcv.html (primary)

https://ceiba.ntu.edu.tw/1072CommE5052 (grade, etc.)

FB: DLCV Spring 2019

Yu-Chiang Frank Wang 王鈺強, Associate Professor

Dept. Electrical Engineering, National Taiwan University

2019/06/05

Page 2: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

What’s to Be Covered Today…

• Guest Lecture• Title:

NTUEE系友有問必答系列CV領域求學、研究及實習經驗分享

• Speaker:Dr. Wei-Sheng Lai 賴威昇 (B97)Univ. California, Merced

• Time/Location:10am @ BL113 (i.e., the 2nd class)

2

Page 3: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

What’s to Be Covered Today…

• VIP Talk• Title:

Face Recognition & Anti-Spoofing for Identity Authentication

• Speaker:Dr. Shang-Hong LaiPrincipal Researcher, Microsoft AI R&D Center

• Time/Location:11am @ BL113 (i.e., the 3rd class)

3

Page 4: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

What’s to be Covered …

• Learning Beyond Images (Part II)• Audio-Visual Event Localization• Spatial Audio Generation• Decomposing Sounds of Visual Objects

• About Final Presentation• Date/time: 6/25 Tue 1:30pm-5pm • Remarks

4

Page 5: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Visual vs. Audio-Visual Event Localization

• Recognizing video event categories• Visual vs. audio-visual features

5

walk run run jump jump

Frame 1 2 3 4 5

Visual

dog dog bark talking cat background

Frame 1 2 3 4 5

Visual

Audio talking dog bark dog bark background background

AV Event background dog bark background background background

Page 6: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Audio-Visual Event Localization• Goal

• Identify event/activity labels across video frames by jointly observing visual and audio features in the input video.

• References• Audio-visual event localization in unconstrained videos, ECCV’18• Dual-modality seq2seq network for audio-visual event localization, ICASSP’19

6

Page 7: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Audio-Visual Event Localization

• Demo video

7Tian, J. Shi, B. Li, Z. Duan, and C. Xu. Audio-visual event localization in unconstrained videos. ECCV 2018

Page 8: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Audio-Visual Event Localization

• Audio-visual event localization in unconstrained videos, ECCV’18• Network Architecture

8

Audio based visual attention(Audio localization)

Video frames

Audio segments

Page 9: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Audio-Visual Event Localization• Dual-modality seq2seq network for audio-visual event localization, ICASSP’19

9

- Decoder

Input : Fused states ℎ𝑓𝑓 , 𝑐𝑐𝑓𝑓audio feature 𝑎𝑎visual feature 𝑣𝑣

Output : Event categories 𝑦𝑦1 …𝑦𝑦𝑡𝑡

- Encoder

Input : image and audio segment (𝑡𝑡 segments)

Output: audio feature 𝑎𝑎1 …𝑎𝑎𝑡𝑡visual feature 𝑣𝑣1 … 𝑣𝑣𝑡𝑡

- Fusion

Input : the last hidden and cell states from

audio and visual modality respectively

Output : Fused states ℎ𝑓𝑓 , 𝑐𝑐𝑓𝑓

Page 10: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Audio-Visual Event Localization• Dual-modality seq2seq network for audio-visual event localization, ICASSP’19

• Evaluation• Audio-Visual Event (AVE) Dataset (ECCV’18):

AVE dataset includes 4143 videos with 28 categories and videos are labeled with audio-visual events every second. AVE dataset covers wide range domain events (e.g., church bell, dog barking, truck, bus, clock, violin, etc.).

10

Page 11: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Audio-Visual Event Localization• Dual-modality seq2seq network for audio-visual event localization, ICASSP’19

• Evaluation• Metric: frame-wise accuracy• % of correct matchings over all test frames.

Can be calculated in fully supervised (every frame label is used in training phase) or weakly supervised (only the average labels are used ) settings.

11

Page 12: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Audio-Visual Event Localization• Dual-modality seq2seq network for audio-visual event localization, ICASSP’19

• Evaluation• Metric: frame-wise accuracy• % of correct matchings over all test frames.

Can be calculated in fully supervised (every frame label is used in training phase) or weakly supervised (only the average labels are used ) settings.

12

Page 13: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Spatial Audio Generation• Goal

• The audio spatial information is highly related with visual scene. • Given a single channel audio, we would like to generate spatial audio data by

observing visual data.

• References• 2.5D Visual Sound, CVPR’19• Self-Supervised Audio Spatialization with Correspondence Classifier, ICIP’19• Self-Supervised Generation of Spatial Audio for 360 Video, NeurIPS’18

13

Page 14: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Spatial Audio Generation

• 2.5D Visual Sound, CVPR 2019• Network architecture

14

Predict difference masks only

Page 15: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Spatial Audio Generation• 2.5D Visual Sound, CVPR 2019

• Evaluation

15

Notice: only FAIR-Play dataset is collected in 2D video. The remaining datasets are 360 videos. The 360 spatial audio is transformed to 2D audio by pretrained audio decoders.

Page 16: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Spatial Audio Generation• 2.5D Visual Sound, CVPR 2019

16

Page 17: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Spatial Audio Generation

• Self-Supervised Audio Spatialization with Correspondence Classifier, ICIP 2019• Network architecture

17

Page 18: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Spatial Audio Generation• Self-Supervised Generation of Spatial Audio for 360 Video, NeurIPS’18

• Network architecture

18

The format of 360 audio

Page 19: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Spatial Audio Generation

• Self-Supervised Generation of Spatial Audio for 360 Video, NeurIPS’18• Evaluation• Metric

• STFT distance: Complex L2 norm between ground truth and predicted spectrogram

• Envelope distance (ENV):~STFT distance but use Hilbert transform

• Earth Mover’s Distance (EMD):The energy of the sound field measured over a small window.

19

Page 20: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Spatial Audio Generation

• Self-Supervised Generation of Spatial Audio for 360 Video, NeurIPS’18• Results

20

Page 21: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Spatial Audio Generation• Self-Supervised Generation of Spatial Audio for 360 Video, NeurIPS’18

21

Page 22: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Decomposing Sounds of Visual Objects

• Goal• Separating mixed sounds into separate ones corresponding to the associated objects• Can be done in supervised or (preferably) unsupervised way

• References• The Sound of Pixels, ECCV 208• The Sound of Motions, Arxiv• Co-Separating Sounds of Visual Objects, Arxiv

22

Page 23: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Decomposing Sounds of Visual Objects

• The Sound of Pixels, ECCV 2018

23

Training pipeline: concatenated videos as visual inputs + mixed audio sources.

mixed audio

No GT audio data

Page 24: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Decomposing Sounds of Visual Objects

• The Sound of Pixels, ECCV 2018

24

mixed audio

evaluation

Page 25: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Decomposing Sounds of Visual Objects

• The Sound of Pixels, ECCV 2018• Evaluation

25

Page 26: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Decomposing Sounds of Visual Objects

• The Sound of Motions (Zhao, Gan, Ma, & Torralba), Arxiv

26

Page 27: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Decomposing Sounds of Visual Objects

• The Sound of Motions (Zhao, Gan, Ma, & Torralba), Arxiv• Network architecture

27

Page 28: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Decomposing Sounds of Visual Objects

• The Sound of Motions (Zhao, Gan, Ma, & Torralba), Arxiv• Evaluation

28

Page 29: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Decomposing Sounds of Visual Objects

• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Disentangle sounds in realistic videos, even in cases where an object was not

observed individually during training.

29

Page 30: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Decomposing Sounds of Visual Objects

• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Network architecture

30

Page 31: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Decomposing Sounds of Visual Objects

• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Network architecture

• Audio-Visual Separator

31

Page 32: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Decomposing Sounds of Visual Objects• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.

• Network architecture

32

Page 33: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Decomposing Sounds of Visual Objects

• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Evaluation

33

Page 34: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

Decomposing Sounds of Visual Objects

• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Demo Video

34

Page 35: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

About Final Challenges & Presentation• Presentation Date/Time

• 6/25 Tue 1:30pm-5pm• If you cannot participate, you need to let me/TAs/your team members know in advance.

• Final 35% (Bonus up to 1%+3%+3%)• Code / Kaggle 10%: Kaggle for references, final accuracy evaluated by TAs

• Early baseline: bonus 1% (due 6/15 Sat 1am)• TA baseline

Public: Weak 5% / Strong 5%; Private: bonus up to 3% (due 6/24 Mon 2:00 am)• Approach & Presentation 25%

• Novelty and technical contribution (10%)• Completeness of experiments (10%) (e.g., comparisons to baseline and recent

models, ablation studies, visualization, etc.)• Presentation (Oral + Poster) 5% + bonus up to 3% (top 3 teams voted by class)

• For both challenges, you need to upload your code to github and provide readme files, so that TAs will be able to reproduce your results!

• If TAs cannot reproduce your results, 0/20 points will be given unless minor errors (i.e., no credits for the approach part).

35

Page 36: Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf · 2019-06-04 · What’s to Be Covered Today… • VIP Talk • Title: Face Recognition

About Final Challenges & Presentation• Intra-Group Evaluation

• Every student needs to rate each member in his/her team. (e.g., 1~5)• Every student needs to specify the contributions of each member in the team. • (Optional)

Students can provide additional remarks on the team members if necessary. The comments will not be released to other team members but accessible to instructor and TAs only.

36