Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf ·...

Deep Learning for Computer VisionSpring 2019

http://vllab.ee.ntu.edu.tw/dlcv.html (primary)

https://ceiba.ntu.edu.tw/1072CommE5052 (grade, etc.)

FB: DLCV Spring 2019

Yu-Chiang Frank Wang 王鈺強, Associate Professor

Dept. Electrical Engineering, National Taiwan University

2019/06/05

http://vllab.ee.ntu.edu.tw/dlcv.html

https://ceiba.ntu.edu.tw/1072CommE5052

What’s to Be Covered Today…

• Guest Lecture• Title:

NTUEE系友有問必答系列CV領域求學、研究及實習經驗分享

• Speaker:Dr. Wei-Sheng Lai 賴威昇 (B97)Univ. California, Merced

• Time/Location:10am @ BL113 (i.e., the 2nd class)

2

What’s to Be Covered Today…

• VIP Talk• Title:

Face Recognition & Anti-Spoofing for Identity Authentication

• Speaker:Dr. Shang-Hong LaiPrincipal Researcher, Microsoft AI R&D Center

• Time/Location:11am @ BL113 (i.e., the 3rd class)

3

What’s to be Covered …

• Learning Beyond Images (Part II)• Audio-Visual Event Localization• Spatial Audio Generation• Decomposing Sounds of Visual Objects

• About Final Presentation• Date/time: 6/25 Tue 1:30pm-5pm • Remarks

4

Visual vs. Audio-Visual Event Localization

• Recognizing video event categories• Visual vs. audio-visual features

5

walk run run jump jump

Frame 1 2 3 4 5

Visual

dog dog bark talking cat background

Frame 1 2 3 4 5

Visual

Audio talking dog bark dog bark background background

AV Event background dog bark background background background

Audio-Visual Event Localization• Goal

• Identify event/activity labels across video frames by jointly observing visual and audio features in the input video.

• References• Audio-visual event localization in unconstrained videos, ECCV’18• Dual-modality seq2seq network for audio-visual event localization, ICASSP’19

6

Audio-Visual Event Localization

• Demo video

7Tian, J. Shi, B. Li, Z. Duan, and C. Xu. Audio-visual event localization in unconstrained videos. ECCV 2018

Audio-Visual Event Localization

• Audio-visual event localization in unconstrained videos, ECCV’18• Network Architecture

8

Audio based visual attention(Audio localization)

Video frames

Audio segments

Audio-Visual Event Localization• Dual-modality seq2seq network for audio-visual event localization, ICASSP’19

9

- Decoder

Input : Fused states ℎ𝑓𝑓 , 𝑐𝑐𝑓𝑓audio feature 𝑎𝑎visual feature 𝑣𝑣

Output : Event categories 𝑦𝑦1 …𝑦𝑦𝑡𝑡

- Encoder

Input : image and audio segment (𝑡𝑡 segments)

Output: audio feature 𝑎𝑎1 …𝑎𝑎𝑡𝑡visual feature 𝑣𝑣1 … 𝑣𝑣𝑡𝑡

- Fusion

Input : the last hidden and cell states from

audio and visual modality respectively

Output : Fused states ℎ𝑓𝑓 , 𝑐𝑐𝑓𝑓


• Evaluation• Audio-Visual Event (AVE) Dataset (ECCV’18):

AVE dataset includes 4143 videos with 28 categories and videos are labeled with audio-visual events every second. AVE dataset covers wide range domain events (e.g., church bell, dog barking, truck, bus, clock, violin, etc.).

10


• Evaluation• Metric: frame-wise accuracy• % of correct matchings over all test frames.

Can be calculated in fully supervised (every frame label is used in training phase) or weakly supervised (only the average labels are used ) settings.

11


• Evaluation• Metric: frame-wise accuracy• % of correct matchings over all test frames.

Can be calculated in fully supervised (every frame label is used in training phase) or weakly supervised (only the average labels are used ) settings.

12

Spatial Audio Generation• Goal

• The audio spatial information is highly related with visual scene. • Given a single channel audio, we would like to generate spatial audio data by

observing visual data.

• References• 2.5D Visual Sound, CVPR’19• Self-Supervised Audio Spatialization with Correspondence Classifier, ICIP’19• Self-Supervised Generation of Spatial Audio for 360 Video, NeurIPS’18

13

Spatial Audio Generation

• 2.5D Visual Sound, CVPR 2019• Network architecture

14

Predict difference masks only

Spatial Audio Generation• 2.5D Visual Sound, CVPR 2019

• Evaluation

15

Notice: only FAIR-Play dataset is collected in 2D video. The remaining datasets are 360 videos. The 360 spatial audio is transformed to 2D audio by pretrained audio decoders.

Spatial Audio Generation• 2.5D Visual Sound, CVPR 2019

16


• Self-Supervised Audio Spatialization with Correspondence Classifier, ICIP 2019• Network architecture

17

Spatial Audio Generation• Self-Supervised Generation of Spatial Audio for 360 Video, NeurIPS’18

• Network architecture

18

The format of 360 audio


• Self-Supervised Generation of Spatial Audio for 360 Video, NeurIPS’18• Evaluation• Metric

• STFT distance: Complex L2 norm between ground truth and predicted spectrogram

• Envelope distance (ENV):~STFT distance but use Hilbert transform

• Earth Mover’s Distance (EMD):The energy of the sound field measured over a small window.

19


• Self-Supervised Generation of Spatial Audio for 360 Video, NeurIPS’18• Results

20

Spatial Audio Generation• Self-Supervised Generation of Spatial Audio for 360 Video, NeurIPS’18

21

Decomposing Sounds of Visual Objects

• Goal• Separating mixed sounds into separate ones corresponding to the associated objects• Can be done in supervised or (preferably) unsupervised way

• References• The Sound of Pixels, ECCV 208• The Sound of Motions, Arxiv• Co-Separating Sounds of Visual Objects, Arxiv

22


• The Sound of Pixels, ECCV 2018

23

Training pipeline: concatenated videos as visual inputs + mixed audio sources.

mixed audio

No GT audio data


• The Sound of Pixels, ECCV 2018

24

mixed audio

evaluation


• The Sound of Pixels, ECCV 2018• Evaluation

25


• The Sound of Motions (Zhao, Gan, Ma, & Torralba), Arxiv

26


• The Sound of Motions (Zhao, Gan, Ma, & Torralba), Arxiv• Network architecture

27


• The Sound of Motions (Zhao, Gan, Ma, & Torralba), Arxiv• Evaluation

28


• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Disentangle sounds in realistic videos, even in cases where an object was not

observed individually during training.

29


• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Network architecture

30


• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Network architecture

• Audio-Visual Separator

31

Decomposing Sounds of Visual Objects• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.

• Network architecture

32


• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Evaluation

33


• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Demo Video

34

About Final Challenges & Presentation• Presentation Date/Time

• 6/25 Tue 1:30pm-5pm• If you cannot participate, you need to let me/TAs/your team members know in advance.

• Final 35% (Bonus up to 1%+3%+3%)• Code / Kaggle 10%: Kaggle for references, final accuracy evaluated by TAs

• Early baseline: bonus 1% (due 6/15 Sat 1am)• TA baseline

Public: Weak 5% / Strong 5%; Private: bonus up to 3% (due 6/24 Mon 2:00 am)• Approach & Presentation 25%

• Novelty and technical contribution (10%)• Completeness of experiments (10%) (e.g., comparisons to baseline and recent

models, ablation studies, visualization, etc.)• Presentation (Oral + Poster) 5% + bonus up to 3% (top 3 teams voted by class)

• For both challenges, you need to upload your code to github and provide readme files, so that TAs will be able to reproduce your results!

• If TAs cannot reproduce your results, 0/20 points will be given unless minor errors (i.e., no credits for the approach part).

35

About Final Challenges & Presentation• Intra-Group Evaluation

• Every student needs to rate each member in his/her team. (e.g., 1~5)• Every student needs to specify the contributions of each member in the team. • (Optional)

Students can provide additional remarks on the team members if necessary. The comments will not be released to other team members but accessible to instructor and TAs only.

36

Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf ·...

Documents

Transcript of Deep Learning for Computer Visionvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w15.pdf ·...