An (very brief) introduction - ESO€¦ · Prof. Dr. Laura Leal-Taixé An (very brief)...

Post on 08-Aug-2020

2 views 0 download

Transcript of An (very brief) introduction - ESO€¦ · Prof. Dr. Laura Leal-Taixé An (very brief)...

Prof. Dr. Laura Leal-Taixé

www.dvl.in.tum.de

An (very brief) introduction to Deep Learning

for Computer

Vision

Computer Vision

Give eyes to a computer

Computer Vision

Understand every pixel of an image

Computer Vision

road

tree

Semantic segmentation

personcar

Understand every pixel of an image

Computer Vision

person 1

tree

Semantic segmentation

person 2

person 3

Instance-based

segmentation

road

car

Understand every pixel of an image

Computer Vision

Understand every pixel of a video

Semantic segmentation

Instance-based

segmentation

Multiple object

tracking

Dynamic Scene Understanding

Understand every pixel of a video

Semantic segmentation

Instance-based

segmentation

Multiple object

tracking

Artificial Intelligence

Data, examples

Computational models

Perform a given task on new unseen data

LEARNING, TRAINING

Artificial Intelligence

Data, examples

Computational models

Perform a given task on new unseen data

Expert eye

Self-driving cars

AI nowadays

What is Deep Learning, really?

CAT

What is Deep Learning, really?

Each node is a small classifier

What is Deep Learning, really?

Each node is a small classifier

Each classifier makes tiny decision

Furry

Has two eyes

Has a tail

Has paws

Has two ears These are learned

from data!

Convolutions on Images

-5 3 2 -5 34 3 2 1 -31 0 3 3 5-2 0 1 4 45 6 7 9 -1

0 -1 0-1 5 -10 -1 0

Ima

ge

5x5

Ke

rne

l 3

x3

6

Ou

tpu

t 3

x3

5 ⋅ 3 + −1 ⋅ 3 + −1 ⋅ 2 + −1 ⋅ 0 + −1 ⋅ 4 =15 − 9 = 6

-5 3 2 -5 34 3 2 1 -31 0 3 3 5-2 0 1 4 45 6 7 9 -1

Convolutions on Images

0 -1 0-1 5 -10 -1 0

Ima

ge

5x5

Ke

rne

l 3

x3

6 1

Ou

tpu

t 3

x3

5 ⋅ 2 + −1 ⋅ 2 + −1 ⋅ 1 + −1 ⋅ 3 + −1 ⋅ 3 =10 − 9 = 1

-5 3 2 -5 34 3 2 1 -31 0 3 3 5-2 0 1 4 45 6 7 9 -1

Convolutions on Images

0 -1 0-1 5 -10 -1 0

Ima

ge

5x5

Ke

rne

l 3

x3

6 1 8

Ou

tpu

t 3

x3

5 ⋅ 1 + −1 ⋅ (−5) + −1 ⋅ (−3) + −1 ⋅ 3 + −1 ⋅ 2 =5 + 3 = 1

-5 3 2 -5 34 3 2 1 -31 0 3 3 5-2 0 1 4 45 6 7 9 -1

Convolutions on Images

0 -1 0-1 5 -10 -1 0

Ima

ge

5x5

Ke

rne

l 3

x3

6 1 8-7

Ou

tpu

t 3

x3

5 ⋅ 0 + −1 ⋅ 3 + −1 ⋅ 0 + −1 ⋅ 1 + −1 ⋅ 3 =0 − 7 = −7

-5 3 2 -5 34 3 2 1 -31 0 3 3 5-2 0 1 4 45 6 7 9 -1

Convolutions on Images

0 -1 0-1 5 -10 -1 0

Ima

ge

5x5

Ke

rne

l 3

x3

6 1 8-7 9

Ou

tpu

t 3

x3

5 ⋅ 3 + −1 ⋅ 2 + −1 ⋅ 3 + −1 ⋅ 1 + −1 ⋅ 0 =15 − 6 = 9

-5 3 2 -5 34 3 2 1 -31 0 3 3 5-2 0 1 4 45 6 7 9 -1

Convolutions on Images

0 -1 0-1 5 -10 -1 0

Ima

ge

5x5

Ke

rne

l 3

x3

6 1 8-7 9 2

Ou

tpu

t 3

x3

5 ⋅ 3 + −1 ⋅ 1 + −1 ⋅ 5 + −1 ⋅ 4 + −1 ⋅ 3 =15 − 13 = 2

-5 3 2 -5 34 3 2 1 -31 0 3 3 5-2 0 1 4 45 6 7 9 -1

Convolutions on Images

0 -1 0-1 5 -10 -1 0

Ima

ge

5x5

Ke

rne

l 3

x3

6 1 8-7 9 2-5

Ou

tpu

t 3

x3

5 ⋅ 0 + −1 ⋅ 0 + −1 ⋅ 1 + −1 ⋅ 6+ −1 ⋅ (−2) = −5

-5 3 2 -5 34 3 2 1 -31 0 3 3 5-2 0 1 4 45 6 7 9 -1

Convolutions on Images

0 -1 0-1 5 -10 -1 0

Ima

ge

5x5

Ke

rne

l 3

x3

6 1 8-7 9 2-5 -9

Ou

tpu

t 3

x3

5 ⋅ 1 + −1 ⋅ 3 + −1 ⋅ 4 + −1 ⋅ 7 + −1 ⋅ 0 =5 − 14 = −9

-5 3 2 -5 34 3 2 1 -31 0 3 3 5-2 0 1 4 45 6 7 9 -1

Convolutions on Images

0 -1 0-1 5 -10 -1 0

Ima

ge

5x5

Ke

rne

l 3

x3

6 1 8-7 9 2-5 -9 3

Ou

tpu

t 3

x3

5 ⋅ 4 + −1 ⋅ 3 + −1 ⋅ 4 + −1 ⋅ 9 + −1 ⋅ 1 =20 − 17 = 3

Image filters

• Each kernel gives us a different image filter

Prof. Leal-Taixé and Prof. Niessner24

Input

Edge detection

−1 −1 −1−1 8 −1−1 −1 −1

Sharpen

0 −1 0−1 5 −10 −1 0

Box mean

191 1 11 1 11 1 1

Gaussian blur

116

1 2 12 4 21 2 1

LET’S L

EARN T

HESE FIL

TERS!

Convolutions on RGB Images

32

323

3 5

5

32×32×3 image (pixels %)

5×5×3 filter (weights &)

128

28

activation map(also feature map)

Convolve

Convolution Layer

32

323

3 5

5

32×32×3 image

5×5×3 filter

128

28

activation maps

Convolve

1

Let’s apply a different filterwith different weights!

Convolution Layer

32

323

32×32×3 image

528

28

activation maps

Convolve

Let’s apply **five** filters,each with different weights!

Convolution “Layer”

Visualizing a CNN

Low-level

features

Mid-level

features

Object

parts

Big Data

Models know where

to learn from

Hardware

Models are

trainable

Deep

Models are

complex

What made Deep Learning possible?

Big data

ImageNet: Goal 10.000 images per 100.000 words

Deep Learning: what is it good at?

Input A Response B

English sentence

Picture

Audio clip

French sentence

Photo tagging

Transcript

Machine translation

Face recognition

Speech recognition

Supervised learning

How to obtain the model?

Data points

Model parameters

Labels (ground truth)

Estimation

Loss function

Optimization

Deep Learning for Image Classification

Classification

Classification

+

Localization

Object

detection

Instance

segmentation

Instance segmentation with Mask-RCNN

K. He, G. Gkioxari, P. Dollar, R. Girshick. Mask R-CNN. ICCV 2017.

Instance segmentation with Mask-RCNN

K. He, G. Gkioxari, P. Dollar, R. Girshick. Mask R-CNN. ICCV 2017.

Manipulating images

Jun-Yan Zhu*, Taesung Park*, Phillip Isola, and Alexei A. Efros. "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks", ICCV, 2017.

Manipulating images

Jun-Yan Zhu*, Taesung Park*, Phillip Isola, and Alexei A. Efros. "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks", ICCV, 2017.

The world is dynamic!

• Deep Learning pushed single-image analysis to a point where results are usable in real-world scenarios

• The world is not static

What we still need in ML: good memory models

Multiple object tracking

Goal: detect and track all objects in a

scene

Video object segmentation

TecoGANLowRes

M. Chu, Y. Xie, L. Leal-Taixé and N. Thuerey. „Temporally Coherent GANs for Video Super-Resolution (TecoGAN)“

Video super resolution

TecoGANLowRes

M. Chu, Y. Xie, L. Leal-Taixé and N. Thuerey. „Temporally Coherent GANs for Video Super-Resolution (TecoGAN)“

Video super resolution

M. Chu, Y. Xie, L. Leal-Taixé and N. Thuerey. „Temporally Coherent GANs for Video Super-Resolution (TecoGAN)“

TecoGAN results

Photo

MapObtain the camera

pose with respect to

the map

Visual localization

Visual Localization: applications

• Robot navigation

• Augmented reality

[Middelberg, Sattler, Untzelmann, Kobbelt, Scalable 6-DOF Localization on Mobile Devices. ECCV 2014]

The challenge: Big data

• We need data, lots of data!

• We might not have the luxury of data for some applications such as medical diagnosis

• Data is biased

Big data

Data bias• Increase diversity in

the data

• Increase diversity in the AI community which is building the algorithms

Data bias

The generalization problem

• Neural networks are GREAT at finding patterns in data they have seen, but not so great at generalizing to new scenarios

True intelligence is still far away

Prof. Dr. Laura Leal-Taixé

www.dvl.in.tum.de

Thank you

https://dvl.in.tum.deIf you have images, contact us!