6.819/6.869 Advances in Computer Vision6.869.csail.mit.edu/fa15/lecture/Khosla_11_12_2015.pdf ·...

6.819/6.869 Advances in Computer Vision

Image by kirkh.deviantart.com

Aditya Khosla

Today’s class

• Part 1: From state-of-the-art to state-of-the-artest• Fine-tuning

• Data augmentation

• Part 2: Applications• Detection, segmentation, …

• Part 3: Learning sequences• RNNs/LSTMs

ImageNet Challenge Year

Object recognition

15%11%

AlexNet

GoogLeNet

VGGNet

GoogLeNet

Credit: Szegedy et al

GoogLeNet vs AlexNet

GoogLeNet

AlexNet

GoogLeNet

• Power and Memory use considerations are important for practical use.

• Image data is mostly sparse and clustered.

• Hebbian Principle:

“Neurons that fire together, wire together”

In images, correlations tend to be local

Cover very local clusters by 1x1 convolutions

1x1number of filters

Less spread out correlations

Cover more spread out clusters by 3x3 convolutions

number of filters

3x35x5

A heterogeneous set of convolutions

Schematic view (naive version)

1x1 convolutions

3x3 convolutions

5x5 convolutions

Filter concatenation

Previous layer

1x1 convolutions

3x3 convolutions

5x5 convolutions

Previous layer

Naive idea

1x1 convolutions

3x3 convolutions

5x5 convolutions

Previous layer

Naive idea (does not work!)

3x3 max pooling

1x1 convolutions

3x3 convolutions

5x5 convolutions

Previous layer

Inception module

3x3 max pooling

1x1 convolutions

3x3 convolutions

5x5 convolutions

Previous layer

Inception module

3x3 max pooling

1x1 convolutions

Dimensionality reduction!

Inception

9 Inception modulesConvolutionPoolingSoftmaxOther

256 480 480512

512 512832 832 1024

Inception

Width of inception modules ranges from 256 filters (in early modules) to 1024 in top inception modules.

Can remove fully connected layers on top completely

Number of parameters is reduced to 5 million

Computional cost is increased by less than 2X compared to AlexNet. (<1.5Bn operations/evaluation)

Credit: Fei-Fei

VGGnet

Credit: Fei-Fei

Data augmentation

• For both training and testing

Classification results on ImageNet 2012

Number of Models Number of Crops Computational Cost Top-5Error

Compared to Base

1 1 (center crop) 1x 10.07% -

1 10* 10x 9.15% -0.92%

1 144 (Our approach) 144x 7.89% -2.18%

7 1 (center crop) 7x 8.09% -1.98%

7 10* 70x 7.62% -2.45%

7 144 (Our approach) 1008x 6.67% -3.41%

*Cropping by [Krizhevsky et al 2014]

Fine-tuning

input output

Objects

Fine-tuning

input output

Objects

Fine-tuning

input output

Scenes

backpropagation

Using a CNN off-the-shelf representation with linear SVMs training significantly outperforms a majority of the baselines.

Results of MIT 67 Scene Classification

Visual Classification

Credit: Fei-Fei

R-CNN Results

Credit: Fei-Fei

Pixels in, pixels out

monocular depth estimation (Liu et al. 2015)

boundary prediction (Xie & Tu 2015)

semanticsegmentation

Credit: Long et al

“tabby cat”

1000-dim vector

< 1 millisecond

ConvNets perform Classification

end-to-end learning

Credit: Long et al

< 1/5 second

end-to-end learning

Credit: Long et al

“tabby cat”

A Classification Network

Credit: Long et al

Becoming fully convolutional

Credit: Long et al

Becoming fully convolutional

Credit: Long et al

Upsampling output

Credit: Long et al

conv, pool,nonlinearity

upsampling

pixelwiseoutput + loss

End-to-end, Pixels-to-pixels network

Credit: Long et al

Deconvolutional network

Credit: Noh et al

Upsampling

Credit: Noh et al

Deconvolution

Credit: Noh et al

Unpooling and Deconv Effects

Credit: Noh et al

Results - segmentation

Credit: Noh et al

Predicting Human Visual Memory

Memorability = The likelihood of remembering a particular image.

Welcome to the

Visual Memory Game

A stream of images will be presented on the screen for 1 second each.

Your task:

Clap anytime you see an image you saw before in this experiment.

Ready?

(Seriously, get ready to clap. The images go by fast…)

<clap!>

Measuring Memorability

Memorability = Probability of correctly detecting a repeat after a single view of an image in a long sequence.

Understanding Image Memorability, Khosla et al, ICCV 2015

Memorability is an intrinsic property of an image!

La Mem

• Focused

• Enclosed Setting

•Dynamics

•Unusual

memorability

• No single focus

• Distant view

• Static

• Common

Training MemNet

input output

backpropagation

MemNet Performance

0.4 0.45 0.5 0.55 0.6 0.65 0.7

HOG2x2

MemNet - small

HybridNet

MemNet

rank correlation

Human rank correlation: 0.68

Prediction rank correlation: 0.64!

Visualizing Neurons

http://memorability.csail.mit.edu

‘selfie selection’

Predicting popularity

Popularity dataset

Dataset: 2.3 million Flickr images

What makes an image popular? Khosla et al, WWW 2014

Input Image

HOGGIST SIFTImage

Features

Support Vector

Regression

2.73 Image popularity

Gist Texture Color BoW Gradient DeepLearning

Combined

What makes an image popular?

colors

Input Image

Object likelihood

Support Vector

Regression

1.9 Image popularity

Medium positive impact

giant panda ladybug basketball

plow cheetah llama

Strong positive impact

brassiere revolver miniskirt

maillot bikini cup

Negative impact

spatula plunger laptop

http://popularity.csail.mit.edu

Predicting the future

Goal: - predict Lucas-Kanade

optical flow given just one image!

Credit: Walker et al

Predicting the future

Credit: Walker et al

Learning sequences

Sequences are everywhere…

Credit: Alex Graves, Kevin Gimpel, Dhruv Batra

Even where you might not expect a sequence…

Credit: Dhruv Batra, Vinyals et al.

How do we model sequences?

• It’s a spectrum…

Input: No sequence

Output: No sequence

Example: “standard”

classification / regression problems

Input: No sequence

Output: Sequence

Example: Im2Caption

Input: Sequence

Output: No sequence

Example: sentence classification,

multiple-choice question

answering

Input: Sequence

Output: Sequence

Example: machine translation, video captioning, open-ended question

answering, video question answering

Credit: Dhruv Batra, Andrej Karpathy

Recurrent Neural Networks (RNNs)

Credit: Christopher Olah

Long-term dependencies–hard to model!

From plain RNNs to LSTMs

(LSTM: Long Short Term Memory Networks)

From plain RNNs to LSTMs

(LSTM: Long Short Term Memory Networks)Credit: Christopher Olah

LSTMs Intuition: Memory

• Cell State / Memory

LSTMs Intuition: Forget Gate

• Should we continue to remember this “bit” of information or not?

LSTMs Intuition: Input Gate

• Should we update this “bit” of information or not?• If so, with what?

LSTMs Intuition: Memory Update

• Forget that + memorize this

LSTMs Intuition: Output Gate

• Should we output this “bit” of information to “deeper” layers?

LSTM: A pretty sophisticated cell

Generating image captions

Credit: Vinyals et al

Generating image captions

Credit: Vinyals et al

6.819/6.869 Advances in Computer Vision6.869.csail.mit.edu/fa15/lecture/Khosla_11_12_2015.pdf ·...

Documents

Transcript of 6.819/6.869 Advances in Computer Vision6.869.csail.mit.edu/fa15/lecture/Khosla_11_12_2015.pdf ·...

6.869 Advances in Computer Vision . computervision.htm Lecture 22 Miscellaneous Spring 2010.

Motion Estimation I - MIT CSAILpeople.csail.mit.edu/torralba/courses/6.869/lectures/lecture20/lectur… · Motion layer assignment • Assign each pixel to a motion cluster layer,

NewYor-Presbyterian Advances Endocrinology Advances in ...

Aggiornamento Aprile 2019 - Valle Brembana · Aprile 2019 6 . Aggiornamento ... in provincia di Bergamo sono 6.819 gli apparecchi censiti, ... Graf. 11 - Raccolta pro-capite per categoria

Today (April 5, 2005) Face detectioncourses.csail.mit.edu/6.869/lectnotes/lect16/lect16-slides-6up.pdf · Since the system does not run in real time, this demonstration is organized

A Simple Vision System - 6.869.csail.mit.edu

Implementation and Optimization of SIFT on a OpenCL GPU 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya.

Recent Advances in Rock Engineering Recent Advances in ...

Motion Estimation I - People | MIT CSAILpeople.csail.mit.edu/torralba/courses/6.869/lectures/... · 2010. 4. 22. · IJCV 2004 • Horn-Schunck (wikipedia) • A. Bruhn, J. Weickert,

Advances - lightwaveonline.com · Advances - lightwaveonline.com

6.819 / 6.869: Advances in Computer Vision6.869.csail.mit.edu/fa15/lecture/6.869-ImageRetrieval.pdf · • Search must be both fast, accurate and scalable to large data set • Fast

Computer vision class, fast-forward - …courses.csail.mit.edu/6.869/lectnotes/lect4/lect4-slides.pdfpixel domain image. Gaussian pyramid = * ... Bayesian Wavelet Coring. ... (linear

Evaluate the advances of the Inca. Postclassical Advances.

DA DISTRIBUIÇÃO PÚBLICA PRIMÁRIA DE COTAS DA 3ª … · e de fundos de investimento imobiliário, nos termos do Ato Declaratório CVM nº 6.819, de 17 de maio de 2002 (“Administrador”),

6.819 / 6.869: Advances in Computer Vision6.869.csail.mit.edu/fa15/lecture/6.869-L11-Intro... · Introduction to Machine Learning for Vision ... Another Classifier: Deep Learning

Advances in Gastrointestinal Malignancies Edward Advances...16th Annual Advances in Oncology September 25-26, 2015 Edward J Kim Advances in Gastrointestinal Malignancies Relevant financial

ˆ ˙.)&*)& *,,)/*&˙*˛ 0+,#12˛ &+#3˛ -1441 546 ˘ˇcourses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf · 2004. 4. 14. · ˆ . +,/ . :ˆ,0 ˇ ˘ ˇ ˛ ˛ ˛ ˆ ˜˛ ˛ ˛

What is the goal of vision? Generative Models › 6.869 › lectnotes › lect... · Performing the marginalization by doing the partial sums is called “belief propagation”. In

The generic, unavoidable problem with Segmentation methodscourses.csail.mit.edu/6.869/lectnotes/lect19/lect19-slides-6up.pdf · Background Subtraction Principles Wallflower: Principles

How to write a good research paper - 6.869 …6.869.csail.mit.edu/fa18/lecture/papers_talks.pdfHow to write a good research paper Bill Freeman MIT CSAIL Nov. 29, 2018 A paper’s impact

ˆ ˙.)&)& ,,)/&˙˛ 0+,#12˛ &+#3˛ -1441 546 ˘ˇcourses.csail.mit.edu/6.869/handouts/PAMIMeanshift.pdf · 2004. 4. 14. · ˆ . +,/ . :ˆ,0 ˇ ˘ ˇ ˛ ˛ ˛ ˆ ˜˛ ˛ ˛