A Grand Unifying Architecture for Scene Understanding … · PPT file · Web view2016-03-23 · A...

39
A Grand Unifying Architecture for Scene Understanding Marc Eder March 23, 2016 Eigen, David, and Rob Fergus. “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture.” Proceedings of the IEEE International Conference on Computer Vision. 2015.

Transcript of A Grand Unifying Architecture for Scene Understanding … · PPT file · Web view2016-03-23 · A...

A Grand Unifying Architecture for Scene UnderstandingMarc EderMarch 23, 2016

Eigen, David, and Rob Fergus. “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture.” Proceedings of the IEEE International Conference on Computer Vision. 2015.

Any fool can know.The point is to understand.Attributed (unverifiably) to Albert Einstein

The 4 C’s of Scene Understanding•Content – What is in the scene?

•Composition – How is the content laid out?

•Configuration – What is the scene’s spatial layout?

•Context – What’s the past/present/future of the scene?

One Ring to Rule Them All• Eigen and Fergus propose a single network for

multiple understanding tasks [1]

• Architecture has state-of-the-art performance on 3 out of 4 C’s▫Doesn’t address scene context

• Earlier version scored top results for depth map estimation from RGB and surface normal estimation from RGB in “Reconstruction Meets Recognition Challenge” at ECCV 2014 [2]

A Brief History of Vision(Before neural networks)

Detecting Content• Ex. object detection, identification, recognition

• Templates and other appearance-based models▫ (1991) Turk, Matthew, and Alex Pentland. "Eigenfaces for

recognition.“• Low-level feature-based approaches

▫ (1999) Lowe, David G. "Object recognition from local scale-invariant features."

▫ (2001) Viola, Paul, and Michael Jones. "Rapid object detection using a boosted cascade of simple features.“

• Intermediate-feature-based methods▫ (2008) Felzenszwalb, Pedro, et al. "A discriminatively trained,

multiscale, deformable part model."▫ (2009) Kumar, Neeraj, et al. "Attribute and simile classifiers for

face verification."

Determining Scene Composition• Ex. Object localization, semantic segmentation

• Bottom-up, low-level feature approaches▫ (1985) Haralick, Robert M., and Linda G. Shapiro. “Image

segmentation techniques.”▫ (1999) Comaniciu, Dorin, and Peter Meer. “Mean shift analysis and

applications.”• Top-down, Gestalt-inspired segmentation

▫ (1997/2000) Shi, Jianbo, and Jitendra Malik. “Normalized cuts and image segmentation.”

▫ (2003) Ren, Xiaofeng, and Jitendra Malik. “Learning a classification model for segmentation.”

• Joint inference (things and stuff)▫ (2004) Torralba, Antonio, Kevin P. Murphy, and William T. Freeman.

“Contextual models for object detection using boosted random fields.”▫ (2013) Tighe, Joseph, and Svetlana Lazebnik. “Finding things: Image

parsing with regions and per-exemplar detectors.”

Estimating Scene Configuration• Ex. 3D structure, depth estimation

• Robust estimation▫ (1981) Fischler, Martin A., and Robert C. Bolles. "Random

sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.“

• Leveraging multiple views▫ (1987) Longuet-Higgins, H. Christopher. "A computer algorithm

for reconstructing a scene from two projections."▫ (2004) Nistér, David. "An efficient solution to the five-point

relative pose problem.“• Structure from Motion (SfM) and dozens of other applications

▫ Hartley, Richard, and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.

Welcome to the New Age

(Where neural networks solve all of your problems)

LeNet from http://deeplearning.net/tutorial/lenet.html

ImageNet from

http://www.im

age-net.org/challenges/L

SVRC/2012/supervisio

n.pdf

R-CNN from Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.

Convolutional Neural Networks• Detecting Content

▫(2014) Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition."

• Determining Scene Composition▫(2014) Girshick, Ross, et al. “Rich feature hierarchies for

accurate object detection and semantic segmentation.”

• Estimating Scene Configuration▫(2015) Flynn, John, et al. "DeepStereo: Learning to

Predict New Views from the World's Imagery."

A New Way to Neural Net

“If you want something new, you have to stop doing something old.”Peter Drucker

Single Multi-Purpose Architecture•Pixel-map regression is a common task for

many applications

•Shifts development focus to defining proper training set and cost function

•Shares information across modalities▫e.g. (Segmentation OR Depth) vs.

(Segmentation AND Depth)

Stack Trace“Study the past if you wish to divine the future”

- Confucious

Single-Image Depth Prediction•Rooted in stereo depth estimation

▫Deterministic▫Requires multiples views with static scenes,

proper baseline, right amount of overlap, …

•Monocular depth is hard▫Non-deterministic▫No verification▫Must handle both local and global scale

ambiguity

A Long Long Time Ago…• (2005) Saxena, Ashutosh, Sung H. Chung, and

Andrew Y. Ng. “Learning depth from single monocular images.” [3]

• Used textural and haze features in an MRF to estimate depth maps

• Premise:▫“The depth of a particular patch depends on the

features of the patch, but is also related to the depths of other parts of the image”

9 Laws energy and 6 oriented gradient filters

Filter images from [3]

A Long Long Time Ago…RESULTS• Column 1: Original image

• Column 2: Ground truth

• Column 3: Gaussian model prediction

• Column 4: Laplacian model prediction (more computationally efficient due to linear programming)

Column 1 Column 2 Column 3 Column 4

Image from [3]

Recent History• (2014) Eigen, David, Christian Puhrsch, and Rob

Fergus. “Depth map prediction from a single image using a multi-scale deep network.” [2]

• Direct precursor to [1]

• Won the “Reconstruction Meets Recognition Challenge” in depth map and surface normal estimation at ECCV 2014

• First introduced the multi-scale approach used in [1]

Recent HistoryCoarse Network• Convolutions and max pooling

reduce spatial dimension of global image information

• Useful for network to learn vanishing points, alignment, and object locations

• Final layers fully connected to contain full image but at ¼ scale

Fine Network• Refines coarse output• Each unit only operates over a

patch of the scene• Returns depth map at ¼ scaleImage from [2]

Recent History• Investigated importance of proper loss

function

• Used a scale-invariant MSE in log space

y := predicted pixel depthy* := ground truth pixel depth

• Every pair of pixel depths should differ by the same amount

Recent HistoryRESULTS• Top – NYUDepth v2

[4]• Bottom – KITTI [5]

• A: Input image• B: Coarse network

output• C: Fine network

output• D: Ground truth

A

B

C

D

A B C D

Image from [2]

Back to the Present(Or at least December 2015)

Predicting Depth, Normals, and Semantic Segmentation• Generalization of architecture from 2014 paper

[2]

• Add an extra scale to the pipeline

• More convolutional layers

• Only one output layer▫Pass feature maps from scale to scale instead of

coarse predictions▫Simplifies training can now train net jointly-ish

Coarse-to-Fine Approach

Images from [1]

Model Comparison

ECCV 2014 [2]ICCV 2015 [1]

The Architecture• Coarse block is nearly identical to [2] except deeper

▫Trained on two sizes: AlexNet [6] and VGG [7]

• Mid-level resolution block builds on global output from coarse block▫Concatenates coarse features with single layer of

finer-stride convolution/pooling▫Continues processing features at mid-level resolution

• Highest resolution does same as mid-level, but at yet finer-stride aligning output to higher resolution

The Training Procedure•Scales 1 and 2 are trained jointly by SGD

•Scale 3 is subsequently trained with scales 1 and 2 held fixed

•At scale 3, random 74x55 crops are used▫On output of scales 1, 2 and original input

•All 3 tasks have roughly same initialization and learning rates at each layer

Now to the tasks at hand…

“It is a marvelous pain to find out but a short way by long wandering.”

-Roger Ascham from the “The Schoolmaster”

But first!•Remember: this paper aims to create a

multi-purpose architecture

•Each task is thus defined only by its training set and loss function

Task 1: Depth Estimation•Train network on NYUDepth v2

•Similar loss to [2] but including a gradient-matching term

•Better results using VGG than AlexNet▫Attributed to larger model size

Task 1: Depth EstimationRESULTS• A: RGB input• B: Result of [2]• C: Output of multipurpose

net• D: Ground truth

• Sharpness improvement over [2]

• Substantial numerical improvement against peer papers as well

A B C D

Image from [1]

Task 2: Surface Normals•Again, train network on NYUDepth v2

▫Actually combined with depth estimation▫Common scale 1, but separate scales 2,3

•Loss function elementwise inner-product with GT

Task 2: Surface Normals

Comparison of surface normal results

Image from [1]

Task 3: Semantic Segmentation• Once again, train network on NYUDepth v2

▫ This time, data is RGB-D, so use GT depth, normals as extra input channels

• Separate filters applied to each input type

• Network initialized on ImageNet

• Loss function pixelwise cross-entropy (i.e. multiclass

where Ci := softmax class prediction at pixel i

Task 3: Semantic SegmentationRESULTS• Tested on NYUDepth v2

(top) as well as Pascal VOC 2011 [6] (bottom)

• For Pascal VOC 2011:▫ A: Original RGB input▫ B: Predicted labeling▫ C: Ground truth

A B C A B C Image from [1]

Wrap Up and Discussion(All good things must end)

Empirical Multi-Scale Effects•Improvement increases as more scales are added

•Coarse scale most important for depth and normal estimation

•Mid-level scale most important for segmentation▫But when only using RGB, coarse scale contributes

more

Discussion• Does neural net size matter?

▫ Larger model better results▫ But network size fixed is across tasks…

• How important is depth/normal information for segmentation?▫ RGB-D with GT depth, normals > RGB alone▫ RGB-D with predicted depth, normals RGB alone▫ Using predicted depth, normals only improves results when

scale 1 block is left out

• Is this true scene understanding?▫ “Information is not knowledge” (also unverifiably attributed to Einstein)

Any comments or questions?

Selected References[1] Eigen, David, and Rob Fergus. "Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture."Proceedings of the IEEE International Conference on Computer Vision. 2015.

[2] Eigen, David, Christian Puhrsch, and Rob Fergus. "Depth map prediction from a single image using a multi-scale deep network." Advances in neural information processing systems. 2014.

[3] Saxena, Ashutosh, Sung H. Chung, and Andrew Y. Ng. "Learning depth from single monocular images." Advances in Neural Information Processing Systems. 2005.

[4] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.

[5] Geiger, Andreas, Philip Lenz, and Raquel Urtasun. "Are we ready for autonomous driving? the kitti vision benchmark suite." Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.

[6] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.

[7] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556(2014).