Download - Indoor Scene Segmentation using a Structured Light Sensor

Transcript
Page 1: Indoor Scene Segmentation using a Structured Light Sensor

Indoor Scene Segmentation using a Structured Light Sensor

Nathan Silberman and Rob Fergus

ICCV 2011 Workshop on 3D Representation and Recognition

Courant Institute

Page 2: Indoor Scene Segmentation using a Structured Light Sensor

Overview

Indoor Scene Recognition using the Kinect• Introduce new Indoor Scene Depth Dataset• Describe CRF-based model– Explore the use of rgb/depth cues

Page 3: Indoor Scene Segmentation using a Structured Light Sensor

Motivation• Indoor Scene recognition is hard– Far less texture than outdoor scenes– More geometric structure

Page 4: Indoor Scene Segmentation using a Structured Light Sensor

Motivation• Indoor Scene recognition is hard– Far less texture than outdoor scenes– More geometric structure

• Kinect gives us depth map (and RGB)– Direct access to shape and geometry information

Page 5: Indoor Scene Segmentation using a Structured Light Sensor

Overview

Indoor Scene Recognition using the Kinect• Introduce new Indoor Scene Depth Dataset• Describe CRF-based model– Explore the use of rgb/depth cues

Page 6: Indoor Scene Segmentation using a Structured Light Sensor

Capturing our Dataset

Page 7: Indoor Scene Segmentation using a Structured Light Sensor

Statistics of the DatasetScene Type Number of

Scenes Frames Labeled Frames *

Bathroom 6 5,588 76

Bedroom 17 22,764 480

Bookstore 3 27,173 784

Cafe 1 1,933 48

Kitchen 10 12,643 285

Living Room 13 19,262 355

Office 14 19,254 319

Total 64 108,617 2,347

* Labels obtained via LabelMe

Page 8: Indoor Scene Segmentation using a Structured Light Sensor

Dataset Examples

Living Room

RGB Raw Depth Labels

Page 9: Indoor Scene Segmentation using a Structured Light Sensor

Dataset Examples

Living Room

RGB Depth* Labels

* Bilateral Filtering used to clean up raw depth image

Page 10: Indoor Scene Segmentation using a Structured Light Sensor

Dataset Examples

Bathroom

RGB Depth Labels

Page 11: Indoor Scene Segmentation using a Structured Light Sensor

Dataset Examples

Bedroom

RGB Depth Labels

Page 12: Indoor Scene Segmentation using a Structured Light Sensor

Existing Depth Datasets

[1] K. Lai, L. Bo, X. Ren, and D. Fox. A Large-Scale Hierarchical Multi-View RGB-D Object Dataset. ICRA 2011 [2] B. Liu, S. Gould and D. Koller. Single Image Depth Estimation from Predicted Semantic Labels. CVPR 2010

RGB-D Dataset [1]

Stanford Make3d [2]

Page 13: Indoor Scene Segmentation using a Structured Light Sensor

Existing Depth Datasets

[1] Abhishek Anand, Hema Swetha Koppula, Thorsten Joachims, Ashutosh Saxena. Semantic Labeling of 3D Point Clouds for Indoor Scenes. NIPS, 2011[2] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, T. Darrell. A Category-Level 3-D Object Dataset: Putting the Kinect to Work. ICCV Workshop on Consumer Depth Cameras for Computer Vision. 2011

Point Cloud Data [1] B3DO [2]

Page 14: Indoor Scene Segmentation using a Structured Light Sensor

Dataset Freely Availablehttp://cs.nyu.edu/~silberman/nyu_indoor_scenes.html

Page 15: Indoor Scene Segmentation using a Structured Light Sensor

Overview

Indoor Scene Recognition using the Kinect• Introduce new Indoor Scene Depth Dataset• Describe CRF-based model– Explore the use of rgb/depth cues

Page 16: Indoor Scene Segmentation using a Structured Light Sensor

Segmentation using CRF ModelCost(labels) = Local Terms(label i) +

Spatial Smoothness (label i, label j)€

• Standard CRF formulation• Optimized via graph cuts• Discrete label set (~12 classes)

i∈ pixels

i, j ∈ pairs of pixels

Page 17: Indoor Scene Segmentation using a Structured Light Sensor

ModelCost(labels) = Local Terms(label i) +

Spatial Smoothness (label i, label j)

= Appearance(label i | descriptor i) Location(i)€

•€

∑€

i∈ pixels

i, j∈ pairs of pixels

Page 18: Indoor Scene Segmentation using a Structured Light Sensor

ModelCost(labels) = Local Terms(label i) +

Spatial Smoothness (label i, label j)

= Appearance(label i | descriptor i) Location(i)€

•€

∑€

i∈ pixels

i, j∈ pairs of pixels

Page 19: Indoor Scene Segmentation using a Structured Light Sensor

Appearance Term

Appearance(label i | descriptor i)

Several Descriptor Types to choose from:o RGB-SIFTo Depth-SIFTo Depth-SPINo RGBD-SIFTo RGB-SIFT/D-SPIN

Page 20: Indoor Scene Segmentation using a Structured Light Sensor

Descriptor Type: RGB-SIFT

Extracted Over Discrete Grid

RGB image from the Kinect

128 D

Page 21: Indoor Scene Segmentation using a Structured Light Sensor

Descriptor Type: Depth-SIFTDepth image from kinect with linear scaling

128 D

Extracted Over Discrete Grid

Page 22: Indoor Scene Segmentation using a Structured Light Sensor

Descriptor Type: Depth-SPINDepth image from kinect with linear scaling

50 D

Radius

Depth

A. E. Johnson and M. Hebert. Using spin images for efficient object recognition in cluttered 3d scenes. IEEE PAMI, 21(5):433–449, 1999

Extracted Over Discrete Grid

Page 23: Indoor Scene Segmentation using a Structured Light Sensor

Descriptor Type: RGBD-SIFT

Concatenate

256 D

RGB image from the Kinect

Depth image from kinectwith linear scaling

Page 24: Indoor Scene Segmentation using a Structured Light Sensor

Descriptor Type: RGD-SIFT, D-SPIN

Concatenate

RGB image from the Kinect

Depth image from kinectwith linear scaling

178 D

Page 25: Indoor Scene Segmentation using a Structured Light Sensor

Appearance Model

Descriptor at each location

Appearance(label i | descriptor i)- Modeled by a Neural Network with a

single hidden layer

Page 26: Indoor Scene Segmentation using a Structured Light Sensor

Appearance Model

Descriptor at each location

Appearance(label i | descriptor i)

13 Classes

1000-D Hidden Layer

128/178/256-D Input

Softmax output layer

Page 27: Indoor Scene Segmentation using a Structured Light Sensor

Appearance Model

13 Classes

1000-D Hidden Layer

128/178/256-D Input

Descriptor at each location

Probability Distribution over classes

Appearance(label i | descriptor i)

Interpreted as p(label | descriptor)

Page 28: Indoor Scene Segmentation using a Structured Light Sensor

Appearance Model

13 Classes

1000-D Hidden Layer

128/178/256-D Input

Descriptor at each location

Probability Distribution over classes

Appearance(label i | descriptor i)

Trained with backpropagation

Page 29: Indoor Scene Segmentation using a Structured Light Sensor

ModelCost(labels) = Local Terms(label i) +

Spatial Smoothness (label i, label j)

= Appearance(label i | descriptor i) Location(i)€

∑€

i∈ pixels

i, j∈ pairs of pixels

Page 30: Indoor Scene Segmentation using a Structured Light Sensor

ModelCost(labels) = Local Terms(label i) +

Spatial Smoothness (label i, label j)

= Appearance(label i | descriptor i) Location(i)€

•€

∑€

i∈ pixels

i, j∈ pairs of pixels

Page 31: Indoor Scene Segmentation using a Structured Light Sensor

ModelCost(labels) = Local Terms(label i) +

Spatial Smoothness (label i, label j)

= Appearance(label i | descriptor i) Location(i)€

∑€

i∈ pixels

i, j∈ pairs of pixels

3D Priors2D Priors

Page 32: Indoor Scene Segmentation using a Structured Light Sensor

Location Priors: 2D

• 2D Priors are histograms of P(class, location)• Smoothed to avoid image-specific artifacts

Page 33: Indoor Scene Segmentation using a Structured Light Sensor

Motivation: 3D Location Priors

• 2D Priors don’t capture 3d geomety• 3D Priors can be built from depth data

• Rooms are of different shapes and sizes, how do we align them?

Page 34: Indoor Scene Segmentation using a Structured Light Sensor

Motivation: 3D Location Priors

• To align rooms, we’ll use a normalized cylindrical coordinate system:

Band of maximum depths along each vertical scanline

Page 35: Indoor Scene Segmentation using a Structured Light Sensor

Relative Depth DistributionsTable Television

Bed Wall

Relative Depth

Density

0 01 1

Page 36: Indoor Scene Segmentation using a Structured Light Sensor

Location Priors: 3D

Page 37: Indoor Scene Segmentation using a Structured Light Sensor

ModelCost(labels) = Local Terms(label i) +

Spatial Smoothness (label i, label j)

= Appearance(label i | descriptor i) Location(i)€

∑€

i∈ pixels

i, j∈ pairs of pixels

3D Priors2D Priors

Page 38: Indoor Scene Segmentation using a Structured Light Sensor

ModelCost(labels) = Local Terms(label i) +

Spatial Smoothness (label i, label j)

∑€

i∈ pixels

i, j∈ pairs of pixels

Penalty for adjacent labels disagreeing(Standard Potts Model)

Page 39: Indoor Scene Segmentation using a Structured Light Sensor

ModelCost(labels) = Local Terms(label i) +

Spatial Smoothness (label i, label j)

∑€

i∈ pixels

i, j∈ pairs of pixels

Spatial Modulation of Smoothness• None• RGB Edge • Depth Edges• RGB + Depth Edges

• Superpixel Edges• Superpixel + RGB Edges• Superpixel + Depth Edges

Page 40: Indoor Scene Segmentation using a Structured Light Sensor

Experimental Setup

• 60% Train (~1408 images)• 40% Test (~939 images)• 10 fold cross validation• Images of the same scene cannot appear apart• Performance criteria is pixel-level classification

(mean diagonal of confusion matrix)• 12 most common classes, 1 background class

(from the rest)

Page 41: Indoor Scene Segmentation using a Structured Light Sensor

Evaluating Descriptors

2D Descriptors 3D Descriptors

Perc

ent

RGB-SIFT Depth-SIFT Depth-SPIN RGBD-SIFT RGB-SIFT/D-SPIN30

32

34

36

38

40

42

44

46

48

50

UnaryCRF

Page 42: Indoor Scene Segmentation using a Structured Light Sensor

Evaluating Location Priors

RGB-SIFT

RGB-SIFT

+2D Prio

rs

RGBD-SIFT

RGBD-SIFT

+2D Prio

rs

RGBD-SIFT

+3D Prio

rs

RGBD-SIFT

+3D Prio

rs (ab

s)30

35

40

45

50

55

UnaryCRF

Perc

ent

2D Descriptors 3D Descriptors

Page 43: Indoor Scene Segmentation using a Structured Light Sensor
Page 44: Indoor Scene Segmentation using a Structured Light Sensor
Page 45: Indoor Scene Segmentation using a Structured Light Sensor

Conclusion

• Kinect Depth signal helps scene parsing• Still a long way from great performance• Shown standard approaches on RGB-D data.• Lots of potential for more sophisticated

methods.• No complicated geometric reasoning• http://cs.nyu.edu/~silberman/nyu_indoor_scenes.html

Page 46: Indoor Scene Segmentation using a Structured Light Sensor

Preprocessing the Data

[1] N. Burrus. Kinect RGB Demo v0.4.0. http://nicolas.burrus.name/index.php/Research/KinectRgbDemoV4?from=Research.KinectRgbDemoV2, Feb. 2011

We use open source calibration software [1] to infer:• Parameters of RGB & Depth cameras• Homography between cameras.

Page 47: Indoor Scene Segmentation using a Structured Light Sensor

Preprocessing the data

• Bilateral filter used to diffuse depth across regions of similar RGB intensity

• Naïve GPU implementation runs in ~100 ms

Page 48: Indoor Scene Segmentation using a Structured Light Sensor

Motivation

Results from Spatial Pyramid-based classification [1] using 5 indoor scene types. Contrast this with the 81% received by [1] on a 13-class (mostly outdoor) scene dataset. They note similar confusion within indoor scenes.[1] Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene CategoriesS. Lazebnik, C. Schmid, and J. Ponce, CVPR 2006