Post on 24-Feb-2016
description
Indoor Scene Segmentation using a Structured Light Sensor
Nathan Silberman and Rob Fergus
ICCV 2011 Workshop on 3D Representation and Recognition
Courant Institute
Overview
Indoor Scene Recognition using the Kinect• Introduce new Indoor Scene Depth Dataset• Describe CRF-based model– Explore the use of rgb/depth cues
Motivation• Indoor Scene recognition is hard– Far less texture than outdoor scenes– More geometric structure
Motivation• Indoor Scene recognition is hard– Far less texture than outdoor scenes– More geometric structure
• Kinect gives us depth map (and RGB)– Direct access to shape and geometry information
Overview
Indoor Scene Recognition using the Kinect• Introduce new Indoor Scene Depth Dataset• Describe CRF-based model– Explore the use of rgb/depth cues
Capturing our Dataset
Statistics of the DatasetScene Type Number of
Scenes Frames Labeled Frames *
Bathroom 6 5,588 76
Bedroom 17 22,764 480
Bookstore 3 27,173 784
Cafe 1 1,933 48
Kitchen 10 12,643 285
Living Room 13 19,262 355
Office 14 19,254 319
Total 64 108,617 2,347
* Labels obtained via LabelMe
Dataset Examples
Living Room
RGB Raw Depth Labels
Dataset Examples
Living Room
RGB Depth* Labels
* Bilateral Filtering used to clean up raw depth image
Dataset Examples
Bathroom
RGB Depth Labels
Dataset Examples
Bedroom
RGB Depth Labels
Existing Depth Datasets
[1] K. Lai, L. Bo, X. Ren, and D. Fox. A Large-Scale Hierarchical Multi-View RGB-D Object Dataset. ICRA 2011 [2] B. Liu, S. Gould and D. Koller. Single Image Depth Estimation from Predicted Semantic Labels. CVPR 2010
RGB-D Dataset [1]
Stanford Make3d [2]
Existing Depth Datasets
[1] Abhishek Anand, Hema Swetha Koppula, Thorsten Joachims, Ashutosh Saxena. Semantic Labeling of 3D Point Clouds for Indoor Scenes. NIPS, 2011[2] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, T. Darrell. A Category-Level 3-D Object Dataset: Putting the Kinect to Work. ICCV Workshop on Consumer Depth Cameras for Computer Vision. 2011
Point Cloud Data [1] B3DO [2]
Dataset Freely Availablehttp://cs.nyu.edu/~silberman/nyu_indoor_scenes.html
Overview
Indoor Scene Recognition using the Kinect• Introduce new Indoor Scene Depth Dataset• Describe CRF-based model– Explore the use of rgb/depth cues
Segmentation using CRF ModelCost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j)€
∑
• Standard CRF formulation• Optimized via graph cuts• Discrete label set (~12 classes)
€
i∈ pixels
€
i, j ∈ pairs of pixels
€
∑
ModelCost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j)
= Appearance(label i | descriptor i) Location(i)€
•€
∑€
∑
€
i∈ pixels
€
i, j∈ pairs of pixels
ModelCost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j)
= Appearance(label i | descriptor i) Location(i)€
•€
∑€
∑
€
i∈ pixels
€
i, j∈ pairs of pixels
Appearance Term
Appearance(label i | descriptor i)
Several Descriptor Types to choose from:o RGB-SIFTo Depth-SIFTo Depth-SPINo RGBD-SIFTo RGB-SIFT/D-SPIN
Descriptor Type: RGB-SIFT
Extracted Over Discrete Grid
RGB image from the Kinect
128 D
Descriptor Type: Depth-SIFTDepth image from kinect with linear scaling
128 D
Extracted Over Discrete Grid
Descriptor Type: Depth-SPINDepth image from kinect with linear scaling
50 D
Radius
Depth
A. E. Johnson and M. Hebert. Using spin images for efficient object recognition in cluttered 3d scenes. IEEE PAMI, 21(5):433–449, 1999
Extracted Over Discrete Grid
Descriptor Type: RGBD-SIFT
Concatenate
256 D
RGB image from the Kinect
Depth image from kinectwith linear scaling
Descriptor Type: RGD-SIFT, D-SPIN
Concatenate
RGB image from the Kinect
Depth image from kinectwith linear scaling
178 D
Appearance Model
Descriptor at each location
Appearance(label i | descriptor i)- Modeled by a Neural Network with a
single hidden layer
Appearance Model
Descriptor at each location
Appearance(label i | descriptor i)
13 Classes
1000-D Hidden Layer
128/178/256-D Input
Softmax output layer
Appearance Model
13 Classes
1000-D Hidden Layer
128/178/256-D Input
Descriptor at each location
Probability Distribution over classes
Appearance(label i | descriptor i)
Interpreted as p(label | descriptor)
Appearance Model
13 Classes
1000-D Hidden Layer
128/178/256-D Input
Descriptor at each location
Probability Distribution over classes
Appearance(label i | descriptor i)
Trained with backpropagation
ModelCost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j)
= Appearance(label i | descriptor i) Location(i)€
∑€
∑
€
i∈ pixels
€
i, j∈ pairs of pixels
€
•
ModelCost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j)
= Appearance(label i | descriptor i) Location(i)€
•€
∑€
∑
€
i∈ pixels
€
i, j∈ pairs of pixels
ModelCost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j)
= Appearance(label i | descriptor i) Location(i)€
∑€
∑
€
i∈ pixels
€
i, j∈ pairs of pixels
3D Priors2D Priors
€
•
Location Priors: 2D
• 2D Priors are histograms of P(class, location)• Smoothed to avoid image-specific artifacts
Motivation: 3D Location Priors
• 2D Priors don’t capture 3d geomety• 3D Priors can be built from depth data
• Rooms are of different shapes and sizes, how do we align them?
Motivation: 3D Location Priors
• To align rooms, we’ll use a normalized cylindrical coordinate system:
Band of maximum depths along each vertical scanline
Relative Depth DistributionsTable Television
Bed Wall
Relative Depth
Density
0 01 1
Location Priors: 3D
ModelCost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j)
= Appearance(label i | descriptor i) Location(i)€
∑€
∑
€
i∈ pixels
€
i, j∈ pairs of pixels
3D Priors2D Priors
€
•
ModelCost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j)
€
∑€
∑
€
i∈ pixels
€
i, j∈ pairs of pixels
Penalty for adjacent labels disagreeing(Standard Potts Model)
ModelCost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j)
€
∑€
∑
€
i∈ pixels
€
i, j∈ pairs of pixels
Spatial Modulation of Smoothness• None• RGB Edge • Depth Edges• RGB + Depth Edges
• Superpixel Edges• Superpixel + RGB Edges• Superpixel + Depth Edges
Experimental Setup
• 60% Train (~1408 images)• 40% Test (~939 images)• 10 fold cross validation• Images of the same scene cannot appear apart• Performance criteria is pixel-level classification
(mean diagonal of confusion matrix)• 12 most common classes, 1 background class
(from the rest)
Evaluating Descriptors
2D Descriptors 3D Descriptors
Perc
ent
RGB-SIFT Depth-SIFT Depth-SPIN RGBD-SIFT RGB-SIFT/D-SPIN30
32
34
36
38
40
42
44
46
48
50
UnaryCRF
Evaluating Location Priors
RGB-SIFT
RGB-SIFT
+2D Prio
rs
RGBD-SIFT
RGBD-SIFT
+2D Prio
rs
RGBD-SIFT
+3D Prio
rs
RGBD-SIFT
+3D Prio
rs (ab
s)30
35
40
45
50
55
UnaryCRF
Perc
ent
2D Descriptors 3D Descriptors
Conclusion
• Kinect Depth signal helps scene parsing• Still a long way from great performance• Shown standard approaches on RGB-D data.• Lots of potential for more sophisticated
methods.• No complicated geometric reasoning• http://cs.nyu.edu/~silberman/nyu_indoor_scenes.html
Preprocessing the Data
[1] N. Burrus. Kinect RGB Demo v0.4.0. http://nicolas.burrus.name/index.php/Research/KinectRgbDemoV4?from=Research.KinectRgbDemoV2, Feb. 2011
We use open source calibration software [1] to infer:• Parameters of RGB & Depth cameras• Homography between cameras.
Preprocessing the data
• Bilateral filter used to diffuse depth across regions of similar RGB intensity
• Naïve GPU implementation runs in ~100 ms
Motivation
Results from Spatial Pyramid-based classification [1] using 5 indoor scene types. Contrast this with the 81% received by [1] on a 13-class (mostly outdoor) scene dataset. They note similar confusion within indoor scenes.[1] Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene CategoriesS. Lazebnik, C. Schmid, and J. Ponce, CVPR 2006