Indoor Scene Segmentation using a Structured Light Sensor

Nathan Silberman and Rob Fergus

ICCV 2011 Workshop on 3D Representation and Recognition

Courant Institute

Overview

Indoor Scene Recognition using the Kinect• Introduce new Indoor Scene Depth Dataset• Describe CRF-based model– Explore the use of rgb/depth cues

Motivation• Indoor Scene recognition is hard– Far less texture than outdoor scenes– More geometric structure

• Kinect gives us depth map (and RGB)– Direct access to shape and geometry information

Overview

Capturing our Dataset

Statistics of the DatasetScene Type Number of

Scenes Frames Labeled Frames *

Bathroom 6 5,588 76

Bedroom 17 22,764 480

Bookstore 3 27,173 784

Cafe 1 1,933 48

Kitchen 10 12,643 285

Living Room 13 19,262 355

Office 14 19,254 319

Total 64 108,617 2,347

* Labels obtained via LabelMe

Dataset Examples

Living Room

RGB Raw Depth Labels

Dataset Examples

Living Room

RGB Depth* Labels

* Bilateral Filtering used to clean up raw depth image

Dataset Examples

Bathroom

RGB Depth Labels

Dataset Examples

Bedroom

RGB Depth Labels

Existing Depth Datasets

[1] K. Lai, L. Bo, X. Ren, and D. Fox. A Large-Scale Hierarchical Multi-View RGB-D Object Dataset. ICRA 2011 [2] B. Liu, S. Gould and D. Koller. Single Image Depth Estimation from Predicted Semantic Labels. CVPR 2010

RGB-D Dataset [1]

Stanford Make3d [2]

Existing Depth Datasets

[1] Abhishek Anand, Hema Swetha Koppula, Thorsten Joachims, Ashutosh Saxena. Semantic Labeling of 3D Point Clouds for Indoor Scenes. NIPS, 2011[2] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, T. Darrell. A Category-Level 3-D Object Dataset: Putting the Kinect to Work. ICCV Workshop on Consumer Depth Cameras for Computer Vision. 2011

Point Cloud Data [1] B3DO [2]

Dataset Freely Availablehttp://cs.nyu.edu/~silberman/nyu_indoor_scenes.html

Overview

Segmentation using CRF ModelCost(labels) = Local Terms(label i) +

Spatial Smoothness (label i, label j)€

• Standard CRF formulation• Optimized via graph cuts• Discrete label set (~12 classes)

i∈ pixels

i, j ∈ pairs of pixels

ModelCost(labels) = Local Terms(label i) +

Spatial Smoothness (label i, label j)

= Appearance(label i | descriptor i) Location(i)€

•€

∑€

i∈ pixels

i, j∈ pairs of pixels

•€

∑€

i∈ pixels

Appearance Term

Appearance(label i | descriptor i)

Several Descriptor Types to choose from:o RGB-SIFTo Depth-SIFTo Depth-SPINo RGBD-SIFTo RGB-SIFT/D-SPIN

Descriptor Type: RGB-SIFT

Extracted Over Discrete Grid

RGB image from the Kinect

Descriptor Type: Depth-SIFTDepth image from kinect with linear scaling

Descriptor Type: Depth-SPINDepth image from kinect with linear scaling

Radius

A. E. Johnson and M. Hebert. Using spin images for efficient object recognition in cluttered 3d scenes. IEEE PAMI, 21(5):433–449, 1999

Descriptor Type: RGBD-SIFT

Concatenate

Depth image from kinectwith linear scaling

Descriptor Type: RGD-SIFT, D-SPIN

Concatenate

Depth image from kinectwith linear scaling

Appearance Model

Descriptor at each location

Appearance(label i | descriptor i)- Modeled by a Neural Network with a

single hidden layer

Appearance Model

13 Classes

1000-D Hidden Layer

128/178/256-D Input

Softmax output layer

Appearance Model

13 Classes

1000-D Hidden Layer

128/178/256-D Input

Probability Distribution over classes

Interpreted as p(label | descriptor)

Appearance Model

13 Classes

1000-D Hidden Layer

128/178/256-D Input

Probability Distribution over classes

Trained with backpropagation

∑€

i∈ pixels

•€

∑€

i∈ pixels

∑€

i∈ pixels

3D Priors2D Priors

Location Priors: 2D

• 2D Priors are histograms of P(class, location)• Smoothed to avoid image-specific artifacts

Motivation: 3D Location Priors

• 2D Priors don’t capture 3d geomety• 3D Priors can be built from depth data

• Rooms are of different shapes and sizes, how do we align them?

Motivation: 3D Location Priors

• To align rooms, we’ll use a normalized cylindrical coordinate system:

Band of maximum depths along each vertical scanline

Relative Depth DistributionsTable Television

Bed Wall

Relative Depth

Density

0 01 1

Location Priors: 3D

∑€

i∈ pixels

3D Priors2D Priors

∑€

i∈ pixels

Penalty for adjacent labels disagreeing(Standard Potts Model)

∑€

i∈ pixels

Spatial Modulation of Smoothness• None• RGB Edge • Depth Edges• RGB + Depth Edges

• Superpixel Edges• Superpixel + RGB Edges• Superpixel + Depth Edges

Experimental Setup

• 60% Train (~1408 images)• 40% Test (~939 images)• 10 fold cross validation• Images of the same scene cannot appear apart• Performance criteria is pixel-level classification

(mean diagonal of confusion matrix)• 12 most common classes, 1 background class

(from the rest)

Evaluating Descriptors

2D Descriptors 3D Descriptors

RGB-SIFT Depth-SIFT Depth-SPIN RGBD-SIFT RGB-SIFT/D-SPIN30

UnaryCRF

Evaluating Location Priors

RGB-SIFT

+2D Prio

RGBD-SIFT

+2D Prio

RGBD-SIFT

+3D Prio

RGBD-SIFT

+3D Prio

rs (ab

UnaryCRF

2D Descriptors 3D Descriptors

Conclusion

• Kinect Depth signal helps scene parsing• Still a long way from great performance• Shown standard approaches on RGB-D data.• Lots of potential for more sophisticated

methods.• No complicated geometric reasoning• http://cs.nyu.edu/~silberman/nyu_indoor_scenes.html

Preprocessing the Data

[1] N. Burrus. Kinect RGB Demo v0.4.0. http://nicolas.burrus.name/index.php/Research/KinectRgbDemoV4?from=Research.KinectRgbDemoV2, Feb. 2011

We use open source calibration software [1] to infer:• Parameters of RGB & Depth cameras• Homography between cameras.

Preprocessing the data

• Bilateral filter used to diffuse depth across regions of similar RGB intensity

• Naïve GPU implementation runs in ~100 ms

Motivation

Results from Spatial Pyramid-based classification [1] using 5 indoor scene types. Contrast this with the 81% received by [1] on a 13-class (mostly outdoor) scene dataset. They note similar confusion within indoor scenes.[1] Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene CategoriesS. Lazebnik, C. Schmid, and J. Ponce, CVPR 2006

Indoor Scene Segmentation using a Structured Light Sensor

Documents

Transcript of Indoor Scene Segmentation using a Structured Light Sensor

Scene Classification of Images and Video via Semantic ... · semantic segmentation is provided on a set of hand-labeled images. Our work improves the semantic segmentation and scene

A Projective Framework for Scene Segmentation in the ...

A Benchmark for Endoluminal Scene Segmentation …downloads.hindawi.com/journals/jhe/2017/4037190.pdfA Benchmark for Endoluminal Scene Segmentation of Colonoscopy Images David Vázquez,1,2

Scene Segmentation inArtistic Archive Documentaries · Scene Segmentation inArtistic Archive Documentaries Dalibor Mitrovi´c, Stefan Hartlieb, Matthias Zeppelzauer, Maia Zaharieva

Scene Understanding Based on High-Order Potentials and ...downloads.hindawi.com/journals/am/2018/8207201.pdf · Scene understanding, based on semantic segmentation, is acoreproblem

Sparse Scene Flow Segmentation for Moving Object Detection ... · Sparse Scene Flow Segmentation for Moving Object Detection in Urban Environments Philip Lenz, Julius Ziegler, Andreas

Layer-structured 3D Scene Inference via View …Layer-structured 3D Scene Inference via View Synthesis 5 3.1 Overview Training Data. We leverage multi-view supervision to learn LDI

Efﬁcient Structured Prediction for 3D Indoor Scene ...ttic.uchicago.edu/~rurtasun/publications/schwing_et_al_cvpr12.pdfEfﬁcient Structured Prediction for 3D Indoor Scene Understanding

Semi-Supervised Video Segmentation using Tree Structured ...mi.eng.cam.ac.uk/.../SemiSupervisedVideoSegmentation_CVPR2011.pdf · Semi-Supervised Video Segmentation using Tree Structured

Mapping Images to Scene Graphs with Permutation-Invariant ...papers.nips.cc/paper/7951-mapping-images-to-scene-graphs...Mapping Images to Scene Graphs with Permutation-Invariant Structured

Dual Attention Network for Scene Segmentation - arxiv.org · Dual Attention Network for Scene Segmentation Jun Fu 1;3 Jing Liu* 1 Haijie Tian 1 Yong Li 2 Yongjun Bao 2 Zhiwei Fang

Close-range Scene Segmentation and …...Close-range Scene Segmentation and Reconstruction of 3D Point Cloud Maps for Mobile Manipulation in Human Environments Radu Bogdan Rusu, Nico

Dual Structured Light 3D using a 1D Sensor - ECCV 2016 Structured Light 3D using a 1D Sensor ... Visible SWIR Penetrate Atmosphere ... Scene 2D Projector 2D Camera Scene 1D Projector

Physically-Based Rendering for Indoor Scene Understanding ... · ing. For three indoor scene understanding tasks, namely normal prediction, semantic segmentation, and object edge

Hand segmentation with structured convolutional learning · Hand segmentation with structured convolutional learning 1;2Natalia Neverova 1;2Christian Wolf 3Graham W. Taylor 4Florian

Automatic Lymph Node Cluster Segmentation Using ...lelu/publication/miccai2016_110.pdf · Automatic Lymph Node Cluster Segmentation Using Holistically-Nested Neural Networks and Structured

Image Segmentation Using Binary Tree Structured …murphyk/Teaching/CS532c_Fall04/Projects/erashin.pdf · Image Segmentation Using Binary Tree Structured Markov ... the literature

Structured Indoor Modeling · 3. Structured Modeling The key innovation lies in the structured representation of scene geometries, and its reconstruction framework. We represent a

Lecture 15 Segmentation and Scene Understanding€¦ · Image Segmentation •One way to represent an image using a set of components ... •Mean Shift •Graph-based Segmentation

"Semantic Segmentation for Scene Understanding: Algorithms and Implementations," a Presentation from Auviz Systems