Andrzej Pronobis, Rajesh P. N. Rao - arXiv.org e-Print … Deep Generative Spatial Models for Mobile...

Learning Deep Generative Spatial Models for Mobile Robots

Andrzej Pronobis, Rajesh P. N. Rao

Abstract— We propose a new probabilistic framework thatallows mobile robots to autonomously learn deep, generativemodels of their environments that span multiple levels ofabstraction. Unlike traditional approaches that combine engi-neered models for low-level features, geometry, and semantics,our approach leverages recent advances in Sum-Product Net-works (SPNs) and deep learning to learn a single, universalmodel of the robot’s spatial environment. Our model is fullyprobabilistic and generative, and represents a joint distributionover spatial information ranging from low-level geometry tosemantic interpretations. Once learned, it is capable of solvinga wide range of tasks: from semantic classification of places,uncertainty estimation, and novelty detection, to generationof place appearances based on semantic information andprediction of missing data in partial observations. Experimentson laser-range data from a mobile robot show that the proposeduniversal model obtains performance superior to state-of-the-art models fine-tuned to one specific task, such as GenerativeAdversarial Networks (GANs) or SVMs.

I. INTRODUCTION

The ability to acquire and represent spatial knowledge isfundamental for mobile robots operating in large, unstruc-tured environments. Such knowledge exists at multiple levelsof abstraction, from robot’s sensory data, through geome-try and appearance, up to high-level semantic descriptions.Experiments have demonstrated that robots can leverageknowledge at all levels to better perform in the real-world [1].

Traditionally, robotic systems utilize an assembly of inde-pendent spatial models [2], which exchange information in alimited fashion. This includes engineered feature extractorsand combinations of machine learning techniques, makingintegration with planning and decision making difficult. Atthe same time, the recent success of deep learning proves thatreplacing multiple representations with a single integratedmodel can lead to a drastic increase in performance [3][4].As a result, deep models have also been applied to spatialmodeling tasks, such as place classification and semanticmapping [5][6]. Yet, the problem was always framed as oneof classification, where sensory data is fed to a convolutionalneural network (CNN) to obtain semantic labels.

In contrast, in this work our goal is not only to unifymultiple representations into a single model, but also todemonstrate that the role of a spatial model can go beyondclassification. To this end, we propose the Deep GenerativeSpatial Model (DGSM), a probabilistic model which learnsa joint distribution between a low-level representation of thegeometry of local environments (places) and their semantic

The authors are with Paul G. Allen School of Computer Science &Engineering, University of Washington, Seattle, WA, USA. A. Pronobis isalso with Robotics, Perception and Learning Lab, KTH, Stockholm, Sweden.{pronobis,rao}@cs.washington.edu. This work was supportedby Office of Naval Research (ONR) grant no. N00014-13-1-0817 andSwedish Research Council (VR) project 2012-4907 SKAEENet. The helpby Kaiyu Zheng and Kousuke Ariga is gratefully acknowledged.

interpretation. Our model leverages Sum-Product Networks(SPNs), a novel probabilistic deep architecture [7][8].

SPNs have been shown to provide state-of-the-art resultsin several domains [9][10][11]. However, their potential hasnot previously been exploited in robotics. DGSM consists ofan SPN with a unique structure designed to hierarchicallyrepresent the geometry and semantics of a place from theperspective of a mobile robot acting in its environment.To this end, the network represents place geometry usinga robot-centric, polar grid, where the nearby objects arecaptured in more detail than more distant context. On top ofthe place geometry, we propose a unique network structurewhich combines domain knowledge with random networkgeneration (which can be seen as a form of structure learning)for parts of the network modeling complex dependencies.

DGSM is generative, probabilistic, and therefore universal.Once learned, it enables a wide range of inferences. First, itcan be used to infer a semantic category of a place from sen-sory input together with probability representing uncertainty.The probabilistic output provides rich information to a po-tential planning or decision-making subsystem. However, asshown in this work, it can also be used to detect novel placecategories. Furthermore, the model reasons jointly about thegeometry of the world and its semantics. We exploit thatproperty for two tasks: to generate prototypical appearancesof places based on semantic information and to infer missinggeometry information in partial observations. We use laser-range data to capture the geometry of places; however theperformance of SPNs for vision-based tasks [7][9] indicatesthat the model should also accommodate 3D and visualinformation without changing the general architecture.

Our goal is to demonstrate the potential of DGSM, anddeep generative models in general, to spatial modeling inrobotics. Therefore, we present results of four differentexperiments addressing each of the inference tasks. In eachexperiment, we compare our universal model to an alternativeapproach that is designed for and fine-tuned to a specific task.First, for semantic categorization, we compare to a well-established Support Vector Machine (SVM) model learnedon widely used geometrical laser-range features [12][13].Second, we benchmark novelty detection against one-classSVM trained on the same features. In both cases, DGSMoffers superior accuracy. Finally, we compare the generativeproperties of our model to Generative Adversarial Networks(GANs) [14][15] on the two remaining inference tasks,reaching state-of-the-art accuracy and superior efficiencybeyond real-time. This serves as a benchmark but alsodemonstrates the use of GANs for spatial modeling inrobotics. Importantly, to open doors for the use of SPNs ina broader range of applications, we introduce LibSPN [16],a new general library for SPN learning and inference.

arX

iv:1

610.

0262

7v3

[cs

.RO

] 2

8 D

ec 2

017

II. RELATED WORK

Representing semantic spatial knowledge is a broadlyresearched topic, with many solutions employing vi-sion [17][18][2][5]. Images clearly carry rich informationabout semantics; however, they are also affected by chang-ing environment conditions. At the same time, roboticsresearchers have seen advantages of using range data thatare much more robust in real-world settings and easier toprocess in real time. In this work, we focus on laser-rangedata, as a way of introducing and evaluating a new spatialmodel as well as a recently proposed deep architecture.

Laser-range data have been extensively used for placeclassification and semantic mapping, and many traditional,handcrafted representations have been proposed. Buschkaet al. [19] contributed a simple method that incrementallydivided grid maps of indoor environments into two classesof open spaces (rooms and corridors). Mozos et al. [12]applied AdaBoost to create a classifier based on a set ofmanually designed geometrical features to classify placesinto rooms, corridors and doorways. In [20], omnidirectionalvision was combined with laser data to build descriptors,called fingerprints of places. Finally, in [13], SVMs havebeen applied to the geometrical features of Mozos et al. [12]leading to significant improvement in performance over theoriginal AdaBoost. That approach has been further integratedwith visual and object cues for semantic mapping in [2].

Deep learning and unsupervised feature learning, aftermany successes in speech recognition and computer vi-sion [3], entered the field of robotics with superior per-formance in object recognition [21][22] and robot grasp-ing [23][4]. The latest work in place classification alsoemploys deep approaches. In [5], deep convolutional network(CNN) complemented with a series of one-vs-all classifiers isused for visual semantic mapping. In [6], CNNs are used toclassify grid maps built from laser data into 3 classes: room,corridor, and doorway. In these works, deep models are usedexclusively for classification, and use of generative modelshas not been explored. In contrast, we propose a universalprobabilistic generative model, and demonstrate its useful-ness for multiple robotics tasks, including classification.

Several generative, deep architectures have recently beenproposed, notably Variational Autoencoders [24], Genera-tive Adversarial Networks [14], and Sum-Product Networks[8][7][9]. GANs have been shown to produce high-qualitygenerative representations of visual data [15], and have beensuccessfully applied to image completion [25]. SPNs, aprobabilistic model, achieved promising results for variedapplications such as speech [10] and language modeling [26],human activity recognition [11], and image classification [9]and completion [7], but have not been used in robotics. In thiswork, we exploit the universality and efficiency of SPNs topropose a single spatial model able to solve a wide range ofinference problems relevant to a mobile robot. Furthermore,inspired by their results in other domains, we also evaluateGANs (when applicable). This serves as a comparison anda demonstration of GANs on a new application.

Fig. 1: An SPN for a naive Bayes mixture model P (X1, X2),with three components over two binary variables. The bottomlayer consists of indicators for X1 and X2. Weighted sumnodes, with weights attached to inputs, are marked with +,while product nodes are marked with ×. Y1 represents alatent variable marginalized out by the root sum.

III. SUM-PRODUCT NETWORKS

Sum-product networks are a recently proposed probabilis-tic deep architecture with several appealing properties andsolid theoretical foundations [8][7][9]. One of the primarylimitations of probabilistic graphical models is the complex-ity of their partition function, often requiring complex ap-proximate inference in the presence of non-convex likelihoodfunctions. In contrast, SPNs represent joint or conditionalprobability distributions with partition functions that areguaranteed to be tractable and involve a polynomial numberof sum and product operations, permitting exact inference.SPNs are a deep, hierarchical representation, capable ofrepresenting context-specific independence and performingfast, tractable inference on high-treewidth models. While notall probability distributions can be encoded by polynomial-sized SPNs, recent experiments in several domains show thatthe class of distributions modeled by SPNs is sufficient formany real-world problems, offering real-time efficiency.

As shown in Fig. 1, on a simple example of a naive Bayesmixture model, an SPN is a generalized directed acyclicgraph composed of weighted sum and product nodes. Thesums can be seen as mixture models over subsets of vari-ables, with weights representing mixture priors. Products canbe viewed as features or mixture components. The latentvariables of the mixtures can be made explicit and theirvalues inferred. This is often done for classification models,where the root sum is a mixture of sub-SPNs representingclasses. The bottom layers effectively define features reactingto certain values of indicators1 for the input variables.

Formally, following Poon & Domingos [7], we can definean SPN as follows:

Definition 1: An SPN over variables X1, . . . , XV is arooted directed acyclic graph whose leaves are the indicators(X1

1 , . . . , XI1 ), . . . , (X

1V , . . . , X

IV ) and whose internal nodes

are sums and products. Each edge (i, j) emanating from asum node i has a non-negative weight wij . The value of aproduct node is the product of the values of its children. The

1Indicator is a binary variable set to 1 when the corresponding categoricalvariable takes a specific value. For using continuous input variables, see [7].

value of a sum node is∑

j∈Ch(i) wijvj , where Ch(i) are thechildren of i and vj is the value of node j. The value of anSPN S[X1, . . . , XV ] is the value of its root.

Not all architectures consisting of sums and products resultin a valid probability distribution. However, following simpleconstraints on the structure of an SPN will guarantee validity(see [7], [8]). When the weights of each sum node are nor-malized to sum to 1, the value of a valid SPN S[X1

1 , . . . , XIV ]

is equal to the normalized probability P (X1, . . . , XV ) of thedistribution modeled by the network [8].

A. Generating SPN structure

The structure of the SPN determines the group of dis-tributions that can be learned. Therefore, most previousworks [9][11][26] relied on domain knowledge to design theappropriate structure. Furthermore, several structure learningalgorithms were proposed [27][28] to discover indepen-dencies between the random variables in the dataset, andstructure the SPN accordingly. In this work, we experimentwith a different approach, originally hinted at in [7], whichgenerates a random structure, as in random forests. Such anapproach has not been previously evaluated. Our experimentsdemonstrate that it can lead to very good performance andcan accommodate a wide range of distributions. Additionally,after parameter learning, the generated structure can bepruned by removing edges associated with weights close tozero. This can be seen as a form of structure learning.

To obtain the random structure, we recursively generatenodes based on multiple random decompositions of a setof random variables into multiple subsets until each subsetis a singleton. As illustrated in Fig. 2 (middle, Level 1),at each level the current set of variables to be decomposedis modeled by multiple mixtures (green nodes), and eachsubset of the decomposition is also modeled by multiplemixtures (green nodes one level below). Product nodes(blue) are used as an intermediate layer and act as featuresdetecting particular combinations of mixtures representingeach subset. The top mixtures of each level mix outputs ofall product nodes at that level. The same set of variables canbe decomposed into subsets in multiple random ways (e.g.there are two decompositions at the top of Fig. 2).

B. Inference and Learning

Inference in SPNs is accomplished by a single passthrough the network. Once the indicators are set to representthe evidence, the upward pass will yield the probability of theevidence as the value of the root node. Partial evidence (ormissing data) can easily be expressed by setting all indicatorsfor a variable to 1. Moreover, since SPNs compute a networkpolynomial [29], derivatives computed over the network canbe used to perform inference for modified evidence withoutrecomputing the whole SPN. Finally, it can be shown [8] thatMPE inference in a certain class of SPNs (selective) can beperformed by replacing all sum nodes with max nodes whileretaining the weights. Then, the indicators of the variablesfor which the MPE state is inferred are all set to 1 and astandard upward pass is performed. A downward pass then

Fig. 2: The structure of the SPN implementing our spatialmodel. The bottom images illustrate a robot in an environ-ment and a robot-centric polar grid formed around the robot.The SPN is built on top of the variables representing theoccupancy in the polar grid.

follows, which recursively selects the highest valued childof each sum (max) node and all children of a product node.The indicators selected by this process indicate the MPE stateof the variables. In general SPNs, this algorithm yields anapproximation of the MPE state.

SPNs lend themselves to be learned generatively [7] ordiscriminatively [9] using Expectation Maximization (EM) or

(a) Corridor (b) Doorway

(c) Small Office (d) Large Office

Fig. 3: Local environment observations used in our experi-ments, expressed as Cartesian and polar occupancy grids, forexamples of places of different semantic categories.

gradient descent. In this work, we employ hard EM to learnthe weights, which was shown to work well for generativelearning [7]. As is often the case for deep models, the gradi-ent quickly diminishes as more layers are added. Hard EMovercomes this problem, permitting learning of SPNs withhundreds of layers. Each iteration of the EM learning consistsof an MPE inference of the implicit latent variables of eachsum with training samples set as evidence (E step), and anupdate of weights based on the inference results (M step,for details, see [7]). We achieved best results by modifyingthe MPE inference to use sums instead of maxes during theupwards pass, while selecting the max valued child duringthe downward pass. Furthermore, we performed additivesmoothing when updating the weights corresponding to aDirichlet prior and terminated learning after 300 iterations.No additional learning parameters are required.

IV. DEEP GENERATIVE SPATIAL MODEL (DGSM)

A. Representing Local Environments

DGSM is designed to support real-time, online spatialreasoning on a mobile robot. A real robot almost always hasaccess to a stream of observations of the environment. Thus,as the first step, we perform spatio-temporal integration ofthe sensory input. We rely on laser-range data, and use aparticle-filter grid mapping [30] to maintain a robot-centricmap of 5m radius around the robot. During acquisition of thedataset used in our experiments, the robot was navigating afixed path through a new environment, while continuouslyintegrating data gathered using a single laser scanner with180◦ FOV. This still results in partial observations of the sur-roundings (especially when the robot enters a new room), buthelps to assemble a more complete representation over time.

Our goal is to model the geometry and semantics of a localenvironment only. We assume that larger-scale spatial model

will be built by integrating multiple models of local places.Thus, we constrain the observation of a place to the informa-tion visible from the robot (structures that can be raytracedfrom the robot’s location). As a result, walls occlude the viewand the local map mostly contains information from a singleroom. In practice, additional noise is almost always present,but is averaged out during learning of the model. Examplesof such local environment observations can be seen in Fig. 3.

Next, each local observation is transformed into a robot-centric polar occupancy grid (compare polar and Cartesiangrids in Fig. 3). The resulting observation contains higher-resolution details closer to the robot and lower-resolutioncontext further away. This focuses the attention of the modelon the nearby objects. Higher resolution of informationcloser to the robot is important for understanding the se-mantics of the robot’s exact location (for instance when therobot is at a doorway). However, it also relates to how spatialinformation is used by a mobile robot when planning andexecuting actions. It is in the vicinity of the robot that higheraccuracy of spatial information is required. A similar princi-ple is exploited by many navigation components, which usedifferent resolution of information for local and global pathplanning. Additionally, such a representation corresponds tothe way the robot perceives the world because of the limitedresolution of its sensors. Our goal is to use a similar strategywhen representing 3D and visual information, by extendingthe polar representation to 3 dimensions. Finally, a high-resolution map of the complete environment can be largelyrecovered by integrating stored or inferred polar observationsover the path of the robot. We built polar grids of radius of5m, with an angle step of 6.4 degrees and grid resolutiondecreasing with the distance from the robot.

B. Architecture of DGSM

The architecture of DGSM is based on a generative SPNillustrated in Fig. 2. The model learns a probability distri-bution P (Y,X1, . . . , XC), where Y represents the semanticcategory of a place, and X1, . . . , XC are input variables rep-resenting the occupancy in each cell of the polar grid. Eachoccupancy cell is represented by three indicators in the SPN(for empty, occupied and unknown space). These indicatorsconstitute the bottom of the network (orange nodes).

The structure of the model is partially designed basedon domain knowledge and partially generated according tothe algorithm described in Sec. III-A. The resulting modelis a single SPN assembled from three levels of sub-SPNs.We begin by splitting the polar grid equally into 8 45-degree views. For each view, we generate a random sub-SPN by recursively building a hierarchy of decompositionsof subsets of polar cells in the view. Then, on top of allthe sub-SPNs representing the views, we generate an SPNrepresenting complete place geometries for each place class.Finally, the sub-SPNs for place classes are combined by asum node forming the root of the network. The latent variableassociated with that sum node is made explicit as Y andrepresents the semantic class of a place.

Sub-dividing the representation into views allows us touse networks of different complexity for representing lower-level view features and high-level structure of a place. Inour experiments, when representing views, we recursivelydecomposed the set of polar cells using a single randomdecomposition, into 2 cell sub-sets, and generated 4 mixturesto model each such subset. This procedure was repeateduntil each subset contained a single variable representing asingle cell. To increase the discriminative power of each viewrepresentation, we used 14 sums at the top level of the viewsub-SPN. These sums are considered input to a randomlygenerated SPN structure representing a place class. To ensurethat each class can be associated with a rich assortmentof place geometries, we increased the complexity of thegenerated structure and performed 4 random decompositionsof the sets of mixtures representing views into 5 subsets.The performance of the model does not vary greatly withthe structure parameters as long as the generated structure issufficiently expressive to support learning of dependenciesin the data.

Several straightforward modifications to the architecturecan be considered. First, the latent variables in the mixturesmodeling each view can be made explicit and considered aview or scene descriptor discovered by the learning algo-rithm. Second, the weights of the network could be sharedacross views, potentially simplifying the learning process.

C. Types of Inference

As a generative model of a joint distribution betweenlow-level observations and high-level semantic phenomena,DGSM is capable of various types of inferences.

First, the model can simply be used to classify observa-tions into semantic categories, which corresponds to MPEinference of y: y∗ = argmaxyP (y|x1, . . . , xC). Second, thelikelihood of an observation can be used as a measure ofnovelty and thresholded:

∑y P (y, x1, . . . , xC) > t. We use

this approach to separate test observations of classes knownduring training from observations of unknown classes.

If instead, we condition on the semantic information, wecan perform MPE inference over the variables representingoccupancy of polar grid cells:

x∗1, . . . , x

∗C = argmax

x1,...,xC

P (x1, . . . , xC |y).

This leads to generation of prototypical examples for eachclass. Finally, we can use partial evidence about the occu-pancy and infer the most likely state of a subset of polar gridcells for which evidence is missing:

x∗J , . . . , x

∗C = argmax

xJ ,...,xC

∑

y

P (y, x1, . . . , xJ−1, xJ , . . . , xC)

We use this technique to infer missing observations in ourexperiments.

V. GANS FOR SPATIAL MODELING

Recently, Generative Adversarial Networks [14] have re-ceived significant attention for their ability to learn complexvisual phenomena in an unsupervised way [15], [25]. The

idea behind GANs is to simultaneously train two deepnetworks: a generative model G(z; θg) that captures thedata distribution and a discriminative model D(x; θd) thatdiscriminates between samples from the training data andsamples generated by G. The training alternates betweenupdating θd to correctly discriminate between the true andgenerated data samples and updating θg so that D is fooled.The generator is defined as a function of noise variablesz, typically distributed uniformly (values from -1 to 1 inour experiments). For every value of z, a trained G shouldproduce a sample from the data distribution.

Although GANs have been known to be unstable totrain, several architectures have been proposed that resultin stable models over a wide range of datasets. Here, weemploy one such architecture called DC-GAN [15], whichprovides excellent results on datasets such as MNIST, LSUN,ImageNet [15] or CelebA [25]. Specifically, we used 3convolutional layers (of dimensions 18×18×64, 9×9×128,5×5×256) with stride 2 and one fully-connected layer forD2. We used an analogous architecture based on fractionalstrided convolutions for G. We assumed z to be of size 100.DC-GANs do not use pooling layers, perform batch normal-ization for both D an G, and use ReLU and LeakyReLUactivations for D and G, respectively. We used ADAM tolearn the parameters.

Since DC-GAN is a convolutional model, we could notdirectly use the polar representation as input. Instead, weused the Cartesian local grid maps directly. The resolutionof the Cartesian maps was set to 36×36, which is larger thenthe average resolution of the polar grid, resulting in 1296occupancy values being fed to the DC-GAN, as compared to1176 for DGSM. We encoded input occupancy values into asingle channel2 (0, 0.5, 1 for unknown, empty, and occupied).

A. Predicting Missing Observations

In [25], a method was proposed for applying GANs tothe problem of image completion. The approach first trainsa GAN on the training set and then relies on stochasticgradient descent to adjust the value of z according to a lossfunction L(z) = Lc(z) + λLp(z), where Lc is a contextualloss measuring the similarity between the generated and trueknown input values, while Lp is a perceptual loss whichensures that the recovered missing values look real to thediscriminator. We use this approach to infer missing observa-tions in our experiments. While effective, it requires iterativeoptimization to infer the missing values. This is in contrastto DGSM, which performs inference using a single up/downpass through the network. We selected the parameter λ toobtain the highest ratio of correctly reconstructed pixels.

VI. EXPERIMENTS

We conducted four experiments corresponding to the infer-ence types described in Sec. IV-C. Importantly, DGSM was

2We evaluated architectures consisting of 4 conv. layers and layers ofdifferent dimensions (depth of the 1st layer ranging from 32 to 256). We alsoinvestigated using two and three channels to encode occupancy information.The final architecture results in significantly better completion accuracy.

(a) DGSM (b) SVM + Geometric Features

Fig. 4: Normalized confusion matrices for the task ofsemantic place categorization.

trained only once and the same instance of the model wasused for all inferences. For each experiment, we comparedto a baseline model fine-tuned to the specific sub-problem.

A. Experimental Setup

Our experiments were performed on laser-range data fromthe COLD-Stockholm dataset [2]. The dataset contains mul-tiple data sequences captured using a mobile robot navigatingwith constant speed through four different floors of anoffice building. On each floor, the robot navigates throughrooms of different semantic categories. There are 9 differentlarge offices, 8 different small offices (distributed acrossthe floors), 4 long corridors (1 per floor, with varyingappearance in different parts), and multiple examples ofplaces in doorways. The dataset features several other roomcategories: an elevator, a living room, a meeting room, a largemeeting room, and a kitchen. However, with only one ortwo room instances in each category. Therefore, we decidedto designate those categories as novel when testing noveltydetection and used the remaining four categories for themajority of the experiments. To ensure variability betweenthe training and test sets, we split the data samples four times,each time training the DGSM model on samples from threefloors and leaving one floor out for testing. The presentedresults are averaged over the four splits.

The experiments were conducted using LibSPN [16]. SPNsare still a new architecture, and only few, limited domain-specific implementations exist at the time of writing. Incontrast, our library offers a general toolbox for struc-ture generation, learning and inference, and enables quickapplication of SPNs to new domains. It integrates withTensorFlow, which leads to an efficient solution capable ofutilizing multiple GPUs, and enables combining SPNs withother deep architectures. The presented experiments are asmuch an evaluation of DGSM as they are of LibSPN.

B. Semantic Place Categorization

First, we evaluated DGSM for semantic place categoriza-tion and compared it to a well-established model based onan SVM and geometric features [12], [13]. The featureswere extracted from 360◦ virtual laser scans raytraced inthe original, high-resolution (2cm/pixel) Cartesian grid mapsused to form the polar grids for DGSM. To ensure thebest SVM result, we used an RBF kernel and selected the

Fig. 5: ROC curves for novelty detection. Inliers areconsidered positive, while novel samples are negative.

kernel and learning parameters directly on the test sets. Whileearly attempts to solve a similar classification problem withdeep conv nets exist [6], it is not clear whether they offerperformance improvements compared to the SVM-basedapproach. Additionally, using SVMs allows us to evaluatenot only classification, but also novelty detection.

The models were trained on the four room categories andevaluated on observations collected in places belonging tothe same category, but on different floors. The normalizedconfusion matrices are shown in Fig. 4. We can see thatDGSM obtains superior results for all classes. The classifi-cation rate averaged over all classes (giving equal importanceto each class) and data splits is 85.9% ± 5.4 for SVM and92.7%±6.2 for DGSM, with DGSM outperforming SVM forevery split. Most of the confusion exists between the smalland large office classes. Offices in the dataset often havecomplex geometry that varies greatly between the rooms.

C. Novelty Detection

The second experiment evaluated the quality of the un-certainty measure produced by DGSM and its applicabilityto detecting outliers from room categories not known duringtraining. We used the same DGSM model and relied on thelikelihood produced by DGSM to decide if a test sample isfrom a known or novel category. The cumulative ROC curveover all data splits is shown in Fig. 5.

We compared to a one-class SVM with an RBF kerneltrained on the geometric features. The ν parameter wasadjusted to obtain the largest area under the curve (AUC) onthe test sets. We can observe that DGSM offers a significantlymore reliable novelty signal, with AUC of 0.81 comparedto 0.76 for SVM. This result is significant, since to thebest of our knowledge, it demonstrates for the first time theusefulness of SPN-based models for novelty detection.

D. Generating Observations of Places

In this qualitative experiment, our aim was to assess andcompare the generative properties of DGSM and GANsby inferring complete, prototypical appearances of places

(a) Corridor (b) Doorway (c) Small Office (d) Large Office (e) GAN Samples

Fig. 6: Results of MPE inference over place appearances conditioned on each semantic category for DGSM ((a)-(d)); andplace appearance samples generated using GAN (e).

(a) Corridor (b) Doorway (c) Small Office (d) Large Office (e) GAN

Fig. 7: Examples of successful and unsuccessful completions of place observations with missing data: grouped by truesemantic category for DGSM ((a)-(d)) and for GAN (e). For each example, a pair of grids is shown, with the true completegrid on the left, and the inferred missing data on the right. The part of the grid that was masked and inferred is highlighted.

knowing only semantic categories. For DGSM, we con-ditioned on the semantic class variable and inferred theMPE state of the observation variables. The generated polaroccupancy grids are illustrated in Fig. 6a-d. For GANs,we plot samples generated for random values of the noisevariables z in Fig. 6e.

We can compare the plots to the true examples depicted inFig. 3. We can see that each polar grid is very characteristicof the class from which it was generated. The corridor isan elongated structure with walls on either side, and thedoorway is depicted as a narrow structure with empty spaceon both sides. Despite the fact that, as shown in Fig. 3, largevariability exists between the instances of offices within thesame category, the generated observations of small and largeoffices clearly have a distinctive size and shape.

While the GAN architecture used for predicting missingobservations in unlabeled samples cannot generate samplesconditional on semantic category, we still clearly see exam-ples of different room classes and intra-class variations.

E. Predicting Missing ObservationsOur final experiment provides a quantitative evaluation

of the generative models on the problem of reconstructingmissing values in partial observations of places. To thisend, we masked a random chunk of 25% of the grid cellsfor each test sample in the dataset. In case of DGSM, we

masked a random 90-degree view, which corresponds to arectangular mask in polar coordinates. For GANs, since weuse a Cartesian grid, we used a square mask in a randomposition around the edges of the grid map3. For DGSM, allindicators for the masked polar cells were set to 1 to indicatemissing evidence and MPE inference followed. For GANs,we used the approach in Sec. V-A.

Fig. 7 shows examples of grids filled with predictedoccupancy information to replace the missing values for bothmodels. While the predictions are often consistent with thetrue values, both models do make mistakes. Analyzing theDGSM results more closely, we see that this typically occurswhen the mask removes information distinctive of the placecategory. Interestingly, in some cases, the unmasked inputgrid might itself be partial due to missing observations duringlaser range data acquisition. When the missing observationscoincide with a mask, the model will attempt to reconstructthem. Such example can be seen for a polar grid captured ina corridor shown in the bottom left corner of Fig. 7.

Overall, when averaged over all the test samples anddata splits, DGSM correctly reconstructs 77.14% ± 1.04 ofmasked cells, while GANs recover 75.84% ± 1.51. This

3We considered polar masks on top of Cartesian grid maps. However, thisprovided a significant advantage to GANs, since most of the masked pixelslay far from the robot, often outside a room, where they are easy to predict.

demonstrates that the models have comparable generativepotential, confirming state-of-the-art performance of DGSM.

F. Discussion and Model Comparison

The experiments clearly demonstrate the potential ofDGSM. Its generative abilities match (and potentially sur-pass) those of GANs on our problem, while being sig-nificantly more computationally efficient. DGSM naturallyrepresents missing evidence and requires only a single up-wards and downwards pass through the network to infermissing observations, while GANs required hundreds ofiterations, each propagating gradients through the network.Additionally, DGSM in our experiments used a smallernetwork than GANs, requiring roughly a quarter of sumand product operations to compute a single pass, withoutthe need for any nonlinearities. This property makes DGSMspecifically well suited for real-time robotics applications.

DC-GANs, being a convolutional model, lend themselvesto very efficient implementations on GPUs. DGSM uses amore complicated network structure. However, in our currentimplementation in LibSPN, DGSM is real-time during infer-ence and very efficient during learning, obtaining much fasterinference than GANs. As a result, extending the model toinclude additional modalities and capture visual appearanceor 3D structure is computationally feasible with DGSM.

The experiments with different inference types were allperformed on the same model after a single training phase(separately for each dataset split). This demonstrates thatour model spans not only multiple levels of abstraction,but also multiple tasks. In contrast, SVMs and GANs wereoptimized to solve a specific task. In particular, the modelretains high capability to discriminate, outperforming a dis-criminative SVM. Yet, it is trained generatively to representa joint distribution over low-level observations. Additionally,as demonstrated in the novelty detection experiments, itproduces a useful, probabilistic uncertainty signal. NeitherGANs nor SVMs explicitly represent likelihood of the data.

VII. CONCLUSIONS AND FUTURE WORK

This paper presents DGSM, a unique generative spatialmodel, which to our knowledge, is the first application ofSum-Product Networks to the domain of robotics. Our resultsdemonstrate that DGSM provides an efficient frameworkfor learning deep probabilistic representations of roboticenvironments, spanning low-level features, geometry, andsemantic representations. We have shown that DGSM hasgreat generative and discriminative potential, and can be usedto predict latent spatial concepts and missing observations.It can solve a variety of important robotic tasks, from se-mantic classification of places and uncertainty estimation, tonovelty detection and generation of place appearances basedon semantic information. DGSM has appealing propertiesand offers state-of-the-art performance. Our future effortswill focus on extending DGSM to include more complexperception provided by visual and depth sensors, as well asexploit the resulting deep representations for probabilisticreasoning and planning at multiple levels of abstraction.

REFERENCES

[1] A. Aydemir, A. Pronobis, M. Gobelbecker, and P. Jensfelt, “Activevisual object search in unknown environments using uncertain seman-tics,” T-RO, vol. 29, no. 4, 2013.

[2] A. Pronobis and P. Jensfelt, “Large-scale semantic mapping andreasoning with heterogeneous modalities,” in ICRA, 2012.

[3] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: Areview and new perspectives,” TPAMI, vol. 35, no. 8, 2013.

[4] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection.” ISER, 2016.

[5] N. Sunderhauf, F. Dayoub, S. McMahon, B. Talbot, R. Schulz,P. Corke, G. Wyeth, B. Upcroft, and M. Milford, “Place categorizationand semantic mapping on a mobile robot,” in ICRA, 2016.

[6] R. Goeddel and E. Olson, “Learning semantic place labels fromoccupancy grids using CNNs,” in IROS, 2016.

[7] H. Poon and P. Domingos, “Sum-product networks: A new deeparchitecture,” in UAI, 2011.

[8] R. Peharz, R. Gens, F. Pernkopf, and P. Domingos, “On the latentvariable interpretation in Sum-Product Networks,” TPAMI, 2017.

[9] R. Gens and P. Domingos, “Discriminative learning of sum-productnetworks,” in NIPS, 2012.

[10] R. Peharz, P. Robert, K. Georg, M. Pejman, and P. Franz, “Modelingspeech with Sum-Product Networks: Application to bandwidth exten-sion,” in ICASSP, 2014.

[11] M. Amer and S. Todorovic, “Sum-Product Networks for activityrecognition,” TPAMI, vol. 38, no. 4, 2015.

[12] O. M. Mozos, C. Stachniss, and W. Burgard, “Supervised learning ofplaces from range data using AdaBoost,” in ICRA, 2005.

[13] A. Pronobis, O. M. Mozos, B. Caputo, and P. Jensfelt, “Multi-modalsemantic place classification,” IJRR, vol. 29, no. 2-3, Feb. 2010.

[14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”in NIPS, 2014.

[15] A. Radford, L. Metz, and S. Chintala, “Unsupervised representationlearning with deep convolutional generative adversarial networks,”arXiv:1511.06434 [cs.LG], 2015.

[16] A. Pronobis, A. Ranganath, and R. P. N. Rao, “LibSPN: A library forlearning and inference with Sum-Product Networks and TensorFlow,”in ICML Workshop on Principled Approaches to Deep Learning,2017. [Online]. Available: http://www.libspn.org

[17] A. Oliva and A. Torralba, “Building the gist of a scene: The role ofglobal image features in recognition,” Brain Research, vol. 155, 2006.

[18] A. Ranganathan, “PLISS: Detecting and labeling places using onlinechange-point detection,” in RSS, 2010.

[19] P. Buschka and A. Saffiotti, “A virtual sensor for room detection,” inIROS, 2002.

[20] A. Tapus and R. Siegwart, “Incremental robot mapping with finger-prints of places,” in ICRA, 2005.

[21] Y. Sun, L. Bo, and D. Fox, “Attribute based object identification,” inICRA, 2013.

[22] A. Eitel, J. Springenberg, L. Spinello, M. Riedmiller, and W. Burgard,“Multimodal deep learning for robust RGB-D object recognition,” inIROS, 2015.

[23] M. Madry, L. Bo, D. Kragic, and D. Fox, “ST-HMP: Unsupervisedspatio-temporal feature learning for tactile data,” in ICRA, 2014.

[24] D. Kingma and M. Welling, “Auto-encoding variational bayes,” inICLR, 2014.

[25] R. Yeh, C. Chen, T. Y. Lim, M. Hasegawa-Johnson, and M. N. Do,“Semantic image inpainting with perceptual and contextual losses,”arXiv:1607.07539 [cs.CV], 2016.

[26] W.-C. Cheng, S. Kok, H. Pham, H. Chieu, and K. Chai, “Languagemodeling with Sum-Product Networks,” in Interspeech, 2014.

[27] R. Gens and P. Domingos, “Learning the structure of Sum-ProductNetworks,” in ICML, 2013.

[28] A. Rooshenas and D. Lowd, “Learning Sum-Product networks withdirect and indirect variable interactions,” in ICML, 2014.

[29] A. Darwiche, “A differential approach to inference in bayesian net-works,” Journal of the ACM, vol. 50, no. 3, May 2003.

[30] G. Grisetti, C. Stachniss, and W. Burgard, “Improved techniques forgrid mapping with Rao-Blackwellized particle filters,” T-RO, vol. 23,no. 1, 2007.

Andrzej Pronobis, Rajesh P. N. Rao - arXiv.org e-Print … Deep Generative Spatial Models for Mobile...

Documents

Transcript of Andrzej Pronobis, Rajesh P. N. Rao - arXiv.org e-Print … Deep Generative Spatial Models for Mobile...