Development of a Semantic Segmentation System for Dynamic Occlusion...

Development of a Semantic Segmentation Systemfor Dynamic Occlusion Handling in Mixed Reality forLandscape Simulation

Daiki Kido1, Tomohiro Fukuda2, Nobuyoshi Yabuki31,2,3Division and Sustainable Energy and Environmental Engineering, GraduateSchool of Engineering, Osaka [email protected] 2,3{fukuda|yabuki}@see.eng.osaka-u.ac.jp

The use of mixed reality (MR) for landscape simulation has attracted attentionrecently. MR can produce a realistic landscape simulation by merging athree-dimensional computer graphic (3DCG) model of a new building on a realspace. One challenge with MR that remains to be tackled is occlusion. Properlyhandling occlusion is important for the understanding of the spatial relationshipbetween physical and virtual objects. When the occlusion targets move or thetarget's shape changes, depth-based methods using a special camera have beenapplied for dynamic occlusion handling. However, these methods have alimitation of the distance to obtain depth information and are unsuitable foroutdoor landscape simulation. This study focuses on a dynamic occlusionhandling method for MR-based landscape simulation. We developed a real-timesemantic segmentation system to perform dynamic occlusion handling. Wedesigned this system for use in mobile devices with client-server communicationfor real-time semantic segmentation processing in mobile devices. Additionally,we used a normal monocular camera for practice use.

Keywords:Mixed Reality, Dynamic occlusion handling, Semantic segmentation,Deep learning, Landscape simulation

INTRODUCTIONPreserving good landscapes is important in enhanc-ing our quality of life. To preserve good land-scapes, it is necessary to predict and assess a plannedlandscape in the project. A landscape assessmentmeans that a project executor hears various opin-ions from stakeholders and evaluates them (Fukuda,et al. 2014). Then it is difficult for stakeholders, es-pecially non-experts like residents, to imagine the

planned landscape in case three-dimensional (3D)objects have not existed yet. According to this con-text, the visualization of the planned landscape canhelp stakeholders to grasp the project and will pre-serve good landscapes.

The use of mixed reality (MR) for landscape sim-ulation has attracted attention recently. MR mergesreal and virtual worlds (Milgram and Kishino, 1994),and can produce a realistic landscape simulation by

Interaction - HUMAN-COMPUTER - Volume 1 - eCAADe 37 / SIGraDi 23 | 641

overlaying a 3D computer graphic (CG) model of abuilding that does not yet exist on a camera view ofa given area. Augmented reality (AR) is a similar con-cept that merges real and virtual worlds by overlay-ing computer-generated information on a real space;however, MR merges real and virtual worlds withhigh standardswhereas AR only overlays informationon a real space.

One challenge with MR that remains to be tack-led is occlusion (i.e., the relationship between thephysical and virtual worlds) (Shah et al., 2012). Whenphysical objects occlude virtual objects, the aug-mented image may cause user confusion related todepth perception (Figure 1). Thus, it is crucial to han-dle occlusion appropriately in MR.

Previous studies of occlusion handling methodshave been divided into three categories: model-based method, depth-based method, and contour-basedmethod. Model-basedmethods create a 3DCGmodel of the real scene in pre-processing and com-pare the depth of the virtual objects with the 3DCGmodel to handle occlusion. Inoue et al. (2018) devel-oped landscape design simulation system using MR.This system handled occlusion using a 3DCG modelof the surrounding environment created with struc-ture frommotion (SfM) technology inpre-processing.In this system, the occlusion problem is solved bynot rendering 3D model pixels that are obscured bythe occlusion model. However, if a physical objectsuch as vegetation changes its shape over time, itcan be difficult to appropriately handle occlusion bychanging the occlusion model’s shape. Moreover,if the targets are moving objects such as car andperson, it can be also difficult to handle occlusion.

Depth basedmethods acquire the depth informationof the real scene from a special camera in real-timeand compare the depth of the virtual objects withthe acquired depth information of the real scene tohandle occlusion. Zhu et al. (2010) improved oc-clusion handling using depth information estimatedby matching pixels from a stereo camera. Tian etal. (2015) converted the depth information acquiredfrom RGB-D camera into a 3Dmodel in real-time andhandled occlusion. Some methods realized realis-tic occlusion handling by depth enhancement (Duet al., 2016, Holynski et al., 2018). However, depth-based methods have a limitation of the distance toobtain depth information and are unsuitable for out-door landscape simulation. Contour-based methodsdetect and track the silhouette of the real objects andhandle occlusion. Tian et al. (2010) obtained thecontour of the specified occluding object by interac-tive object segmentation method, and tracked theobject and handled occlusion. In this method, theoccluding object needs to be displayed in the firstframe, and segmented and tracked with high accu-racy. Roxas et al. (2018) used semantic segmenta-tion and a given depth map for occlusion handling.Semantic segmentation image pixels with a corre-sponding class of what is being represented. Thismethod cannot handle dynamic occlusion because agiven depth map is necessary, and also needs to usea high-end laptop computer in outdoor MR simula-tion because real-time semantic segmentation pro-cessing involves heavy processing.

This study focuses on a real-time dynamic oc-clusion handling method for MR-based landscapesimulation. To this end, we developed a real-time

Figure 1Occlusion problemin MR (Inoue, et al.2018)

642 | eCAADe 37 / SIGraDi 23 - Interaction - HUMAN-COMPUTER - Volume 1

semantic segmentation system to perform dynamicocclusion handling. Our proposed method can ex-tract each object’s region and handle occlusion dy-namically in outdoor MR simulation. Thus our pro-posed system can implement more realistic land-scape simulation with the expression of the relation-ship between a 3DCG model and physical objectsincluding moving objects such as car and personand help stakeholders to understand a planned land-scape more correctly. We designed this system foruse in mobile devices such as standard laptop com-puters and tablets, thus, it enables client-server com-munication for real-time semantic segmentationpro-cessing in mobile devices. Additionally, we used amonocular camera for this system.

DEVELOPMENT OF A SEMANTIC SEGMEN-TATION SYSTEM FOR DYNAMIC OCCLU-SION HANDLINGOverview of our proposed system is shown in Figure2. As explained in Chapter 1, we adopted a seman-tic segmentation technique to handle occlusion dy-namically in outdoor MR simulation. Semantic seg-mentation can extract each object’s region frame byframe, thus, it is considered that dynamic occlusionhandling can be realized using this technique.

Real-time semantic segmentation processing in-volvesheavyprocessingand thus requires ahigh-enddesktop computer. However, our proposed systemis to be used for landscape simulation outdoors, soit needs to be used on a mobile device. We there-

Figure 2Overview of ourproposed system


fore developed a system in which a client devicetransfers image frames to a server, which performssemantic segmentation processing on those framesand sends processed frames back to the client. More-over, many semantic segmentation modules havebeen proposed due to the advancement of com-puter vision recently. Thus, we designed this systemwhose semantic segmentation module can be re-placed when more sophisticated semantic segmen-tation module is developed. We used the Unitygame engine to develop our proposed system be-cause Unity game engine can implement a GraphicUser Interface (GUI) based MR system.

Client-server communicationFor client-server communication, we used a Pythonweb application framework known as Flask, whichsupports the development of web-based services.When a client device connects to the web applica-tionon the server it sends anHTTP request, which theserver receives alongwith an image fromacameraonthe client device. The server then performs semanticsegmentation processing and transfers the segmen-tation image back to the client device as an HTTP re-sponse (Figure 3).

Next, we displayed the frames acquired from theclient device camera and the segmentation imagesacquired from the server in the Unity engine on theclient side. We used Unity’s www class to receivesegmentation images from the web application onthe server. If the frames acquired from the cameraare displayed as is, there would be a gap betweenthose frames and the corresponding segmentationimages because of the latency of the client-servercommunication. Therefore, we delayed and adjustedthe display of client-acquired frames to the displayof segmentation images, solving this mismatch be-causewas assumed itwould lead to inappropriateoc-clusion handling.

Figure 3Flow chart of theclient-servercommunication

Figure 4Mask imagecreation (targets:vegetation, fence)

Figure 5Occlusion handlingby not rendering ARimage pixels thatmerged into a maskimage


Figure 6Arrangement of thenewly constructedbuilding

Occlusion handlingTo implement dynamic occlusion handling, we mustcreate a mask image of the occlusion handing targetobject. We created a mask image with RGB valuesin the segmentation image using OpenCV for Unity(ver. 2.2.8), which is a Unity image processing plugin(Figure 4). We then defined an occlusion handlingtarget object and created the object’s mask image.We then merged the mask image into an image ofa 3DCG model, referred to as an AR image, and han-dled occlusion by not rendering AR image pixels thatmerged into the mask image (Figure 5).

VALIDATION EXPERIMENTWe validated our semantic segmentation system fordynamic occlusion handling. We adopted ICNet(Zhao et al., 2018), which is a semantic segmenta-tion method based on deep learning, because ofits processing speed and accuracy. We applied theCityscapes dataset (Cordts et al., 2016), which wascreated for understanding urban scenes as ICNet’ssemantic segmentation dataset. We used a laptopcomputer with Intel Core i5-8250U of CPU, 8GB ofRAM, and Intel UHD Graphics 620 of GPU as a clientdevice and a desktop computer with Intel Core i7-8700Kof CPU, 32GBof RAM, andNVIDIAGeForceGTX1080Ti 11GB of GPU as a server device. A LogicoolHD Pro Stream Webcam C920 was also used. We ac-

quired frames with 1024 × 576 pixel resolution froma webcam connected to the laptop and conductedclient-server communication over a local area net-work (LAN) as this was the only possible environ-ment.

Figure 7Measurement ofthe newlyconstructedbuilding model

Figure 8Experimentalphotograph

Outdoor MR simulation was conducted to vali-date the occlusion handling of the proposed system.Figure 6 shows the arrangement of the new buildingand the camera’s position and orientation. Figure 7shows the measurement of the new building model.Then, the viewpoint was on the fourth floor of thebuilding because the server device was equipped onthe fourth floor of thebuilding and client-server com-munication was conducted over LAN (Figure 8). Veg-


Figure 9Validation results ofdynamic occlusionhandling


Figure 10Labels which can bedetected in thisexperiment

etation and fence equipped on the fourth floor of thebuilding were defined as occlusion targets in this ex-periment. The webcam was panned in a horizontaldirection in the range of the red dotted line of Figure6 during 20 seconds. After 20 seconds, the webcamwas tilted up during 15 seconds. Camera’s positionand orientation were computed using the methodwhich solves the perspective n-points (PnP) problemand eliminates the outliers using the random sampleconsensus (RANSAC) method proposed by Inoue etal. (2018). The validation results are shown in Figure9. Labels and objects which can be detected in thisexperiment are shown in Figure 10.

From this validation, we confirmed that our sys-tem can create semantic segmentation images andmask images using frames acquired from a webcam,thus handling appropriate dynamic occlusion in MR.This system also achieved dynamic occlusion han-dling using a general mobile device, meaning thatthis system can be used widely and easily for land-scape simulation. However, the processing speedwas about 5 frames per second (fps). Processingspeed that displays only frames acquired from thewebcam was about 30fps, and the processing speedof semantic segmentation with ICNet was about 65fps. Therefore, it was confirmed that the communi-cation speedbetween the client-server should be im-proved. And also it was confirmed that the result ofsemantic segmentation processing contained falseextraction, especially in the result after 35 seconds.The accuracy of semantic segmentation processingshould be validated.

CONCLUSIONThis research accomplished the following:

• We developed a system which enables dy-namic occlusion handling in MR using imageprocessing techniques that create a mask im-age from a segmentation image and mergethat mask image into an AR image.

• We developed a system in which a client de-vice acquires frames and transfers them to aserver that implements semantic segmenta-tion processing on the frames and transfersthem back to the client device.

Future work should adapt this system to a wide areanetwork (WAN) and Internet environment. Addition-ally, we should validate our proposed system at anoutdoor venue like an architectural project, and alsovalidate in the case where occlusion targets aremov-ing objects such as car and person.

ACKNOWLEDGEMENTSThis research has beenpartly supportedby JSPS KAK-ENHI Grant Number JP19K12681.

REFERENCESCordts, M., Omran, S., Rehfeld, T., Enzweiler, M., Benen-

son, R., Franke, U., Roth, S. and Schiele, B. 2016’Thecityscapesdataset for semanticurban sceneun-derstanding’, Proceedings of IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR 2016), pp.3213-3223

Du, C., Chen, Y., Ye, M. and Ren, L. 2016 ’Edge snapping-based depth enhancement for dynamic occlusionhandling in augmented reality’, The 15th IEEE Inter-national SymposiumonMixedandAugmentedReality(ISMAR 2016), pp. 54-62

Fukuda, T., Zhang, T. and Yabuki, N. 2014, ’Improvementof registration accuracy of a handheld augmented


reality system for urban landscape simulation’, Fron-tiers of Architectural Research, 3, pp. 386-397

Holynski, A. and Kopt, J. 2018, ’Fast depth densificationfor occlusion-aware augmented reality’, ACM Trans-actions on Graphics (TOG), 37 (6), pp. 1-11, 194

Inoue, K., Fukuda, T., Cao, R. and Yabuki, N. 2018’Tracking Robustness and Green View Index Esti-mation of Augmented and Diminished Reality forEnvironmental Design: PhotoAR+DR2017 project’,Proceedings of the 23rd International Conference onComputer-AidedArchitecturalDesignResearch inAsia(CAADRIA 2018), pp. 339-348

Milgram, P. and Kishino, F. 1994, ’A Taxonomy of MixedReality VisualDisplays’, IEIXETransactionson Informa-tion and Systems, E77-D, 2, pp. 1321-1329

Roxas, M., Hori, T., Fukiage, T., Okamoto, Y. and Oishi, T.2018 ’Occlusion Handling using Semantic Segmen-tation and Visibility-Based Rendering for Mixed Re-ality’, Proceedings of the 24th ACM Symposium on Vir-tual Reality Software and Technology (VRST 2018), pp.1-8, 20

Shah, MM., Arshad, H. and Sulaiman, R. 2012 ’Occlusionin augmented reality’, Proceedings of the 8th Interna-tional Conference on Information Science and DigitalContent Technology (ICIDT 2012), pp. 372-378

Tian, Y., Guan, T. andWang, C. 2010, ’Real-TimeOcclusionHandling in Augmented Reality Based on an ObjectTracking Approach’, Sensors, 10 (4), pp. 2885-2900

Tian, Y., Long, Y., Xia, D., Yao, H. and Zhang, J. 2015, ’Han-dling occlusions in augmented reality based on 3Dreconstruction method’, Neurocomputing, 156, pp.96-104

Zhao, H., Qi, X., Shen, A. and Jia, J. 2018 ’ICNet for Real-Time Semantic Segmentation on High-ResolutionImages’, Proceedingsof EuropeanConferenceonCom-puter Vision (ECCV 2018), pp. 418-434

Zhu, J., Pan, Z., Sun, C. and Chen, W. 2010, ’Handlingocclusions in video-based augmented reality usingdepth information’, Computer Animation and VirtualWorlds, 21, pp. 509-521


Development of a Semantic Segmentation System for Dynamic Occlusion...

Documents

Transcript of Development of a Semantic Segmentation System for Dynamic Occlusion...