arXiv:1908.11569v1 [cs.CV] 30 Aug 2019Aniket Agarwal IIT Roorkee [email protected] Jose Dolz...

13
Revisiting CycleGAN for semi-supervised segmentation Arnab Kumar Mondal IIT Kharagpur [email protected] Aniket Agarwal IIT Roorkee [email protected] Jose Dolz ETS Montreal [email protected] Christian Desrosiers ETS Montreal [email protected] Abstract In this work, we study the problem of training deep networks for semantic image segmentation using only a fraction of annotated images, which may significantly re- duce human annotation efforts. Particularly, we propose a strategy that exploits the unpaired image style transfer capabilities of CycleGAN in semi-supervised segmentation. Unlike recent works using adversarial learning for semi- supervised segmentation, we enforce cycle consistency to learn a bidirectional mapping between unpaired images and segmentation masks. This adds an unsupervised reg- ularization effect that boosts the segmentation performance when annotated data is limited. Experiments on three differ- ent public segmentation benchmarks (PASCAL VOC 2012, Cityscapes and ACDC) demonstrate the effectiveness of the proposed method. The proposed model achieves 2-4% of improvement with respect to the baseline and outperforms recent approaches for this task, particularly in low labeled data regime. 1. Introduction Deep learning methods have recently emerged as an ef- ficient solution for semantic image segmentation, achiev- ing outstanding performance in a wide range of applications like analyzing natural scenes, autonomous driving or med- ical imaging. Despite their success, a main limitation of these methods is the need for large training datasets of pixel- level annotated images. Acquiring such labeled images is a time consuming process that may require user expertise in various scenarios. This impedes the applicability of deep models to applications where labeled images are scarce. Semi-supervised learning (SSL) has been proposed to overcome the shortage of labeled data. In this scenario, we assume that a large set of unlabeled images is available during training, in addition to a small set of images with strong annotations. Consider a SSL segmentation setting with two distinct subsets: L = {(x i ,y i )} n i=1 contains la- beled images x i and their corresponding ground-truth mask y i , and U = {x 0 i } m i=1 contains unlabeled images x 0 i (typi- cally m n). In this setting, the objective is often formu- lated as maximizing a log-likelihood with respect the learn- ing parameters θ of a deep network, through the supervision provided by the labeled set L. On the other hand, unsuper- vised images in U can be leveraged in different ways, typi- cally introducing a regularization effect in deep models and therefore improving their generalization capabilities. Generative adversarial networks (GANs) [8] have shown to be an efficient solution for unsupervised domain adap- tation [9, 10, 28, 29], a problem related to semi-supervised learning. GAN-based methods for domain adaptation use adversarial learning to match the distributions of source and target data, commonly at the input or in feature space. Re- cently, the CycleGAN model [34] has become a popular choice to transfer image style between domains, as it elim- inates the restriction of corresponding image pairs during training [12]. This model finds a mapping between source and target images which preserves key attributes between the input and the transformed image using a cycle consis- tency loss. While CycleGAN has been widely employed to learn a mapping between different domains, it has not yet been investigated in more traditional semi-supervised sce- narios where there is no domain shift between labeled and unlabeled data. In this work, we leverage the unpaired domain adapta- tion ability of CycleGAN to learn a bidirectional mapping from unlabeled real images to available ground truth masks. This mapping, learned in conjunction with the standard su- pervised mapping from labeled images to their correspond- ing labels, acts as an unsupervised regularization loss which helps train the network when labeled data is limited. The proposed method contrasts with recent work on domain adaptation for segmentation [9, 13], where the CycleGAN arXiv:1908.11569v1 [cs.CV] 30 Aug 2019

Transcript of arXiv:1908.11569v1 [cs.CV] 30 Aug 2019Aniket Agarwal IIT Roorkee [email protected] Jose Dolz...

Page 1: arXiv:1908.11569v1 [cs.CV] 30 Aug 2019Aniket Agarwal IIT Roorkee aagarwal@ma.iitr.ac.in Jose Dolz ETS Montreal jose.dolz@etsmtl.ca Christian Desrosiers ETS Montreal christian.desrosiers@etsmtl.ca

Revisiting CycleGAN for semi-supervised segmentation

Arnab Kumar MondalIIT Kharagpur

[email protected]

Aniket AgarwalIIT Roorkee

[email protected]

Jose DolzETS Montreal

[email protected]

Christian DesrosiersETS Montreal

[email protected]

Abstract

In this work, we study the problem of training deepnetworks for semantic image segmentation using only afraction of annotated images, which may significantly re-duce human annotation efforts. Particularly, we proposea strategy that exploits the unpaired image style transfercapabilities of CycleGAN in semi-supervised segmentation.Unlike recent works using adversarial learning for semi-supervised segmentation, we enforce cycle consistency tolearn a bidirectional mapping between unpaired imagesand segmentation masks. This adds an unsupervised reg-ularization effect that boosts the segmentation performancewhen annotated data is limited. Experiments on three differ-ent public segmentation benchmarks (PASCAL VOC 2012,Cityscapes and ACDC) demonstrate the effectiveness of theproposed method. The proposed model achieves 2-4% ofimprovement with respect to the baseline and outperformsrecent approaches for this task, particularly in low labeleddata regime.

1. IntroductionDeep learning methods have recently emerged as an ef-

ficient solution for semantic image segmentation, achiev-ing outstanding performance in a wide range of applicationslike analyzing natural scenes, autonomous driving or med-ical imaging. Despite their success, a main limitation ofthese methods is the need for large training datasets of pixel-level annotated images. Acquiring such labeled images is atime consuming process that may require user expertise invarious scenarios. This impedes the applicability of deepmodels to applications where labeled images are scarce.

Semi-supervised learning (SSL) has been proposed toovercome the shortage of labeled data. In this scenario,we assume that a large set of unlabeled images is availableduring training, in addition to a small set of images with

strong annotations. Consider a SSL segmentation settingwith two distinct subsets: L = {(xi, yi)}ni=1 contains la-beled images xi and their corresponding ground-truth maskyi, and U = {x′i}mi=1 contains unlabeled images x′i (typi-cally m � n). In this setting, the objective is often formu-lated as maximizing a log-likelihood with respect the learn-ing parameters θ of a deep network, through the supervisionprovided by the labeled set L. On the other hand, unsuper-vised images in U can be leveraged in different ways, typi-cally introducing a regularization effect in deep models andtherefore improving their generalization capabilities.

Generative adversarial networks (GANs) [8] have shownto be an efficient solution for unsupervised domain adap-tation [9, 10, 28, 29], a problem related to semi-supervisedlearning. GAN-based methods for domain adaptation useadversarial learning to match the distributions of source andtarget data, commonly at the input or in feature space. Re-cently, the CycleGAN model [34] has become a popularchoice to transfer image style between domains, as it elim-inates the restriction of corresponding image pairs duringtraining [12]. This model finds a mapping between sourceand target images which preserves key attributes betweenthe input and the transformed image using a cycle consis-tency loss. While CycleGAN has been widely employed tolearn a mapping between different domains, it has not yetbeen investigated in more traditional semi-supervised sce-narios where there is no domain shift between labeled andunlabeled data.

In this work, we leverage the unpaired domain adapta-tion ability of CycleGAN to learn a bidirectional mappingfrom unlabeled real images to available ground truth masks.This mapping, learned in conjunction with the standard su-pervised mapping from labeled images to their correspond-ing labels, acts as an unsupervised regularization loss whichhelps train the network when labeled data is limited. Theproposed method contrasts with recent work on domainadaptation for segmentation [9, 13], where the CycleGAN

arX

iv:1

908.

1156

9v1

[cs

.CV

] 3

0 A

ug 2

019

Page 2: arXiv:1908.11569v1 [cs.CV] 30 Aug 2019Aniket Agarwal IIT Roorkee aagarwal@ma.iitr.ac.in Jose Dolz ETS Montreal jose.dolz@etsmtl.ca Christian Desrosiers ETS Montreal christian.desrosiers@etsmtl.ca

is employed to map images across two domains. It also dif-fers significantly from recent work using GAN-generatedimages for semi-supervised segmentation [27], in which cy-cle consistency is not enforced. The main contributions ofthis paper can be summarized as follows:

1. To our knowledge, this is the first semi-supervised seg-mentation method using CycleGAN to learn a cycle-consistent mapping between unlabeled real images andground truth masks. The proposed technique acts as anunsupervised regularization prior which improves seg-mentation performance when labeled data is limited.

2. We validate our approach on three challenging seg-mentation tasks from different applications (i.e., natu-ral scenes, autonomous driving and medical imaging),and show that our method is dataset-independent andeffective for a wide range of scenarios.

3. Additionally, we present an ablation study which ana-lyzes the effect of various components of the proposedunsupervised loss and demonstrates the usefulness ofthese components for improving performance. We be-lieve this analysis is important for future investigationsof CycleGANs applied to semi-supervised segmenta-tion.

The rest of the paper is organized as follows. In Sec-tion 2, we give a brief overview of relevant work on seman-tic segmentation with a focus on semi-supervised learningand adversarial learning. Section 3 then presents our modelwhich is evaluated on three challenging datasets in Section4. Finally, we conclude with a summary of our main contri-butions and results.

2. Related workSupervised methods based on convolutional neural net-

works (CNNs) are driving progress in semantic segmenta-tion [4, 18, 26]. Despite their success, training these net-works requires a large number of densely-annotated imageswhich are expensive to obtain. A solution to address thislimitation is weakly-supervised learning, where easier-to-obtain annotations like image-level tags [15,21,32], bound-ing boxes [6, 25] or scribbles [17] are instead used to trainsegmentation models. However, weakly-supervised meth-ods still require some human interaction, which may be dif-ficult to get in certain scenarios.

Semi-supervision is a special type of weakly-supervisedlearning where many unlabeled images are also availablefor training [1, 2, 20, 23, 24, 33]. Instead of relying on weakannotations, semi-supervised learning (SSL) typically usesdomain- or task-agnostic properties of the data to regular-ize learning. Recently, several SSL methods have been pro-posed for semantic segmentation, for instance based on self-training [1], distillation [33], attention learning [20], mani-

fold embedding [2], co-training [23], and temporal ensem-bling [24]. As these methods, the proposed approach canalso leverage unlabeled image directly, without the need forweak annotations or task-specific priors.

Adversarial learning has also shown great promise fortraining deep segmentation models with few strongly-annotated images [11,27,31]. An interesting approach to in-clude unlabeled images during training is to add a discrimi-nator network in the model, which must determine whetherthe output of the segmentation network corresponds to a la-beled or unlabeled image [31]. This encourages the seg-mentation network to have a similar distribution of outputsfor images with and without annotations, thereby helpinggeneralization. A potential issue with this approach is thatthe adversarial network can have a reverse effect, where theoutput for annotated images becomes growingly similar toincorrect predictions obtained for unlabeled images. A re-lated strategy uses the discriminator to predict a confidencemap for the segmentation, enforcing this output to be max-imum for annotated images [11]. For unlabeled images, ar-eas of high confidence are used to update the segmentationnetwork in a self-teaching manner. The main limitation ofthis approach is that a confidence threshold must be pro-vided, the value of which can affect performance.

Until now, only a single work has applied Generative Ad-versarial Networks (GANs) for semi-supervised segmenta-tion [27]. In this previous work, generated images are usedfor training in addition to both labeled and unlabeled data.The trained segmentation network must predict the correctlabels for real images or a special fake label for generatedimages. For this method to work, fake images should begenerated from outside the distribution of real images sothat the segmentation network learns a better representationof the manifold (i.e., fake images constitute negative exam-ples). In contrast, our method uses cycle-consistent GANsto better estimate the distribution of real images and theircorresponding segmentation masks.

3. Methodology

3.1. CycleGAN for semi-supervised segmentation

The proposed architecture for semi-supervised segmen-tation, illustrated in Figure 1, is based on the cycle-consistent GAN (CycleGAN) model [34] which hasshown outstanding performance for unpaired image-to-image translation. This architecture is composed of fourinter-connected networks, two conditional generators andtwo discriminators, which are trained simultaneously. Inthe original CycleGAN model, the generators are employedto learn a bidirectional mapping from an image domain tothe other. On the other hand, discriminators try to deter-mine whether an image from the corresponding domain isreal or generated. By fooling the discriminators through ad-

Page 3: arXiv:1908.11569v1 [cs.CV] 30 Aug 2019Aniket Agarwal IIT Roorkee aagarwal@ma.iitr.ac.in Jose Dolz ETS Montreal jose.dolz@etsmtl.ca Christian Desrosiers ETS Montreal christian.desrosiers@etsmtl.ca

Image to labels generator

Labels to image generator

Image to labels generator

Image to labelsgenerator

Labels to image generator

Real / fake image

Image discriminator

Real / fake labels

Labels discriminator

Unlabeled images

Generated images�

<latexit sha1_base64="Kfe2l2iGdo8ph/MEHW/TMNkFtOM=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BLx4jmAckS5idzCZjZmeWeQhhyT948aCIV//Hm3/jJNmDJhY0FFXddHdFKWfa+P63V1hb39jcKm6Xdnb39g/Kh0ctLa0itEkkl6oTYU05E7RpmOG0kyqKk4jTdjS+nfntJ6o0k+LBTFIaJngoWMwINk5q9WTKre6XK37VnwOtkiAnFcjR6Je/egNJbEKFIRxr3Q381IQZVoYRTqelntU0xWSMh7TrqMAJ1WE2v3aKzpwyQLFUroRBc/X3RIYTrSdJ5DoTbEZ62ZuJ/3lda+LrMGMitYYKslgUW46MRLPX0YApSgyfOIKJYu5WREZYYWJcQCUXQrD88ipp1arBRbV2f1mp3+RxFOEETuEcAriCOtxBA5pA4BGe4RXePOm9eO/ex6K14OUzx/AH3ucP0UyPSQ==</latexit>

Ground truth labels

Generated labels�

<latexit sha1_base64="Kfe2l2iGdo8ph/MEHW/TMNkFtOM=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BLx4jmAckS5idzCZjZmeWeQhhyT948aCIV//Hm3/jJNmDJhY0FFXddHdFKWfa+P63V1hb39jcKm6Xdnb39g/Kh0ctLa0itEkkl6oTYU05E7RpmOG0kyqKk4jTdjS+nfntJ6o0k+LBTFIaJngoWMwINk5q9WTKre6XK37VnwOtkiAnFcjR6Je/egNJbEKFIRxr3Q381IQZVoYRTqelntU0xWSMh7TrqMAJ1WE2v3aKzpwyQLFUroRBc/X3RIYTrSdJ5DoTbEZ62ZuJ/3lda+LrMGMitYYKslgUW46MRLPX0YApSgyfOIKJYu5WREZYYWJcQCUXQrD88ipp1arBRbV2f1mp3+RxFOEETuEcAriCOtxBA5pA4BGe4RXePOm9eO/ex6K14OUzx/AH3ucP0UyPSQ==</latexit>

Ground truth labelsYL

<latexit sha1_base64="jBvIXRbDehvwXQl+EOiPI5wueZw=">AAACAnicbVDLSsNAFL3xWesr6krcBIvgqiRV0GXRjQsXFexD2hAm00k7dDIJMxOhhODGX3HjQhG3foU7/8ZJG0RbDwycOede7r3HjxmVyra/jIXFpeWV1dJaeX1jc2vb3NltySgRmDRxxCLR8ZEkjHLSVFQx0okFQaHPSNsfXeZ++54ISSN+q8YxcUM04DSgGCkteeZ+L0RqiBFL7zIv/flcZ5lnVuyqPYE1T5yCVKBAwzM/e/0IJyHhCjMkZdexY+WmSCiKGcnKvUSSGOERGpCuphyFRLrp5ITMOtJK3woioR9X1kT93ZGiUMpx6OvKfEc56+Xif143UcG5m1IeJ4pwPB0UJMxSkZXnYfWpIFixsSYIC6p3tfAQCYSVTq2sQ3BmT54nrVrVOanWbk4r9YsijhIcwCEcgwNnUIcraEATMDzAE7zAq/FoPBtvxvu0dMEoevbgD4yPb0s6mAE=</latexit>

Labeled imageXL

<latexit sha1_base64="UN6wMQj1eeZcDzqVQQORfASKCqg=">AAACAnicbVDLSsNAFL2pr1pfUVfiJlgEVyWpgi6Lbly4qGAf0IYwmU7aoZNJmJkIJQQ3/oobF4q49Svc+TdO2iDaemDgzDn3cu89fsyoVLb9ZZSWlldW18rrlY3Nre0dc3evLaNEYNLCEYtE10eSMMpJS1HFSDcWBIU+Ix1/fJX7nXsiJI34nZrExA3RkNOAYqS05JkH/RCpEUYs7WZe+vO5yTLPrNo1ewprkTgFqUKBpmd+9gcRTkLCFWZIyp5jx8pNkVAUM5JV+okkMcJjNCQ9TTkKiXTT6QmZdayVgRVEQj+urKn6uyNFoZST0NeV+Y5y3svF/7xeooILN6U8ThTheDYoSJilIivPwxpQQbBiE00QFlTvauEREggrnVpFh+DMn7xI2vWac1qr355VG5dFHGU4hCM4AQfOoQHX0IQWYHiAJ3iBV+PReDbejPdZackoevbhD4yPb0mnmAA=</latexit>

Generated labelsGIS(XL)

<latexit sha1_base64="uhMl7hO9vBX+ktl1UZKTeDU6ais=">AAACCnicbVDLSsNAFJ34rPUVdelmtAh1U5Iq6LLoQgUXFe0D2hAm00k7dDIJMxOhhKzd+CtuXCji1i9w5984aYNo64GBM+fcy733eBGjUlnWlzE3v7C4tFxYKa6urW9smlvbTRnGApMGDlko2h6ShFFOGooqRtqRICjwGGl5w/PMb90TIWnI79QoIk6A+pz6FCOlJdfcu3CTq9u03A2QGmDEknbqJj+f6zQ9dM2SVbHGgLPEzkkJ5Ki75me3F+I4IFxhhqTs2FaknAQJRTEjabEbSxIhPER90tGUo4BIJxmfksIDrfSgHwr9uIJj9XdHggIpR4GnK7Ml5bSXif95nVj5p05CeRQrwvFkkB8zqEKY5QJ7VBCs2EgThAXVu0I8QAJhpdMr6hDs6ZNnSbNasY8q1ZvjUu0sj6MAdsE+KAMbnIAauAR10AAYPIAn8AJejUfj2Xgz3ielc0beswP+wPj4Bp4omts=</latexit>

Unlabeled imageXU

<latexit sha1_base64="MTRfotSIBQx0Tw2iNuo630k/sds=">AAACAnicbVDLSsNAFL3xWesr6krcDBbBVUmqoMuiG5cVTFtoQ5hMJ+3QyYOZiVBCcOOvuHGhiFu/wp1/46QNoq0HBs6ccy/33uMnnEllWV/G0vLK6tp6ZaO6ubW9s2vu7bdlnApCHRLzWHR9LClnEXUUU5x2E0Fx6HPa8cfXhd+5p0KyOLpTk4S6IR5GLGAEKy155mE/xGpEMM+6uZf9fJw898yaVbemQIvELkkNSrQ887M/iEka0kgRjqXs2Vai3AwLxQinebWfSppgMsZD2tM0wiGVbjY9IUcnWhmgIBb6RQpN1d8dGQ6lnIS+rix2lPNeIf7n9VIVXLoZi5JU0YjMBgUpRypGRR5owAQlik80wUQwvSsiIywwUTq1qg7Bnj95kbQbdfus3rg9rzWvyjgqcATHcAo2XEATbqAFDhB4gCd4gVfj0Xg23oz3WemSUfYcwB8YH99XXZgJ</latexit>

Generated labelsGIS(XU )

<latexit sha1_base64="fRmgsK5JShGmuOrJGQEzcwUMSII=">AAACCnicbVDLSsNAFJ3UV62vqEs3o0Wom5JUQZdFF+quon1AG8JkOmmHTiZhZiKUkLUbf8WNC0Xc+gXu/BsnbRBtPTBw5px7ufceL2JUKsv6MgoLi0vLK8XV0tr6xuaWub3TkmEsMGnikIWi4yFJGOWkqahipBMJggKPkbY3usj89j0Rkob8To0j4gRowKlPMVJacs39Sze5vk0rvQCpIUYs6aRu8vNppumRa5atqjUBnCd2TsogR8M1P3v9EMcB4QozJGXXtiLlJEgoihlJS71YkgjhERqQrqYcBUQ6yeSUFB5qpQ/9UOjHFZyovzsSFEg5DjxdmS0pZ71M/M/rxso/cxLKo1gRjqeD/JhBFcIsF9ingmDFxpogLKjeFeIhEggrnV5Jh2DPnjxPWrWqfVyt3ZyU6+d5HEWwBw5ABdjgFNTBFWiAJsDgATyBF/BqPBrPxpvxPi0tGHnPLvgD4+Mbq+ea5A==</latexit>

Generated imageGSI(YL)

<latexit sha1_base64="yUVGF627swj2za2cDULssHhLW90=">AAACCnicbVDLSsNAFJ3UV62vqEs3o0Wom5JUQZdFFyq4qGgf0oYwmU7aoZNJmJkIJWTtxl9x40IRt36BO//GSVtEWw8MnDnnXu69x4sYlcqyvozc3PzC4lJ+ubCyura+YW5uNWQYC0zqOGShaHlIEkY5qSuqGGlFgqDAY6TpDc4yv3lPhKQhv1XDiDgB6nHqU4yUllxz99xNbi7TUidAqo8RS+5SN/n5XKXpgWsWrbI1Apwl9oQUwQQ11/zsdEMcB4QrzJCUbduKlJMgoShmJC10YkkihAeoR9qachQQ6SSjU1K4r5Uu9EOhH1dwpP7uSFAg5TDwdGW2pJz2MvE/rx0r/8RJKI9iRTgeD/JjBlUIs1xglwqCFRtqgrCgeleI+0ggrHR6BR2CPX3yLGlUyvZhuXJ9VKyeTuLIgx2wB0rABsegCi5ADdQBBg/gCbyAV+PReDbejPdxac6Y9GyDPzA+vgGfxprc</latexit>

Reconstructed labelsGIS

�GSI(YL)

�<latexit sha1_base64="ih9yO62foQYGCKgyphsaaV6VtuI=">AAACGnicbVDLSsNAFJ3UV62vqEs3wSK0m5JUQZdFF1pwUal9SBPCZDpth04mYWYilJDvcOOvuHGhiDtx4984aYNo64GBM+fcy733eCElQprml5ZbWl5ZXcuvFzY2t7Z39N29tggijnALBTTgXQ8KTAnDLUkkxd2QY+h7FHe88UXqd+4xFyRgt3ISYseHQ0YGBEGpJFe3Lt243kxsjwxLijbrScn2oRwhSOO7xI1/PtdJUk6ryq5eNCvmFMYisTJSBBkarv5h9wMU+ZhJRKEQPcsMpRNDLgmiOCnYkcAhRGM4xD1FGfSxcOLpaYlxpJS+MQi4ekwaU/V3Rwx9ISa+pyrTTcW8l4r/eb1IDs6cmLAwkpih2aBBRA0ZGGlORp9wjCSdKAIRJ2pXA40gh0iqNAsqBGv+5EXSrlas40r15qRYO8/iyIMDcAhKwAKnoAauQAO0AAIP4Am8gFftUXvW3rT3WWlOy3r2wR9on990uqEj</latexit>

Reconstructed imageGSI

�GIS(XU )

�<latexit sha1_base64="/pIZLopvOzUb7E0OOtHQyDNZfhk=">AAACGnicbVDLSsNAFJ3UV62vqEs3wSK0m5JUQZdFF9pdpaYtNCFMptN26GQSZiZCCfkON/6KGxeKuBM3/o2TNoi2Hhg4c8693HuPH1EipGl+aYWV1bX1jeJmaWt7Z3dP3z/oiDDmCNsopCHv+VBgShi2JZEU9yKOYeBT3PUnV5nfvcdckJDdyWmE3QCOGBkSBKWSPN269pJ2M3V8Mqoo2mynFSeAcowgTXqpl/x87DStZlVVTy+bNXMGY5lYOSmDHC1P/3AGIYoDzCSiUIi+ZUbSTSCXBFGclpxY4AiiCRzhvqIMBli4yey01DhRysAYhlw9Jo2Z+rsjgYEQ08BXldmmYtHLxP+8fiyHF25CWBRLzNB80DCmhgyNLCdjQDhGkk4VgYgTtauBxpBDJFWaJRWCtXjyMunUa9ZprX57Vm5c5nEUwRE4BhVggXPQADegBWyAwAN4Ai/gVXvUnrU37X1eWtDynkPwB9rnN4ENoSs=</latexit>

Cycle consistency loss

Segmentation loss Discriminator loss

DI<latexit sha1_base64="x2+TSrLg3zG8NrPpTMXK8u8QcWM=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BPegtonlAsoTZySQZMju7zPQKYcknePGgiFe/yJt/4yTZgyYWNBRV3XR3BbEUBl3328mtrK6tb+Q3C1vbO7t7xf2DhokSzXidRTLSrYAaLoXidRQoeSvWnIaB5M1gdD31m09cGxGpRxzH3A/pQIm+YBSt9HDTvesWS27ZnYEsEy8jJchQ6xa/Or2IJSFXyCQ1pu25Mfop1SiY5JNCJzE8pmxEB7xtqaIhN346O3VCTqzSI/1I21JIZurviZSGxozDwHaGFIdm0ZuK/3ntBPuXfipUnCBXbL6on0iCEZn+TXpCc4ZybAllWthbCRtSTRnadAo2BG/x5WXSqJS9s3Ll/rxUvcriyMMRHMMpeHABVbiFGtSBwQCe4RXeHOm8OO/Ox7w152Qzh/AHzucP4heNiA==</latexit>

DS<latexit sha1_base64="PC+qsceGl06kN2LGfiIWTwn6x1E=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BPXiMxDwgWcLspDcZMju7zMwKIeQTvHhQxKtf5M2/cZLsQRMLGoqqbrq7gkRwbVz328mtrW9sbuW3Czu7e/sHxcOjpo5TxbDBYhGrdkA1Ci6xYbgR2E4U0igQ2ApGtzO/9YRK81g+mnGCfkQHkoecUWOl+l2v3iuW3LI7B1klXkZKkKHWK351+zFLI5SGCap1x3MT40+oMpwJnBa6qcaEshEdYMdSSSPU/mR+6pScWaVPwljZkobM1d8TExppPY4C2xlRM9TL3kz8z+ukJrz2J1wmqUHJFovCVBATk9nfpM8VMiPGllCmuL2VsCFVlBmbTsGG4C2/vEqalbJ3Ua48XJaqN1kceTiBUzgHD66gCvdQgwYwGMAzvMKbI5wX5935WLTmnGzmGP7A+fwB8T+Nkg==</latexit>

GSI<latexit sha1_base64="q13MaFUB3Aw/H11MOJ+h9Q+XlX8=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRbBU9mtBT0WPai3ivYD2qVk02wbm02WJCuUpf/BiwdFvPp/vPlvTNs9aOuDgcd7M8zMC2LOtHHdbye3srq2vpHfLGxt7+zuFfcPmlomitAGkVyqdoA15UzQhmGG03asKI4CTlvB6Grqt56o0kyKBzOOqR/hgWAhI9hYqXndS+9vJ71iyS27M6Bl4mWkBBnqveJXty9JElFhCMdadzw3Nn6KlWGE00mhm2gaYzLCA9qxVOCIaj+dXTtBJ1bpo1AqW8Kgmfp7IsWR1uMosJ0RNkO96E3F/7xOYsILP2UiTgwVZL4oTDgyEk1fR32mKDF8bAkmitlbERlihYmxARVsCN7iy8ukWSl7Z+XKXbVUu8ziyMMRHMMpeHAONbiBOjSAwCM8wyu8OdJ5cd6dj3lrzslmDuEPnM8fT5KO9A==</latexit>

GSI<latexit sha1_base64="q13MaFUB3Aw/H11MOJ+h9Q+XlX8=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRbBU9mtBT0WPai3ivYD2qVk02wbm02WJCuUpf/BiwdFvPp/vPlvTNs9aOuDgcd7M8zMC2LOtHHdbye3srq2vpHfLGxt7+zuFfcPmlomitAGkVyqdoA15UzQhmGG03asKI4CTlvB6Grqt56o0kyKBzOOqR/hgWAhI9hYqXndS+9vJ71iyS27M6Bl4mWkBBnqveJXty9JElFhCMdadzw3Nn6KlWGE00mhm2gaYzLCA9qxVOCIaj+dXTtBJ1bpo1AqW8Kgmfp7IsWR1uMosJ0RNkO96E3F/7xOYsILP2UiTgwVZL4oTDgyEk1fR32mKDF8bAkmitlbERlihYmxARVsCN7iy8ukWSl7Z+XKXbVUu8ziyMMRHMMpeHAONbiBOjSAwCM8wyu8OdJ5cd6dj3lrzslmDuEPnM8fT5KO9A==</latexit>

GIS<latexit sha1_base64="4jK1Em7V7SYD/LZYMbFwMYQfPj4=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRbBU9mtBT0WPai3ivYD2qVk02wbm02WJCuUpf/BiwdFvPp/vPlvTNs9aOuDgcd7M8zMC2LOtHHdbye3srq2vpHfLGxt7+zuFfcPmlomitAGkVyqdoA15UzQhmGG03asKI4CTlvB6Grqt56o0kyKBzOOqR/hgWAhI9hYqXndS2/vJ71iyS27M6Bl4mWkBBnqveJXty9JElFhCMdadzw3Nn6KlWGE00mhm2gaYzLCA9qxVOCIaj+dXTtBJ1bpo1AqW8Kgmfp7IsWR1uMosJ0RNkO96E3F/7xOYsILP2UiTgwVZL4oTDgyEk1fR32mKDF8bAkmitlbERlihYmxARVsCN7iy8ukWSl7Z+XKXbVUu8ziyMMRHMMpeHAONbiBOjSAwCM8wyu8OdJ5cd6dj3lrzslmDuEPnM8fT4iO9A==</latexit>

GIS<latexit sha1_base64="4jK1Em7V7SYD/LZYMbFwMYQfPj4=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRbBU9mtBT0WPai3ivYD2qVk02wbm02WJCuUpf/BiwdFvPp/vPlvTNs9aOuDgcd7M8zMC2LOtHHdbye3srq2vpHfLGxt7+zuFfcPmlomitAGkVyqdoA15UzQhmGG03asKI4CTlvB6Grqt56o0kyKBzOOqR/hgWAhI9hYqXndS2/vJ71iyS27M6Bl4mWkBBnqveJXty9JElFhCMdadzw3Nn6KlWGE00mhm2gaYzLCA9qxVOCIaj+dXTtBJ1bpo1AqW8Kgmfp7IsWR1uMosJ0RNkO96E3F/7xOYsILP2UiTgwVZL4oTDgyEk1fR32mKDF8bAkmitlbERlihYmxARVsCN7iy8ukWSl7Z+XKXbVUu8ziyMMRHMMpeHAONbiBOjSAwCM8wyu8OdJ5cd6dj3lrzslmDuEPnM8fT4iO9A==</latexit>

GIS<latexit sha1_base64="4jK1Em7V7SYD/LZYMbFwMYQfPj4=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRbBU9mtBT0WPai3ivYD2qVk02wbm02WJCuUpf/BiwdFvPp/vPlvTNs9aOuDgcd7M8zMC2LOtHHdbye3srq2vpHfLGxt7+zuFfcPmlomitAGkVyqdoA15UzQhmGG03asKI4CTlvB6Grqt56o0kyKBzOOqR/hgWAhI9hYqXndS2/vJ71iyS27M6Bl4mWkBBnqveJXty9JElFhCMdadzw3Nn6KlWGE00mhm2gaYzLCA9qxVOCIaj+dXTtBJ1bpo1AqW8Kgmfp7IsWR1uMosJ0RNkO96E3F/7xOYsILP2UiTgwVZL4oTDgyEk1fR32mKDF8bAkmitlbERlihYmxARVsCN7iy8ukWSl7Z+XKXbVUu8ziyMMRHMMpeHAONbiBOjSAwCM8wyu8OdJ5cd6dj3lrzslmDuEPnM8fT4iO9A==</latexit>

Figure 1: Schematic explaining the working of our model. The model contains four networks which are trained simultane-ously.

versarial learning, the model thus learns to generate imagesfrom the true distribution without requiring paired images.A cycle-consistency loss is also added to ensure that gen-erators are consistent, i.e. that we recover the same imagewhen going through both generators sequentially.

In our semi-supervised segmentation model, the Cycle-GAN is instead used to map images to their correspond-ing segmentation mask and vice-versa. The first genera-tor (GIS), corresponding to the segmentation network thatwe want to obtain, learns a mapping from an image to itssegmentation labels. The first discriminator (DS) tries todifferentiate these generated labels from real segmentationmasks. Note that the combination ofGIS ofDS is similar tosemi-supervised segmentation approach presented in [31].Conversely, the second generator (GSI ) learns to map a seg-mentation mask to its image. In our semi-supervised seg-mentation setting, this generator is only used to improvetraining. Likewise, the second discriminator (DI ) receivesan image as input and predicts whether this image is realor generated. To enforce cycle consistency, generators aretrained so that feeding the labels generated by GIS for animage into GSI gives that same image, and passing back toGIS the image generated by GSI for a segmentation maskgives that same mask. Figure 2 shows examples of images,ground truth labels, generated images and generated labels

obtained for the three datasets used in our experiments.

3.2. Loss functions

In this section, we formally define the loss functionsemployed to train our model in a semi-supervised settingwhere the data comes from three distributions: labeled im-ages (XL), ground truth masks of labeled images (YL), andunlabeled images (XU ). The first loss function is a standardsupervised segmentation loss that imposes the segmentationnetwork (GIS) to generate labels of ground truth masks:

LSgen(GIS) = Ex,y∼XL,YL

[H(y,GIS(x)

)](1)

whereH is the pixelwise cross-entropy defined as

H(y, y) = −N∑j=1

K∑k=1

yj,k log yj,k. (2)

In this expression, yj,k and yj,k are the ground truth andpredicted probabilities that pixel j ∈ {1, . . . , N} has labelk ∈ {1, . . . ,K}. Likewise, we employ a pixelwise L2 normbetween a labeled image and the image generated from itscorresponding ground truth as supervised loss to train theimage generator GSI :

LIgen(GSI) = Ex,y∼XL,YL

[‖GSI(y)− x‖22

]. (3)

Page 4: arXiv:1908.11569v1 [cs.CV] 30 Aug 2019Aniket Agarwal IIT Roorkee aagarwal@ma.iitr.ac.in Jose Dolz ETS Montreal jose.dolz@etsmtl.ca Christian Desrosiers ETS Montreal christian.desrosiers@etsmtl.ca

Image Ground truth labels Generated image Generated labels

Figure 2: Examples of images, ground truth labels, generated images and generated labels obtained for three benchmarkdatasets: PASCAL VOC 2012 (top row), Cityscapes (middle row), and ACDC (bottom row).

To exploit unlabeled images, we incorporate two ad-ditional types of losses: adversarial losses and cycle-consistency losses. The adversarial losses are used to trainthe generators and discriminators in a competing fashion,and help the generators produce realistic images and seg-mentation masks. To have a better training of the discrim-inators, we follow the approach presented in [19] and usea least square loss instead of the traditional cross-entropy.It was shown in this previous work that this loss functionleads to minimizing the Pearson χ2 divergence. Supposethat DS(y) is the predicted probability that segmentationlabels y correspond to a ground truth mask. We define theadversarial loss for DS as

LSdisc(GIS , DS) = Ey∼YL

[(DS(y)− 1

)2]+

Ex′∼XU

[(DS(GIS(x

′)))2]

.(4)

Similarly, let DI(x) be the predicted probability that an im-age x is real, the adversarial loss for the image discriminatoris defined as

LIdisc(GSI , DI) = Ex′∼XU

[(DI(x

′)− 1)2]

+

Ey∼YL

[(DI(GSI(y))

)2].

(5)

The first cycle consistency loss measures the differencebetween an unlabeled image and the regenerated image af-ter passing through generators GIS and GSI sequentially.Here, we use the L1 norm since it leads to sharper images

than the L2 norm:

LIcycle(GIS , GSI) = Ex′∼XU

[‖GSI(GIS(x

′))− x′‖1].

(6)On the other hand, since the segmentation labels are cate-gorical variables, we use cross-entropy to evaluate the dif-ference between a ground-truth segmentation mask and theregenerated labels after passing through generatorsGSI andGIS in sequence:

LScycle(GIS , GSI) = Ey∼YL

[H(y,GIS(GSI(y))

)]. (7)

Finally, the total loss is obtained by combining all six lossterms:

Ltotal(GIS , GSI , DS , DI) = LSgen(GIS) + λ1L

Igen(GSI)

+ λ2LScycle(GIS , GSI) + λ3L

Icycle(GIS , GSI)

− λ4LSdisc(GIS , DS) − λ5L

Idisc(GSI , DI)

(8)

argminGIS ,GSI

argmaxDS ,DI

Ltotal(GIS , GSI , DS , DI). (9)

In practice, learning is performed in an alternating fashion,where the parameters of the generators are optimized whileconsidering those of the discriminators as fixed, and vice-versa.

3.3. Implementation details

Following the original implementation of CycleGAN,we adopt the architecture proposed in [14] for our gener-ators, since it has shown impressive results for image-style

Page 5: arXiv:1908.11569v1 [cs.CV] 30 Aug 2019Aniket Agarwal IIT Roorkee aagarwal@ma.iitr.ac.in Jose Dolz ETS Montreal jose.dolz@etsmtl.ca Christian Desrosiers ETS Montreal christian.desrosiers@etsmtl.ca

transfer. This network is composed of two stride-2 convo-lutions, followed by 9 residual blocks and two fractionally-strided convolutions with stride 1/2. Similarly, instancenormalization [30] was employed and no drop-out wasadopted. Furthermore, we used softmax as output functionwhen generating segmentation labels from images, whereastanh was the selected function when translating from seg-mentation labels to images, in order to have continuous val-ues. In pre-processing, each channel of an image is nor-malized to the [−1, 1] range by subtracting its mean valueand dividing by the difference between the maximum andminimum value.

Unlike the original CycleGAN model, we make use ofpixel-wise discriminators [11] where the size of the out-put is the same as the input and the adversarial label (i.e.,real / generated) is recopied at each output pixel. We foundthis model to perform better than having a single discrim-inator output. Each discriminator contains three convolu-tional blocks, followed by Leaky ReLU activations withnegative slope of α=0.2. In addition, batch normalizationis used in the discriminators after the second convolutionalblock.

Both generators and discriminators were trained usingAdam optimizer [16] with β1 and β2 parameters equal to0.5 and 0.999. Learning rate was initially set to 2×10−4

with a linear decay after every 100 epochs, during 400epochs. Furthermore, batch size was set to 5 in all ex-periments. The values of the weighting terms in Eq. (8)were set to λ1 = 1, λ2 = 0.05, λ3 = 1, λ4 = 1 andλ5 = 1. The code was implemented in Pytorch 3.3 [22]and experiments were run on a server equipped with aNVIDIA Titan V GPU (12 GBs). The code is made pub-licly available at https://github.com/arnab39/Semi-supervised-segmentation-cycleGAN.

4. Experiments

4.1. Datasets

We conduct experiments on three different public se-mantic segmentation benchmarks: PASCAL VOC 2012 [7],Cityscapes [5] and the Automated Cardiac Diagnosis Chal-lenge (ACDC) MICCAI 2017 Challenge [3].

PASCAL VOC 2012: This dataset contains 21 commonobject classes, including one background class. In our ex-periments, we employed the augmented set composed of10,582 images, which we split into training (8994 images)and validation (1588 images) subsets. In addition, due tomemory limitations, we resized all images to 200×200 pix-els before being fed into the network.

Cityscapes: This second dataset contains 50 videos fromdriving scenes where a total of 20 classes (including back-ground) are manually annotated. In our experiments, we

split the 3475 provided images into training (2953 images)and validation (522 images) subsets. As in the previouscase, all images were resized to a 128×256 pixel resolu-tion.

ACDC: This medical image set focuses on the segmenta-tion of cardiac structures (the left ventricular endocardiumand epicardium and the right ventricular endocardium) andconsists of 100 cine magnetic resonance (MR) exams cov-ering normal cases and subjects with well-defined definedpathologies: dilated cardiomyopathy, hypertrophic car-diomyopathy, myocardial infarction with altered left ven-tricular ejection fraction and abnormal right ventricle. Eachexam contains acquisitions at the diastolic and systolicphases. For our experiments, we employed 75 exams fortraining and the remaining 25 for validation.

It is important to note that, since we aim at isolating theperformance of each method and not achieving state-of-the-art results, no data augmentation was performed in any ofthe datasets for training.

4.2. Evaluation protocol

We use the mean intersection over union (mIOU) metricto evaluate the segmentation results of all the models. Thismetric can be defined as TP

TP+FP+FN , where TP, FP, andFN are the true positive, false positive, and false negativepixels, respectively, determined over the whole validationset.

To have an upper-bound on performance, we train a net-work in a fully-supervised manner, employing all avail-able training images. We also trained the same modelfrom scratch using only 10%, 20%, 30% or 50% of labeledimages, and refer to this baseline as Partial. Our semi-supervised method is trained with the same subsets as thePartial baseline, however it also makes use of unlabeledtraining images. Last, we compare our method to the ap-proach presented in [11], which has shown state-of-art per-formance for semi-supervised segmentation.

4.3. Results

In the following section, we report the experimental re-sults of the proposed approach on the three datasets de-scribed in Section 4.1.

4.3.1 Comparison on benchmarks

Table 1 reports the results obtained by the tested approacheson the three benchmark datasets. We first observe that, in allcases, the proposed model outperforms the partial supervi-sion baseline when training with a reduced set of labeledimages. This difference is particularly significant whenpixel-level annotations are scarce (i.e., 10% and 20% ofthe whole training set), where the proposed model achieves2-4% of improvement. As the number of labeled images

Page 6: arXiv:1908.11569v1 [cs.CV] 30 Aug 2019Aniket Agarwal IIT Roorkee aagarwal@ma.iitr.ac.in Jose Dolz ETS Montreal jose.dolz@etsmtl.ca Christian Desrosiers ETS Montreal christian.desrosiers@etsmtl.ca

Method Labeled % VOC Cityscapes ACDC

Full 100 0.5543 0.5551 0.8982

Hung et al. [11] 20 0.2032 0.3490 0.8063

Partial

50 0.4108 0.4856 0.886330 0.3237 0.4502 0.878520 0.2688 0.4112 0.864210 0.2158 0.3636 0.8418

Ours

50 0.4197 0.4997 0.889030 0.3514 0.4654 0.880420 0.2981 0.4321 0.868810 0.2543 0.3923 0.8463

Table 1: Semantic segmentation performance of testedmethods on the three benchmark datasets, for different lev-els of supervision. Full corresponds to the segmentationnetwork trained with all training samples, and Partial tothe same network trained with a subset of labeled images(ranging from 10% to 50%) without considering unlabeledimages.

increases, the gap between the baseline and the proposedmodels decreases, with a gain close to 1% when trainingwith half of the whole training set. Furthermore, we foundthat the semi-supervised segmentation approach of Hung etal. [11] obtained poor results for all three datasets, withlower accuracy than the partial supervision baseline (Par-tial). In the original work [11], authors used a generatorpre-trained using ImageNet. In our experiments, to have anunbiased comparison, we tested methods without such pre-training (i.e., all generators and discriminators were trainedfrom scratch). This could potentially explain our lower re-sults obtained for Hung et al.’s method.

A visual comparison of results is given in Figures 7, 4and 5. It can be seen that the proposed method predictsa segmentation closer to the network trained with all im-ages (Full) than the partial supervision baseline (Partial) andthe Hung et al.’s model. While predicted region boundariesare sometimes inaccurate, the global semantic informationof the image (i.e., actual class labels) appears to be betterlearned by our model compared to the partial supervisionbaseline. In addition, our model seems to better capture de-tails of thin objects –e.g., legs or persons– compared to boththe baseline and the method in [11].

4.3.2 Ablation study

To further analyze the effect of the different components ofthe proposed model, we conduct an ablation study wherethe model is trained while removing a single loss term ofEq. (8). Specifically, we train the model without the labelscycle-consistency loss

(LScycle

), image cycle-consistency

Method VOC

Proposed 0.2981

w/o labels cycle loss(LScycle

)0.2627

w/o image cycle loss(LIcycle

)0.2733

w/o labels discr. loss(LSdisc

)0.2614

w/o image discr. loss(LIdisc

)0.2543

Table 2: Ablation study on the PASCAL VOC 2012 datasetwith 20% labeled data.

loss(LScycle

), labels discriminator loss

(LSdisc

), or image

discriminator loss(LIdisc

). Note that these modifications

correspond to setting λ3, λ4, λ5 or λ6 to 0, respectively.For this experiment, we investigate the performance of themodel trained with 20% of labeled data on PASCAL VOC2012.

The results of our ablation study are summarized in Ta-ble 2. The proposed model containing all loss terms reachesa mIOU value of 0.2981. If we remove the cycle consis-tency loss on the generation of segmentation labels, thisvalue is reduced to 0.2627. However, removing the cycleconsistency loss on the image generation leads to an evenlower accuracy of 0.2733, suggesting that the cycle consis-tency loss on segmentation masks has a stronger impact inthe model. Regarding the significance of the losses in thediscriminators, we observe a reverse effect. A lower per-formance is observed if the loss on the discriminator DI

is ignored, which is responsible of differentiating betweenunlabeled and generated images.

5. Discussion and conclusion

We presented a semi-supervised method for image se-mantic segmentation, where the key idea is to leverageCycleGAN to learn a cycle-consistent mapping betweenunlabeled real images and available ground truth masks.Unlike recent work using adversarial learning for semi-supervised segmentation [11, 27, 31], the proposed strategyenforces consistency between unpaired images and segmen-tation masks, which acts as an unsupervised regularizer.From the reported results, we have shown that this strat-egy improves segmentation performance, particularly whenannotated data is scarce.

Due to the high computational and memory requirementsof generating large images, our experiments have employedimages with reduced size, in particular for the Cityscapedataset where the resolution was reduced from 1024×2048pixels to 128×256. This is in large part responsible for thelower accuracy values obtained in our experiments, com-pared to those reported in the literature. In a future inves-tigation, we will evaluate the performance of our model on

Page 7: arXiv:1908.11569v1 [cs.CV] 30 Aug 2019Aniket Agarwal IIT Roorkee aagarwal@ma.iitr.ac.in Jose Dolz ETS Montreal jose.dolz@etsmtl.ca Christian Desrosiers ETS Montreal christian.desrosiers@etsmtl.ca

Image Ground truth Full Partial Hung et al. [11] Ours

Figure 3: Visual comparisons on the PASCAL VOC 2012 dataset employing 20% of labeled images for training.

full-sized images. Moreover, in this work, we used the samenetwork for both generators (GIS and GSI ). This architec-tural choice was made to achieve a better learning equilib-rium during training (i.e., avoid a generator learning muchfaster than the other). Employing different networks in fu-ture experiments could however improve performance.

References[1] W. Bai, O. Oktay, M. Sinclair, H. Suzuki, M. Rajchl, G. Tar-

roni, B. Glocker, A. King, P. M. Matthews, and D. Rueck-ert. Semi-supervised learning for network-based cardiac mrimage segmentation. In International Conference on Med-ical Image Computing and Computer-Assisted Intervention,pages 253–260. Springer, 2017.

[2] C. Baur, S. Albarqouni, and N. Navab. Semi-supervised deeplearning for fully convolutional networks. In InternationalConference on Medical Image Computing and Computer-Assisted Intervention, pages 311–319. Springer, 2017.

[3] O. Bernard, A. Lalande, C. Zotti, F. Cervenansky, X. Yang,P.-A. Heng, I. Cetin, K. Lekadir, O. Camara, M. A. G.Ballester, et al. Deep learning techniques for automatic MRIcardiac multi-structures segmentation and diagnosis: Is theproblem solved? IEEE Transactions on Medical Imaging,2018.

[4] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. Deeplab: Semantic image segmentation withdeep convolutional nets, atrous convolution, and fully con-nected crfs. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 40(4):834–848, 2018.

[5] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,R. Benenson, U. Franke, S. Roth, and B. Schiele. The

cityscapes dataset for semantic urban scene understanding.In Proceedings of the IEEE conference on computer visionand pattern recognition, pages 3213–3223, 2016.

[6] J. Dai, K. He, and J. Sun. Boxsup: Exploiting boundingboxes to supervise convolutional networks for semantic seg-mentation. In Proceedings of the IEEE International Con-ference on Computer Vision, pages 1635–1643, 2015.

[7] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, andA. Zisserman. The pascal visual object classes (voc) chal-lenge. International journal of computer vision, 88(2):303–338, 2010.

[8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial nets. In Z. Ghahramani, M. Welling,C. Cortes, N. D. Lawrence, and K. Q. Weinberger, edi-tors, Advances in Neural Information Processing Systems 27,pages 2672–2680. Curran Associates, Inc., 2014.

[9] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko,A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adver-sarial domain adaptation. arXiv preprint arXiv:1711.03213,2017.

[10] J. Hoffman, D. Wang, F. Yu, and T. Darrell. FCNs in thewild: Pixel-level adversarial and constraint-based adapta-tion. arXiv preprint arXiv:1612.02649, 2016.

[11] W.-C. Hung, Y.-H. Tsai, Y.-T. Liou, Y.-Y. Lin, and M.-H.Yang. Adversarial learning for semi-supervised semanticsegmentation. In Proceedings of the British Machine VisionConference (BMVC), page 1, 2018.

[12] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. InProceedings of the IEEE conference on computer vision andpattern recognition, pages 1125–1134, 2017.

Page 8: arXiv:1908.11569v1 [cs.CV] 30 Aug 2019Aniket Agarwal IIT Roorkee aagarwal@ma.iitr.ac.in Jose Dolz ETS Montreal jose.dolz@etsmtl.ca Christian Desrosiers ETS Montreal christian.desrosiers@etsmtl.ca

Image Ground truth Full

Partial Hung et al. [11] Ours

Image Ground truth Full

Partial Hung et al. [11] Ours

Figure 4: Visual comparisons on the Cityscape dataset employing 20% of labeled images for training.

Image Ground truth Full Partial Hung et al. [11] Ours

Figure 5: Visual comparisons on the ACDC dataset employing 20% of labeled images for training.

[13] J. Jiang, Y.-C. Hu, N. Tyagi, P. Zhang, A. Rimner, G. S.Mageras, J. O. Deasy, and H. Veeraraghavan. Tumor-aware,adversarial domain adaptation from ct to mri for lung cancer

segmentation. In International Conference on Medical Im-age Computing and Computer-Assisted Intervention, pages777–785. Springer, 2018.

Page 9: arXiv:1908.11569v1 [cs.CV] 30 Aug 2019Aniket Agarwal IIT Roorkee aagarwal@ma.iitr.ac.in Jose Dolz ETS Montreal jose.dolz@etsmtl.ca Christian Desrosiers ETS Montreal christian.desrosiers@etsmtl.ca

[14] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses forreal-time style transfer and super-resolution. In Europeanconference on computer vision, pages 694–711. Springer,2016.

[15] H. Kervadec, J. Dolz, M. Tang, E. Granger, Y. Boykov, andI. B. Ayed. Constrained-CNN losses for weakly supervisedsegmentation. Medical image analysis, 2019.

[16] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.

[17] D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribble-sup: Scribble-supervised convolutional networks for seman-tic segmentation. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 3159–3167, 2016.

[18] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In Proceedings of theIEEE conference on computer vision and pattern recogni-tion, pages 3431–3440, 2015.

[19] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smol-ley. Least squares generative adversarial networks. In Pro-ceedings of the IEEE International Conference on ComputerVision, pages 2794–2802, 2017.

[20] S. Min and X. Chen. A robust deep attention network tonoisy labels in semi-supervised biomedical segmentation.arXiv preprint arXiv:1807.11719, 2018.

[21] G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille.Weakly-and semi-supervised learning of a DCNN for seman-tic image segmentation. In ICCV, 2015.

[22] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De-Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Auto-matic differentiation in PyTorch. In NIPS Autodiff Workshop,2017.

[23] J. Peng, G. Estradab, M. Pedersoli, and C. Desrosiers. Deepco-training for semi-supervised image segmentation. arXivpreprint arXiv:1903.11233, 2019.

[24] C. S. Perone, P. Ballester, R. C. Barros, and J. Cohen-Adad. Unsupervised domain adaptation for medical imag-ing segmentation with self-ensembling. arXiv preprintarXiv:1811.06042, 2018.

[25] M. Rajchl, M. C. Lee, O. Oktay, K. Kamnitsas, J. Passerat-Palmbach, W. Bai, M. Damodaram, M. A. Rutherford, J. V.Hajnal, B. Kainz, et al. Deepcut: Object segmentation frombounding box annotations using convolutional neural net-works. IEEE transactions on medical imaging, 36(2):674–683, 2017.

[26] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convo-lutional networks for biomedical image segmentation. InInternational Conference on Medical image computing andcomputer-assisted intervention, pages 234–241. Springer,2015.

[27] N. Souly, C. Spampinato, and M. Shah. Semi supervised se-mantic segmentation using generative adversarial network.In Computer Vision (ICCV), 2017 IEEE International Con-ference on, pages 5689–5697. IEEE, 2017.

[28] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang,and M. Chandraker. Learning to adapt structured outputspace for semantic segmentation. In Computer Vision andPattern Recognition (CVPR), 2018.

[29] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarialdiscriminative domain adaptation. In Computer Vision andPattern Recognition (CVPR), volume 1, page 4, 2017.

[30] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normal-ization: The missing ingredient for fast stylization. arXivpreprint arXiv:1607.08022, 2016.

[31] Y. Zhang, L. Yang, J. Chen, M. Fredericksen, D. P. Hughes,and D. Z. Chen. Deep adversarial networks for biomed-ical image segmentation utilizing unannotated images. InInternational Conference on Medical Image Computing andComputer-Assisted Intervention, pages 408–416, 2017.

[32] Y. Zhou, X. He, L. Huang, L. Liu, F. Zhu, S. Cui, andL. Shao. Collaborative learning of semi-supervised segmen-tation and classification for medical images. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 2079–2088, 2019.

[33] Y. Zhou, Y. Wang, P. Tang, W. Shen, E. K. Fishman, andA. L. Yuille. Semi-supervised multi-organ segmentation viamulti-planar co-training. arXiv preprint arXiv:1804.02586,2018.

[34] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial net-works. In Proceedings of the IEEE International Conferenceon Computer Vision, pages 2223–2232, 2017.

Page 10: arXiv:1908.11569v1 [cs.CV] 30 Aug 2019Aniket Agarwal IIT Roorkee aagarwal@ma.iitr.ac.in Jose Dolz ETS Montreal jose.dolz@etsmtl.ca Christian Desrosiers ETS Montreal christian.desrosiers@etsmtl.ca

Partial Ours

Class 10% 20% 30% 10% 20% 30%

1 0.4236 0.5302 0.5798 0.4369 0.5808 0.58192 0.2046 0.2624 0.3328 0.1423 0.3574 0.41973 0.1060 0.1542 0.1544 0.0238 0.1511 0.25754 0.1633 0.2479 0.2349 0.1504 0.2048 0.31465 0.0119 0.0535 0.0744 0.0227 0.0188 0.06236 0.5192 0.5310 0.6315 0.6962 0.6867 0.64267 0.3263 0.3368 0.4746 0.4451 0.4877 0.50428 0.2523 0.4050 0.3841 0.3455 0.3123 0.40759 0.1029 0.1226 0.1379 0.2158 0.1429 0.168010 0.0123 0.1888 0.1211 0.1412 0.2021 0.157911 0.0725 0.1868 0.3487 0.0643 0.2097 0.235912 0.2048 0.2496 0.2509 0.2331 0.2302 0.298913 0.0781 0.1033 0.1769 0.2474 0.1056 0.220514 0.3811 0.4126 0.5343 0.1496 0.4601 0.510715 0.5886 0.6528 0.6721 0.5690 0.6992 0.664516 0.0889 0.1230 0.0775 0.1996 0.2179 0.170717 0.1294 0.1061 0.2078 0.1851 0.0488 0.225818 0.1089 0.0654 0.1820 0.2091 0.0350 0.223219 0.2990 0.3760 0.4768 0.1879 0.5115 0.468120 0.2407 0.2668 0.4210 0.4280 0.2988 0.5331

Table 3: Classwise mean IoU for the 20 valid classes of the PASCAL VOC 2012 dataset, obtained with 10%, 20% or 30% oflabeled examples. Partial corresponds to training only the segmentation network of our semi-supervised CycleGAN methodwith the subset of labeled examples.

Page 11: arXiv:1908.11569v1 [cs.CV] 30 Aug 2019Aniket Agarwal IIT Roorkee aagarwal@ma.iitr.ac.in Jose Dolz ETS Montreal jose.dolz@etsmtl.ca Christian Desrosiers ETS Montreal christian.desrosiers@etsmtl.ca

Partial Ours

Class 10% 20% 30% 10% 20% 30%

1 0.9346 0.9453 0.9523 0.9369 0.9457 0.95342 0.5932 0.6518 0.6850 0.6323 0.6775 0.69983 0.7807 0.8149 0.8333 0.7977 0.8199 0.84894 0.1419 0.1463 0.1820 0.1401 0.1657 0.20085 0.0715 0.1201 0.1684 0.1128 0.1469 0.17266 0.2441 0.2975 0.3288 0.2732 0.3074 0.34267 0.1255 0.2067 0.2481 0.1869 0.2275 0.26448 0.1993 0.2799 0.3315 0.2528 0.3091 0.35129 0.7824 0.8177 0.8355 0.8012 0.8278 0.856510 0.4458 0.4661 0.4923 0.4433 0.4718 0.513711 0.8777 0.8953 0.9019 0.8872 0.8953 0.907212 0.3766 0.4534 0.5021 0.4387 0.4822 0.522913 0.0687 0.1173 0.1786 0.0867 0.1625 0.196114 0.7626 0.8094 0.8367 0.7822 0.8259 0.852815 0.0877 0.0857 0.1258 0.0825 0.1046 0.147316 0.0448 0.0715 0.1735 0.0576 0.1425 0.196517 0.0814 0.2454 0.2892 0.2067 0.2781 0.298518 0.0739 0.0935 0.1471 0.0701 0.1525 0.152919 0.216 0.2948 0.3420 0.2611 0.2601 0.3618

Table 4: Classwise mean IoU for the 19 valid classes of the Cityscapes dataset, obtained with 10%, 20% or 30% of labeledexamples. Partial corresponds to training only the segmentation network of our semi-supervised CycleGAN method with thesubset of labeled examples.

Page 12: arXiv:1908.11569v1 [cs.CV] 30 Aug 2019Aniket Agarwal IIT Roorkee aagarwal@ma.iitr.ac.in Jose Dolz ETS Montreal jose.dolz@etsmtl.ca Christian Desrosiers ETS Montreal christian.desrosiers@etsmtl.ca

Image Ground truth Partial Ours

Figure 6: Visual comparisons on the PASCAL VOC 2012 dataset employing 30% of labeled images for training.

Page 13: arXiv:1908.11569v1 [cs.CV] 30 Aug 2019Aniket Agarwal IIT Roorkee aagarwal@ma.iitr.ac.in Jose Dolz ETS Montreal jose.dolz@etsmtl.ca Christian Desrosiers ETS Montreal christian.desrosiers@etsmtl.ca

Image Ground truth Partial Ours

Figure 7: Visual comparisons on the Cityscapes dataset employing 30% of labeled images for training.