JointGAN: Multi-Domain Joint Distribution Learning with...

JointGAN: Multi-Domain Joint Distribution Learning withGenerative Adversarial Nets

Yunchen Pu 1 Shuyang Dai 2 Zhe Gan 3 Weiyao Wang 2 Guoyin Wang 2 Yizhe Zhang 3 Ricardo Henao 2

Lawrence Carin 2

AbstractA new generative adversarial network is devel-oped for joint distribution matching. Distinct frommost existing approaches, that only learn condi-tional distributions, the proposed model aims tolearn a joint distribution of multiple random vari-ables (domains). This is achieved by learningto sample from conditional distributions betweenthe domains, while simultaneously learning tosample from the marginals of each individual do-main. The proposed framework consists of multi-ple generators and a single softmax-based critic,all jointly trained via adversarial learning. Froma simple noise source, the proposed frameworkallows synthesis of draws from the marginals, con-ditional draws given observations from a subset ofrandom variables, or complete draws from the fulljoint distribution. Most examples considered arefor joint analysis of two domains, with examplesfor three domains also presented.

1. IntroductionGenerative adversarial networks (GANs) (Goodfellow et al.,2014) have emerged as a powerful framework for modelingthe draw of samples from complex data distributions. Whentrained on datasets of natural images, significant progresshas been made on synthesizing realistic and sharp-lookingimages (Radford et al., 2016). Recent work has also ex-tended the GAN framework for the challenging task of textgeneration (Yu et al., 2017; Zhang et al., 2017b). How-ever, in its standard form, GAN models distributions in onedomain, i.e., for a single random variable.

This work was done when Yunchen Pu, Zhe Gan and YizheZhang were Ph.D. students at Duke University. 1Facebook,Menlo Park, CA, USA 2Duke University, Durham, NC,USA 3Microsoft Research, Redmond, WA, USA. Corre-spondence to: Yunchen Pu <[email protected]>, Shuyang Dai<[email protected]>, Zhe Gan <[email protected]>,Lawrence Carin <[email protected]>.

Proceedings of the 35 th International Conference on MachineLearning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018by the author(s).

There has been recent interest in employing GAN ideas tolearn conditional distributions for two random variables.This setting is of interest when one desires to synthesize(or infer) one random variable given an instance of anotherrandom variable. Example applications include generativemodels with (stochastic) latent variables (Mescheder et al.,2017; Tao et al., 2018), and conditional data synthesis (Isolaet al., 2017; Reed et al., 2016), when both domains consistof observed pairs of random variables.

In this paper we focus on learning the joint distribution ofmultiple random variables using adversarial training. Forthe case of two random variables, conditional GAN (Mirza& Osindero, 2014) and Triangle GAN (Gan et al., 2017a)have been utilized for this task in the case that paireddata are available. Further, adversarially learned inference(ALI) (Dumoulin et al., 2017; Donahue et al., 2017) and Cy-cleGAN (Zhu et al., 2017; Kim et al., 2017; Yi et al., 2017)were developed for unsupervised learning, where the two-way mappings between two domains are learned withoutany paired data. These models are unified as the joint dis-tribution matching problem by Li et al. (2017a). However,in all previous approaches the joint distributions are notfully learned, i.e., the model only learns to sample from theconditional distributions, assuming access to the marginaldistributions, which are typically instantiated as empiricalsamples from each individual domain (see Figure 1(b) forillustration). Therefore, only conditional data synthesis canbe achieved due to the lack of a learned sample mechanismfor the marginals.

It is desirable to build a generative-model learning frame-work from which one may sample from a fully-learned jointdistribution. We design a new GAN framework that learnsthe joint distribution by decomposing it into the product of amarginal and a conditional distribution(s), each learned viaadversarial training (see Figure 1(c) for illustration). Theresulting model may then be employed in several distinct ap-plications: (i) synthesis of draws from any of the marginals;(ii) synthesis of draws from the conditionals when otherrandom variables are observed, i.e., imputation; (iii) or wemay simultaneously draw all random variables from thejoint distribution.

JointGAN: Multi-Domain Joint Distribution Learning with Generative Adversarial Nets

For the special case of two random variables, the proposedmodel consists of four generators and a softmax critic func-tion. The design includes two generators for the marginals,two for the conditionals, and a single 5-way critic (discrim-inator) trained to distinguish pairs of real data from fourdifferent kinds of synthetic data. These five modules are im-plemented as neural networks, which share parameters forefficiency and are optimized jointly via adversarial learning.We also consider an example with three random variables.

The contributions of this work are summarized as follows.(i) We present the first GAN-enabled framework that al-lows for full joint-distribution learning of multiple randomvariables. Unlike existing models, the proposed frameworklearns marginals and conditionals simultaneously. (ii) Weshare parameters of the generator models, and thus the re-sulting model does not have a significant increase in thenumber of parameters relative to prior work that only con-sidered conditionals (ALI, Triangle GAN, CycleGAN, etc.)or marginals (traditional GAN). (iii) Unlike existing ap-proaches, we consolidate all real vs. artificial sample com-parisons into a single softmax-based critic function. (iv)While the main focus is on the case of two random variables,we extend the proposed model to learning the joint distri-bution of three or more random variables. (v) We applythe proposed model for both unsupervised and supervisedlearning paradigms.

2. BackgroundTo simplify the presentation, we first consider joint mod-eling of two random variables, with the setup generalizedin Sec. 3.2 to the case of more than two domains. Forthe two-random-variable case, consider marginal distribu-tions q(x) and q(y) defined over two random variablesx ∈ X and y ∈ Y , respectively. Typically, we have sam-ples but not an explicit density form for q(x) and q(y),i.e., ensembles {xi}Ni=1 and {yj}Mj=1 are available for learn-ing. In general, their joint distribution can be written asthe product of a marginal and a conditional in two ways:q(x,y) = q(x)q(y|x) = q(y)q(x|y). One random vari-able can be synthesized (or inferred) given the other usingconditional distributions, q(x|y) and q(y|x).

2.1. Generative Adversarial Networks

Nonparametric sampling from marginal distributions q(x)and q(y) can be accomplished via adversarial learn-ing (Goodfellow et al., 2014), which provides a sam-pling mechanism that only requires gradient backpropa-gation, avoiding the need to explicitly adopt a form forthe marginals. Specifically, instead of sampling directlyfrom an (assumed) parametric distribution for the desiredmarginal, the target random variable is generated as a deter-ministic transformation of an easy-to-sample, independent

(b) Learning conditionals with ALI, Triangle GAN, Cycle GAN.

(c) Learning joint distributions with the proposed Joint GAN.

(a) Learning marginals with GAN.

Figure 1. Comparing the generative process of GAN (Goodfellowet al., 2014), ALI (Dumoulin et al., 2017), Triangle GAN (Ganet al., 2017a), CycleGAN (Zhu et al., 2017) and the proposedJointGAN model. GAN in (a) only allows synthesizing samplesfrom marginals, the models in (b) only allow for conditional datasynthesis, while the proposed model in (c), allows data synthesisfrom both marginals and conditionals.

noise source, e.g., Gaussian distribution. The sampling pro-cedure for the marginals, implicitly defined as x ∼ pα(x)and y ∼ pβ(y), is carried out through the following twogenerative processes:

x = fα(ε1), ε1 ∼ p(ε1) , (1)y = fβ(ε′1), ε′1 ∼ p(ε′1) , (2)

where fα(·) and fβ(·) are two marginal generators, speci-fied as neural networks with parameters α and β, respec-tively. p(ε1) and p(ε′1) are assumed to be simple distribu-tions, e.g., isotropic Gaussian. The generative processesmanifested by (1) and (2) are illustrated in Figure 1(a).Within the models, stochasticity in x and y is manifestedvia draws from p(ε1) and p(ε′1), and respective neural net-works fα(·) and fβ(·) transform draws from these simpledistributions such that they are approximately consistentwith draws from q(x) and q(y).

For this purpose, GAN trains a ω-parameterized critic func-tion gω(x), to distinguish samples generated from pα(x)and q(x). Formally, the minimax objective of GAN is

minα

maxωLGAN(α,ω) = Ex∼q(x)[log σ(gω(x))] (3)

+ Eε1∼p(ε1)[log(1− σ(gω(x)))] ,


with expectations Ex∼q(x)[·] and Eε1∼p(ε1)[·] approximatedvia sampling, and σ(·) is the sigmoid function. As shownin Goodfellow et al. (2014), the equilibrium of the objectivein (3) is achieved if and only if pα(x) = q(x).

Similarly, we can design a corresponding minimax objectivethat is similar to (3) to match the marginal pβ(y) to q(y).

2.2. Adversarially Learned Inference

In the same spirit, sampling from conditional distributionsq(x|y) and q(y|x) can be also achieved as a deterministictransformation of two inputs, the variable in the sourcedomain as a covariate, plus an independent source of noise.Specifically, the sampling procedure for the conditionalsy ∼ pθ(y|x) and x ∼ pφ(x|y) is modeled as

y = fθ(x, ε2), x ∼ q(x), ε2 ∼ p(ε2) , (4)x = fφ(y, ε′2), y ∼ q(y), ε′2 ∼ p(ε′2) , (5)

where fθ(·) and fφ(·) are two conditional generators, spec-ified as neural networks with parameters θ and φ, respec-tively. In practice, the inputs of fθ(·) and fφ(·) are concate-nated. As in GAN, p(ε2) and p(ε′2) are two simple distribu-tions that provide the stochasticity when generating y givenx, and vice versa. The conditional generative processesmanifested in (4) and (5) are illustrated in Figure 1(b),

ALI (Dumoulin et al., 2017) learns the two desired con-ditionals by matching joint distributions pθ(x,y) =q(x)pθ(y|x) and pφ(x,y) = q(y)pφ(x|y), using a criticfunction gω(x,y) similar to (3). The minimax objective ofALI can be written as

minθ,φ

maxω

E(x,y)∼pθ(x,y)[log σ(gω(x, y))] (6)

+ E(x,y)∼pφ(x,y)[log(1− σ(gω(x,y)))] .

The equilibrium of the objective in (6) is achieved if andonly if pθ(x,y) = pφ(x,y).

While ALI is able to match joint distributions using (6), onlyconditional distributions pθ(y|x) and pφ(x|y) are learned,thus assuming access to (samples from) the true marginaldistributions q(x) and q(y), respectively.

3. Adversarial Joint Distribution LearningBelow we discuss fully learning the joint distribution oftwo random variables in both supervised and unsupervisedsettings. By “supervised” it is meant that we have accessto joint empirical data (x,y), and by “unsupervised” it ismeant that we have access to empirical draws ofx and y, butnot paired observations (x,y) from the joint distribution.

3.1. JointGAN

Since q(x,y) = q(x)q(y|x) = q(y)q(x|y), a simple wayto achieve joint-distribution learning is to first learn models

for the two marginals separately, using a pair of traditionalGANs, followed by training an independent ALI model forthe two conditionals. However, such a two-step trainingprocedure is suboptimal, as there is no information flowbetween marginals and conditionals during training. Thissuboptimality is demonstrated in the experiments. Addi-tionally, a two-step learning process becomes cumbersomewhen considering more than two random variables.

Alternatively, we consider learning to sample from condi-tionals via pθ(y|x) and pφ(x|y), while also learning tosample from marginals via pα(x) and pβ(y). All modeltraining is performed simultaneously. We term our modelJointGAN for full GAN analysis of joint random variables.

Access to Paired Empirical Draws In this setting, weassume access to samples from the true (empirical) jointdistribution q(x,y). The models we seek to learn constitutetwo means of approximating draws from the true distributionq(x,y), i.e., pα(x)pθ(y|x) and pβ(y)pφ(x|y), as shownin Figure 1(c):

x = fα(ε1), y = fθ(x, ε2), (7)y = fβ(ε′1), x = fφ(y, ε′2), (8)

where fα(·), fβ(·), fθ(·) and fφ(·) are neural networks asdefined previously. ε1, ε2, ε′1 and ε′2 are independent noise.Note that the only difference between fα(·) and fφ(·) isthat the function fφ(·) has another conditional input y whencompared with fα(·). Therefore, in implementation, wecouple the parameters α and φ together. Similarly, β and θare also coupled together. Specifically, fα(·) and fβ(·) areimplemented as

fα(·) = fφ(0, ·), fβ(·) = fθ(0, ·) , (9)

where 0 is an all-zero tensor which has the same size as xor y. As a result, (7) and (8) have almost the same numberof parameters as ALI-like approaches.

The following notation is introduced for simplicity of illus-tration:

p1(x,y) = q(x)pθ(y|x), p2(x,y) = q(y)pφ(x|y)

p3(x,y) = pα(x)pθ(y|x), p4(x,y) = pβ(y)pφ(x|y)

p5(x,y) = q(x,y) , (10)

where p1(x,y), p2(x,y), p3(x,y) and p4(x,y) are implic-itly defined in (4), (5), (7) and (8). Note that p5(x,y) issimply the empirical joint distribution.

When learning, we wish to impose that the five distributionsin (10) should be identical. Toward this end, an adversarialobjective is specified. Joint pairs (x,y) are drawn from thefive distributions in (10), and a critic function is learnedto discriminate among them, while the four generators are


trained to mislead the critic. Naively, for JointGAN, one canuse 4 binary critics to mimic a 5-class classifier. Departingfrom previous work such as Gan et al. (2017a), here thediscriminator is implemented directly as a 5-way softmaxclassifier. Compared with using multiple binary classifiers,this design is more principled in that we avoid multiplecritics resulting in possibly conflicting (real vs. synthetic)assessments.

Let the critic gω(x,y) ∈ ∆4 (in the 4-simplex) be a ω-parameterized neural network with softmax on the toplayer, i.e.,

∑5k=1 gω(x,y)[k] = 1 and gω(x,y)[k] ∈ (0, 1),

where gω(x,y)[k] is an entry of gω(x,y). The minimaxobjective for JointGAN, LJointGAN(θ,φ,ω), is given by

minθ,φ

maxωLJointGAN(θ,φ,ω) (11)

=∑5

k=1Epk(x,y)[log gω(x,y)[k]] .

The above objective (11) has taken into consideration themodel design that α and φ are coupled together, with thesame for β and θ; thusα and β are not present in (11). Notethat expectation Ep5(x,y)[·] is approximated using empiricaljoint samples, expectations Ep3(x,y)[·] and Ep4(x,y)[·] areboth approximated with purely synthesized joint samples,while Ep1(x,y)[·] and Ep2(x,y)[·] are approximated usingconditionally synthesized samples, given samples from theempirical marginals. The following proposition character-izes the solution of (11) in terms of the joint distributions.

Proposition 1 The equilibrium for the minimax objectivein (11) is achieved if and only if p1(x,y) = p2(x,y) =p3(x,y) = p4(x,y) = p5(x,y).

The proof is provided in Appendix A.

No Access to Paired Empirical Draws When paireddata samples are not available, we do not have access todraws from p5(x,y) = q(x,y), so this term is not con-sidered in (11). Instead, we wish to impose “cycle consis-tency” (Zhu et al., 2017), i.e., q(x) → x → pθ(y|x) →y → pφ(x|y)→ x yields small ||x− x||, for an appropri-ate norm. Similarly, we impose q(y)→ y → pφ(x|y)→x→ pθ(y|x)→ y resulting in small ||y − y||.

In this case, the discriminator becomes a 4-class classi-fier. Let g′ω(x,y) ∈ ∆3 denote a new critic, with soft-max on the top layer, i.e.,

∑4k=1 g

′ω(x,y)[k] = 1 and

g′ω(x,y)[k] ∈ (0, 1). To encourage cycle consistency, wemodify the objective in (11) as

minθ,φ

maxωLJointGAN(θ,φ,ω) (12)

=∑4

k=1Epk(x,y)[log g′ω(x,y)[k]] +Rθ,φ(x,y) ,

where Rθ,φ(x,y) in (12) is a cycle-consistency regulariza-

Figure 2. Generative model for the tuple (x,y,z) manifested viapα(x)pν(y|x)pγ(z|x,y). Note the skip connection that natu-rally arises due to the form of the generative model.

tion term specified as

Rθ,φ(x,y) = Ex∼q(x),y∼pθ(y|x),x∼pφ(x|y)||x− x||+ Ey∼q(y),x∼pφ(x|y),y∼pθ(y|x)||y − y|| .

3.2. Extension to multiple domains

The above formulation may be extended to the case of threeor more joint random variables. However, for m randomvariables, there are m! different ways in which the jointdistribution can be factorized. For example, for joint randomvariables (x,y, z), there are possibly six different forms ofthe model. One must have access to all the six instantiationsof these models, if the goal is to be able to generate (impute)samples from all conditionals. However, not all modeledforms of p(x,y, z) need to be considered, if there is notinterest in the corresponding form of the conditional. Below,we consider two specific forms of the model:

p(x,y, z) = pα(x)pν(y|x)pγ(z|x,y) (13)= pβ(z)pψ(y|z)pη(x|y, z) . (14)

Typically, the joint draws from q(x,y, z) may not be easyto access; therefore, we assume that only empirical drawsfrom q(x,y) and q(y, z) are available. For the purpose ofadversarial learning, we let the critic be a 6-class softmaxclassifier that aims to distinguish samples from the following6 distributions:

pα(x)pν(y|x)pγ(z|x,y) , pβ(z)pψ(z|y)pη(x|y, z)

q(x)pν(y|x)pγ(z|x,y) , q(x)pψ(z|y)pη(x|y, z)

q(x,y)pγ(z|x,y) , q(y, z)pη(x|y, z) .

After training, one may synthesize (x,y, z), impute (y, z)from observed x, or impute z from (x,y), etc. Examples ofthis learning paradigm is demonstrated in the experiments.Interestingly, when implementing a sampling-based methodfor the above models, skip connections are manifested natu-rally as a result of the partitioning of the joint distribution,


e.g., via pγ(z|x,y) and pη(x|y, z). This is illustrated inFigure 2, for pα(x)pν(y|x)pγ(z|x,y).

4. Related workAdversarial methods for joint distribution learning can beroughly divided into two categories, depending on the appli-cation: (i) generation and inference if one of the domainsconsists of (stochastic) latent variables, and (ii) conditionaldata synthesis if both domains consists of observed pairsof random variables. Below, we review related work fromthese two perspectives.

Generation and inference The joint distribution of dataand latent variables or codes can be considered in two (sym-metric) forms: (i) from observed data samples fed throughthe encoder to yield codes, i.e., inference, and (ii) fromcodes drawn from a simple prior and propagated through thedecoder to manifest data samples, i.e., generation. ALI (Du-moulin et al., 2017) and BiGAN (Donahue et al., 2017)proposed fully adversarial methods for this purpose. Thereare also many recent works concerned with integrating vari-ational autoencoder (VAE) (Kingma & Welling, 2013; Puet al., 2016) and GAN concepts for improved data gener-ation and latent code inference (Hu et al., 2017). Repre-sentative work includes the AAE (Makhzani et al., 2015),VAE-GAN (Larsen et al., 2015), AVB (Mescheder et al.,2017), AS-VAE (Pu et al., 2017), SVAE (Chen et al., 2018),etc.

Conditional data synthesis Conditional GAN can bereadily used for conditional-data synthesis if paired data areavailable. Multiple conditional GANs have been proposedto generate the images based on class labels (Mirza & Osin-dero, 2014), attributes (Perarnau et al., 2016), text (Reedet al., 2016; Xu et al., 2017) and other images (Isola et al.,2017). Often, only the mapping from one direction (asingle conditional) is learned. Triangle GAN (Gan et al.,2017a) and Triple GAN (Li et al., 2017b) can be used tolearn bi-directional mappings (both conditionals) in a semi-supervised learning setup. Unsupervised learning methodswere also developed for this task. CycleGAN (Zhu et al.,2017) proposed to use two generators to model the condi-tionals and two critics to decide whether a generated sampleis synthesized, in each individual domain. Further, addi-tional reconstruction losses were introduced to impose cycleconsistency. Similar work includes DiscoGAN (Kim et al.,2017), DualGAN (Yi et al., 2017) and UNIT (Liu et al.,2017).

CoGAN (Liu & Tuzel, 2016) can be used to achieve jointdistribution learning. However, the joint distribution is onlyroughly approximated by the marginals, via sharing low-layer weights of the generators, hence not learning the true(empirical) joint distributions in a principled way.

All the other previously proposed models focus on learning

to sample from the conditionals given samples from one ofthe true (empirical) marginals, while the proposed model,to the best of the authors’ knowledge, is the first attemptto learn a full joint distribution of two or more observedrandom variables. Moreover, this paper presents the firstconsolidation of multiple binary critics into a single unifiedsoftmax-based critic.

We observe that the proposed model, JointGAN, may follownaturally in concept from GAN (Goodfellow et al., 2014)and ALI (Donahue et al., 2017; Dumoulin et al., 2017).However, there are several keys to obtaining good perfor-mance. Specifically, (i) the condition distribution setupnaturally yields skip connections in the architecture. (ii)Compared with using multiple binary critics, the softmax-based critic can be considered as sharing the parametersamong all the binary critics except the top layer. This alsoimposes the critic embedding the generated samples fromdifferent ways into a common latent space and reduces thenumber of parameters. (iii) The weight-sharing constraintamong generators enforces that synthesized images from themarginal and conditional generator share a common latentspace, and also further reduces the number of parameters inthe network.

5. ExperimentsAdam (Kingma & Ba, 2014) with learning rate 0.0002 isutilized for optimization of the JointGAN objectives. Allnoise vectors ε1, ε2, ε′1 and ε′2 are drawn from a N (0, I)distribution, with the dimension of each set to 100. Besidesthe results presented in this section, more results can befound in Appendix C.2. The code can be found at https://github.com/sdai654416/Joint-GAN.

5.1. Joint modeling multi-domain images

Datasets We present results on five datasets: edges↔shoes (Yu & Grauman, 2014), edges↔handbags (Zhuet al., 2016), Google maps↔aerial photos (Isola et al.,2017), labels↔facades (Tylecek & Šára, 2013) and labels↔cityscapes (Cordts et al., 2016). All of these datasets aretwo-domain image pairs.

For three-domain images, we create a new dataset by com-bining labels↔facades pairs and labels↔cityscapes pairsinto facades↔labels↔cityscapes tuples. In this dataset,only empirical draws from q(x,y) and q(y, z) are available.Another new dataset is created based on MNIST, where thethree image domains are the MNIST images, clockwisetransposed ones, and anticlockwise transposed ones.

Baseline As described in Sec. 3.1, a two-step model isimplemented as the baseline. Specifically, WGAN-GP (Gul-rajani et al., 2017) is employed to model the two marginals;Pix2pix (Isola et al., 2017) and CycleGAN (Zhu et al., 2017)are utilized to model the conditionals for the case with and

https://github.com/sdai654416/Joint-GAN

https://github.com/sdai654416/Joint-GAN


𝜖2 𝜖2,

𝑥 = 𝑓(𝜖1) 𝑦 = 𝑓(𝑥, 𝜖2)𝜖1 𝜖1

,𝑦 = 𝑓(𝜖2) 𝑥 = 𝑓(𝑦, 𝜖1)

Figure 3. Generated paired samples from models trained on paired data.

Figure 4. Generated paired samples from the JointGAN modeltrained on the paired edges↔shoes dataset.

without access to paired empirical draws, respectively.

Network Architectures For generators, we employed theU-net (Ronneberger et al., 2015) which has been demon-strated to achieve impressive results for image-to-imagetranslation. Following Isola et al. (2017), PatchGAN isemployed for the discriminator, which provides real vs. syn-thesized prediction on 70× 70 overlapping image patches.

5.1.1. QUALITATIVE RESULTS

Figures 3 and 4 show the results trained on paired data.All the image pairs are generated from random noise. ForFigure 4, we first draw (ε1, ε2) and (ε′1, ε

′2) to generate the

top-left image pairs and bottom-right image pairs accordingto (7). All remaining image pairs are generated from thenoise pair made by linear interpolation between ε1 andε′1, and between ε2 and ε′2, respectively, also via (7). ForFigure 3, in each row of the left block, the column is first

generated from pα(x), and then the images of the right partare generated based on the leftmost image and an additionalnoise vector linear-interpolated between two random pointsε2 and ε′2. The images in the right block are produced in asimilar way.

These results demonstrate that our model is able to gen-erate both realistic and highly coherent image pairs. Inaddition, the interpolation experiments illustrate that ourmodel maintains smooth transitions in the latent space, witheach point in the latent space corresponding to a plausibleimage. For example, in the edges↔handbags dataset, it canbe seen that the edges smoothly transforming from com-plicated structures into simple ones, and the color of thehandbags transforming from black to red. The quality ofimages generated from the baseline is much worse than ours,and are provided in Appendix C.1.

Figure 5 shows the generated samples trained on unpaireddata. Our model is able to produce image pairs whosequality are close to the samples trained on paired data.

Figures 6 and 7 show the generated samples from modelstrained on three-domain images. The generated images ineach tuple are highly correlated. Interestingly, in Figure 7,the synthesized labels strive to be consistent with both thegenerated street scene and facade photos.

5.1.2. QUANTITATIVE RESULTS

We perform a detailed quantitative analysis on the two-domain image-pair task.

Human Evaluation We perform human evaluation usingAmazon Mechanical Turk (AMT), and present human eval-


Figure 5. Generated paired samples from models trained on un-paired data.

Figure 6. Generated three-domain samples from models trained onMNIST. The images in each tuple of the left three columns aresequentially generated from left to right, while those in the rightthree columns are generated from right to left.

uation results on the relevance and realism of generatedpairs in both the cases with or without access to pairedempirical samples. In each survey, we compare JointGANand the two-step baseline by taking a random sample of100 generated image pairs (5 datasets, 20 samples on eachdataset), and ask the human evaluator to select which sam-ple is more realistic and the content of which pairs are morerelevant. We obtained roughly 44 responses per data sam-ple (4378 samples in total) and the results are shown inTable 1. Clearly, human analysis suggest that our JointGANproduces higher-quality samples when compared with the

Table 1. Human evaluation results on the quality of generated pairs.

Method Realism Relevance

Trained with paired data

WGAN-GP + Pix2pix wins 2.32% 3.1%JointGAN wins 17.93% 36.32%

Not distinguishable 79.75% 60.58%

Trained with unpaired data

WGAN-GP + CycleGAN wins 0.13% 1.31%JointGAN wins 81.55% 40.87%

Not distinguishable 18.32% 57.82%

two-step baseline, verifying the effectiveness of learning themarginal and conditional simultaneously.

Relevance Score We use relevance score to evaluate thequality and relevance of two generated images. The rele-vance score is calculated as the cosine similarity betweentwo images that are embedded into a shared latent space,which are learned via training a ranking model (Huang et al.,2013). Details are provided in Appendix B. The final rele-vance score is the average over all the individual relevancescores on each pair. Results are summarized in Table 2.Our JointGAN provides significantly better results than thetwo-step baselines, especially when we do not have accessto the paired empirical samples.

Besides the results of our model and baselines, we alsopresent results on three types of real images: (i) True pairs:this is the real image pairs from the same dataset but notused for training the ranking model; (ii) Random pairs: theimages are from the same dataset but the content of twoimages are not correlated; (iii) Other pairs: the images arecorrelated but sampled from a dataset different from thetraining set. We can see in Table 2 that the first one obtainsa high relevance score while the latter two have a very lowscore, which shows that the relevance score metric assigns alow value when either the content of generated image pairsis not correlated or the images are not plausibly like thetraining set. It demonstrates that this metric correlates wellwith the quality of generated image pairs.

5.2. Joint modeling caption features and images

Setup Our model is next evaluated on the Caltech-UCSDBirds dataset (Welinder et al., 2010), in which each imageof bird is paired with 10 different captions. Since generatingrealistic text using GAN itself is a challenging task, in thiswork, we train our model on pairs of caption features andimages. The caption features are obtained from a pretrainedword-level CNN-LSTM autoencoder (Gan et al., 2017b),which aims to achieve a one-to-one mapping between thecaptions and the features. We then train JointGAN basedon the caption features and their corresponding images (thepaired data for training JointGAN use CNN-generated text


Table 2. Relevance scores of the generated pairs on the five two-domain image datasets.edges↔shoes edges↔handbags labels↔cityscapes labels↔facades maps↔satellites

True pairs 0.684 0.672 0.591 0.529 0.514Random pairs 0.008 0.005 0.012 0.011 0.054

Other pairs 0.113 0.139 0.092 0.076 0.081

WGAN-GP + Pix2pix 0.352 0.343 0.301 0.288 0.125JointGAN (paired) 0.488 0.489 0.377 0.364 0.328

WGAN-GP + CycleGAN 0.203 0.195 0.201 0.139 0.091JointGAN (unpaired) 0.452 0.461 0.339 0.341 0.299

Figure 7. Generated paired samples from models trained on facades↔labels↔cityscapes. The images in the first row are sequentiallygenerated from left to right while those in the second row are generated from right to left. Both generation starts from a noise vector.

Figure 8. Generated paired samples of caption features and images(all data synthesized). Top block: from generated images to captionfeatures. Bottom block: from generated caption features to images.

features, which avoids issues of training GAN for text gener-ation). Finally to visualize the results, we use the pretrainedLSTM decoder to decode the generated features back to cap-tions. We employ StackGAN-stage-I (Zhang et al., 2017a)for generating images from caption features while a CNN is

utilized to generate caption features from images. Detailsare provided in Appendix D.

Qualitative Results Figure 8 shows the qualitative resultsof JointGAN: (i) generate images from noise and then condi-tionally generate caption features, and (ii) generate captionfeatures from noise and then conditionally generate images.The results show high-quality and diverse image generation,and strong coherent relationship between each pair of thecaption feature and image. It demonstrates the robustnessof our model, in that it not only generates realistic multi-domain images but also handles well different datasets suchas caption feature and image pairs.

6. ConclusionWe propose JointGAN, a new framework for multi-domainjoint distribution learning. The joint distribution is learnedvia decomposing it into the product of a marginal and a con-ditional distribution(s), each learned via adversarial training.JointGAN allows interesting applications since it providesfreedom to draw samples from various marginalized or con-ditional distributions. We consider joint analysis of two andthree domains, and demonstrate that JointGAN achievessignificantly better results than a two-step baseline model,both qualitatively and quantitatively.

AcknowledgementsThis research was supported in part by DARPA, DOE, NIH,ONR and NSF.


ReferencesChen, L., Dai, S., Pu, Y., Zhou, E., Li, C., Su, Q., Chen,

C., and Carin, L. Symmetric variational autoencoder andconnections to adversarial learning. In AISTATS, 2018.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler,M., Benenson, R., Franke, U., Roth, S., and Schiele, B.The cityscapes dataset for semantic urban scene under-standing. In CVPR, 2016.

Donahue, J., Krähenbühl, P., and Darrell, T. Adversarialfeature learning. In ICLR, 2017.

Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O.,Lamb, A., Arjovsky, M., and Courville, A. Adversariallylearned inference. In ICLR, 2017.

Gan, Z., Chen, L., Wang, W., Pu, Y., Zhang, Y., Liu, H.,Li, C., and Carin, L. Triangle generative adversarialnetworks. In NIPS, 2017a.

Gan, Z., Pu, Y., Henao, R., Li, C., He, X., and Carin, L.Learning generic sentence representations using convolu-tional neural networks. In EMNLP, 2017b.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.Generative adversarial nets. In NIPS, 2014.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., andCourville, A. C. Improved training of Wasserstein GANs.In NIPS, 2017.

Hu, Z., Yang, Z., Salakhutdinov, R., and Xing, E. P.On unifying deep generative models. arXiv preprintarXiv:1706.00550, 2017.

Huang, P.-S., He, X., Gao, J., Deng, L., Acero, A., andHeck, L. Learning deep structured semantic models forweb search using clickthrough data. In CIKM, 2013.

Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. Image-to-image translation with conditional adversarial networks.In CVPR, 2017.

Kim, T., Cha, M., Kim, H., Lee, J., and Kim, J. Learn-ing to discover cross-domain relations with generativeadversarial networks. In ICML, 2017.

Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variationalbayes. arXiv preprint arXiv:1312.6114, 2013.

Larsen, A. B. L., Sønderby, S. K., Larochelle, H., andWinther, O. Autoencoding beyond pixels using a learnedsimilarity metric. arXiv preprint arXiv:1512.09300, 2015.

Li, C., Liu, H., Chen, C., Pu, Y., Chen, L., Henao, R.,and Carin, L. Alice: Towards understanding adversariallearning for joint distribution matching. In NIPS, 2017a.

Li, C., Xu, K., Zhu, J., and Zhang, B. Triple generativeadversarial nets. In NIPS, 2017b.

Liu, M.-Y. and Tuzel, O. Coupled generative adversarialnetworks. In NIPS, 2016.

Liu, M.-Y., Breuel, T., and Kautz, J. Unsupervised image-to-image translation networks. In NIPS, 2017.

Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., andFrey, B. Adversarial autoencoders. arXiv preprintarXiv:1511.05644, 2015.

Mescheder, L., Nowozin, S., and Geiger, A. Adversarialvariational bayes: Unifying variational autoencoders andgenerative adversarial networks. In ICML, 2017.

Mirza, M. and Osindero, S. Conditional generative adver-sarial nets. In arXiv preprint arXiv:1411.1784, 2014.

Perarnau, G., van de Weijer, J., Raducanu, B., and Álvarez,J. M. Invertible conditional gans for image editing. arXivpreprint arXiv:1611.06355, 2016.

Pu, Y., Gan, Z., Henao, R., Yuan, X., Li, C., Stevens, A.,and Carin, L. Variational autoencoder for deep learningof images, labels and captions. In NIPS, 2016.

Pu, Y., Wang, W., Henao, R., Chen, L., Gan, Z., Li, C., andCarin, L. Adversarial symmetric variational autoencoder.In NIPS, 2017.

Radford, A., Metz, L., and Chintala, S. Unsupervised rep-resentation learning with deep convolutional generativeadversarial networks. ICLR, 2016.

Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B.,and Lee, H. Generative adversarial text to image synthesis.In ICML, 2016.

Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolu-tional networks for biomedical image segmentation. InMICCAI, 2015.

Tao, C., Chen, L., Henao, R., Feng, J., and Carin, L. Chi-square generative adversarial network. In ICML, 2018.

Tylecek, R. and Šára, R. Spatial pattern templates for recog-nition of objects with regular structure. In GCPR, 2013.

Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Be-longie, S., and Perona, P. Caltech-UCSD Birds 200. Tech-nical report, California Institute of Technology, 2010.


Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang,X., and He, X. Attngan: Fine-grained text to image gen-eration with attentional generative adversarial networks.arXiv preprint arXiv:1711.10485, 2017.

Yi, Z., Zhang, H., Tan, P., and Gong, M. Dualgan: Unsu-pervised dual learning for image-to-image translation. InCVPR, 2017.

Yu, A. and Grauman, K. Fine-grained visual comparisonswith local learning. In CVPR, 2014.

Yu, L., Zhang, W., Wang, J., and Yu, Y. Seqgan: Sequencegenerative adversarial nets with policy gradient. In AAAI,2017.

Zhang, H., Xu, T., Li, H., Zhang, S., and Metaxas, D. Stack-gan: Text to photo-realistic image synthesis with stackedgenerative adversarial networks. In ICCV, 2017a.

Zhang, Y., Gan, Z., Fan, K., Chen, Z., Henao, R., Shen,D., and Carin, L. Adversarial feature matching for textgeneration. In ICML, 2017b.

Zhu, J.-Y., Krähenbühl, P., Shechtman, E., and Efros, A. A.Generative visual manipulation on the natural image man-ifold. In ECCV, 2016.

Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpairedimage-to-image translation using cycle-consistent adver-sarial networks. In CVPR, 2017.

JointGAN: Multi-Domain Joint Distribution Learning with...

Documents

Transcript of JointGAN: Multi-Domain Joint Distribution Learning with...