Abstract arXiv:1812.05484v1 [cs.CV] 13 Dec 2018 · 2018-12-14 · image Iby minimizing a...

Unsupervised Image Decomposition in Vector Layers

Othman Sbai1,2 Camille Couprie1 Mathieu Aubry2

1 Facebook AI Research, 2 LIGM (UMR 8049) - Ecole des Ponts, UPE

Abstract

Deep image generation is becoming a tool to enhanceartists and designers creativity potential. In this paper,we aim at making the generation process more structuredand easier to interact with. Inspired by vector graph-ics systems, we propose a new deep image reconstructionparadigm where the outputs are composed from simple lay-ers, defined by their color and a vector transparency mask.This presents a number of advantages compared to the com-monly used convolutional network architectures. In partic-ular, our layered decomposition allows simple user inter-action, for example to update a given mask, or change thecolor of a selected layer. From a compact code, our archi-tecture also generates vector images with a virtually infi-nite resolution, the color at each point in an image being aparametric function of its coordinates. We validate the effi-ciency of our approach by comparing reconstructions withstate-of-the-art baselines given similar memory resourceson CelebA and ImageNet datasets. Most importantly, wedemonstrate several applications of our new image repre-sentation obtained in an unsupervised manner, includingediting, vectorization and image search.

1. Introduction

Deep image generation models demonstrate breathtakingand inspiring results, e.g. [52, 51, 26, 5], but usually offerlimited control and little interpretability. It is indeed par-ticularly challenging to learn end-to-end editable image de-composition without relying on either expensive user inputor handcrafted image processing tools. In contrast, we in-troduce and explore a new deep image generation paradigm,which follows an approach similar to the one used in inter-active design tools. We formulate image generation as thecomposition of successive layers, each associated to a sin-gle color. Rather than learning high resolution image gen-eration, we produce a decomposition of the image in vectorlayers, that can easily be used to edit images at any reso-lution. We aim at enabling designers to easily build on theresults of deep image generation methods, by editing layers

Figure 1: Our system learns in an unsupervised manner adecomposition of images as superimposed α-channel masks(top) that can be used for quick image editing (bottom).

individually, changing their characteristics, or interveningin the middle of the generation process.

Our approach is in line with the long standing ComputerVision trend to look for a simplified and compact repre-sentation of the visual world. For examples, in 1971 Bin-ford [4] proposes to represent the 3D world using gener-alized cylinders and in 1987 the seminal work of Bieder-man [3] aims at explaining and interpreting the 3D worldand images using geons, a family of simple parametricshapes. These ideas have recently been revisited usingNeural Networks to represent a 3D shape using a set ofblocks [44] or, more related to our approach, a set of para-metric patches [18]. The idea of identifying elementaryshapes and organizing them in layers has been successfullyapplied to model images [1, 22] and videos [45]. A classi-

1

arX

iv:1

812.

0548

4v2

[cs

.CV

] 7

Jul

201

9

cal texture generation method, the dead leaves model [29]which creates realistic textures by relying on the iteration ofsimple patterns addition, is particularly related to our work.

We build on this idea of composing layers of sim-ple primitives in order to design a deep image generationmethod, relying on two core ingredients. First, the learn-ing of vector transparency masks as parametric continuousfunction defined on the unit square. In practice, this func-tion is computed by a network applied at 2D coordinates ona square grid, to output mask values at each pixel coordi-nates. Second, a mask blending module which we use toiteratively build the images by superimposing a mask witha given color to the previous generation. At each step of ourgeneration process, a network predicts both parameters andcolor for one mask. Our final generated image is the resultof blending a fixed number of colored masks. One of the ad-vantages of this approach is that, differently to most existingdeep generation setups where the generation is of fixed size,our generations are vector images defined continuously, andthus have virtually infinite resolution. Another key aspect isthat the generation process is easily interpretable, allowingsimple user interaction.

To summarize, our main contribution is a new deep im-age generation paradigm which:

• builds images iteratively from masks corresponding tomeaningful image regions, learned without any seman-tic supervision.

• is one of the first to generate vector images from acompact code.

• is useful for several applications, including image edit-ing using generated masks, image vectorization, andimage search in mask space.

Our code will be made available1.

2. Related workWe begin this section by presenting relevant works on

image vectorization, then focus on most related unsuper-vised image generation strategies and finally discuss appli-cations of deep learning to image manipulation.

Vectorization. Many vector-based primitives have beenproposed to allow shape editing, color editing and vectorimage processing ranging from paths and polygons filledwith uniform color or linear and radial gradients [40, 12],to region based partitioning using triangulation [8, 30,10], parametric patches (Bezier patches) [47] or diffusioncurves [36]. We note that traditionally, image vectoriza-tion techniques were handcrafted using image smoothingand edge detectors. In contrast, our approach parametrizesthe image using a function defined by a neural network.

1http://imagine.enpc.fr/~sbaio/pix2vec/

Differentiable image parametrizations with neural net-works were first proposed by Stanley et al. [43] which intro-duced Compositional Pattern Producing Networks (CPPNs)that are simply neural networks that map pixel coordinatesto image colors at each pixel. The architecture of the net-work determines the space of images that can be generated.Since CPPNs learn images as functions of pixel coordinatesthey provide the ability to sample images at high resolu-tion. The weights of the network can be optimized to re-construct a single image [25] or sample randomly in whichcase each network results in abstract patterns [19]. In con-trast with these approaches, we propose to learn the weightsof this mapping network and condition it on a an imagefeature so that it can generate any image without image-specific weight optimization. Similarly, recent works havemodeled 2D and 3D shapes using parametric and implicitfunctions [18, 34, 37, 6]. While previous attempts to applythis idea on images has focused on directly generating im-ages on simple datasets such as MNIST [20, 6], we obtain alayer decomposition allowing various applications such asimage editing and retrieval on complex images.

Deep, unsupervised, sequential image generation.We now present deep unupervised sequential approaches

to image generation, the most related to our work. [41] usesa recurrent auto-encoder to reconstruct images iteratively,and employs a sparsity criterion to make sure that the im-age parts that are added at each iteration are simple. A sec-ond line of approaches [17, 11, 16] are designed in a VAEframework. Deep Recurrent Attentive Writer (DRAW) [17]frames a recurrent approach using reinforcement learningand a spatial attention mechanism to mimic human gestures.A potential application of DRAW arises in its extension toconceptual image compression [16], where a recurrent con-volutional and hierarchical architecture allows to obtain var-ious levels of lossy compressed images. Attend, Infer, Re-peat [11] models scenes by latent variables of object pres-ence, content, and position. The parameters of presence andposition are inferred by an RNN and a VAE decodes the ob-jects one at a time to reconstruct images. A third strategy forlearning sequential generative models is to employ adver-sarial networks. Ganin et al. [13] employ adversarial train-ing in a reinforcement learning context. Specifically, theirmethod dubbed SPIRAL, trains an agent to synthesize pro-grams executed by a graphic engine to reconstruct images.The Layered Recursive GANs of [48] learn to generate fore-ground and background images that are stitched together us-ing STNs to form a consistent image. Although presented ina generic way that generalizes to multiple steps, the experi-ments are limited to foreground and background separation,made possible by the definition of a prior on the object sizecontained in the image. In contrast, our method (i) does notrely on STNs; (ii) extends to tens of steps as demonstratedin our experiments; (iii) relies on simple architectures and

losses, without the need of LSTMs or reinforcement learn-ing.

Image manipulation. Some successful applications ofdeep learning to image manipulation have been demon-strated, but they are usually specialized and offer limiteduser interaction. Image colorization [49] and style trans-fer [14] are two popular examples. Most approaches thatallow user interaction are supervised. Zhu et al. [50] inte-grate user constraints in the form of brush strokes in GANgenerations. More recently, Park et al. [38] use seman-tic segmentation layouts and brush strokes to allow users tocreate new landscapes. In a similar vein, [2] locates sets ofneurons related to different visual components of images,such as trees or artifacts, and allows their removal interac-tively. Approaches specialized in face editing, such as [42]and [39] demonstrate the large set of photo-realistic imagemanipulations that can be done to enhance quality, for in-stance background removal or swapping, diverse stylizationeffects, changes of the depth of field of the background, etc.These approach typically require precise label inputs fromusers, or training on heavily annotated datasets. Our ap-proach provides an unsupervised alternative, with similarediting capacities.

3. Layered Vector Image GenerationWe frame image generation as an alpha-blending com-

position of a sequence of layers starting from a canvas ofrandom uniform color I0. Given a fixed budget of T itera-tions, we iteratively blend T generated colored masks ontothe canvas. In this section, we first present our new archi-tecture for vector image generation, then the training lossand finally discuss the advantages of our new architecturecompared to existing approaches.

3.1. Architecture

The core idea of our approach is visualized in Fig. 2.At each iteration t ∈ {1...T}, our model takes as input theconcatenation of the target image I ∈ R3×W×H and thecurrent canvas It, and iteratively blends colored masks onthe canvas resulting in It:

It = g(It−1, I), (1)

where g consists of:

(i) a Residual Network (ResNet) that predicts mask pa-rameters pt ∈ RP , with the corresponding color tripletct ∈ R3,

(ii) a mask generator module f , which generates an alpha-blending mask Mt from the parameters pt, and

(iii) our mask blending module that blends the masks Mt

with their color ct on the previous canvas It−1.

We represent the function f generating the mask Mt

from pt as a standard Multi-Layer Perceptron (MLP),which takes as input the concatenation of the mask param-eters pt and the two spatial coordinates (x, y) of a point inimage space. This MLP f defines the continuous 2D func-tion of the mask Mt by:

Mt(x, y) = f(x, y,pt). (2)

In practice, we evaluate the mask at discrete spatial lo-cations corresponding to the desired resolution to produce adiscrete image. We then update It at each spatial location(x, y) using the following blending:

It(x, y) = It−1(x, y).(1−Mt(x, y)) + ct.Mt(x, y), (3)

where It(x, y) ∈ R3 is the RGB value of the resultingimage It at position (x, y). We note that, at test time, wemay perform a different number of iterationsN than the oneduring training T . Choosing N > T may help to model ac-curately images that contain complex patterns, as we showin our experiments.

All the design choices of our approach are justified in de-tail in Section 3.3 and supported empirically by experimentsand ablations in Section 4.3.

3.2. Training losses

We learn the weights of our network end-to-end by min-imizing a reconstruction loss between the target I and ourresult R = IT . We perform experiments either using an`1 loss, which enables simple quantitative comparisons, ora perceptual loss [24], leading to visually improved results.Our perceptual loss Lperc is based on the Euclidean norm‖.‖2 between feature maps φ(.) extracted from a pre-trainedVGG16 network and the Frobenius norm between the Grammatrices obtained from these feature maps G(φ(.)):

Lperc = Lcontent + λLstyle,

where

Lcontent(I,R) = ‖φ(I)− φ(R)‖2 ,

Lstyle(I,R) = ‖G(φ(I))−G(φ(R))‖F ,

and λ is a non-negative scalar that controls the relative in-fluence of the style loss. To obtain even sharper results, wemay optionally add an adversarial loss. In this case, a dis-criminatorD is trained to recognize real images from gener-ated ones, and we optimize our generator G to fool this dis-criminator. We trainD to minimize the non saturating GANloss from [15] with R1 Gradient Penalty loss [33]. The ar-chitecture of D is the patch discriminator defined in [23].

x x

⇥<latexit sha1_base64="L3dZ0XgOyNX2PKpIro/rl+dxAfQ=">AAACBXicbVDLSgMxFL3js46vqks3wSK4KjMi6LLgxpVUsA9oS8lkMm1skhmTjDAMXbt1q//gTtz6Hf6CX2HazsK2HggczrmXc3OChDNtPO/bWVldW9/YLG252zu7e/vlg8OmjlNFaIPEPFbtAGvKmaQNwwyn7URRLAJOW8HoeuK3nqjSLJb3JktoT+CBZBEj2Fip2TVMUN0vV7yqNwVaJn5BKlCg3i//dMOYpIJKQzjWuhNx+ih7bo6VYYTTsdtNNU0wGeEB7VgqsY3p5dN7x+jUKiGKYmWfNGiq/t3IsdA6E4GdFNgM9aI3Ef/zOqmJrno5k0lqqCSzoCjlyMRo8nkUMkWJ4ZklmChmb0VkiBUmxlY0l6KNwCpT4di15fiLVSyT5nnV96r+3UWldlvUVIJjOIEz8OESanADdWgAgQd4gVd4c56dd+fD+ZyNrjjFzhHMwfn6BRxkmZg=</latexit><latexit sha1_base64="L3dZ0XgOyNX2PKpIro/rl+dxAfQ=">AAACBXicbVDLSgMxFL3js46vqks3wSK4KjMi6LLgxpVUsA9oS8lkMm1skhmTjDAMXbt1q//gTtz6Hf6CX2HazsK2HggczrmXc3OChDNtPO/bWVldW9/YLG252zu7e/vlg8OmjlNFaIPEPFbtAGvKmaQNwwyn7URRLAJOW8HoeuK3nqjSLJb3JktoT+CBZBEj2Fip2TVMUN0vV7yqNwVaJn5BKlCg3i//dMOYpIJKQzjWuhNx+ih7bo6VYYTTsdtNNU0wGeEB7VgqsY3p5dN7x+jUKiGKYmWfNGiq/t3IsdA6E4GdFNgM9aI3Ef/zOqmJrno5k0lqqCSzoCjlyMRo8nkUMkWJ4ZklmChmb0VkiBUmxlY0l6KNwCpT4di15fiLVSyT5nnV96r+3UWldlvUVIJjOIEz8OESanADdWgAgQd4gVd4c56dd+fD+ZyNrjjFzhHMwfn6BRxkmZg=</latexit><latexit sha1_base64="L3dZ0XgOyNX2PKpIro/rl+dxAfQ=">AAACBXicbVDLSgMxFL3js46vqks3wSK4KjMi6LLgxpVUsA9oS8lkMm1skhmTjDAMXbt1q//gTtz6Hf6CX2HazsK2HggczrmXc3OChDNtPO/bWVldW9/YLG252zu7e/vlg8OmjlNFaIPEPFbtAGvKmaQNwwyn7URRLAJOW8HoeuK3nqjSLJb3JktoT+CBZBEj2Fip2TVMUN0vV7yqNwVaJn5BKlCg3i//dMOYpIJKQzjWuhNx+ih7bo6VYYTTsdtNNU0wGeEB7VgqsY3p5dN7x+jUKiGKYmWfNGiq/t3IsdA6E4GdFNgM9aI3Ef/zOqmJrno5k0lqqCSzoCjlyMRo8nkUMkWJ4ZklmChmb0VkiBUmxlY0l6KNwCpT4di15fiLVSyT5nnV96r+3UWldlvUVIJjOIEz8OESanADdWgAgQd4gVd4c56dd+fD+ZyNrjjFzhHMwfn6BRxkmZg=</latexit><latexit sha1_base64="L3dZ0XgOyNX2PKpIro/rl+dxAfQ=">AAACBXicbVDLSgMxFL3js46vqks3wSK4KjMi6LLgxpVUsA9oS8lkMm1skhmTjDAMXbt1q//gTtz6Hf6CX2HazsK2HggczrmXc3OChDNtPO/bWVldW9/YLG252zu7e/vlg8OmjlNFaIPEPFbtAGvKmaQNwwyn7URRLAJOW8HoeuK3nqjSLJb3JktoT+CBZBEj2Fip2TVMUN0vV7yqNwVaJn5BKlCg3i//dMOYpIJKQzjWuhNx+ih7bo6VYYTTsdtNNU0wGeEB7VgqsY3p5dN7x+jUKiGKYmWfNGiq/t3IsdA6E4GdFNgM9aI3Ef/zOqmJrno5k0lqqCSzoCjlyMRo8nkUMkWJ4ZklmChmb0VkiBUmxlY0l6KNwCpT4di15fiLVSyT5nnV96r+3UWldlvUVIJjOIEz8OESanADdWgAgQd4gVd4c56dd+fD+ZyNrjjFzhHMwfn6BRxkmZg=</latexit>

⇥<latexit sha1_base64="L3dZ0XgOyNX2PKpIro/rl+dxAfQ=">AAACBXicbVDLSgMxFL3js46vqks3wSK4KjMi6LLgxpVUsA9oS8lkMm1skhmTjDAMXbt1q//gTtz6Hf6CX2HazsK2HggczrmXc3OChDNtPO/bWVldW9/YLG252zu7e/vlg8OmjlNFaIPEPFbtAGvKmaQNwwyn7URRLAJOW8HoeuK3nqjSLJb3JktoT+CBZBEj2Fip2TVMUN0vV7yqNwVaJn5BKlCg3i//dMOYpIJKQzjWuhNx+ih7bo6VYYTTsdtNNU0wGeEB7VgqsY3p5dN7x+jUKiGKYmWfNGiq/t3IsdA6E4GdFNgM9aI3Ef/zOqmJrno5k0lqqCSzoCjlyMRo8nkUMkWJ4ZklmChmb0VkiBUmxlY0l6KNwCpT4di15fiLVSyT5nnV96r+3UWldlvUVIJjOIEz8OESanADdWgAgQd4gVd4c56dd+fD+ZyNrjjFzhHMwfn6BRxkmZg=</latexit><latexit sha1_base64="L3dZ0XgOyNX2PKpIro/rl+dxAfQ=">AAACBXicbVDLSgMxFL3js46vqks3wSK4KjMi6LLgxpVUsA9oS8lkMm1skhmTjDAMXbt1q//gTtz6Hf6CX2HazsK2HggczrmXc3OChDNtPO/bWVldW9/YLG252zu7e/vlg8OmjlNFaIPEPFbtAGvKmaQNwwyn7URRLAJOW8HoeuK3nqjSLJb3JktoT+CBZBEj2Fip2TVMUN0vV7yqNwVaJn5BKlCg3i//dMOYpIJKQzjWuhNx+ih7bo6VYYTTsdtNNU0wGeEB7VgqsY3p5dN7x+jUKiGKYmWfNGiq/t3IsdA6E4GdFNgM9aI3Ef/zOqmJrno5k0lqqCSzoCjlyMRo8nkUMkWJ4ZklmChmb0VkiBUmxlY0l6KNwCpT4di15fiLVSyT5nnV96r+3UWldlvUVIJjOIEz8OESanADdWgAgQd4gVd4c56dd+fD+ZyNrjjFzhHMwfn6BRxkmZg=</latexit><latexit sha1_base64="L3dZ0XgOyNX2PKpIro/rl+dxAfQ=">AAACBXicbVDLSgMxFL3js46vqks3wSK4KjMi6LLgxpVUsA9oS8lkMm1skhmTjDAMXbt1q//gTtz6Hf6CX2HazsK2HggczrmXc3OChDNtPO/bWVldW9/YLG252zu7e/vlg8OmjlNFaIPEPFbtAGvKmaQNwwyn7URRLAJOW8HoeuK3nqjSLJb3JktoT+CBZBEj2Fip2TVMUN0vV7yqNwVaJn5BKlCg3i//dMOYpIJKQzjWuhNx+ih7bo6VYYTTsdtNNU0wGeEB7VgqsY3p5dN7x+jUKiGKYmWfNGiq/t3IsdA6E4GdFNgM9aI3Ef/zOqmJrno5k0lqqCSzoCjlyMRo8nkUMkWJ4ZklmChmb0VkiBUmxlY0l6KNwCpT4di15fiLVSyT5nnV96r+3UWldlvUVIJjOIEz8OESanADdWgAgQd4gVd4c56dd+fD+ZyNrjjFzhHMwfn6BRxkmZg=</latexit><latexit sha1_base64="L3dZ0XgOyNX2PKpIro/rl+dxAfQ=">AAACBXicbVDLSgMxFL3js46vqks3wSK4KjMi6LLgxpVUsA9oS8lkMm1skhmTjDAMXbt1q//gTtz6Hf6CX2HazsK2HggczrmXc3OChDNtPO/bWVldW9/YLG252zu7e/vlg8OmjlNFaIPEPFbtAGvKmaQNwwyn7URRLAJOW8HoeuK3nqjSLJb3JktoT+CBZBEj2Fip2TVMUN0vV7yqNwVaJn5BKlCg3i//dMOYpIJKQzjWuhNx+ih7bo6VYYTTsdtNNU0wGeEB7VgqsY3p5dN7x+jUKiGKYmWfNGiq/t3IsdA6E4GdFNgM9aI3Ef/zOqmJrno5k0lqqCSzoCjlyMRo8nkUMkWJ4ZklmChmb0VkiBUmxlY0l6KNwCpT4di15fiLVSyT5nnV96r+3UWldlvUVIJjOIEz8OESanADdWgAgQd4gVd4c56dd+fD+ZyNrjjFzhHMwfn6BRxkmZg=</latexit>

+<latexit sha1_base64="CuHZZdZlV0qZYt64uTfAQB24j90=">AAACAHicbVDLSsNAFJ3UV42vqks3g0UQhJKIoMuCG1fSgn1AG8pkctMOnUzizEQIoRu3bvUf3Ilb/8Rf8CuctlnY1gMXDufcy733+AlnSjvOt1VaW9/Y3Cpv2zu7e/sHlcOjtopTSaFFYx7Lrk8UcCagpZnm0E0kkMjn0PHHt1O/8wRSsVg86CwBLyJDwUJGiTZS82JQqTo1Zwa8StyCVFGBxqDy0w9imkYgNOVEqV7I4VF4dk6kZpTDxO6nChJCx2QIPUMFiUB5+ezUCT4zSoDDWJoSGs/UvxM5iZTKIt90RkSP1LI3Ff/zeqkOb7yciSTVIOh8UZhyrGM8/RsHTALVPDOEUMnMrZiOiCRUm3QWtigdEZnJYGKbcNzlKFZJ+7LmOjW3eVWt3xcxldEJOkXnyEXXqI7uUAO1EEWAXtArerOerXfrw/qct5asYuYYLcD6+gWfb5cT</latexit><latexit sha1_base64="CuHZZdZlV0qZYt64uTfAQB24j90=">AAACAHicbVDLSsNAFJ3UV42vqks3g0UQhJKIoMuCG1fSgn1AG8pkctMOnUzizEQIoRu3bvUf3Ilb/8Rf8CuctlnY1gMXDufcy733+AlnSjvOt1VaW9/Y3Cpv2zu7e/sHlcOjtopTSaFFYx7Lrk8UcCagpZnm0E0kkMjn0PHHt1O/8wRSsVg86CwBLyJDwUJGiTZS82JQqTo1Zwa8StyCVFGBxqDy0w9imkYgNOVEqV7I4VF4dk6kZpTDxO6nChJCx2QIPUMFiUB5+ezUCT4zSoDDWJoSGs/UvxM5iZTKIt90RkSP1LI3Ff/zeqkOb7yciSTVIOh8UZhyrGM8/RsHTALVPDOEUMnMrZiOiCRUm3QWtigdEZnJYGKbcNzlKFZJ+7LmOjW3eVWt3xcxldEJOkXnyEXXqI7uUAO1EEWAXtArerOerXfrw/qct5asYuYYLcD6+gWfb5cT</latexit><latexit sha1_base64="CuHZZdZlV0qZYt64uTfAQB24j90=">AAACAHicbVDLSsNAFJ3UV42vqks3g0UQhJKIoMuCG1fSgn1AG8pkctMOnUzizEQIoRu3bvUf3Ilb/8Rf8CuctlnY1gMXDufcy733+AlnSjvOt1VaW9/Y3Cpv2zu7e/sHlcOjtopTSaFFYx7Lrk8UcCagpZnm0E0kkMjn0PHHt1O/8wRSsVg86CwBLyJDwUJGiTZS82JQqTo1Zwa8StyCVFGBxqDy0w9imkYgNOVEqV7I4VF4dk6kZpTDxO6nChJCx2QIPUMFiUB5+ezUCT4zSoDDWJoSGs/UvxM5iZTKIt90RkSP1LI3Ff/zeqkOb7yciSTVIOh8UZhyrGM8/RsHTALVPDOEUMnMrZiOiCRUm3QWtigdEZnJYGKbcNzlKFZJ+7LmOjW3eVWt3xcxldEJOkXnyEXXqI7uUAO1EEWAXtArerOerXfrw/qct5asYuYYLcD6+gWfb5cT</latexit><latexit sha1_base64="CuHZZdZlV0qZYt64uTfAQB24j90=">AAACAHicbVDLSsNAFJ3UV42vqks3g0UQhJKIoMuCG1fSgn1AG8pkctMOnUzizEQIoRu3bvUf3Ilb/8Rf8CuctlnY1gMXDufcy733+AlnSjvOt1VaW9/Y3Cpv2zu7e/sHlcOjtopTSaFFYx7Lrk8UcCagpZnm0E0kkMjn0PHHt1O/8wRSsVg86CwBLyJDwUJGiTZS82JQqTo1Zwa8StyCVFGBxqDy0w9imkYgNOVEqV7I4VF4dk6kZpTDxO6nChJCx2QIPUMFiUB5+ezUCT4zSoDDWJoSGs/UvxM5iZTKIt90RkSP1LI3Ff/zeqkOb7yciSTVIOh8UZhyrGM8/RsHTALVPDOEUMnMrZiOiCRUm3QWtigdEZnJYGKbcNzlKFZJ+7LmOjW3eVWt3xcxldEJOkXnyEXXqI7uUAO1EEWAXtArerOerXfrw/qct5asYuYYLcD6+gWfb5cT</latexit>

ct<latexit sha1_base64="bbRbyyomJJsasHk6yTZKV4PERh8=">AAACC3icbVDLSsNAFJ3UV42vqks3g0VwVRIRdFlw40oq2Ac0oUwmk3bozCTOTIQQ8glu3eo/uBO3foS/4Fc4abOwrQcuHM65l3s4QcKo0o7zbdXW1jc2t+rb9s7u3v5B4/Cop+JUYtLFMYvlIECKMCpIV1PNyCCRBPGAkX4wvSn9/hORisbiQWcJ8TkaCxpRjLSRPI8jPQmiHBcjPWo0nZYzA1wlbkWaoEJn1PjxwhinnAiNGVJqGDHyKHw7R1JTzEhhe6kiCcJTNCZDQwXiRPn5LHUBz4wSwiiWZoSGM/XvRY64UhkPzGaZUi17pfifN0x1dO3nVCSpJgLPH0UpgzqGZQUwpJJgzTJDEJbUZIV4giTC2hS18EVpjmQmw8I25bjLVayS3kXLdVru/WWzfVfVVAcn4BScAxdcgTa4BR3QBRgk4AW8gjfr2Xq3PqzP+WrNqm6OwQKsr18ScpxS</latexit><latexit sha1_base64="bbRbyyomJJsasHk6yTZKV4PERh8=">AAACC3icbVDLSsNAFJ3UV42vqks3g0VwVRIRdFlw40oq2Ac0oUwmk3bozCTOTIQQ8glu3eo/uBO3foS/4Fc4abOwrQcuHM65l3s4QcKo0o7zbdXW1jc2t+rb9s7u3v5B4/Cop+JUYtLFMYvlIECKMCpIV1PNyCCRBPGAkX4wvSn9/hORisbiQWcJ8TkaCxpRjLSRPI8jPQmiHBcjPWo0nZYzA1wlbkWaoEJn1PjxwhinnAiNGVJqGDHyKHw7R1JTzEhhe6kiCcJTNCZDQwXiRPn5LHUBz4wSwiiWZoSGM/XvRY64UhkPzGaZUi17pfifN0x1dO3nVCSpJgLPH0UpgzqGZQUwpJJgzTJDEJbUZIV4giTC2hS18EVpjmQmw8I25bjLVayS3kXLdVru/WWzfVfVVAcn4BScAxdcgTa4BR3QBRgk4AW8gjfr2Xq3PqzP+WrNqm6OwQKsr18ScpxS</latexit><latexit sha1_base64="bbRbyyomJJsasHk6yTZKV4PERh8=">AAACC3icbVDLSsNAFJ3UV42vqks3g0VwVRIRdFlw40oq2Ac0oUwmk3bozCTOTIQQ8glu3eo/uBO3foS/4Fc4abOwrQcuHM65l3s4QcKo0o7zbdXW1jc2t+rb9s7u3v5B4/Cop+JUYtLFMYvlIECKMCpIV1PNyCCRBPGAkX4wvSn9/hORisbiQWcJ8TkaCxpRjLSRPI8jPQmiHBcjPWo0nZYzA1wlbkWaoEJn1PjxwhinnAiNGVJqGDHyKHw7R1JTzEhhe6kiCcJTNCZDQwXiRPn5LHUBz4wSwiiWZoSGM/XvRY64UhkPzGaZUi17pfifN0x1dO3nVCSpJgLPH0UpgzqGZQUwpJJgzTJDEJbUZIV4giTC2hS18EVpjmQmw8I25bjLVayS3kXLdVru/WWzfVfVVAcn4BScAxdcgTa4BR3QBRgk4AW8gjfr2Xq3PqzP+WrNqm6OwQKsr18ScpxS</latexit><latexit sha1_base64="bbRbyyomJJsasHk6yTZKV4PERh8=">AAACC3icbVDLSsNAFJ3UV42vqks3g0VwVRIRdFlw40oq2Ac0oUwmk3bozCTOTIQQ8glu3eo/uBO3foS/4Fc4abOwrQcuHM65l3s4QcKo0o7zbdXW1jc2t+rb9s7u3v5B4/Cop+JUYtLFMYvlIECKMCpIV1PNyCCRBPGAkX4wvSn9/hORisbiQWcJ8TkaCxpRjLSRPI8jPQmiHBcjPWo0nZYzA1wlbkWaoEJn1PjxwhinnAiNGVJqGDHyKHw7R1JTzEhhe6kiCcJTNCZDQwXiRPn5LHUBz4wSwiiWZoSGM/XvRY64UhkPzGaZUi17pfifN0x1dO3nVCSpJgLPH0UpgzqGZQUwpJJgzTJDEJbUZIV4giTC2hS18EVpjmQmw8I25bjLVayS3kXLdVru/WWzfVfVVAcn4BScAxdcgTa4BR3QBRgk4AW8gjfr2Xq3PqzP+WrNqm6OwQKsr18ScpxS</latexit>

ctMt<latexit sha1_base64="TJwISkHzUulgDYQt6eHG/gsMlR0=">AAACEXicbVDLSsNAFJ3UV62vqEs3g0VwVRIRdFlw40apYB/QhjCZTNqhk0mcuSmU0K9w61b/wZ249Qv8Bb/CSduFbT1w4XDOvdzDCVLBNTjOt1VaW9/Y3CpvV3Z29/YP7MOjlk4yRVmTJiJRnYBoJrhkTeAgWCdVjMSBYO1geFP47RFTmifyEcYp82LSlzzilICRfNvuxQQGQZTTiQ/4zgffrjo1Zwq8Stw5qaI5Gr790wsTmsVMAhVE624k2JP0KjlRwKlgk0ov0ywldEj6rGuoJDHTXj7NPsFnRglxlCgzEvBU/XuRk1jrcRyYzSKpXvYK8T+vm0F07eVcphkwSWePokxgSHBRBA65YhTE2BBCFTdZMR0QRSiYuha+aIiJGqtwUjHluMtVrJLWRc11au7DZbV+P6+pjE7QKTpHLrpCdXSLGqiJKBqhF/SK3qxn6936sD5nqyVrfnOMFmB9/QI5+J3r</latexit><latexit sha1_base64="TJwISkHzUulgDYQt6eHG/gsMlR0=">AAACEXicbVDLSsNAFJ3UV62vqEs3g0VwVRIRdFlw40apYB/QhjCZTNqhk0mcuSmU0K9w61b/wZ249Qv8Bb/CSduFbT1w4XDOvdzDCVLBNTjOt1VaW9/Y3CpvV3Z29/YP7MOjlk4yRVmTJiJRnYBoJrhkTeAgWCdVjMSBYO1geFP47RFTmifyEcYp82LSlzzilICRfNvuxQQGQZTTiQ/4zgffrjo1Zwq8Stw5qaI5Gr790wsTmsVMAhVE624k2JP0KjlRwKlgk0ov0ywldEj6rGuoJDHTXj7NPsFnRglxlCgzEvBU/XuRk1jrcRyYzSKpXvYK8T+vm0F07eVcphkwSWePokxgSHBRBA65YhTE2BBCFTdZMR0QRSiYuha+aIiJGqtwUjHluMtVrJLWRc11au7DZbV+P6+pjE7QKTpHLrpCdXSLGqiJKBqhF/SK3qxn6936sD5nqyVrfnOMFmB9/QI5+J3r</latexit><latexit sha1_base64="TJwISkHzUulgDYQt6eHG/gsMlR0=">AAACEXicbVDLSsNAFJ3UV62vqEs3g0VwVRIRdFlw40apYB/QhjCZTNqhk0mcuSmU0K9w61b/wZ249Qv8Bb/CSduFbT1w4XDOvdzDCVLBNTjOt1VaW9/Y3CpvV3Z29/YP7MOjlk4yRVmTJiJRnYBoJrhkTeAgWCdVjMSBYO1geFP47RFTmifyEcYp82LSlzzilICRfNvuxQQGQZTTiQ/4zgffrjo1Zwq8Stw5qaI5Gr790wsTmsVMAhVE624k2JP0KjlRwKlgk0ov0ywldEj6rGuoJDHTXj7NPsFnRglxlCgzEvBU/XuRk1jrcRyYzSKpXvYK8T+vm0F07eVcphkwSWePokxgSHBRBA65YhTE2BBCFTdZMR0QRSiYuha+aIiJGqtwUjHluMtVrJLWRc11au7DZbV+P6+pjE7QKTpHLrpCdXSLGqiJKBqhF/SK3qxn6936sD5nqyVrfnOMFmB9/QI5+J3r</latexit><latexit sha1_base64="TJwISkHzUulgDYQt6eHG/gsMlR0=">AAACEXicbVDLSsNAFJ3UV62vqEs3g0VwVRIRdFlw40apYB/QhjCZTNqhk0mcuSmU0K9w61b/wZ249Qv8Bb/CSduFbT1w4XDOvdzDCVLBNTjOt1VaW9/Y3CpvV3Z29/YP7MOjlk4yRVmTJiJRnYBoJrhkTeAgWCdVjMSBYO1geFP47RFTmifyEcYp82LSlzzilICRfNvuxQQGQZTTiQ/4zgffrjo1Zwq8Stw5qaI5Gr790wsTmsVMAhVE624k2JP0KjlRwKlgk0ov0ywldEj6rGuoJDHTXj7NPsFnRglxlCgzEvBU/XuRk1jrcRyYzSKpXvYK8T+vm0F07eVcphkwSWePokxgSHBRBA65YhTE2BBCFTdZMR0QRSiYuha+aIiJGqtwUjHluMtVrJLWRc11au7DZbV+P6+pjE7QKTpHLrpCdXSLGqiJKBqhF/SK3qxn6936sD5nqyVrfnOMFmB9/QI5+J3r</latexit>

Mask Blending<latexit sha1_base64="Ya4IsXfe+hONppoU0SsnOdRsvGw=">AAAB9HicdVDLSsNAFJ3UV62vqks3g0VwFZLUtHFX6saNUME+oA1lMpm2QyeTODMplNDvcONCEbd+jDv/xulDUNEDFw7n3Mu99wQJo1JZ1oeRW1vf2NzKbxd2dvf2D4qHRy0ZpwKTJo5ZLDoBkoRRTpqKKkY6iSAoChhpB+Orud+eECFpzO/UNCF+hIacDihGSkv+DZJjWGeEh5QP+8WSZZY9x6u40DLdi8uqU9bErpRdz4W2aS1QAis0+sX3XhjjNCJcYYak7NpWovwMCUUxI7NCL5UkQXiMhqSrKUcRkX62OHoGz7QSwkEsdHEFF+r3iQxFUk6jQHdGSI3kb28u/uV1UzXw/IzyJFWE4+WiQcqgiuE8ARhSQbBiU00QFlTfCvEICYSVzqmgQ/j6FP5PWo5pa37rlGr1VRx5cAJOwTmwQRXUwDVogCbA4B48gCfwbEyMR+PFeF225ozVzDH4AePtE9MIkiM=</latexit><latexit sha1_base64="Ya4IsXfe+hONppoU0SsnOdRsvGw=">AAAB9HicdVDLSsNAFJ3UV62vqks3g0VwFZLUtHFX6saNUME+oA1lMpm2QyeTODMplNDvcONCEbd+jDv/xulDUNEDFw7n3Mu99wQJo1JZ1oeRW1vf2NzKbxd2dvf2D4qHRy0ZpwKTJo5ZLDoBkoRRTpqKKkY6iSAoChhpB+Orud+eECFpzO/UNCF+hIacDihGSkv+DZJjWGeEh5QP+8WSZZY9x6u40DLdi8uqU9bErpRdz4W2aS1QAis0+sX3XhjjNCJcYYak7NpWovwMCUUxI7NCL5UkQXiMhqSrKUcRkX62OHoGz7QSwkEsdHEFF+r3iQxFUk6jQHdGSI3kb28u/uV1UzXw/IzyJFWE4+WiQcqgiuE8ARhSQbBiU00QFlTfCvEICYSVzqmgQ/j6FP5PWo5pa37rlGr1VRx5cAJOwTmwQRXUwDVogCbA4B48gCfwbEyMR+PFeF225ozVzDH4AePtE9MIkiM=</latexit><latexit sha1_base64="Ya4IsXfe+hONppoU0SsnOdRsvGw=">AAAB9HicdVDLSsNAFJ3UV62vqks3g0VwFZLUtHFX6saNUME+oA1lMpm2QyeTODMplNDvcONCEbd+jDv/xulDUNEDFw7n3Mu99wQJo1JZ1oeRW1vf2NzKbxd2dvf2D4qHRy0ZpwKTJo5ZLDoBkoRRTpqKKkY6iSAoChhpB+Orud+eECFpzO/UNCF+hIacDihGSkv+DZJjWGeEh5QP+8WSZZY9x6u40DLdi8uqU9bErpRdz4W2aS1QAis0+sX3XhjjNCJcYYak7NpWovwMCUUxI7NCL5UkQXiMhqSrKUcRkX62OHoGz7QSwkEsdHEFF+r3iQxFUk6jQHdGSI3kb28u/uV1UzXw/IzyJFWE4+WiQcqgiuE8ARhSQbBiU00QFlTfCvEICYSVzqmgQ/j6FP5PWo5pa37rlGr1VRx5cAJOwTmwQRXUwDVogCbA4B48gCfwbEyMR+PFeF225ozVzDH4AePtE9MIkiM=</latexit><latexit sha1_base64="Ya4IsXfe+hONppoU0SsnOdRsvGw=">AAAB9HicdVDLSsNAFJ3UV62vqks3g0VwFZLUtHFX6saNUME+oA1lMpm2QyeTODMplNDvcONCEbd+jDv/xulDUNEDFw7n3Mu99wQJo1JZ1oeRW1vf2NzKbxd2dvf2D4qHRy0ZpwKTJo5ZLDoBkoRRTpqKKkY6iSAoChhpB+Orud+eECFpzO/UNCF+hIacDihGSkv+DZJjWGeEh5QP+8WSZZY9x6u40DLdi8uqU9bErpRdz4W2aS1QAis0+sX3XhjjNCJcYYak7NpWovwMCUUxI7NCL5UkQXiMhqSrKUcRkX62OHoGz7QSwkEsdHEFF+r3iQxFUk6jQHdGSI3kb28u/uV1UzXw/IzyJFWE4+WiQcqgiuE8ARhSQbBiU00QFlTfCvEICYSVzqmgQ/j6FP5PWo5pa37rlGr1VRx5cAJOwTmwQRXUwDVogCbA4B48gCfwbEyMR+PFeF225ozVzDH4AePtE9MIkiM=</latexit>

Module<latexit sha1_base64="vAuRuRaS/0VA/ONLC2YHVRt6RRk=">AAAB7XicdVDLSsNAFJ3UV62vqks3g0VwFZLUtHFXdONGqGAf0IYymUzasZNMmJkIJfQf3LhQxK3/486/cfoQVPTAhcM593LvPUHKqFSW9WEUVlbX1jeKm6Wt7Z3dvfL+QVvyTGDSwpxx0Q2QJIwmpKWoYqSbCoLigJFOML6c+Z17IiTlya2apMSP0TChEcVIaal9zcOMkUG5YplVz/FqLrRM9+y87lQ1sWtV13OhbVpzVMASzUH5vR9ynMUkUZghKXu2lSo/R0JRzMi01M8kSREeoyHpaZqgmEg/n187hSdaCWHEha5Ewbn6fSJHsZSTONCdMVIj+dubiX95vUxFnp/TJM0USfBiUZQxqDicvQ5DKghWbKIJwoLqWyEeIYGw0gGVdAhfn8L/Sdsxbc1vnErjYhlHERyBY3AKbFAHDXAFmqAFMLgDD+AJPBvceDRejNdFa8FYzhyCHzDePgETQI90</latexit><latexit sha1_base64="vAuRuRaS/0VA/ONLC2YHVRt6RRk=">AAAB7XicdVDLSsNAFJ3UV62vqks3g0VwFZLUtHFXdONGqGAf0IYymUzasZNMmJkIJfQf3LhQxK3/486/cfoQVPTAhcM593LvPUHKqFSW9WEUVlbX1jeKm6Wt7Z3dvfL+QVvyTGDSwpxx0Q2QJIwmpKWoYqSbCoLigJFOML6c+Z17IiTlya2apMSP0TChEcVIaal9zcOMkUG5YplVz/FqLrRM9+y87lQ1sWtV13OhbVpzVMASzUH5vR9ynMUkUZghKXu2lSo/R0JRzMi01M8kSREeoyHpaZqgmEg/n187hSdaCWHEha5Ewbn6fSJHsZSTONCdMVIj+dubiX95vUxFnp/TJM0USfBiUZQxqDicvQ5DKghWbKIJwoLqWyEeIYGw0gGVdAhfn8L/Sdsxbc1vnErjYhlHERyBY3AKbFAHDXAFmqAFMLgDD+AJPBvceDRejNdFa8FYzhyCHzDePgETQI90</latexit><latexit sha1_base64="vAuRuRaS/0VA/ONLC2YHVRt6RRk=">AAAB7XicdVDLSsNAFJ3UV62vqks3g0VwFZLUtHFXdONGqGAf0IYymUzasZNMmJkIJfQf3LhQxK3/486/cfoQVPTAhcM593LvPUHKqFSW9WEUVlbX1jeKm6Wt7Z3dvfL+QVvyTGDSwpxx0Q2QJIwmpKWoYqSbCoLigJFOML6c+Z17IiTlya2apMSP0TChEcVIaal9zcOMkUG5YplVz/FqLrRM9+y87lQ1sWtV13OhbVpzVMASzUH5vR9ynMUkUZghKXu2lSo/R0JRzMi01M8kSREeoyHpaZqgmEg/n187hSdaCWHEha5Ewbn6fSJHsZSTONCdMVIj+dubiX95vUxFnp/TJM0USfBiUZQxqDicvQ5DKghWbKIJwoLqWyEeIYGw0gGVdAhfn8L/Sdsxbc1vnErjYhlHERyBY3AKbFAHDXAFmqAFMLgDD+AJPBvceDRejNdFa8FYzhyCHzDePgETQI90</latexit><latexit sha1_base64="vAuRuRaS/0VA/ONLC2YHVRt6RRk=">AAAB7XicdVDLSsNAFJ3UV62vqks3g0VwFZLUtHFXdONGqGAf0IYymUzasZNMmJkIJfQf3LhQxK3/486/cfoQVPTAhcM593LvPUHKqFSW9WEUVlbX1jeKm6Wt7Z3dvfL+QVvyTGDSwpxx0Q2QJIwmpKWoYqSbCoLigJFOML6c+Z17IiTlya2apMSP0TChEcVIaal9zcOMkUG5YplVz/FqLrRM9+y87lQ1sWtV13OhbVpzVMASzUH5vR9ynMUkUZghKXu2lSo/R0JRzMi01M8kSREeoyHpaZqgmEg/n187hSdaCWHEha5Ewbn6fSJHsZSTONCdMVIj+dubiX95vUxFnp/TJM0USfBiUZQxqDicvQ5DKghWbKIJwoLqWyEeIYGw0gGVdAhfn8L/Sdsxbc1vnErjYhlHERyBY3AKbFAHDXAFmqAFMLgDD+AJPBvceDRejNdFa8FYzhyCHzDePgETQI90</latexit>

Figure 2: Our iterative generation pipeline for image reconstruction of target I . The previous canvas It−1 (I0 can be initializedto a random uniform color) is concatenated with I and forwarded through a ResNet feature extractor, to obtain a color ct andmask parameters pt. A Multi Layer Perceptron f generates a parametric mask Mt from pixelwise coordinates of a 2D gridand mask parameters pt. Our Mask Blending Module (in green) finally blends this mask with its corresponding color to theprevious output It−1.

3.3. Discussion

Architecture choices. Our architecture choices are re-lated to desirable properties of the final generation model:

Layered decomposition: This choice allows us to ob-tain a mask decomposition which is a key component ofimage editing pipelines. Defining one color per layer, sim-ilar to image simplification and quantization approaches, isimportant to obtain visually coherent regions. We furthershow that a single layer baseline does not perform as well.

Vectorized layers: By using a lattice input for the maskgenerator, it is possible to perform local image editingand generation at any resolution without introducing up-sampling artifacts or changing our model architecture. Thisvector mask representation is especially convenient for HDimage editing.

Recursive vs one-shot: We generate the mask parame-ters recursively to allow the model to better take into ac-count the interaction between the different masks. We showthat a one-shot baseline, where all the mask parameters arepredicted in a single pass leads to worse results. Moreover,as mentioned above and demonstrated in the experiments,our recursive procedure can be applied a larger number oftimes to model more complex images.

Number of layers vs. size of the mask parameter. Ourmask blending module iteratively adds colored masks to thecanvas to compose the final image. The size of the maskparameter p controls the complexity of the possible maskshapes, while the number of masks controls the amount ofdifferent shapes that be used to compose the image. Sincewe aim at producing a set of layers that can easily be usedand interpreted by a human, we use a limited number ofstrokes and masks.

Complexity of the mask generator network. Interest-ingly, if the network generating the masks from the param-eters was very large, it could generate very complex pat-terns. In fact, one could show using the universal approxi-mation theorem [7, 21] that, with a large number of hiddenunits in the MLP f , an image could be approximated withonly three layers (N = 3) of our generation process, usingone mask for each color channel. Thus it is important tocontrol the complexity of f to obtain meaningful primitiveshapes. For example, we found that replacing our MLP bya ResNet leads to less interpretable masks (see Section 4.3and Fig. 10).

4. Experiments

In this section we first introduce the datasets, the train-ing and network architecture details, then we demonstratethe practical interest of our approach in several applica-tions, and justify the architecture choices in extensive ab-lation studies.

4.1. Datasets and implementation details

Datasets. Our models are trained on two datasets,CelebA [31] (202k images of celebrity faces) and Ima-geNet [9] (1.28M natural images of 1000 classes), usingimages downsampled to 128× 128.

Training details. The parameters of our generator g areoptimized using Adam [27] with a learning rate of 2×10−4,β1 = 0.9 and no weight decay. The batch size is set to 32and training image size is fixed to 128× 128 pixel images.

Figure 3: Our editing interface using automatically extracted masks to bootstrap the editing process.

Figure 4: Some editings on CelebA and ImageNet, using little supervision (mask selection in one click and new style/colorselection). Note that the CelebA editings are performed on 1024× 1024 images. Left: original; center: mask; right: edit.

Network architectures. The mask generator f consistsof an MLP with three hidden layers of 128 units with groupnormalization [46], tanh non-linearities, and an additionalsigmoid after the last layer. f takes as input a parametervector p and pixel coordinates (x, y), and outputs a valuebetween 0 and 1. The parameter p and the color c are pre-dicted by a ResNet-18 network. Further details about thenetwork architectures are in the supplementary material.

4.2. Applications

We now demonstrate how our image decomposition mayserve different purposes such as image editing, retrieval andvectorization.

Image editing. Image editing from raw pixels can be timeconsuming. Using our generated masks, it is possible to al-

ter the original image by applying edits such as luminosityor color modifications on the region specified by a mask.Fig. 3 shows an interface we designed for such editingshowing the masks corresponding to the image. It avoidsgoing through the tedious process of defining a blendingmask manually. The learned masks capture the main com-ponents of the image, such as the background, face, hairs,lips. Fig. 4 demonstrate a variety of editing we performedand the associated masks. Our approach works well on theCelebA dataset, and allows to make simple image modifi-cations on the more challenging ImageNet images. To op-timize our results on ImageNet, the edits of Fig. 4 are ob-tained by finetuning our model on images of each objectclass.

Figure 5: t-SNE visualization of masks obtained from 5000reconstructions on CelebA.

Attribute-based image retrieval. A t-SNE [32] visual-ization of the mask parameters obtained on CelebA isshown in Fig. 5. Different clusters of masks are clearlyvisible, for backgrounds, hairs, face shadows, etc. This ex-periment highlights the fact our approach naturally extractsemantic components of face images.

Our approach may be used in an image search content:given a query image, a user can select a mask that displays aparticular attribute of interest and search for images whichdecomposition includes similar masks. Suppose we wouldlike to retrieve pictures of people wearing a hat as displayedin a query image, we can easily extract the mask that corre-sponds to the hat in our decomposition and its parameters.Nearest neighbor for different masks, using a cosine sim-ilarity distance between mask parameters p are providedin Fig. 6. Note how different masks extracted from thesame query image lead to very different retrieved images.Such a strategy could potentially be used for efficient im-age annotation or few-shot learning. We evaluated oneshotnearest neighbor classification for the ”Wearing Hat” and”Eyeglasses” categories in CelebA using the hat and glassesexamples shown in Fig. 6, and obtained respectively 34%and 49% average precision. Results for eyeglasses attributewere especially impressive with 33% recall at 98% preci-sion, compared to a low recall (less than 10% at 98% preci-sion) for a baseline using cosine distance between featuresof a Resnet18 trained on ImageNet.

Vector image generation. Producing vectorized imagesis often essential for design applications. We demonstrate inFig. 7(a) the potential of our approach for producing a con-tinuous vector image from a low resolution bitmap. Here,we train our network on the MNIST dataset (28 × 28), butgenerate the output at resolution 1024×1024. Compared tobilinear interpolation, the image we generate presents lessartifacts.

Hat Shirt Backgrd Glasses Face Lipstickcollar text direction

Figure 6: Given a target image and a mask of an area ofinterest extracted from it, a nearest neighbor search in thelearned mask parameter space allows the retrieval of imagessharing the desired attribute with the target.

We finally compare our model with SPIRAL [13] on afew images from CelebA dataset published in [13]. SPI-RAL is the approach the most closely related to ours in thesense that it is an iterative deep approach for reconstructingan image and extracting its structure only using a few colorstrokes and that it can produce vector results. We reportSPIRAL results using 20-step episodes. In each episode, atuple of 8 discrete decisions is estimated, resulting in a to-tal of 160 parameters for reconstruction. Our results shownin Fig. 7(b) are obtained with a model using 10 iterationsand 10 mask parameters. Although we do not reproducethe stroke gesture for drawing each mask as it is the case inSPIRAL, our results reconstruct the original images muchbetter.

4.3. Architecture and training choices

L1, perceptual and adversarial loss. In Fig. 8, we showhow the perceptual loss allows to obtain qualitatively betterreconstructions than these obtained with an `1 loss. Train-

Original Bilinear Ours

(a) Vectorization: reconstructions of MNIST images.Target L1 Perceptual Spiral [13]

(b) Comparison with SPIRAL [13] on CelebA.Figure 7: Our model learns a vectorized mask representa-tion that can be generated at any resolution without interpo-lation artifacts.

ing our model with an additional adversarial loss enhancesfurther the sharpness of the reconstructions. In the remain-

Figure 8: Training with perceptual and adversarial loss al-lows our model to reach more convincing details in the re-constructions. From top to bottom: Original images, `1 re-construction, using perceptual loss, adding adversarial loss.

der of this section, we trained our models with an `1 losswhich results in easier quality assessment using standardimage similarity metrics.

Comparison to baselines. As discussed in Section 3.3,every component of our model is important to obtain re-constructions similar to the target. To show that, we pro-

A.Resnet AE

B.MLP AE

C.Vect. AE

D. OursOneshot

E. OursResnet

F. Ours

Layered Recursive VectorizedA. Resnet AE - - -B. MLP AE - - -C. Vect. AE - - XD. Ours Oneshot X - XE. Ours Resnet X X -F. Ours X X X

Figure 9: Considered baselines and their properties. Firstthree baselines are non iterative. Next two are ablations ofeach property we consider in our final model F. Legend: R:reconstructed image; BM: Mask blending module of Fig. 2.

vide comparisons between different versions of our modeland baselines using PSNR and MS-SSIM metrics. Eachbaseline consists of an auto-encoder where the encoder isa residual network (ResNet-18, same as our model) produc-ing a latent code z and different types of decoders. Thedifferent baselines, depicted in Fig. 9 with a summary oftheir properties, are designed to validate each component ofour architecture:

A. ResNet AE: using as a decoder a ResNet with convolu-tions, residual connections, and upsampling similarlyto the architecture used in [35, 28].

B. MLP AE: using as decoder an MLP with a 3×W ×Houtput.

C. Vect. AE: the decoder computes the resulting imageR as a function f of the coordinates (x, y) of a pixelin image space and the latent code z as R(x, y) =f(x, y, z). Here f is an MLP similar to the one usedin our mask generation network, but with a 3-channeloutput instead of a 1-channel as for the mask.

D. Ours One-shot: generates all the mask parameters pt

and colors ct in one pass, instead of recursively. TheMLP then processes each pt separately leading to dif-ferent masks to be assembled in the blending moduleas in our approach.

E. Ours ResNet: using a ResNet decoder to generatemasks Mt and otherwise similar to our method, itera-tively blending the masks with one color onto the can-vas, in our experiments we started with a black canvas.

Table 1 shows a quantitative comparison of results ob-tained by our model and baselines trained with `1 loss andfor the same bottleneck |z| = 320. This corresponds to asize of parameters of P = |z|/N − 3 where 3 is the numberof parameters used for color prediction. On both datasets,our approach (F) clearly outperforms the baselines whichproduce vector outputs, either in one layer (C) or with one-shot parameters prediction (D). Interestingly, a parametricgeneration (C) is itself better than directly using an MLP topredict pixel values (B). Finally, our approach (F) has quan-titative reconstruction results similar to the ResNet base-lines (A and E).

However, our method has two strong advantages overResnet generations. First, it produces vector outputs. Sec-ond, it produces more interpretable masks. This can be seenin Fig. 10 where we compare the masks resulting from (E)and (F). Our method (F) captures much better the differ-ent components of face images, notably the hairs, while themasks of (E) include several different component in the im-age, with a first mask covering both hairs and faces. In thesupplementary material, we show that a qualitatively simi-lar difference can be observed for our method when reduc-ing the number of masks while increasing their number ofparameters to keep a constant total code size.

ImageNet CelebAPSNR MSSIM PSNR MSSIM

A. MLP AE 16.45 0.46 19.69 0.78C. Vect. AE 17.95 0.62 20.99 0.82D. Ours One-shot 20.00 0.77 23.13 0.89E. Ours Resnet 21.05 0.82 24.67 0.92F. Ours 21.03 0.82 24.02 0.90

Table 1: Comparing the quality of reconstruction on Im-ageNet and CelebA using a bottleneck z of size 320 (10masks for iterative approaches).

Recursive setup and computational cost. There is ofcourse a computational cost to our recursive approach. InTable 2, we compare the PSNR and computation time forthe same total number of parameters (320) but using differ-ent number of masks T , both for our approach and the one-shot baseline. Interestingly, the quality of the reconstruction

Figure 10: Comparison of masks obtained with our ap-proach (F) (bottom), with these obtained by our iterativeResNet baseline (E) (top).

T = 5 T = 10 T = 20One-shot Ours One-shot Ours One-shot Ours

PSNR 21.97 23.07 22.25 24.2 22.37 24Time(h)95% PSNR

7.6 9.8 12.1 16.9 19.8 36.5

Testingtime (ms)

12 32 18 65 31 129

Table 2: Comparison of our recursive strategy with the One-shot approach, in terms of reconstruction quality (PSNR)and training time required to reach 95% of its best achiev-able PSNR at full convergence on CelebA. The inferencetime does not exceeds 0.2 seconds.

improves with the number of masks for both approaches,our approach being consistently more than a point PSNRbetter than the one-shot baseline. However, as expected ourapproach is slower than the baseline both for training andtesting, and the cost increases with the number of masks.

Table 3 evaluates the reconstruction quality when recom-posing images at test time with a larger number of masksN than the T = 10 masks used at training time. On bothdatasets, the PSNR increases by almost a point with addi-tional masks. This is another advantage of our recursiveapproach.

ImageNet CelebAN 10 20 40 80 10 20 40 80PSNR 20.97 21.72 21.83 21.84 24.02 24.82 24.86 24.86

Table 3: Forwarding more masks at test time improves re-construction. These results with N masks forwards are ob-tained with a model trained using T=10 masks (|z| = 320).

5. ConclusionWe have presented a new paradigm for image reconstruc-

tion using a succession of single-color parametric layers.This decomposition, learned without supervision, enablesimage editing from the masks generated for each layer.

We also show how the learned mask parameters may beexploited in a retrieval context. Moreover, our experimentsprove that our image reconstruction results are competitive

with convolution-based auto-encoders.Our work is the first to showcase the potential of a deep

vector image architecture for real world applications. Fur-thermore, while our model is introduced in an image re-construction setting, it may be extended to adversarial im-age generation, where generating high resolution images ischallenging. We’re aware of risks surrounding manipulatedmedia but we believe the importance of publishing this workopenly may have benefits in AR filters or more realistic VR.

We think that because of its differences and its advan-tages for user interaction, our method will inspire new ap-proaches.

References[1] E. Adelson. Layered representations for image coding. Vi-

sion and Modeling Group, Media Laboratory, MassachusettsInstitute of Technology, 1991. 1

[2] D. Bau, J.-Y. Zhu, H. Strobelt, B. Zhou, J. B. Tenenbaum,W. T. Freeman, and A. Torralba. Gan dissection: Visu-alizing and understanding generative adversarial networks.arXiv:1811.10597, 2018. 3

[3] I. Biederman. Recognition-by-components: a theory of hu-man image understanding. Psychological review, 1987. 1

[4] I. Binford. Visual perception by computer. In IEEE Confer-ence of Systems and Control, 1971. 1

[5] A. Brock, J. Donahue, and K. Simonyan. Large scale GANtraining for high fidelity natural image synthesis. In ICLR,2019. 1

[6] Z. Chen and H. Zhang. Learning implicit fields for generativeshape modeling. In CVPR, 2019. 2

[7] G. Cybenko. Approximation by superpositions of a sig-moidal function. Mathematics of control, signals and sys-tems, 1989. 4

[8] L. Demaret, N. Dyn, and A. Iske. Image compression by lin-ear splines over adaptive triangulations. Signal Processing,2006. 2

[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.ImageNet: A Large-Scale Hierarchical Image Database. InCVPR, 2009. 4

[10] L. Duan and F. Lafarge. Image partitioning into convex poly-gons. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 3119–3127, 2015. 2

[11] S. M. A. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepesvari,K. Kavukcuoglu, and G. Hinton. Attend, Infer, Repeat: FastScene Understanding with Generative Models. NIPS, 2016.2

[12] J.-D. Favreau, F. Lafarge, and A. Bousseau. Photo2clipart:image abstraction and vectorization using layered linear gra-dients. ACM Transactions on Graphics (TOG), 2017. 2

[13] Y. Ganin, T. Kulkarni, I. Babuschkin, S. Eslami, andO. Vinyals. Synthesizing programs for images using rein-forced adversarial learning. arXiv:1804.01118, 2018. 2, 6,7

[14] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transferusing convolutional neural networks. In CVPR, 2016. 3

[15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial nets. In NIPS, 2014. 3

[16] K. Gregor, F. Besse, D. J. Rezende, I. Danihelka, andD. Wierstra. Towards conceptual compression. In NIPS,2016. 2

[17] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, andD. Wierstra. Draw: A recurrent neural network for imagegeneration. arXiv:1502.04623, 2015. 2

[18] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, andM. Aubry. A papier-mache approach to learning 3d surfacegeneration. In Proceedings of the IEEE conference on com-puter vision and pattern recognition, 2018. 1, 2

[19] D. Ha. Generating abstract patterns with tensorflow.blog.otoro.net, 2016. 2

[20] D. Ha. Generating large images from latent vectors.blog.otoro.net, 2016. 2

[21] K. Hornik. Approximation capabilities of multilayer feed-forward networks. Neural networks, 1991. 4

[22] P. Isola and C. Liu. Scene collaging: Analysis and synthesisof natural images with semantic layers. In ICCV, 2013. 1

[23] P. Isola, J. Zhu, T. Zhou, and A. A. Efros. Image-to-imagetranslation with conditional adversarial networks. ICCV,2017. 3

[24] J. Johnson, A. Alahi, and F. Li. Perceptual lossesfor real-time style transfer and super-resolution. CoRR,abs/1603.08155, 2016. 3

[25] A. Karpathy. Image ’painting’. cs.stanford.edu. 2[26] T. Karras, S. Laine, and T. Aila. A style-based generator

architecture for generative adversarial networks. In CVPR,2019. 1

[27] D. Kingma and J. Ba. Adam: A method for stochastic opti-mization. arXiv preprint arXiv:1412.6980, 2014. 4

[28] K. Kurach, M. Lucic, X. Zhai, M. Michalski, and S. Gelly.The GAN landscape: Losses, architectures, regularization,and normalization, 2018. 7

[29] A. B. Lee, D. Mumford, and J. Huang. Occlusion modelsfor natural images: A statistical study of a scale-invariantdead leaves model. International Journal of Computer Vi-sion, 2001. 2

[30] Z. Liao, H. Hoppe, D. Forsyth, and Y. Yu. A subdivision-based representation for vector image editing. IEEE trans-actions on visualization and computer graphics, 2012. 2

[31] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face at-tributes in the wild. In ICCV. IEEE Computer Society, 2015.4

[32] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne.Journal of machine learning research, 2008. 6

[33] L. Mescheder, A. Geiger, and S. Nowozin. Which train-ing methods for gans do actually converge? arXiv preprintarXiv:1801.04406, 2018. 3

[34] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, andA. Geiger. Occupancy networks: Learning 3d reconstructionin function space. In CVPR, 2019. 2

[35] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida.Spectral normalization for generative adversarial networks.arXiv:1802.05957, 2018. 7

[36] A. Orzan, A. Bousseau, H. Winnemoller, P. Barla, J. Thollot,and D. Salesin. Diffusion curves: a vector representation forsmooth-shaded images. In ACM Transactions on Graphics(TOG). ACM, 2008. 2

[37] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Love-grove. Deepsdf: Learning continuous signed distance func-tions for shape representation. arXiv:1901.05103, 2019. 2

[38] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu. Semanticimage synthesis with spatially-adaptive normalization. InCVPR, 2019. 3

[39] T. Portenier, Q. Hu, A. Szabo, S. A. Bigdeli, P. Favaro, andM. Zwicker. Faceshop: Deep sketch-based face image edit-ing. ACM Transactions on Graphics (TOG), 2018. 3

[40] C. Richardt, J. Lopez-Moreno, A. Bousseau, M. Agrawala,and G. Drettakis. Vectorising bitmaps into semi-transparentgradient layers. In Computer Graphics Forum. Wiley OnlineLibrary, 2014. 2

[41] J. T. Rolfe and Y. LeCun. Discriminative recurrent sparseauto-encoders. ICLR, 2013. 2

[42] X. Shen, A. Hertzmann, J. Jia, S. Paris, B. Price, E. Shecht-man, and I. Sachs. Automatic portrait segmentation for im-age stylization. In Computer Graphics Forum, 2016. 3

[43] K. O. Stanley. Compositional pattern producing networks:A novel abstraction of development. Genetic programmingand evolvable machines, 2007. 2

[44] S. Tulsiani, H. Su, L. J. Guibas, A. A. Efros, and J. Malik.Learning shape abstractions by assembling volumetric prim-itives. In Proc. CVPR, volume 2, 2017. 1

[45] J. Wang and E. Adelson. Representing moving images withlayers. IEEE Transactions on Image Processing, 1994. 1

[46] Y. Wu and K. He. Group normalization. arXiv preprintarXiv:1803.08494, 2018. 5

[47] T. Xia, B. Liao, and Y. Yu. Patch-based image vectorizationwith automatic curvilinear feature alignment. ACM Transac-tions on Graphics (TOG), 2009. 2

[48] J. Yang, A. Kannan, D. Batra, and D. Parikh. LR-GAN:layered recursive generative adversarial networks for imagegeneration. ICLR, 2017. 2

[49] R. Zhang, P. Isola, and A. A. Efros. Colorful image coloriza-tion. In ECCV, 2016. 3

[50] J. Zhu, P. Krahenbuhl, E. Shechtman, and A. A. Efros. Gen-erative visual manipulation on the natural image manifold.ECCV, 2016. 3

[51] J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros,O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. NIPS, 2017. 1

[52] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial net-works. arXiv:1703.10593, 2017. 1

Abstract arXiv:1812.05484v1 [cs.CV] 13 Dec 2018 · 2018-12-14 · image Iby minimizing a...

Documents

Transcript of Abstract arXiv:1812.05484v1 [cs.CV] 13 Dec 2018 · 2018-12-14 · image Iby minimizing a...