# swapping_autoencoder_for_deep_image_manipulation__a5bb5a95.pdf

Swapping Autoencoder for Deep Image Manipulation

Taesung Park12 Jun-Yan Zhu2 Oliver Wang2 Jingwan Lu2

Eli Shechtman2 Alexei A. Efros12 Richard Zhang2

1UC Berkeley 2Adobe Research

Deep generative models have become increasingly effective at producing realistic images from randomly sampled seeds, but using such models for controllable manipulation of existing images remains challenging. We propose the Swapping Autoencoder, a deep model designed speciﬁcally for image manipulation, rather than random sampling. The key idea is to encode an image into two independent components and enforce that any swapped combination maps to a realistic image. In particular, we encourage the components to represent structure and texture, by enforcing one component to encode co-occurrent patch statistics across different parts of the image. As our method is trained with an encoder, ﬁnding the latent codes for a new input image becomes trivial, rather than cumbersome. As a result, our method enables us to manipulate real input images in various ways, including texture swapping, local and global editing, and latent code vector arithmetic. Experiments on multiple datasets show that our model produces better results and is substantially more efﬁcient compared to recent generative models.

Figure 1: Our Swapping Autoencoder learns to disentangle texture from structure for image editing tasks. One such task is texture swapping, shown here. Please see our project webpage for a demo video of our editing method.

1 Introduction

Traditional photo-editing tools, such as Photoshop, operate solely within the conﬁnes of the input image, i.e. they can only recycle the pixels that are already there. The promise of using machine learning for image manipulation has been to incorporate the generic visual knowledge drawn from external visual datasets into the editing process. The aim is to enable new class of editing operations, such as inpainting large image regions [60, 81, 55], synthesizing photorealistic images from layouts [33, 73, 59], replacing objects [88, 28], or changing the time photo is taken [41, 2].

However, learning-driven image manipulation brings in its own challenges. For image editing, there is a fundamental conﬂict: what information should be gleaned from the dataset versus information that must be retained from the input image? If the output image relies too much on the dataset, it will retain no resemblance to the input, so can hardly be called editing , whereas relying too much on the input lessens the value of the dataset. This conﬂict can be viewed as a disentanglement 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

problem. Starting from image pixels, one needs to factor out the visual information which is speciﬁc to a given image from information that is applicable across different images of the dataset. Indeed, many existing works on learning-based image manipulation, though not always explicitly framed as learning disentanglement, end up doing so, using paired supervision [70, 33, 73, 59], domain supervision [88, 30, 56, 2], or inductive bias of the model architecture [1, 21].

In our work, we aim to discover a disentanglement suitable for image editing in an unsupervised setting. We argue that it is natural to explicitly factor out the visual patterns within the image that must change consistently with respect to each other. We operationalize this by learning an autoencoder with two modular latent codes, one to capture the within-image visual patterns, and another to capture the rest of the information. We enforce that any arbitrary combination of these codes map to a realistic image. To disentangle these two factors, we ensure that all image patches with the same within-image code appear coherent with each other. Interestingly, this coincides with the classic deﬁnition of visual texture in a line of works started by Julesz [38, 40, 39, 64, 24, 17, 54]. The second code captures the remaining information, coinciding with structure. As such, we refer to the two codes as texture and structure codes.

A natural question to ask is: why not simply use unconditional GANs [19] that have been shown to disentangle style and content in unsupervised settings [43, 44, 21]? The short answer is that these methods do not work well for editing existing images. Unconditional GANs learn a mapping from an easy-to-sample (typically Gaussian) distribution. Some methods [4, 1, 44] have been suggested to retroﬁt pre-trained unconditional GAN models to ﬁnd the latent vector that reproduces the input image, but we show that these methods are inaccurate and magnitudes slower than our method. The conditional GAN models [33, 88, 30, 59] address this problem by starting with input images, but they require the task to be deﬁned a priori. In contrast, our model learns an embedding space that is useful for image manipulation in several downstream tasks, including synthesizing new image hybrids (see Figure 1), smooth manipulation of attributes or domain transfer by traversing latent directions (Figure 7), and local manipulation of the scene structure (Figure 8).

To show the effectiveness of our method, we evaluate it on multiple datasets, such as LSUN churches and bedrooms [80], Flickr Faces-HQ [43], and newly collected datasets of mountains and waterfalls, using both automatic metrics and human perceptual judgments. We also present an interactive UI (please see our video in the project webpage) that showcases the advantages of our method.

2 Related Work

Conditional generative models, such as image-to-image translation [33, 88], learn to directly synthesize an output image given a user input. Many applications have been successfully built with this framework, including image inpainting [60, 32, 77, 55], photo colorization [83, 50, 85, 23], texture and geometry synthesis [86, 20, 75], sketch2photo [66], semantic image synthesis and editing [73, 63, 10, 59]. Recent methods extent it to multi-domain and multi-modal setting [30, 89, 56, 82, 12]. However, it is challenging to apply such methods to on-the-ﬂy image manipulation, because for each new application and new user input, a new model needs to be trained. We present a framework for both image synthesis and manipulation, in which the task can be deﬁned by one or a small number of examples at run-time. While recent works [67, 68] propose to learn a single-image GANs for image editing, our model can be quickly applied to a test image without extensive computation of single-image training. Deep image editing via latent space exploration modiﬁes the latent vector of a pre-trained, unconditional generative model (e.g., a GAN [19]) according to the desired user edits. For example, i GAN [87] obtains the latent code using an encoder-based initialization followed by Quasi-Newton optimization, and updates the code according to new user constraints. Similar ideas have been explored in other tasks like image inpainting, face editing, and deblurring [8, 61, 78, 3]. More recently, instead of using the input latent space, GANPaint [4] adapts layers of a pre-trained GAN for each input image and updates layers according to a user s semantic control [5]. Image2Style GAN [1] and Style GAN2 [44] reconstruct the image using an extended embedding space and noise vectors. Our work differs in that we allow the code space to be learned rather than sampled from a ﬁxed distribution, thus making it much more ﬂexible. In addition, we train an encoder together with the generator, which allows for signiﬁcantly faster reconstruction. Disentanglement of content and style generative models. Deep generative models learn to model the data distribution of natural images [65, 19, 47, 13, 11, 76], many of which aim to represent content and style as independently controllable factors [43, 44, 45, 15, 74]. Of special relevance to our work are models that use code swapping during training [58, 29, 36, 69, 15, 43]. Our work differs from them

in three aspects. First, while most require human supervision, such as class labels [58], pairwise image similarity [36], images pairs with same appearances [15], or object locations [69], our method is fully unsupervised. Second, our decomposable structure and texture codes allow each factor be extracted from the input images to control different aspects of the image, and produce higher-quality results when mixed. Note that for our application, image quality and ﬂexible control are critically important, as we focus on image manipulation rather than unsupervised feature learning. Recent image-to-image translation methods also use code swapping but require ground truth domain labels [49, 51, 53]. In concurrent work, Anokhin et al. [2] and ALAE [62] propose models very close to our code swapping scheme for image editing purposes. Style transfer. Modeling style and content is a classic computer vision and graphics problem [70, 25]. Several recent works revisited the topic using modern neural networks [18, 37, 71, 9], by measuring content using perceptual distance [18, 14], and style as global texture statistics, e.g., a Gram matrix. These methods can transfer low-level styles such as brush strokes, but often fail to capture larger scale semantic structures. Photorealistic style transfer methods further constrain the result to be represented by local afﬁne color transforms from the input image [57, 52, 79], but such methods only allow local color changes. In contrast, our learned decomposition can transfer semantically meaningful structure, such as the architectural details of a church, as well as perform other image editing operations.

What is the desired representation for image editing? We argue that such representation should be able to reconstruct the input image easily and precisely. Each code in the representation can be independently modiﬁed such that the resulting image both looks realistic and reﬂects the unmodiﬁed codes. The representation should also support both global and local image editing.

To achieve the above goals, we train a swapping autoencoder (shown in Figure 2) consisting of an encoder E and a generator G, with the core objectives of 1) accurately reconstructing an image, 2) learning independent components that can be mixed to create a new hybrid image, and 3) disentangling texture from structure by using a patch discriminator that learns co-occurrence statistics of image patches.

3.1 Accurate and realistic reconstruction

In a classic autoencoder [27], the encoder E and generator G form a mapping between image x X RH W 3 and latent code z Z. As seen in the top branch of Figure 2, our autoencoder also follows this framework, using an image reconstruction loss:

Lrec(E,G)=Ex X[ x G(E(x)) 1]. (1)

In addition, we wish for the image to be realistic, enforced by a discriminator D. The non-saturating adversarial loss [19] for the generator G and encoder E is calculated as:

LGAN,rec(E,G,D)=Ex X[ log(D(G(E(x))))]. (2)

3.2 Decomposable latent codes

We divide the latent space Z into two components, z=(zs,zt), and enforce that swapping components with those from other images still produces realistic images, using the GAN loss [19].

LGAN,swap(E,G,D)=Ex1,x2 X,x1 =x2 log(D(G(z1 s,z2 t))) , (3)

where z1 s, z2 t are the ﬁrst and second components of E(x1), E(x2), respectively. Furthermore, as shown in Figure 2, we design the shapes of zs and zt asymmetrically such that zs is a tensor with spatial dimensions, while zt is a vector. In our model, zs and zt are intended to encode structure and texture information, and hence named structure and texture code, respectively, for convenience. At each training iteration, we randomly sample two images x1 and x2, and enforce Lrec and LGAN,rec on x1, and LGAN,swap on the hybrid image of x1 and x2.

A majority of recent deep generative models [6, 26, 13, 11, 46, 43, 44], such as in GANs [19] and VAEs [47], attempt to make the latent space Gaussian to enable random sampling. In contrast, we do not enforce such constraint on the latent space of our model. Our swapping constraint focuses on making the distribution around a speciﬁc input image and its plausible variations well-modeled.

Reconstruction

Reference patches Real/fake? Patch co-occurrence discriminator 𝐷patch

Figure 2: Swapping Autoencoder consists of autoencoding (top) and swapping (bottom) operation. (Top) An encoder E embeds an input (Notre-Dame) into two codes. The structure code (

) is a tensor with spatial dimensions; the texture code (

) is a 2048-dimensional vector. Decoding with generator G should produce a realistic image (enforced by discriminator D) matching the input (reconstruction loss). (Bottom) Decoding with the texture code from a second image (Saint Basil s Cathedral) should look realistic (via D) and match the texture of the image, by training with a patch co-occurrence discriminator Dpatch that enforces the output and reference patches look indistinguishable.

Under ideal convergence, the training of the Swapping Autoencoder encourages several desirable properties of the learned embedding space Z. First, the encoding function E is optimized toward injection, due to the reconstruction loss, in that different images are mapped to different latent codes. Also, our design choices encourage that different codes produce different outputs via G: the texture code must capture the texture distribution, while the structure code must capture location-speciﬁc information of the input images (see Appendix ?? for more details). Lastly, the joint distribution of the two codes of the swap-generated images is factored by construction, since the structure codes are combined with random texture codes.

3.3 Co-occurrent patch statistics

While the constraints above are sufﬁcient for our swapping autoencoder to learn a factored representation, the resulting representation will not necessarily be intuitive for image editing, with no guarantee that zs and zt actually represent structure and texture. To address this, we encourage the texture code zt to maintain the same texture in any swap-generated images. We introduce a patch co-occurrence discriminator Dpatch, as shown in the bottom of Figure 2. The generator aims to generate a hybrid image G(z1 s,z2 t), such that any patch from the hybrid cannot be distinguished from a group of patches from input x2.

LCooccur GAN(E,G,Dpatch)=Ex1,x2 X

log Dpatch crop(G(z1 s,z2 t)),crops(x2) , (4)

where crop selects a random patch of size 1/8 to 1/4 of the full image dimension on each side (and crops is a collection of multiple patches). Our formulation is inspired by Julesz s theory of texture perception [38, 40] (long used in texture synthesis [64, 17]), which hypothesizes that images with similar marginal and joint feature statistics appear perceptually similar. Our co-occurence discriminator serves to enforce that the joint statistics of a learned representation be consistently transferred. Similar ideas for modeling co-occurences have been used for propagating a single texture in a supervised setting [75], self-supervised representation learning [34], and identifying image composites [31].

3.4 Overall training and architecture

Our ﬁnal objective function for the encoder and generator is Ltotal =Lrec+0.5LGAN,rec+0.5LGAN,swap+ LCooccur GAN. The discriminator objective and design follows Style GAN2 [44]. The co-occurrence patch discriminator ﬁrst extracts features for each patch, and then concatenates them to pass to the ﬁnal

Im2Style GAN

Method Runtime (sec) ( ) LPIPS Reconstruction ( )

Church FFHQ Waterfall Average

Ours 0.101 0.227 0.074 0.238 0.180

Im2Style GAN 495 0.186 0.174 0.281 0.214

Style GAN2 96 0.377 0.215 0.384 0.325

Figure 3: Embedding examples and reconstruction quality. We project images into embedding spaces for our method and baseline GAN models, Im2Style GAN [1, 43] and Style GAN2 [44]. Our reconstructions better preserve the detailed outline (e.g., doorway, eye gaze) than Style GAN2, and appear crisper than Im2Style GAN. This is veriﬁed on average with the LPIPS metric [84]. Our method also reconstructs images much faster than recent generative models that use iterative optimization. See Appendix ?? for more visual examples.

classiﬁcation layer. The encoder consists of 4 downsampling Res Net [22] blocks to produce the tensor zs, and a dense layer after average pooling to produce the vector zt. As a consequence, the structure code zs, is limited by its receptive ﬁeld at each location, providing an inductive bias for capturing local information. On the other hand, the texture code zt, deprived of spatial information by the average pooling, can only process aggregated feature distributions, forming a bias for controlling global style. The generator is based on Style GAN2, with weights modulated by the texture code. Please see Appendix ?? for a detailed speciﬁcation of the architecture, as well as details of the discriminator loss function.

4 Experiments

The proposed method can be used to efﬁciently embed a given image into a factored latent space, and to generate hybrid images by swapping latent codes. We show that the disentanglement of latent codes into the classic concepts of style and content is competitive even with style transfer methods that address this speciﬁc task [48, 79], while producing more photorealistic results. Furthermore, we observe that even without an explicit objective to encourage it, vector arithmetic in the learned embedding space Z leads to consistent and plausible image manipulations [7, 43, 35]. This opens up a powerful set of operations, such as attribute editing, image translation, and interactive image editing, which we explore.

We ﬁrst describe our experimental setup. We then evaluate our method on: (1) quickly and accurately embedding a test image, (2) producing realistic hybrid images with a factored latent code that corresponds to the concepts of texture and structure, and (3) editability and usefulness of the latent space. We evaluate each aspect separately, with appropriate comparisons to existing methods.

4.1 Experimental setup

Datasets. For existing datasets, our model is trained on LSUN Churches, Bedrooms [80], Animal Faces HQ (AFHQ) [12], Flickr Faces HQ (FFHQ) [43], all at resolution of 256px except FFHQ at 1024px. In addition, we introduce new datasets, which are Portrait2FFHQ, a combined dataset of 17k portrait paintings from wikiart.org and FFHQ at 256px, Flickr Mountain, 0.5M mountain images from flickr. com, and Waterfall, of 90k 256px waterfall images. Flickr Mountain is trained at 512px resolution, but the model can handle larger image sizes (e.g., 1920 1080) due to the fully convolutional architecture. Baselines. To use a GAN model for downstream image editing, one must embed the image into its latent space [87]. We compare our approach to two recent solutions. Im2Style GAN [1] present a method for embedding into Style GAN [43], using iterative optimization into the W +-space of the model. The Style GAN2 model [44] also includes an optimization-based method to embed into its latent space and noise vectors. One application of this embedding is producing hybrids. Style GAN and Style GAN2 present an emergent hierarchical parameter space that allows hybrids to be produced by mixing parameters of two images. We additionally compare to image stylization methods, which aim to mix the style of one image with the content from another. STROTSS [48] is an optimization-based framework, in the spirit of the classic method of Gatys et al. [18]. We also compare to WCT2 [79], a recent state-of-the-art photorealistic style transfer method based on a feedforward network.

4.2 Image embedding

The ﬁrst step of manipulating an image with a generative model is projecting it into its latent spade. If the input image cannot be projected with high ﬁdelity, the embedded vector cannot be used for editing, as the user would be editing a different image. Figure 3 illustrates both example reconstructions

structure structure

Figure 4: Image swapping. Each row shows the result of combining the structure code of the leftmost image with the texture code of the top image (trained on LSUN Church and Bedroom). Our model generates realistic images that preserve texture (e.g., material of the building, or the bedsheet pattern) and structure (outline of objects).

Method Runtime (sec) ( ) Human Perceptual Study (AMT Fooling Rate) ( )

Church FFHQ Waterfall Average Swap Autoencoder (Ours) 0.113 31.3 2.4 19.4 2.0 41.8 2.2 31.0 1.4 Im2Style GAN [1, 43] 990 8.5 2.1 3.9 1.1 12.8 2.4 8.4 1.2 Style GAN2 [44] 192 24.3 2.2 13.8 1.8 35.3 2.4 24.4 1.4 STROTSS [48] 166 13.7 2.2 3.5 1.1 23.0 2.1 13.5 1.2 WCT2 [79] 1.35 27.9 2.3 22.3 2.0 35.8 2.4 28.6 1.3

Table 1: Realism of swap-generated images We study how realistic our swap-generated swapped appear, compared to state-of-the-art generative modeling approaches (Im2Style GAN and Style GAN2) and stylization methods (STROTSS and WCT2). We run a perceptual study, where each method/dataset is evaluated with 1000 human judgments. We bold the best result per column and bold+italicize methods that are within the statistical signiﬁcance of the top method. Our method achieves the highest score across all datasets. Note that WCT2 is a method tailored especially for photorealistic style transfer and is within the statistical signiﬁcance of our method in the perceptual study. Runtime is reported for 1024 1024 resolution.

and quantitative measurement of reconstruction quality, using LPIPS [84] between the original and embedded images. Note that our method accurately preserves the doorway pattern (top) and facial features (bottom) without blurriness. Averaged across datasets and on 5 of the 6 comparisons to the baselines, our method achieves better reconstruction quality than the baselines. An exception is on the Church dataset, where Im2Style GAN obtains a better reconstruction score. Importantly, as our method is designed with test-time embedding in mind, it only requires a single feedforward pass, at least 1000 faster than the baselines that require hundreds to thousands of optimization steps. Next, we investigate how useful the embedding is by exploring manipulations with the resulting code.

4.3 Swapping to produce image hybrids

In Figure 4, we show example hybrid images with our method, produced by combining structure and texture codes from different images. Note that the textures of the top row of images are consistently transferred; the sky, facade, and window patterns are mapped to the appropriate regions on the structure images on the churches, and similarly for the bedsheets. Realism of image hybrids. In Table 1, we show results of comparison to existing methods. As well as generative modeling methods [1, 44, 43]. For image hybrids, we additionally compare with SOTA style transfer methods [48, 79], although they are not directly applicable for controllable editing by embedding images (Section 4.5). We run a human perceptual study, following the test setup used in [83, 33, 67]. A real and generated image are shown sequentially for one second each to Amazon Mechanical Turkers (AMT), who choose which they believe to be fake. We measure how often they fail to identify the fake. An algorithm generating perfectly plausible images would achieve a fooling

Style GAN2 Im2Style GAN STROTSS Structure Texture Ours

Figure 5: Comparison of image hybrids. Our approach generates realistic results that combine scene structure with elements of global texture, such as the shape of the towers (church), the hair color (portrait), and the long exposure (waterfall). Please see Appendix ?? for more comparisons.

Im2Style GAN

Which do you think is more

similar in structure/content?

Which do you think is more similar in style?

0.09 0.10 0.11 0.12 0.13 0.14 0.15 Self-Similarity Distance (content / structure)

Single-Image FID (style)

= 1.0 (swapping)

patch size 1/8 patch size 1/4 (default) patch size 1/2 patch size 3/4

Figure 6: Style and content. (Left) Results of our perceptual study where we asked users on AMT to choose which image better reﬂects the style or content of a provided reference image, given two results (ours and a baseline). Our model is rated best for capturing style, and second-best for preserving content, behind WCT2 [79], a photorealistic style transfer method. Most importantly, our method was rated strictly better in both style and content matching than both image synthesis models Im2Style GAN [1, 43] and Style GAN2 [44]. (Right) Using the self-similarity distance [48] and SIFID [67], we study variations of the co-occurrence discriminator s patch size in training with respect to the image size. As patch size increases, our model tends to make more changes in swapping (closer to the target style and further from input structure). In addition, we gradually interpolate the texture code, with interpolation ratio α, away from a full swapping α=1.0, and observe that the transition is smooth.

rate of 50%. We gather 15,000 judgments, 1000 for each algorithm and dataset. Our method achieves more realistic results across all datasets. The nearest competitor is the WCT2 [79] method, which is designed for photorealistic style transfer. Averaged across the three datasets, our method achieves the highest fooling rate (31.0 1.4%), with WCT2 closely following within the statistical signiﬁcance (28.6 1.3%). We show qualitative examples in Figure 5. Style and content. Next, we study how well the concepts of content and style are reﬂected in the structure and texture codes, respectively. We employ a Two-alternative Forced Choice (2AFC) user study to quantify the quality of image hybrids in content and style space. We show participants our result and a baseline result, with the style or content reference in between. We then ask a user which image is more similar in style, or content respectively. Such 2AFC tests were used to train the LPIPS perceptual metric [84], as well as to evaluate style transfer methods in [48]. As no true automatic perceptual function exists, human perceptual judgments remain the gold standard for evaluating image synthesis results [83, 33, 10, 67]. Figure 6 visualizes the result of 3,750 user judgments over four baselines and three datasets, which reveal that our method outperforms all baseline methods with statistical signiﬁcance in style preservation. For content preservation, our method is only behind WCT2, which is a photorealistic stylization method that makes only minor color modiﬁcations to the input. Most importantly, our method achieves the best performance with statistical signiﬁcance in both style and content among models that can embed images, which is required for other forms of image editing.

more snow less snow input image

dogs painting photo wildlife

Figure7:Continuousinterpolation.(top)Amanipulationvectorforsnowisdiscoveredbytakingmeandifference between 10 user-collected photos of snowy and summer mountain. The vector is simply added to the texture code of the input image (red) with some gain. (bottom) Multi-domain, continuous transformation is achieved by applying the average vector difference between the texture codes of two domains, based on annotations from the training sets. We train on Portrait2FFHQ and AFHQ [12] datasets. See Appendix ?? for more results.

4.4 Analysis of our method

Next we analyze the behavior of our model using automated metrics. Self-similarity Distance [48] measures structural similarity in deep feature space based on the self-similarity map of Image Netpretrained network features. Single-Image FID [67] measures style similarity by computing the Fréchet Inception Distance (FID) between two feature distributions, each generated from a single image. SIFID is similar to Gram distance, a popular metric in stylization methods [18, 17], but differs by comparing the mean of the feature distribution as well as the covariance.

Speciﬁcally, we vary the size of cropped patches for the co-occurrence patch discriminator in training. In Figure 6 (right), the max size of random cropping is varied from 1/8 to 3/4 of the image side length, including the default setting of 1/4. We observe that as the co-occurrence discriminator sees larger patches, it enforces stronger constraint, thereby introducing more visual change in both style and content. Moreover, instead of full swapping, we gradually interpolate one texture code to the other. We observe that the SIFID and self-similarity distance both change gradually, in all patch settings. Such gradual visual change can be clearly observed in Figure 7, and the metrics conﬁrm this.

4.5 Image editing via latent space operations

Even though no explicit constraint was enforced on the latent space, we ﬁnd that modiﬁcations to the latent vectors cause smooth and predictable transformations to the resulting images. This makes such a space amenable to downstream editing in multiple ways. First, we ﬁnd that our representation allows for controllable image manipulations by vector arithmetic in the latent space. Figure 7 shows that adding the same vector smoothly transforms different images into a similar style, such as gradually adding more snow (top). Such vectors can be conveniently derived by taking the mean difference between the embeddings of two groups of images.

In a similar mechanism, the learned embedding space can also be used for image-to-image translation tasks (Figure 7), such as transforming paintings to photos. Image translation is achieved by applying the domain translation vector, computed as the mean difference between the two domains. Compared to most existing image translation methods, our method does not require that all images are labeled, and also allows for multi-domain, ﬁne-grained control simply by modifying the vector magnitude and

members of the domain at test time. Finally, the design of the structure code zs is directly amenable local editing operations, due to its spatial nature; we show additional results in Appendix ??.

4.6 Interactive user interface for image editing

Based on the proposed latent space exploration methods, we built a sample user interface for creative user control over photographs. Figure 8 shows three editing modes that our model supports. Please see a demo video on our webpage. We demonstrate three operations: (1) global style editing: the texture code can be transformed by adding predeﬁned manipulation vectors that are computed from PCA on the train set. Like GANSpace [21], the user is provided with knobs to adjust the gain for each manipulation vector. (2) region editing: the structure code can also be manipulated the same way of using PCA components, by treating each location as individual, controllable vectors. In addition, masks can be automatically provided to the user based on the self-similarity map at the location of interest to control the extent of structural manipulation. (3) cloning: the structure code can be directly edited using a brush that replaces the code from another part of the image, like the Clone Stamp tool of Photoshop.

global style editing region editing with self-similarity mask

brush stroke visualization

1. remove road

2. draw mountain

UI with input image

Figure 8: Example Interactive UI. (top, cloning) using an interactive UI, part of the image is redrawn by the user with a brush tool that extracts structure code from user-speciﬁed location. (left, region editing) the bottom region is transformed to lake, snow, or different vegetation by adding a manipulation vector to the structure codes of the masked region, which is auto-generated from the self-similarity map at the speciﬁed location. (right, global style editing) the overall texture and style can be changed using vector arithmetic with principal directions of PCA, controlled by the sliders on the right pane of the UI. (best viewed zoomed in)

5 Discussion

The main question we would like to address, is whether unconditional random image generation is requiredforhigh-qualityimageeditingtasks.Forsuchapproaches,projectionbecomesachallengingoperation,andintuitivedisentanglementstillremainsachallengingquestion.Weshowthatourmethodbased on an auto-encoder model has a number of advantages over prior work, in that it can accurately embed high-resolution images in real-time, into an embedding space that disentangles texture from structure, and generates realistic output images with both swapping and vector arithmetic. We performed extensive qualitative and quantitative evaluations of our method on multiple datasets. Still, structured texture transfer remains challenging, such as the striped bedsheet of Figure 4. Furthermore, extensive analysis on the nature of disentanglement, ideally using reliable, automatic metrics will be beneﬁcial as future work.

Acknowledgments. We thank Nicholas Kolkin for the helpful discussion on the automated content and style evaluation, Jeongo Seo and Yoseob Kim for advice on the user interface, and William T. Peebles, Tongzhou Wang, and Yu Sun for the discussion on disentanglement.

Broader Impact

From the sculptor s chisel to the painter s brush, tools for creative expression are an important part of human culture. The advent of digital photography and professional editing tools, such as Adobe Photoshop, has allowed artists to push creative boundaries. However, the existing tools are typically too complicated to be useful by the general public. Our work is one of the new generation of visual content creation methods that aim to democratize the creative process. The goal is to provide intuitive controls (see Section 4.6) for making a wider range of realistic visual effects available to non-experts.

While the goal of this work is to support artistic and creative applications, the potential misuse of such technology for purposes of deception posing generated images as real photographs is quite concerning. To partially mitigate this concern, we can use the advances in the ﬁeld of image forensics [16], as a way of verifying the authenticity of a given image. In particular, Wang et al. [72] recently showed that a classiﬁer trained to classify between real photographs and synthetic images generated by Pro GAN [42], was able to detect fakes produced by other generators, among them, Style GAN [43] and Style GAN2 [44]. We take a pretrained model of [72] and report the detection rates on several datasets in Appendix??.Ourswap-generatedimagescanbedetectedwithanaveragerategreaterthan90%,andthisindicates that our method shares enough architectural components with previous methods to be detectable. However, these detection methods do not work at 100%, and performance can degrade as the images are degraded in the wild (e.g., compressed, rescanned) or via adversarial attacks. Therefore, the problem of verifying image provenance remains a signiﬁcant challenge to society that requires multiple layers of solutions, from technical (such as learning-based detection systems or authenticity certiﬁcation chains), to social, such as efforts to increase public awareness of the problem, to regulatory and legislative.

Funding Disclosure

Taesung Park is supported by a Samsung Scholarship and an Adobe Research Fellowship, and much of this work was done as an Adobe Research intern. This work was supported in part by an Adobe gift.

[1] Abdal, R., Qin, Y., Wonka, P.: Image2stylegan: How to embed images into the stylegan latent space? In: IEEE International Conference on Computer Vision (ICCV) (2019) 2, 5, 6, 7

[2] Anokhin, I., Solovev, P., Korzhenkov, D., Kharlamov, A., Khakhulin, T., Silvestrov, A., Nikolenko, S., Lempitsky, V., Sterkin, G.: High-resolution daytime translation without domain labels. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020) 1, 2, 3

[3] Asim, M., Shamshad, F., Ahmed, A.: Blind image deconvolution using deep generative priors. ar Xiv preprint ar Xiv:1802.04073 (2018) 2

[4] Bau, D., Strobelt, H., Peebles, W., Wulff, J., Zhou, B., Zhu, J.Y., Torralba, A.: Semantic photo manipulation with a generative image prior. ACM Transactions on Graphics (TOG) 38(4), 1 11 (2019) 2

[5] Bau, D., Zhu, J.Y., Strobelt, H., Bolei, Z., Tenenbaum, J.B., Freeman, W.T., Torralba, A.: Gan dissection: Visualizing and understanding generative adversarial networks. In: International Conference on Learning Representations (ICLR) (2019) 2

[6] Bengio, Y.: Deep learning of representations: Looking forward. In: International Conference on Statistical Language and Speech Processing (2013) 3

[7] Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high ﬁdelity natural image synthesis. In: International Conference on Learning Representations (ICLR) (2019) 5

[8] Brock, A., Lim, T., Ritchie, J.M., Weston, N.: Neural photo editing with introspective adversarial networks. In: International Conference on Learning Representations (ICLR) (2017) 2

[9] Chen, D., Yuan, L., Liao, J., Yu, N., Hua, G.: Stylebank: An explicit representation for neural image style transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1897 1906 (2017) 3

[10] Chen, Q., Koltun, V.: Photographic image synthesis with cascaded reﬁnement networks. In: IEEE International Conference on Computer Vision (ICCV) (2017) 2, 7

[11] Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In: Advances in Neural Information Processing Systems (2016) 2, 3

[12] Choi, Y., Uh, Y., Yoo, J., Ha, J.W.: Stargan v2: Diverse image synthesis for multiple domains. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020) 2, 5, 8

[13] Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real nvp. In: International Conference on Learning Representations (ICLR) (2017) 2, 3

[14] Dosovitskiy, A., Brox, T.: Generating images with perceptual similarity metrics based on deep networks. In: Advances in Neural Information Processing Systems (2016) 3

[15] Esser, P., Haux, J., Ommer, B.: Unsupervised robust disentangling of latent characteristics for image synthesis. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2699 2709 (2019) 2, 3

[16] Farid, H.: Photo forensics. MIT press (2016) 10

[17] Gatys, L., Ecker, A.S., Bethge, M.: Texture synthesis using convolutional neural networks. In: Advances in Neural Information Processing Systems (2015) 2, 4, 8

[18] Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 3, 5, 8

[19] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (2014) 2, 3

[20] Guérin, E., Digne, J., Galin, E., Peytavie, A., Wolf, C., Benes, B., Martinez, B.: Interactive example-based terrain authoring with conditional generative adversarial networks. ACM Transactions on Graphics (TOG) 36(6) (2017) 2

[21] Härkönen, E., Hertzmann, A., Lehtinen, J., Paris, S.: Ganspace: Discovering interpretable gan controls. In: Advances in Neural Information Processing Systems (2020) 2, 9

[22] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 5

[23] He, M., Chen, D., Liao, J., Sander, P.V., Yuan, L.: Deep exemplar-based colorization. ACM Transactions on Graphics (TOG) 37(4), 1 16 (2018) 2

[24] Heeger, D.J., Bergen, J.R.: Pyramid-based texture analysis/synthesis. In: Proceedings of the 22nd annual conference on Computer graphics and interactive techniques. pp. 229 238 (1995) 2

[25] Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analogies. In: ACM Transactions on Graphics (TOG) (2001) 3

[26] Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., Lerchner, A.: beta-vae: Learning basic visual concepts with a constrained variational framework. In: International Conference on Learning Representations (ICLR) (2017) 3

[27] Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504 507 (2006) 3

[28] Hong, S., Yan, X., Huang, T.S., Lee, H.: Learning hierarchical semantic image manipulation through structured representations. In: Advances in Neural Information Processing Systems (Neur IPS) (2018) 1

[29] Hu, Q., Szabó, A., Portenier, T., Favaro, P., Zwicker, M.: Disentangling factors of variation by mixing them. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018) 2

[30] Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. European Conference on Computer Vision (ECCV) (2018) 2

[31] Huh, M., Liu, A., Owens, A., Efros, A.A.: Fighting fake news: Image splice detection via learned self-consistency. In: European Conference on Computer Vision (ECCV) (2018) 4

[32] Iizuka, S., Simo-Serra, E., Ishikawa, H.: Globally and locally consistent image completion. ACM Transactions on Graphics (TOG) 36(4), 107 (2017) 2

[33] Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 1, 2, 6, 7

[34] Isola, P., Zoran, D., Krishnan, D., Adelson, E.H.: Learning visual groups from co-occurrences in space and time. ar Xiv preprint ar Xiv:1511.06811 (2015) 4

[35] Jahanian, A., Chai, L., Isola, P.: On the steerability" of generative adversarial networks. In: International Conference on Learning Representations (ICLR) (2020) 5

[36] Jha, A.H., Anand, S., Singh, M., Veeravasarapu, V.: Disentangling factors of variation with cycle-consistent variational auto-encoders. In: European Conference on Computer Vision (ECCV) (2018) 2, 3

[37] Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision (ECCV) (2016) 3

[38] Julesz, B.: Visual pattern discrimination. IRE transactions on Information Theory 8(2), 84 92 (1962) 2, 4

[39] Julesz, B.: Textons, the elements of texture perception, and their interactions. Nature 290(5802), 91 97 (1981) 2

[40] Julesz, B., Gilbert, E.N., Shepp, L.A., Frisch, H.L.: Inability of humans to discriminate between visual textures that agree in second-order statistics revisited. Perception 2(4), 391 405 (1973) 2, 4

[41] Karacan, L., Akata, Z., Erdem, A., Erdem, E.: Manipulating attributes of natural scenes via hallucination. ar Xiv preprint ar Xiv:1808.07413 (2018) 1

[42] Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. In: International Conference on Learning Representations (ICLR) (2018) 10

[43] Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019) 2, 3, 5, 6, 7, 10

[44] Karras,T.,Laine,S.,Aittala,M.,Hellsten,J.,Lehtinen,J.,Aila,T.:Analyzingandimprovingtheimagequality of stylegan. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020) 2, 3, 4, 5, 6, 7, 10

[45] Kazemi, H., Iranmanesh, S.M., Nasrabadi, N.: Style and content disentanglement in generative adversarial networks. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 848 856. IEEE (2019) 2

[46] Kim, H., Mnih, A.: Disentangling by factorising. In: International Conference on Machine Learning (ICML) (2018) 3

[47] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: International Conference on Learning Representations (ICLR) (2014) 2, 3

[48] Kolkin, N., Salavon, J., Shakhnarovich, G.: Style transfer by relaxed optimal transport and self-similarity. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019) 5, 6, 7, 8

[49] Kotovenko, D., Sanakoyeu, A., Lang, S., Ommer, B.: Content and style disentanglement for artistic style transfer. In: IEEE International Conference on Computer Vision (ICCV) (2019) 3

[50] Larsson, G., Maire, M., Shakhnarovich, G.: Learning representations for automatic colorization. In: European Conference on Computer Vision (ECCV) (2016) 2

[51] Lee, H.Y., Tseng, H.Y., Huang, J.B., Singh, M.K., Yang, M.H.: Diverse image-to-image translation via disentangled representation. In: European Conference on Computer Vision (ECCV) (2018) 3

[52] Li, Y., Liu, M.Y., Li, X., Yang, M.H., Kautz, J.: A closed-form solution to photorealistic image stylization. In: European Conference on Computer Vision (ECCV) (2018) 3

[53] Lin, J., Chen, Z., Xia, Y., Liu, S., Qin, T., Luo, J.: Exploring explicit domain supervision for latent space disentanglement in unpaired image-to-image translation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2019) 3

[54] Lin, T.Y., Maji, S.: Visualizing and understanding deep texture representations. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2791 2799 (2016) 2

[55] Liu, G., Reda, F.A., Shih, K.J., Wang, T.C., Tao, A., Catanzaro, B.: Image inpainting for irregular holes using partial convolutions. In: European Conference on Computer Vision (ECCV) (2018) 1, 2

[56] Liu, M.Y., Huang, X., Mallya, A., Karras, T., Aila, T., Lehtinen, J., Kautz, J.: Few-shot unsupervised image-to-image translation. In: IEEE International Conference on Computer Vision (ICCV) (2019) 2

[57] Luan, F., Paris, S., Shechtman, E., Bala, K.: Deep photo style transfer. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 3

[58] Mathieu, M.F., Zhao, J.J., Zhao, J., Ramesh, A., Sprechmann, P., Le Cun, Y.: Disentangling factors of variation in deep representation using adversarial training. In: Advances in Neural Information Processing Systems (2016) 2, 3

[59] Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019) 1, 2

[60] Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 1, 2

[61] Perarnau, G., van de Weijer, J., Raducanu, B., Álvarez, J.M.: Invertible conditional gans for image editing. In: NIPS Workshop on Adversarial Training (2016) 2

[62] Pidhorskyi, S., Adjeroh, D.A., Doretto, G.: Adversarial latent autoencoders. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) (2020) 3

[63] Portenier, T., Hu, Q., Szabó, A., Bigdeli, S.A., Favaro, P., Zwicker, M.: Faceshop: Deep sketch-based face image editing. ACM Transactions on Graphics (TOG) 37(4) (2018) 2

[64] Portilla, J., Simoncelli, E.P.: A parametric texture model based on joint statistics of complex wavelet coefﬁcients. International journal of computer vision 40(1), 49 70 (2000) 2, 4

[65] Salakhutdinov, R., Hinton, G.: Deep boltzmann machines. In: Artiﬁcial intelligence and statistics. pp. 448 455 (2009) 2

[66] Sangkloy, P., Lu, J., Fang, C., Yu, F., Hays, J.: Scribbler: Controlling deep image synthesis with sketch and color. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 2

[67] Shaham, T.R., Dekel, T., Michaeli, T.: Singan: Learning a generative model from a single natural image. In: IEEE International Conference on Computer Vision (ICCV) (2019) 2, 6, 7, 8

[68] Shocher, A., Bagon, S., Isola, P., Irani, M.: Ingan: Capturing and remapping the" dna" of a natural image. In: IEEE International Conference on Computer Vision (ICCV) (2019) 2

[69] Singh, K.K., Ojha, U., Lee, Y.J.: Finegan: Unsupervised hierarchical disentanglement for ﬁne-grained object generation and discovery. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019) 2, 3

[70] Tenenbaum, J.B., Freeman, W.T.: Separating style and content with bilinear models. Neural computation 12(6), 1247 1283 (2000) 2, 3

[71] Ulyanov, D., Lebedev, V., Vedaldi, A., Lempitsky, V.: Texture networks: Feed-forward synthesis of textures and stylized images. In: International Conference on Machine Learning (ICML) (2016) 3

[72] Wang, S.Y., Wang, O., Zhang, R., Owens, A., Efros, A.A.: Cnn-generated images are surprisingly easy to spot... for now. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020) 10

[73] Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018) 1, 2

[74] Wu, W., Cao, K., Li, C., Qian, C., Loy, C.C.: Disentangling content and style via unsupervised geometry distillation. ar Xiv preprint ar Xiv:1905.04538 (2019) 2

[75] Xian, W., Sangkloy, P., Agrawal, V., Raj, A., Lu, J., Fang, C., Yu, F., Hays, J.: Texturegan: Controlling deep image synthesis with texture patches. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018) 2, 4

[76] Xing, X., Han, T., Gao, R., Zhu, S.C., Wu, Y.N.: Unsupervised disentangling of appearance and geometry by deformable generator network. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019) 2

[77] Yang, C., Lu, X., Lin, Z., Shechtman, E., Wang, O., Li, H.: High-resolution image inpainting using multi-scale neural patch synthesis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 2

[78] Yeh, R.A., Chen, C., Yian Lim, T., Schwing, A.G., Hasegawa-Johnson, M., Do, M.N.: Semantic image inpainting with deep generative models. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 2

[79] Yoo, J., Uh, Y., Chun, S., Kang, B., Ha, J.W.: Photorealistic style transfer via wavelet transforms. In: IEEE International Conference on Computer Vision (ICCV) (2019) 3, 5, 6, 7

[80] Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., Xiao, J.: Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. ar Xiv preprint ar Xiv:1506.03365 (2015) 2, 5

[81] Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Generative image inpainting with contextual attention. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018) 1

[82] Yu, X., Chen, Y., Liu, S., Li, T., Li, G.: Multi-mapping image-to-image translation via learning disentanglement. In: Advances in Neural Information Processing Systems (Neur IPS) (2019) 2

[83] Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: European Conference on Computer Vision (ECCV) (2016) 2, 6, 7

[84] Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018) 5, 6, 7

[85] Zhang, R., Zhu, J.Y., Isola, P., Geng, X., Lin, A.S., Yu, T., Efros, A.A.: Real-time user-guided image colorization with learned deep priors. ACM Transactions on Graphics (TOG) 9(4) (2017) 2

[86] Zhou, Y., Zhu, Z., Bai, X., Lischinski, D., Cohen-Or, D., Huang, H.: Non-stationary texture synthesis by adversarial expansion. ACM Transactions on Graphics (TOG) 37(4) (2018) 2

[87] Zhu, J.Y., Krähenbühl, P., Shechtman, E., Efros, A.A.: Generative visual manipulation on the natural image manifold. In: European Conference on Computer Vision (ECCV) (2016) 2, 5

[88] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE International Conference on Computer Vision (ICCV) (2017) 1, 2

[89] Zhu, J.Y., Zhang, R., Pathak, D., Darrell, T., Efros, A.A., Wang, O., Shechtman, E.: Toward multimodal image-to-image translation. In: Advances in Neural Information Processing Systems (2017) 2