# infinitygan_towards_infinitepixel_image_synthesis__86507fe4.pdf

Published as a conference paper at ICLR 2022

INFINITYGAN: TOWARDS INFINITE-PIXEL IMAGE SYNTHESIS

Chieh Hubert Lin1 Hsin-Ying Lee2 Yen-Chi Cheng3 Sergey Tulyakov2 Ming-Hsuan Yang1,4,5

1UC Merced 2Snap Inc. 3Carnegie Mellon University 4Yonsei University 5Google Research

ABSTRACT We present a novel framework, Inﬁnity GAN, for arbitrary-sized image generation. The task is associated with several key challenges. First, scaling existing models to an arbitrarily large image size is resource-constrained, in terms of both computation and availability of large-ﬁeld-of-view training data. Inﬁnity GAN trains and infers in a seamless patch-by-patch manner with low computational resources. Second, large images should be locally and globally consistent, avoid repetitive patterns, and look realistic. To address these, Inﬁnity GAN disentangles global appearances, local structures, and textures. With this formulation, we can generate images with spatial size and level of details not attainable before. Experimental evaluation validates that Inﬁnity GAN generates images with superior realism compared to baselines and features parallelizable inference. Finally, we show several applications unlocked by our approach, such as spatial style fusion, multimodal outpainting, and image inbetweening. All applications can be operated with arbitrary input and output sizes.

Style A Style B

Training Patch Size

Figure 1: Synthesizing inﬁnite-pixel images from ﬁnite-sized training data. A 1024 2048 image composed of 242 patches, independently synthesized by Inﬁnity GAN with spatial fusion of two styles. The generator is trained on 101 101 patches (e.g., marked in top-left) sampled from 197 197 real images. Note that training and inference (of any size) are performed on a single GTX TITAN X GPU. Zoom-in for better experience.

1 INTRODUCTION

To inﬁnity and beyond! Buzz Lightyear

Generative models witness substantial improvements in resolution and level of details. Most improvements come at a price of increased training time (Gulrajani et al., 2017; Mescheder et al., 2018), larger model size (Balaji et al., 2021), and stricter data requirements (Karras et al., 2018). The most recent works synthesize images at 1024 1024 resolution featuring a high level of details

All codes, datasets, and trained models are publicly available. Project page: https://hubert0527.github.io/inﬁnity GAN/

Published as a conference paper at ICLR 2022

and ﬁdelity. However, models generating high resolution images usually still synthesize images of limited ﬁeld-of-view bounded by the training data. It is not straightforward to scale these models to generate images of arbitrarily large ﬁeld-of-view. Synthesizing inﬁnite-pixel images is constrained by the ﬁnite nature of resources. Finite computational resources (e.g., memory and training time) set bounds for input receptive ﬁeld and output size. A further limitation is that there exists no inﬁnitepixel image dataset. Thus, to generate inﬁnite-pixel images, a model should learn the implicit global structure without direct supervision and under limited computational resources.

Repetitive texture synthesis methods (Efros & Leung, 1999; Xian et al., 2018) generalize to large spatial sizes. Yet, such methods are not able to synthesize real-world images. Recent works, such as Sin GAN (Shaham et al., 2019) and In GAN (Shocher et al., 2019), learn an internal patch distribution for image synthesis. Although these models can generate images with arbitrary shapes, in Section 4.1, we show that they do not infer structural relationships well, and fail to construct plausible holistic views with spatially extended latent space. A different approach, COCO-GAN (Lin et al., 2019), learns a coordinate-conditioned patch distribution for image synthesis. As shown in Figure 4, despite the ability to slightly extend images beyond the learned boundary, it fails to maintain the global coherence of the generated images when scaling to a 2 larger generation size.

How to generate inﬁnite-pixel images? Humans are able to guess the whole scene given a partial observation of it. In a similar fashion, we aim to build a generator that trains with image patches, and inference images of unbounded arbitrary-large size. An example of a synthesized scene containing globally-plausible structure and heterogeneous textures is shown in Figure 1.

We propose Inﬁnity GAN, a method that trains on a ﬁnite-pixel dataset, while generating inﬁnitepixel images at inference time. Inﬁnity GAN consists of a neural implicit function, termed structure synthesizer, and a padding-free Style GAN2 generator, dubbed texture synthesizer. Given a global appearance of an inﬁnite-pixel image, the structure synthesizer samples a sub-region using coordinates and synthesizes an intermediate local structural representations. The texture synthesizer then seamlessly synthesizes the ﬁnal image by parts after ﬁlling the ﬁne local textures to the local structural representations. Inﬁnity GAN can infer a compelling global composition of a scene with realistic local details. Trained on small patches, Inﬁnity GAN achieves high-quality, seamless and arbitrarily-sized outputs with low computational resources a single TITAN X to train and test.

We conduct extensive experiments to validate the proposed method. Qualitatively, we present the everlastingly long landscape images. Quantitatively, we evaluate Inﬁnity GAN and related methods using user study and a proposed Scale Inv FID metric. Furthermore, we demonstrate the efﬁciency and efﬁcacy of the proposed methods with several applications. First, we demonstrate the ﬂexibility and controllability of the proposed method by spatially fusing structures and textures from different distributions within an image. Second, we show that our model is an effective deep image prior for the image outpainting task with the image inversion technique and achieves multi-modal outpainting of arbitrary length from arbitrarily-shaped inputs. Third, with the proposed model we can divide-and-conquer the full image generation into independent patch generation and achieve 7.2 of inference speed-up with parallel computing, which is critical for high-resolution image synthesis.

2 RELATED WORK Latent generative models. Existing generative models are mostly designed to synthesize images of ﬁxed sizes. A few methods (Karras et al., 2018; 2020) have been recently developed to train latent generative models on high-resolution images, up to 1024 1024 pixels. However, latent generative models generate images from dense latent vectors that require synthesizing all structural contents at once. Bounded by computational resources and limited by the learning framework and architecture, these approaches synthesize images of certain sizes and are non-trivial to generalize to different output size. In contrast, patch-based GANs trained on image patches (Lin et al., 2019; Shaham et al., 2019; Shocher et al., 2019) are less constrained by the resource bottleneck with the synthesis-by-part approach. However, (Shaham et al., 2019; Shocher et al., 2019) can only model and repeat internal statistics of a single image, and (Lin et al., 2019) can barely extrapolate few patches beyond the training size. ALIS (Skorokhodov et al., 2021) is a concurrent work that also explores synthesizing inﬁnite-pixel images. It recursively inbetweens latent variable pairs in the horizontal direction. We further discuss the method in Appendix A.1. Finally, autoregressive models (Oord et al., 2016; Razavi et al., 2019; Esser et al., 2021) can theoretically synthesize at arbitrary image sizes. Despite (Razavi et al., 2019) and (Esser et al., 2021) showing unconditional images synthesis at 1024 1024 resolution, their application in inﬁnite-pixel image synthesis has not yet been well-explored.

Published as a conference paper at ICLR 2022

Section 3.2 Section 3.3

Coordinates (c) Local Latent (zl)

Structure Synthesizer (𝐺S)

(Implicit Function)

Texture Synthesizer (𝐺T)

(Fully-Convolutional)

Implicit Image (Infinitely Large)

Discriminator

Real Full Image

𝐿𝑎𝑟(Equation 5)

Section 3.5

Global Latent (zg)

Randomized Noises (zn)

Synthesized

Figure 2: Overview. The generator of Inﬁnity GAN consists of two modules, a structure synthesizer based on a neural implicit function, and a fully-convolutional texture synthesizer with all positional information removed (see Figure 3). The two networks take four sets of inputs, a global latent variable that deﬁnes the holistic appearance of the image, a local latent variable that represents the local and structural variation, a continuous coordinate for learning the neural implicit structure synthesizer, and a set of randomized noises to model ﬁne-grained texture. Inﬁnity GAN synthesizes images of arbitrary size by learning spatially extensible representations.

Conditional generative models. Numerous tasks such as image super-resolution, semantic image synthesis, and image extrapolation often showcase results over 1024 1024 pixels. These tasks are less related to our setting, as most structural information is already provided in the conditional inputs. We illustrate and compare the characteristics of these tasks against ours in Appendix B. Image outpainting. Image outpainting (Abdal et al., 2020; Liu et al., 2021; Sabini & Rusak, 2018; Yang et al., 2019) is related to image inpainting (Liu et al., 2018a; Yu et al., 2019) and shares similar issues that the generator tends to copy-and-paraphrase the conditional input or create mottled textural samples, leading to repetitive results especially when the outpainted region is large. In Out (Cheng et al., 2021) proposes to outpaint image with GANs inversion and yield results with higher diversity. We show that with Inﬁnity GAN as the deep image prior along with In Out (Cheng et al., 2021), we obtain the state-of-the-art outpainting results and avoids the need of sequential outpainting. Then, we demonstrate applications in arbitrary-distant image inbetweening, which is at the intersection of image inpainting (Liu et al., 2018a; Nazeri et al., 2019; Yu et al., 2019) and outpainting research. Neural implicit representation. Neural implicit functions (Park et al., 2019; Mescheder et al., 2019; Mildenhall et al., 2020) have been applied to model the structural information of 3D and continuous representations. Adopting neural implicit modeling, our query-by-coordinate synthesizer is able to model structural information effectively. Some recent works (De Vries et al., 2021; Niemeyer & Geiger, 2021; Chan et al., 2021) also attempt to integrate neural implicit function into generative models, but aiming at 3D-structure modeling instead of extending the synthesis ﬁeld-of-view.

3 PROPOSED METHOD

3.1 OVERVIEW

An arbitrarily large image can be described globally and locally. Globally, images should be coherent and hence global characteristics should be expressible by a compact holistic appearance (e.g., a medieval landscape, ocean view panorama). Therefore, we adopt a ﬁxed holistic appearance for each inﬁnite-pixel image to represent the high-level composition and content of the scene. Locally, a close-up view of an image is deﬁned by its local structure and texture. The structure represents objects, shapes and their arrangement within a local region. Once the structure is deﬁned, there exist multiple feasible appearances or textures to render realistic scenes. At the same time, structure and texture should conform to the global holistic appearance to maintain the visual consistency among the neighboring patches. Given these assumptions, we can generate an inﬁnite-pixel image by ﬁrst sampling a global holistic appearance, then spatially extending local structures and textures following the holistic appearance.

Accordingly, the Inﬁnity GAN generator G consists of a structure synthesizer GS and a texture synthesizer GT. GS is an implicit function that samples a sub-region with coordinates and creates local structural features. GT is a fully convolutional Style GAN2 (Karras et al., 2020) modeling textural properties for local patches and rendering ﬁnal image. Both modules follow a consistent holistic appearance throughout the process. Figure 2 presents the overview of our framework.

Published as a conference paper at ICLR 2022

3.2 STRUCTURE SYNTHESIZER (GS) Structure synthesizer is a neural implicit function driven by three sets of latent variables: A global latent vector zg representing the holistic appearance of the inﬁnite-pixel image (also called implicit image since the whole image is never explicitly sampled), a local latent tensor zl expressing the local structural variation of the image content, and a coordinate grid c specifying the location of the patches to sample from the implicit image. The synthesis process is formulated as: z S = GS(zg, zl, c), (1) where z S denotes the structural latent variable that is later used as an input to the texture synthesizer.

We sample zg RDzg from a unit Gaussian distribution once and inject zg into every layer and pixel in GS via feature modulation (Huang & Belongie, 2017; Karras et al., 2020). As local variations are independent across the spatial dimension, we independently sample them from a unit Gaussian prior for each spatial position of zl RH W Dzl, where H and W can be arbitrarily extended.

We then use coordinate grid c to specify the location of the target patches to be sampled. To be able to condition GS with coordinates inﬁnitely far from the origin, we introduce a prior by exploiting the nature of landscape images: (a) self-similarity for the horizontal direction, and (b) rapid saturation (e.g., land, sky or ocean) for the vertical direction. To implement this, we use the positional encoding for the horizontal axis similar to (Vaswani et al., 2017; Tancik et al., 2020; Sitzmann et al., 2020). We use both sine and cosine functions to encode each coordinate for numerical stability. For the vertical axis, to represent saturation, we apply the tanh function. Formally, given horizontal and vertical indexes (ix, iy) of zl tensor, we encode them as c = (tanh(iy), cos(ix/T), sin(ix/T)) , where T is the period of the sine function and c controls the location of the patch to generate.

To prevent the model from ignoring the variation of zl and generating repetitive content by following the periodically repeating coordinates, we adopt a mode-seeking diversity loss (Mao et al., 2019; Lee et al., 2020) between a pair of local latent variables zl1 and zl2 while sharing the same zg and c: Ldiv = zl1 zl2 1 / GS(zg, zl1, c) GS(zg, zl2, c) 1 . (2)

Conventional neural implicit functions produce outputs for each input query independently, which is a pixel in zl for Inﬁnity GAN. Such a design causes training instabilities and slows convergence, as we show in Figure 37. We therefore adopt the feature unfolding technique (Chen et al., 2021) to enable GS to account for the information in a broader neighboring region of zl and c, introducing a larger receptive ﬁeld. For each layer in GS, before feeding forward to the next layer, we apply a k k feature unfolding transformation at each location (i, j) of the origin input f to obtain the unfolded input f : f (i,j) = Concat({f(i + n, j + m)}n,m { k/2,k/2}) , (3) where Concat( ) concatenates the unfolded vectors in the channel dimension. In practice, as the gridshaped zl and c are sampled with equal spacing between consecutive pixels, the feature unfolding can be efﬁciently implemented with Coord Conv (Liu et al., 2018b).

3.3 TEXTURE SYNTHESIZER (GT) Texture synthesizer aims to model various realizations of local texture given the local structure z S generated by the structure synthesizer. In addition to the holistic appearance zg and the local structural latent z S, texture synthesizer uses noise vectors zn to model the ﬁnest-grained textural variations that are difﬁcult to capture by other variables. The generation process can be written as: pc = GT(z S, zg, zn) , (4) where pc is a generated patch at location c (i.e., the c used in Eq 1 for generating z S).

We implement upon Style GAN2 (Karras et al., 2020). First, we replace the ﬁxed constant input with the generated structure z S. Similar to Style GAN2, randomized noises zn are added to all layers of GT, representing the local variations of ﬁne-grained textures. Then, a mapping layer projects zg to a style vector, and the style is injected into all pixels in each layer via feature modulation. Finally, we remove all zero-paddings from the generator, as shown in Figure 3(b).

Both zero-padding and GS can provide positional information to the generator, and we later show that positional information is important for generator learning in Section 4.2. However, it is necessary to remove all zero-paddings from GT for three major reasons. First, zero-padding has a consistent pattern during training, due to the ﬁxed training image size. Such a behavior misleads

Published as a conference paper at ICLR 2022

Example 1 3x3 Conv2d

Example 2 3x3 Conv2d Transpose

Local Latent & Coordinate

Image Space

(b) Padding-Free Generator

Ignored Ignored

Patch A Patch B Features

Positional Info

Conv2d-Transpose

(a) Conventional Generator

Inconsistent

Patch A Patch B

Positional Info

Consistent Feature Pathway

Figure 3: Padding-free generator. (Left) Conventional generators synthesize inconsistent pixels due to the zero-paddings. Note that the inconsistency region grows exponentially as the network deepened. (Right) In contrast, our padding-free generator can synthesize consistent pixel value regardless of the position in the model receptive ﬁeld. Such a property facilitates spatiallyindependently generating patches and forming into a seamless image with consistent feature values.

the generator to memorize the padding pattern, and becomes vulnerable to unseen padding patterns while attempting to synthesize at a different image size. The third column of Figure 4 shows when we extend the input latent variable of the Style GAN2 generator multiple times, the center part of the features does not receive expected coordinate information from the paddings, resulting in extensively repetitive textures in the center area of the output image. Second, zero-paddings can only provide positional information within a limited distance from the image border. However, while generating inﬁnite-pixel images, the image border is considered inﬁnitely far from the generated patch. Finally, as shown in Figure 3, the existence of paddings hampers GT from generating separate patches that can be composed together. Therefore, we remove all paddings from GT, facilitating the synthesisby-parts of arbitrary-sized images. We refer to the proposed GT as a padding-free generator (PFG).

3.4 SPATIALLY INDEPENDENT GENERATION Inﬁnity GAN enables spatially independent generation thanks to two characteristics of the proposed modules. First, GS, as a neural implicit function, naturally supports independent inference at each spatial location. Second, GT, as a fully convolutional generator with all paddings removed, can synthesize consistent pixel values at the same spatial location in the implicit image, regardless of different querying coordinates, as shown in Figure 3(b). With these properties, we can independently query and synthesize a patch from the implicit image, seamlessly combine multiple patches into an arbitrarily large image, and maintain constant memory usage while synthesizing images of any size.

In practice, having a single center pixel in a z S slice that aligns to the center pixel of the corresponding output image patch can facilitate zl and c indexing. We achieve the goal by shrinking the Style GAN2 blur kernel size from 4 to 3, causing the model to generate odd-sized features in all layers, due to the convolutional transpose layers.

3.5 MODEL TRAINING The discriminator D of Inﬁnity GAN is similar to the one in the Style GAN2 method. The detailed architectures of G and D are presented in Appendix D. The two networks are trained with the nonsaturating logistic loss Ladv (Goodfellow et al., 2014), R1 regularization LR1 (Mescheder et al., 2018) and path length regularization Lpath (Karras et al., 2020). Furthermore, to encourage the generator to follow the conditional distribution in the vertical direction, we train G and D with an auxiliary task (Odena et al., 2017) predicting the vertical position of the patch:

Lar = ˆcy cy 1 , (5)

where ˆcy is the vertical coordinate predicted by D, and cy is either (for generated images) cy = tanh(iy) or (for real images) the vertical position of the patch in the full image. We formulate Lar as a regression task. The overall loss function for the Inﬁnity GAN is:

min D Ladv + λar Lar + λR1LR1 , min G Ladv + λar Lar + λdiv Ldiv + λpath Lpath , (6)

where λ s are the weights.

Published as a conference paper at ICLR 2022

Table 1: Quantitative evaluation on Flickr-Landscape. Despite we use a disadvantageous setting for our Inﬁnity GAN (discussed in Section 4.1), it still outperforms all baselines after extending the size to 4 larger. Furthermore, the user study shows an over 90% preference favors our Inﬁnity GAN results. The preference is marked as x% when x% of selections prefer the results from the corresponding method over Inﬁnity GAN. The images are ﬁrst resized to 128 before resizing to 197.

Method Image Size FID Scale Inv FID Preference v.s. Ours Inference Memory Full Train Test 8 Train 1 2 4 8 4 8

Sin GAN 128 128 1024 4.21 4.21 57.10 145.12 210.22 0.80% 1.60% O(size2) COCO-GAN 128 32 1024 17.52 41.32 258.51 376.69 387.15 0% 0% O(1) Style GAN2+NCI 128 128 1024 4.19 4.19 18.31 79.83 189.65 9.20% 7.20% O(size2) Style GAN2+NCI (Patched) 128 64 1024 5.35 21.06 58.84 165.65 234.19 - - O(size2) Style GAN2+NCI+PFG 197 101 1576 86.76 90.79 126.88 211.22 272.80 0.40% 1.20% O(1)

Inﬁnity GAN (Ours) (Style GAN2+NCI+PFG+GS) 197 101 1576 11.03 21.84 28.83 61.41 121.18 - - O(1)

COCO-GAN Sin GAN Style GAN+NCI Style GAN+NCI+PFG Inﬁnity GAN (ours) 128 128 128 128 128 128 197 197 197 197

Generated Full Image

Test-time extension to 1024 1024

Figure 4: Qualitative comparison. We show that Inﬁnity GAN can produce more favorable holistic appearances against related methods while testing with an extended size 1024 1024. (NCI: Non Constant Input, PFG: Padding-Free Generator). More results are shown in Appendix E.

4 EXPERIMENTAL RESULTS

Datasets. We evaluate the ability of synthesizing at extended image sizes on the Flickr-Landscape dataset consists of 450,000 high-quality landscape images, which are crawled from the Landscape group on Flickr. For the image outpainting experiments, we evaluate with other baseline methods on scenery-related subsets from the Place365 (Zhou et al., 2017) (62,500 images) and Flickr Scenery (Cheng et al., 2021) (54,710 images) datasets. Note that the Flickr-Scenery here is different from our Flickr-Landscape. For image outpainting task, we split the data into 80%, 10%, 10% for training, validation, and test. All quantitative and qualitative evaluations are conducted on test set.

Hyperparameters. We use λar = 1, λdiv = 1, λR1 = 10, and λpath = 2 for all datasets. All models are trained with 101 101 patches cropped from 197 197 real images. Since our Inﬁnity GAN synthesizes odd-sized images, we choose 101 that maintains a sufﬁcient resolution that humans can still recognize its content. On the other hand, 197 is the next output resolution if stacking another upsampling layer to Inﬁnity GAN, which also provides 101 101 patches a sufﬁcient ﬁeld-of-view. We adopt the Adam (Kingma & Ba, 2015) optimizer with β1 = 0, β2 = 0.99 and a batch size 16 for 800,000 iterations. More details are presented in Appendix C.

Metrics. We ﬁrst evaluate Fr echet Inception Distance (FID) (Heusel et al., 2017) at G training resolution. Then, without access to real images at larger sizes, we assume that the real landscape with a larger Fo V will share a certain level of self-similarity with its smaller Fo V parts. We accordingly propose a Scale Inv FID, which resizes larger images to the training data size with bilinear interpolation, then computes FID. We denote N Scale Inv FID when the metric is evaluated with images N larger than the training samples.

Evaluated Method. We perform the evaluation on Flickr-Landscape with the following algorithms:

Sin GAN. We train an individual Sin GAN model for each image. The images at larger sizes are generated by setting spatially enlarged input latent variables. Note that we do not compare with the super-resolution setting from Sin GAN since we focus on extending the learned structure rather than super-resolve the high-frequency details.

Published as a conference paper at ICLR 2022

COCO-GAN. Follow the Beyond-Boundary Generation protocol of COCO-GAN, we transfer a trained COCO-GAN model to extended coordinates with a post-training procedure. Style GAN2 (+NCI). We replace the constant input of the original Style GAN2 with a zl of the same shape, we call such a replacement as non-constant input (NCI) . This modiﬁcation enables Style GAN2 to generate images at different output sizes with different zl sizes.

4.1 GENERATION AT EXTENDED SIZE.

Additional (unfair) protocols for fairness. We adopt two additional preand post-processing to ensure that Inﬁnity GAN does not take advantage of its different training resolution. To ensure Inﬁnity GAN is trained with the same amount of information as other methods, images are ﬁrst bilinear interpolated into 128 128 before resized into 197 197. Next, for all testing sizes in Table 4, Inﬁnity GAN generates at 1.54 (=197/128) larger size to ensure ﬁnal output images share the same Fo V with others. In fact, these corrections make the setting disadvantageous for Inﬁnity GAN, as it is trained with patches of 50% Fo V, generates 54% larger images for all settings, and requires to composite multiple patches for its 1 Scale Inv FID.

Quantitative analysis. For all the FID metrics in Table 1, unfortunately, the numbers are not directly comparable, since Inﬁnity GAN is trained with patches with smaller Fo V and at a different resolution. Nevertheless, the trend in Scale Inv FID is informative. It reﬂects the fact that the global structures generated from the baselines drift far away from the real landscape as the testing Fo V enlarges. Meanwhile, Inﬁnity GAN maintains a more steady slope, and surpasses the strongest baseline after 4 Scale Inv FID. Showing that Inﬁnity GAN indeed performs favorably better than all baselines as the testing size increases.

Figure 5: LSUN bridge and tower. Inﬁnity GAN synthesize at 512 512 pixels. We provide more details and samples in Appendix H.

Different local latent variables

Different styles

Figure 6: Diversity. Inﬁnity GAN synthesizes diverse samples at the same coordinate with different local latent and styles. More samples are shown in Appendix I

Qualitative results. In Figure 4, we show that all baselines fall short of creating reasonable global structures with spatially expanded input latent variables. COCO-GAN is unable to transfer to new coordinates when the extrapolated coordinates are too far away from the training distribution. Both Sin GAN and Style GAN2 implicitly establish image features based on position encoded by zero padding, assuming the training and testing position encoding should be the same. However, when synthesizing at extended image sizes, the inevitable change in the spatial size of the input and the features leads to drastically different position encoding in all model layers. Despite the models can still synthesize reasonable contents near the image border, where the position encoding is still partially correct, they fail to synthesize structurally sound content in the image center. Such a result causes Scale Inv FID to rapidly surge as the extended generation size increases to 1024 1024. Note that at the 16 setting, Style GAN2 runs out of memory with a batch size of 1 and does not generate any result. In comparison, Inﬁnity GAN achieves reasonable global structures with ﬁne details. Note that the 1024 1024 image from Inﬁnity GAN is created by compositing 121 independently synthesized patches. With the ability of generating consistent pixel values (Section 3.4), the composition is guaranteed to be seamless. We provide more comparisons in Appedix E, a larger set of generated samples in Appendix F, results from models trained at a higher resolution in Appendix G, and a very-long synthesis result in Appendix J.

In Figure 5, we further conduct experiments on LSUN bridge and tower datasets, demonstrating Inﬁnity GAN is applicable on other datasets. However, since the two datasets are object centric with a low view-angle variation in the vertical direction, Inﬁnity GAN frequently ﬁlls the top and bottom area with blank padding textures.

In Figure 6, we switch different zl and GT styles (i.e., zg projected with the mapping layer) while sharing the same c. More samples can be found in Appendix I. The results show that the structure

Published as a conference paper at ICLR 2022

Figure 7: Spatial style fusion. We present a mechanism in fusing multiple styles together to increase the interestingness and interactiveness of the generation results. The 512 4096 image fuses four styles across 258 independently generated patches. Table 2: Outpainting performance. The combination of In&Out (Cheng et al., 2021) and Inﬁnity GAN achieves state-of-the-art IS (higher better) and FID (lower better) performance on image outpainting task.

Method Place365 Flickr-Scenery

FID IS FID IS

Boundless 35.02 6.15 61.98 6.98 NS-outpaint 50.68 4.70 61.16 4.76

In&Out 23.57 7.18 30.34 7.16 In&Out+ Inﬁnity GAN 9.11 6.78 15.31 7.19

Table 3: Inference speed up with parallel batching. Beneﬁt from the spatial independent generation nature, Inﬁnity GAN achieves up to 7.20 inference speed up by with parallel batching at 8192 8192 pixels. The complete table can be found in Appendix P.

Method Parallel Batch Size # GPUs Inference Time (second / image) Speed Up

Style GAN2 N/A 1 OOM -

Ours 1 1 137.44 1.00 128 8 19.09 7.20

and texture are disentangled and modeled separately by GS and GT. The ﬁgure also shows that GS can generate a diverse set of structures realized by different zl.

User study. We use two-alternative forced choice (2AFC) between Inﬁnity GAN and other baselines on the Flickr-Landscape dataset. A total of 50 participants with basic knowledge in computer vision engage the study, and we conduct 30 queries for each participant. For each query, we show two separate grids of 16 random samples from each of the comparing methods and ask the participant to select the one you think is more realistic and overall structurally plausible. As presented in Table 1, the user study shows an over 90% of preference favorable to Inﬁnity GAN against all baselines.

4.2 ABLATION STUDY: THE POSITIONAL INFORMATION IN GENERATOR

As discussed in Section 3.3, we hypothesize that Style GAN2 highly relies on the positional information from the zero-paddings. In Table 1 and Figure 4, we perform an ablation by removing all paddings from Style GAN2+NCI, yielding Style GAN2+NCI+PFG that has no positional information in the generator. The results show that Style GAN2+NCI+PFG fails to generate reasonable image structures, and signiﬁcantly degrades in all FID settings. Then, with the proposed GS, the positional information is properly provided from z S, and resumes the generator performance back to a reasonable state.

4.3 APPLICATIONS

Spatial style fusion. Given a single global latent variable zg, the corresponding inﬁnite-pixel image is tied to a single modal of global structures and styles. To achieve greater image diversity and allow the user to interactively generate images, we propose a spatial fusion mechanism that can spatially combine two global latent variables with a smooth transition between them. First, we manually deﬁne multiple style centers in the pixel space and then construct an initial fusion map by assigning pixels to the nearest style center. The fusion map consists of one-hot vectors for each pixel, forming a style assignment map. According to the style assignment map, we then propagate the styles in all intermediate layers. Please refer to Appendix L for implementation details. Finally, with the fusion maps annotated for every layer, we can apply the appropriate zg from each style center to each pixel using feature modulation.Note that the whole procedure has a similar inference speed as the normal synthesis. Figure 7 shows synthesized fusion samples.

Outpainting via GAN Inversion. We leverage the pipeline proposed in In&Out (Cheng et al., 2021) to perform image outpainting with latent variable inversion. All loss functions follow the ones proposed in In&Out. We ﬁrst obtain inverted latent variables that generates an image similar to the given real image via GAN inversion techniques, then outpaint the image by expanding zl and zn with their unit Gaussian prior. See Appendix K for implementation details.

In Table 2, our model performs favorably against all baselines in image outpainting (Boundless (Teterwak et al., 2019), NS-outpaint (Yang et al., 2019), and In&Out (Cheng et al., 2021)).

Published as a conference paper at ICLR 2022

Boundless NS-outpaint

In&Out In&Out + Infinity GAN

Figure 8: Outpainting long-range area. Inﬁnity GAN synthesizes continuous and more plausible outpainting results for arbitrarily large outpainting areas. The real image annotated with red box is 256 128 pixels.

Figure 9: Multi-modal outpainting. Inﬁnity GAN can natively achieve multi-modal outpainting by sampling different local latents in the outpainted region. The real image annotated with red box is 256 128 pixels. We present more outpainting samples in Appendix M.

Arbitrary-Length Cyclic Panorama Arbitrary-Length Inbetweening

Arbitrary-Length Inbetweening

Figure 10: Image inbetweening with inverted latents. The Inﬁnity GAN can synthesize arbitrarylength cyclic panorama and inbetweened images by inverting a real image at different position. The top-row image size is 256 2080 pixels. We present more samples in Appendix N and Appendix O.

As shown in Figure 8, while dealing with a large outpainting area (e.g., panorama), all previous outpainting methods adopt a sequential process that generates a ﬁxed region at each step. This introduces obvious concatenation seams, and tends to produce repetitive contents and black regions after the multiple steps. In contrast, with Inﬁnity GAN as the image prior in the pipeline of (Cheng et al., 2021), we can directly outpaint arbitrary-size target region from inputs of arbitrary shape. Moreover, in Figure 9, we show that our outpainting pipeline natively supports multi-modal outpainting by sampling different local latent codes in the outpainting area.

Image inbetweening with inverted latent variables. We show another adaptation of outpainting with model inversion by setting two sets of inverted latent variables at two different spatial locations, then perform spatial style fusion between the variables. Please refer to Appendix K for implementation details. As shown in Figure 10, we can naturally inbetween (Lu et al., 2021) the area between two images with arbitrary distance. A cyclic panorama of arbitrary width can also be naturally generated by setting the same image on two sides.

Parallel batching. The nature of spatial-independent generation enables parallel inference on a single image. As shown in Table 3, by stacking a batch of patches together, Inﬁnity GAN can signiﬁcantly speed up inference at testing up to 7.20 times. Note that this speed-up is critical for high-resolution image synthesis with a large number of FLOPs.

5 CONCLUSIONS

In this work, we propose and tackle the problem of synthesizing inﬁnite-pixel images, and demonstrate several applications of Inﬁnity GAN, including image outpainting and inbetweening.

Our future work will focus on improving Inﬁnity GAN in several aspects. First, our Flickr-Landscape dataset consists of images taken at different Fo Vs and distances to the scenes. When Inﬁnity GAN composes landscapes of different scales together, synthesized images may contain artifacts. Second, similar to the Fo V problem, some images intentionally include tree leaves on top of the image as a part of the photography composition. These greenish textures cause Inﬁnity GAN sometimes synthesizing trees or related elements in the sky region. Third, there is still a slight decrease in FID score in comparison to Style GAN2. This may be related to the convergence problem in video synthesis (Tian et al., 2021), in which the generator achieves inferior performance if a preceding network (e.g., the motion module in video synthesis) is jointly trained with the image module.

Published as a conference paper at ICLR 2022

6 ACKNOWLEDGEMENTS

This work is supported in part by the NSF CAREER Grant #1149783 and a gift from Snap Inc.

7 ETHICS STATEMENT

Our work follows the General Ethical Principles listed at ICLR Code of Ethics (https://iclr. cc/public/Code Of Ethics).

The research in generative modeling is frequently accompanied by concerns about the misuse of manipulating or hallucinating information for improper use. Despite none of the proposed techniques aiming at improving manipulation of ﬁne-grained image detail or hallucinating human activities, we cannot rule out the potential of misusing the framework to recreate fake scenery images for any inappropriate application. However, as we do not drastically alter the plausibility of synthesis results in the high-frequency domain, our research is still covered by continuing research in ethical generative modeling and image forensics.

Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2stylegan++: How to edit the embedded images? In IEEE Conference on Computer Vision and Pattern Recognition, 2020. 3, 33

Yogesh Balaji, Mohammadmahdi Sajedi, Neha Mukund Kalibhat, Mucong Ding, Dominik St oger, Mahdi Soltanolkotabi, and Soheil Feizi. Understanding over-parameterization in generative adversarial networks. In International Conference on Learning Representations, 2021. 1

Urs Bergmann, Nikolay Jetchev, and Roland Vollgraf. Learning texture manifolds with the periodic spatial gan. 2017. 41

Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, 2021. 3

Yinbo Chen, Sifei Liu, and Xiaolong Wang. Learning continuous image representation with local implicit image function. In IEEE Conference on Computer Vision and Pattern Recognition, 2021. 4

Yen-Chi Cheng, Chieh Hubert Lin, Hsin-Ying Lee, Sergey Tulyakov, Jian Ren, and Ming-Hsuan Yang. In&Out: Diverse image outpainting via gan inversion. ar Xiv preprint ar Xiv:2104.00675, 2021. 3, 6, 8, 9, 32

Terrance De Vries, Miguel Angel Bautista, Nitish Srivastava, Graham W Taylor, and Joshua M Susskind. Unconstrained scene generation with locally conditioned radiance ﬁelds. In IEEE International Conference on Computer Vision, 2021. 3

Alexei A Efros and Thomas K Leung. Texture synthesis by non-parametric sampling. In IEEE International Conference on Computer Vision, 1999. 2

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, 2021. 2

Anna Fr uhst uck, Ibraheem Alhashim, and Peter Wonka. Tilegan: synthesis of large-scale nonhomogeneous textures. ACM Transactions on Graphics, 2019. 41

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Neural Information Processing Systems, 2014. 5

Ishaan Gulrajani, Faruk Ahmed, Mart ın Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved training of wasserstein gans. In Neural Information Processing Systems, 2017. 1

Published as a conference paper at ICLR 2022

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Neural Information Processing Systems, 2017. 6

Tobias Hinz, Matthew Fisher, Oliver Wang, and Stefan Wermter. Improved techniques for training single-image gans. In IEEE Winter Conference on Applications of Computer Vision, 2021. 41

Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In IEEE International Conference on Computer Vision, 2017. 4

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2017. 18

Nikolay Jetchev, Urs Bergmann, and Gokhan Yildirim. Copy the old or paint anew? an adversarial framework for (non-) parametric image stylization. Neural Information Processing Systems Workshops, 2018. 41

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations, 2018. 1, 2

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In IEEE Conference on Computer Vision and Pattern Recognition, 2020. 2, 3, 4, 5, 33

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. 6, 33

Hsin-Ying Lee, Hung-Yu Tseng, Qi Mao, Jia-Bin Huang, Yu-Ding Lu, Maneesh Kumar Singh, and Ming-Hsuan Yang. Drit++: Diverse image-to-image translation viadisentangled representations. International Journal of Computer Vision, pp. 1 16, 2020. 4

Chieh Hubert Lin, Chia-Che Chang, Yu-Sheng Chen, Da-Cheng Juan, Wei Wei, and Hwann-Tzong Chen. COCO-GAN: Generation by parts via conditional coordinating. In IEEE International Conference on Computer Vision, 2019. 2

Andrew Liu, Richard Tucker, Varun Jampani, Ameesh Makadia, Noah Snavely, and Angjoo Kanazawa. Inﬁnite nature: Perpetual view generation of natural scenes from a single image. In IEEE International Conference on Computer Vision, 2021. 3

Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. Image inpainting for irregular holes using partial convolutions. In European Conference on Computer Vision, 2018a. 3

Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason Yosinski. An intriguing failing of convolutional neural networks and the coordconv solution. In Neural Information Processing Systems, 2018b. 4

Chia-Ni Lu, Ya-Chu Chang, and Wei-Chen Chiu. Bridging the visual gap: Wide-range image blending. In IEEE Conference on Computer Vision and Pattern Recognition, 2021. 9

Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and Ming-Hsuan Yang. Mode seeking generative adversarial networks for diverse image synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, 2019. 4

Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? In International Conference on Machine Learning, 2018. 1, 5

Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. In IEEE Conference on Computer Vision and Pattern Recognition, 2019. 3

Published as a conference paper at ICLR 2022

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance ﬁelds for view synthesis. In European Conference on Computer Vision, 2020. 3

Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Qureshi, and Mehran Ebrahimi. Edgeconnect: Structure guided image inpainting using edge prediction. In IEEE International Conference on Computer Vision Workshops, 2019. 3

Michael Niemeyer and Andreas Geiger. Giraffe: Representing scenes as compositional generative neural feature ﬁelds. In IEEE Conference on Computer Vision and Pattern Recognition, 2021. 3

Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classiﬁer GANs. In International Conference on Machine Learning, 2017. 5

Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In International Conference on Machine Learning, 2016. 2

Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. In IEEE Conference on Computer Vision and Pattern Recognition, 2019. 3

Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-ﬁdelity images with VQ-VAE-2. In Neural Information Processing Systems, 2019. 2

Mark Sabini and Gili Rusak. Painting outside the box: Image outpainting with gans. ar Xiv preprint ar Xiv:1808.08483, 2018. 3

Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli. Singan: Learning a generative model from a single natural image. In IEEE International Conference on Computer Vision, 2019. 2

Assaf Shocher, Shai Bagon, Phillip Isola, and Michal Irani. Ingan: Capturing and retargeting the dna of a natural image. In IEEE International Conference on Computer Vision, 2019. 2

Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. In Neural Information Processing Systems, 2020. 4

Ivan Skorokhodov, Grigorii Sotnikov, and Mohamed Elhoseiny. Aligning latent and image spaces to connect the unconnectable. In IEEE International Conference on Computer Vision, 2021. 2, 15

Matthew Tancik, Pratul P Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. In Neural Information Processing Systems, 2020. 4

Piotr Teterwak, Aaron Sarna, Dilip Krishnan, Aaron Maschinot, David Belanger, Ce Liu, and William T Freeman. Boundless: Generative adversarial networks for image extension. In IEEE International Conference on Computer Vision, 2019. 8

Yu Tian, Jian Ren, Menglei Chai, Kyle Olszewski, Xi Peng, Dimitris N. Metaxas, and Sergey Tulyakov. A good image generator is what you need for high-resolution video synthesis. In International Conference on Learning Representations, 2021. 9

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neural Information Processing Systems, 2017. 4

Jonas Wulff and Antonio Torralba. Improving inversion and generation diversity in stylegan using a gaussianized latent space. ar Xiv preprint ar Xiv:2009.06529, 2020. 33

Wenqi Xian, Patsorn Sangkloy, Varun Agrawal, Amit Raj, Jingwan Lu, Chen Fang, Fisher Yu, and James Hays. Texturegan: Controlling deep image synthesis with texture patches. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2

Published as a conference paper at ICLR 2022

Rui Xu, Xintao Wang, Kai Chen, Bolei Zhou, and Chen Change Loy. Positional encoding as spatial inductive bias in gans. In IEEE Conference on Computer Vision and Pattern Recognition, 2021. 16

Zongxin Yang, Jian Dong, Ping Liu, Yi Yang, and Shuicheng Yan. Very long natural scenery image prediction by outpainting. In IEEE International Conference on Computer Vision, 2019. 3, 8

Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Free-form image inpainting with gated convolution. In IEEE International Conference on Computer Vision, 2019. 3

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 33

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. 6

Appendix Table of Contents

A Comparisons with Concurrent Work ................................................................................. 15

B Conceptual Comparisons Among Different Tasks............................................................. 17

C Implementation Details of Coordinates.............................................................................. 17

D Inﬁnity GAN Architecture .................................................................................................... 20

E More Comparison with baselines........................................................................................ 21

F More Inﬁnity GAN Qualitative Results............................................................................... 22

G Inﬁnity GAN Results At A Higher Resolution.................................................................... 25

H More Results on Other Datasets.......................................................................................... 29

I More Diversity Visualization ............................................................................................... 31

J Our Best Attempt in Including An Everlasting Image...................................................... 32

K Implementation Details of Image Outpainting and Inbetweening via Inversion............ 32

L Implementation Details of Spatial Style Fusion................................................................. 34

M More Qualitative Results of Outpainting via Inversion .................................................... 36

N More Qualitative Results of Inbetweening via Inversion.................................................. 37

O More Qualitative Results of Cyclic Panoramic Inbetweening via Inversion................... 38

P Experimental Details of The Speed Benchmark with Parallel Batching......................... 39

Q Ablation: Feature Unfolding................................................................................................ 39

R Limitations............................................................................................................................. 40

S Implementation Illustration of Scale Inv FID..................................................................... 40

T More Comparisons with Texture Synthesis Method ......................................................... 41

U More Comparisons with Sin GAN-based Models............................................................... 41

V Ablation: Mode-Seeking Diversity Loss (Ldiv) .................................................................. 43

W Ablation: Auxiliary Loss (Lar)............................................................................................. 43

Published as a conference paper at ICLR 2022

A COMPARISONS WITH CONCURRENT WORK

A.1 ALIS (SKOROKHODOV ET AL., 2021)

ALIS is a concurrent work that achieves a similar application in inﬁnite-pixel generation by iteratively inbetween pairs of anchor patches. Here, we discuss some of the critical differences.

ALIS has limited ability in extending vertically. Inﬁnity GAN achieves both vertical and horizontal extension while maintaining a plausible holistic appearance. ALIS only presents horizontal extension in the paper. Vertically connecting anchors on any dataset used in their paper will produce invalid structures (e.g., layered landscapes periodically stacking in the sky). Therefore, vertical anchor connection is limited to certain datasets, such as pattern-like textures or satellite images.

Differences in the problem formulation. Inﬁnity GAN directly models each inﬁnite-pixel image with a shared global latent variable using an implicit function. In contrast, ALIS learns to inbetween two independent global latent variables. Furthermore, Inﬁnity GAN can still achieve the inbetweening setup similar to ALIS with spatial style fusion. However, applying ALIS to synthesize images with a shared global context will lead to periodically repeating patches, as shown in Figure 13.

Inﬁnity GAN allows free-form anchor placements without training. ALIS has to designate a constant relative position between the anchors before training starts. It is also non-trivial for ALIS to inbetween multiple anchors. Meanwhile, as shown in Figure 12, our training-free spatial style fusion allows placing any number of anchors at any place. However, the ﬂexibility also comes with a trade-off, our spatial style fusion is not trained with adversarial learning, causing the synthesis performance not compatible to the regularly synthesized images. Following the ALIS evaluation protocol, we train our Inﬁnity GAN on the ALIS LHQ dataset at 256 256 resolution, yielding an FID 11.82 with regular synthesis and an -FID 19.22 with spatial style fusion. In contrast, ALIS shows negligible performance gap between FID (10.48) and -FID (10.64). Such a result suggest a ﬂexibility-accuracy trade-off between Inﬁnity GAN and ALIS.

ALIS generates blocky and high-frequency lattice artifacts. ALIS adopts the generation-by-parts design from COCO-GAN. As shown in Figure 11, a critical consequence is creating the blocky and lattice artifacts between patches, since the inter-patch continuity is unstably maintained with adversarial learning. In contrast, Inﬁnity GAN is an improved version of COCO-GAN that the interpatch continuity is guaranteed with implicit-function and padding-free generator design.

ALIS still suffers from content repetition. The content repetition problem is discussed in the ALIS paper as a limitation. Inﬁnity GAN addresses the issue with a local latent space and enforces the contribution of the local variables with a diversity loss (2) between zl and z S. We demonstrate that Inﬁnity GAN does not have the content repetition problem in Figure 31.

Blocky Artifacts Discontinuity Lattice Artifacts Figure 11: ALIS can suffer from blocky artifacts, inter-patch discontinuity and lattice artifacts. We train ALIS with the ofﬁcial implementation at 1024 1024 resolution. We focus on the failure cases caused by COCO-GAN-based generation-by-parts framework, which artiﬁcially enforces the learned inter-patch continuity with adversarial learning.

Published as a conference paper at ICLR 2022

Figure 12: Inﬁnity GAN with free-form anchor placements. The spatial style fusion of Inﬁnity GAN can place any number of style centers (called anchors in ALIS) at any location. We show case (top) a ﬁve-anchor example and (bottom) the corresponding regions covered by each style center.

Figure 13: ALIS cannot synthesis with a single holistic appearance. ALIS synthesizes repetitive content if all anchors share the same latent vector.

A.2 MS-PIE (XU ET AL., 2021).

MS-PIE found the positional information is crucial for GANs training. The paper explores different positional encoding schema, including sinusoidal coordinate encoding and padding removal. We further discuss the relation and distinction between MS-PIE and Inﬁnity GAN in these two modules.

Extensible coordinate encoding. The expand conﬁguration of MS-PIE allows coordinate value extrapolation in the spatial dimension. However, its training sticks to a ﬁxed coordinate matrix for each synthesis scale, and the framework aligns the real images to the ﬁxed coordinate matrix with bilinear interpolation 1 during training. Such a design causes the synthesized content to attach to the coordinate matrix, thus inevitably creates repetitive content when the coordinate value periodically repeats as the matrix expanding. In Figure 14, we train a no-padding Style GAN2 model with MSPIE-expand setting (i.e., conﬁg (k) in Table 5 of MS-PIE paper) on our Flickr-Landscape dataset. The model is trained at scales 256, 384 and 512, then test at scale 1408.

1The corresponding ofﬁcial implementation: https://github.com/open-mmlab/mmgeneration/blob/ 95f962e54815b8f72c015c134cd597e9eff3de36/mmgen/models/gans/mspie_stylegan2.py#L115.

Published as a conference paper at ICLR 2022

Figure 14: MS-PIE creates repetitive content in the expand setting. We train the no-padding Style GAN2 with MS-PIE at 256, 384 and 512 scales, then synthesize at 1408 scale.

Distinctions between no-padding and padding-free generator. The no-padding generator (NPG) in MS-PIE is conceptually related to our padding-free generator (PFG), but fundamentally different while considering how the information of border pixels in the feature space is processed. NPG gradually loses the information from border pixels after each convolution, since the border pixels are less visited than the other pixels (consider a 3 3 convolutional kernel, most pixels will be scanned for 9 times, while edge pixels will be only scanned for 4 times, and corner pixels only 1 time). Further note that the information loss worsens exponentially as the network stacking more convolutional layers. In contrast, our PFG pads feature values from the neighbor context before the no-padding convolution is applied. The information loss caused by the no-padding convolution is a natural way to discard the information that is spatially too far away from and less related to the current context.

B CONCEPTUAL COMPARISONS AMONG DIFFERENT TASKS

Training-Data

Distribution

Further Extended Resolution

Input Condition

(a) Super Resolution

(b) Texture Synthesis

(includes Sin GAN)

(c) Image Extrapolation

(d) COCO-GAN

(e) Infinity GAN

(Only few steps)

Figure 15: (a) Super Resolution: The ﬁnal outputs inherit the coarse structure from and share the same ﬁeld-of-view with the original input condition. (b) Texture Synthesis: Due to coordinate encoding, objects are generated near image the border, and the center of the image is ﬁlled with repetitive textures. (c) Image Extrapolation: Current extrapolation models tend to copy-and-paraphrase the conditional input or create mottled textural samples, leading to repetitive results especially when the outpainted region is large. (d) COCO-GAN: COCO-GAN can only synthesize samples slightly larger than its training distribution. (e) Inﬁnity GAN: Ours Inﬁnity GAN can synthesize a more favorable global structure at arbitrary resolutions without an input condition.

C IMPLEMENTATION DETAILS OF COORDINATES

We ﬁrst derive the receptive ﬁeld size R of an L-layer GS after adding the 7 7 feature unfolding to all layers. Assume that the size of z S (i.e., output of GS, and input of GT) is M. The value of R can be derived as M +(2 3) L, where 3 is the half-size of the feature unfolding area. In practice, with

Published as a conference paper at ICLR 2022

an architecture shown in Figure 18 with training patch size 101 101, we have M = 11, L = 4, and R = 35.

Horizontal direction. In order to avoid the generator discovering and exploiting any property of the periodic coordinate, we use a period T much larger than R. Meanwhile, the period should be short enough to avoid the differences between consecutive coordinates vanish to almost zero. In practice, we use T = 4 R for both of the cosine and sine coordinates.

Vertical direction. We utilize the property of the tanh function that its slope rapidly saturates to nearly zero and its value range is bounded by [ 1, 1]. Such a property is aligned to the property of real-world scenery the structural interactions among landscape objects are mostly concentrated around the horizon and rapidly saturates to a single modal (i.e., sky, ground, or water) in the vertical direction.

There are two hyperparameters while constructing the tanh coordinates: (i) a pair of cutoff points, and (ii) the sampling period of zl in the spatial dimension.

Cutoff points (ccut) are a pair of values that, during training, we only sample the coordinates between the value pair. Such a hyperparameter is required since we are unable to sample all coordinates from an inﬁnitely large coordinate system within the ﬁnite training steps. In practice, we set the cutoff points at 0.995 of tanh-projected coordinate. Note that the underlying effect of using different cutoff point values is not well-investigated, we do not observe obvious changes with slightly different values. Sampling period (dl) deﬁnes the distance between two spatially consecutive values of zl are sampled within a training sample. Such a value relates to the grain of the representation that the generator overall models. It is equivalent to the occupation ratio of zl between the cutoff points. Thus we can alternatively deﬁne a hyperparameter V that deﬁnes the occupation ratio of zl by R R+V . Then, the sampling period can be derived with dl = 2 ccut 1 R+V . In practice, we use V = 10, resulting in dl = 0.1 3.

C.1 MORE DISCUSSIONS ON THE CHOICE OF COORDINATE SYSTEM

In this paper, we mainly tackle the scenery image datasets, such as landscape, LSUN bridge and LSUN tower. The coordinate prior we introduced in Section 3.2 is speciﬁcally designed for such types of data, which has self-similarity in the horizontal direction and rapid mode-saturation in the vertical direction. However, we want to emphasize that the choice of coordinate system is a hyperparameter depending on the dataset. For instance, in Figure 16, we show that Inﬁnity GAN can also work on satellite image dataset (Isola et al., 2017), which is more frequently used in texture synthesis task. A more natural choice of coordinate system in such a setting is using periodic coordinates in both horizontal and vertical directions. Nevertheless, in Figure 17 we observe that Inﬁnity GAN can still produce visually plausible and globally sound appearances with using saturating (i.e., tanh) coordinates in the vertical direction and periodic coordinates in the horizontal direction.

In Table 4, we identify three categories of coordinate systems, but these options may not have covered all possibilities. One may need to be aware of the characteristics of the data and the intended outcome before selecting the appropriate coordinate system.

Dataset Type Object-centric Texture Scenery Example Image Net, Celeb A Texture, satellite images Landscape Spatial Distribution of Content N/A Spatially agnostic (Vertical) Spatially varying (Horizontal) Spatially agnostic Horizontal Coordinate Constant Periodic (e.g., sin/cos) Periodic (e.g., sin/cos) Vertical Coordinate Constant Periodic (e.g., sin/cos) Saturate (e.g., tanh) Table 4: Possible coordinate designs for different types of dataset. The choice of coordinate system for Inﬁnity GAN is a dataset-dependent hyperparameter.

Published as a conference paper at ICLR 2022

Real Images Generated (1x) Generated (2x) Generated (4x) Figure 16: Qualitative results on satellite image dataset. Inﬁnity GAN trained on satellite image dataset with cyclic coordinates on both vertical and horizontal directions.

Real Images Generated (1x) Generated (2x) Generated (4x)

Figure 17: Qualitative results on satellite image dataset. Inﬁnity GAN trained on satellite image dataset with saturating coordinates (i.e., tanh) on vertical direction and cyclic coordinates on horizontal direction.

Published as a conference paper at ICLR 2022

D INFINITYGAN ARCHITECTURE

Linear + Leaky Re LU

Styled Residual Block (kernel: 3 3, padding: False)

zl (shape: B 256 35 35)

z S (shape: B 256 11 11)

Modulated Coord Conv2d (kernel: 7 7, stride: 1, padding: False)

zg (shape: B 512)

Structure Synthesizer (GS)

c (shape: B 3 35 35)

Modulated Coord Conv2d (kernel: 7 7, stride: 1, padding: False)

Modulated Coord Conv2d (kernel: 7 7, stride: 1, padding: False)

Modulated Coord Conv2d (kernel: 7 7, stride: 1, padding: False)

Texture Synthesizer (GT)

z S (shape: B 256 11 11) zg (shape: B 512)

Mapping Network

(8 Linear Layers)

Styled Residual Block (kernel: 3 3, padding: False)

Styled Residual Block (kernel: 3 3, padding: False)

Styled Residual Block (kernel: 3 3, padding: False)

pc (shape: B 3 101 101)

Discriminator

pc (shape: B 3 101 101)

DResidual Block (kernel: 3 3, padding: True)

DResidual Block (kernel: 3 3, padding: True)

DResidual Block (kernel: 3 3, padding: True)

DResidual Block (kernel: 3 3, padding: True)

DResidual Block (kernel: 3 3, padding: True, add_std_ch: True)

Conv2d + Leaky Re LU (kernel: 3 3, padding: True)

Linear + Leaky Re LU

Adversarial Prediction Auxiliary Prediction c

(B 256 29 29)

(B 256 23 23)

(B 256 17 17) (B 512 53 53)

(B 512 29 29)

(B 512 17 17)

(B 512 50 50)

(B 512 25 25)

(B 512 12 12)

(B 512 6 6)

(B 513 3 3)

(B 512 3 3)

(B 1) (B 1) Linear Linear

(B 512) (B 512)

Conv2d + Leaky Re LU (kernel: 3 3, padding: True)

(B 512 101 101)

Figure 18: A high-level overview of Inﬁnity GAN model architecture.

Modulated Coord Conv2d (kernel: 7 7, padding: False)

zl or feature N (shape: B P H W) zg (shape: B 512)

c (shape: B 3 H W)

Concatenate (dim: 1)

(B [P+3] H W)

Conv2d + Leaky Re LU

Weight Modulation

& Demodulation

Conv2d Weights

(Modulated)

Conv2d Weights

feature (N+1) (shape: B Q [H-6] [W-6])

Styled Residual Block (kernel: 3 3, padding: False)

Styled Skip

zl or feature N (shape: B P H W) z T (shape: B 512)

skip N (shape: B P H W)

feature (N+1) skip (N+1)

Styled Upsample Conv

Styled Conv

(B P [2H-3] [2W-3])

(shape: B Q [2H-3] [2W-3])

DResidual Block (kernel: 3 3, padding: True)

Conv2d + Leaky Re LU

pc or feature N (shape: B P H W)

Conv2d + Leaky Re LU

(stride: 2)

Conv2d (stride: 2)

Elementwise Add

feature (N+1)

(B Q H//2 W//2])

(B Q H//2 W//2])

(shape: B Q H//2 W//2)

Module Name (Scope Config)

Styled Upsample Conv (kernel: 3 3, padding: False)

zl or feature N (shape: B P H W) z T (shape: B 512)

Conv Transpose2d

(stride: 2)

Weight Modulation

& Demodulation

Conv Transpose2d Weights (Modulated)

Conv Transpose2d

feature (N+1) (shape: B Q [2H-3] [2W-3])

Styled Conv (kernel: 3 3, padding: False)

Styled Skip (kernel: 3 3, padding: False) Module Name (Scope Config)

Noise Injection

Gaussian Blur

(kernel: 3 3)

(B Q [2H-1] [2W-1])

(B Q [2H-3] [2W-3])

Leaky Re LU

zl or feature N (shape: B P H W) z T (shape: B 512)

Weight Modulation

& Demodulation

Conv2d Weights

(Modulated)

Linear Conv2d Weights

feature (N+1) (shape: B Q [H-2] [W-2])

Noise Injection zn

(B Q [H-2] [W-2])

Leaky Re LU

zl or feature N (shape: B P H W) z T (shape: B 512)

Weight Modulation

Conv2d Weights

(Modulated)

Linear Conv2d Weights

Leaky Re LU

(B Q [H-2] [W-2])

Center Crop

Linear Upsampling

Element-wise Sum

skip(N+1) (shape: B Q [H-2] [W-2])

(B Q [H-2] [W-2])

Figure 19: The low-level design of each module within Inﬁnity GAN.

Published as a conference paper at ICLR 2022

E MORE COMPARISON WITH BASELINES

Sin GAN Style GAN2 + NCI Style GAN2 + NCI + FCG Inﬁnity GAN (ours)

Figure 20: More qualitative comparisons. We show more samples on Flickr-Landscape at 1024 1024 pixels.

Published as a conference paper at ICLR 2022

F MORE INFINITYGAN QUALITATIVE RESULTS

Figure 21: More qualitative results. We provide more images synthesized at 1024 1024 pixels with our Inﬁnity GAN trained on Flickr-Landscape. All images are synthesized with the same model presented in the paper, which is trained with 101 101 patches cropped from 197 197 resolution real images. All images share the same coordinate and present a high structural diversity. Note that the images are down-sampled 2 to reduce ﬁle size.

Published as a conference paper at ICLR 2022

Figure 22: More qualitative results. We provide more images synthesized at 1024 1024 resolution with our Inﬁnity GAN trained on Flickr-Landscape. All images are synthesized with the same model presented in the paper, which is trained with 101 101 resolution patches cropped from 197 197 resolution real images. All images share the same coordinate and present a high structural diversity. Note that the images are down-sampled 2 to reduce ﬁle size.

Published as a conference paper at ICLR 2022

Figure 23: More qualitative results. We provide more images synthesized at 1024 1024 resolution with our Inﬁnity GAN trained on Flickr-Landscape. All images are synthesized with the same model presented in the paper, which is trained with 101 101 resolution patches cropped from 197 197 resolution real images. All images share the same coordinate and present a high structural diversity. Note that the images are down-sampled 2 to reduce ﬁle size.

Published as a conference paper at ICLR 2022

G INFINITYGAN RESULTS AT A HIGHER RESOLUTION

Figure 24: Inﬁnity GAN samples training at a higher resolution. We synthesize 4096 4096 pixel images using Inﬁnity GAN trained on Flickr-Landscape at 397 397 pixels patches cropped from 773 773 full images. The top and bottom rows are zoom-in view of the image. Note that the ﬁgure is 2 down-sampled to reduce ﬁle size.

Published as a conference paper at ICLR 2022

Figure 25: Inﬁnity GAN samples training at a higher resolution. We synthesize 4096 4096 pixel images using Inﬁnity GAN trained on Flickr-Landscape at 397 397 pixels patches cropped from 773 773 full images. The top and bottom rows are zoom-in view of the image. Note that the ﬁgure is 4 down-sampled to reduce ﬁle size.

Published as a conference paper at ICLR 2022

Figure 26: Inﬁnity GAN samples training at a higher resolution. We synthesize 4096 4096 pixel images using Inﬁnity GAN trained on Flickr-Landscape at 397 397 pixels patches cropped from 773 773 full images. The top and bottom rows are zoom-in view of the image. Note that the ﬁgure is 4 down-sampled to reduce ﬁle size.

Published as a conference paper at ICLR 2022

Figure 27: Inﬁnity GAN samples training at a higher resolution. We synthesize 4096 4096 pixel images using Inﬁnity GAN trained on Flickr-Landscape at 397 397 pixels patches cropped from 773 773 full images. The top and bottom rows are zoom-in view of the image. Note that the ﬁgure is 4 down-sampled to reduce ﬁle size.

Published as a conference paper at ICLR 2022

H MORE RESULTS ON OTHER DATASETS

Training Size (101 101) Test Size (512 512)

Figure 28: LSUN bridge category. Inﬁnity GAN synthesis results at 512 512 pixels on LSUN bridge category. The model is trained with 101 101 pixels patches cropped from 197 197 resolution real images.

Published as a conference paper at ICLR 2022

Training Size (101 101) Test Size (512 512)

Figure 29: LSUN tower category. Inﬁnity GAN synthesis results at 512 512 pixels on LSUN tower category. The model is trained with 101 101 pixels patches cropped from 197 197 resolution real images.

Published as a conference paper at ICLR 2022

I MORE DIVERSITY VISUALIZATION

Different styles

Different local latent variables

Figure 30: Generation diversity. We show that structure synthesizer and texture synthesizer separately models structure and texture by changing either the local latent or style while all other variables are ﬁxed. The results also show that Inﬁnity GAN can synthesize a diverse set of landscape structures at the same coordinate. All samples are synthesized at 389 389 pixels with Inﬁnity GAN trained at 101 101.

Published as a conference paper at ICLR 2022

J OUR BEST ATTEMPT IN INCLUDING AN EVERLASTING IMAGE

Figure 31: We provide a 256 9984 pixels sample synthesized with Inﬁnity GAN. The sample shows that (a) our Inﬁnity GAN can generalize to arbitrarily-large sizes, and (b) the synthesized contents do not self-repeat while using the sample global latent variable zg.

K IMPLEMENTATION DETAILS OF IMAGE OUTPAINTING AND INBETWEENING VIA INVERSION

Our pipeline is similar to that of In Out (Cheng et al., 2021). With a given image x, the objective of GAN-model inversion is to recover a set of generator-parameter-dependent input latent variables z that can synthesize a resulting image x that is similar to x. There exist multiple different implementations to recover z , we adopt the gradient-descent-based method, which optimizes z as a set of learnable parameters with carefully designed objective functions. In the context of Inﬁnity GAN, we optimize four groups of variables: zg, zl, style (z T) and zn. In particular, we uses the W+ space

Published as a conference paper at ICLR 2022

formulation (Abdal et al., 2020; Wulff & Torralba, 2020) for z T, named z+ T, where each layer in GT has its own set of z Ti and each of the z Ti is optimized separately.

Objective functions. We ﬁrst introduce two image-distance losses to the inversion objectives:

Lpix = x x 2 , Lpercept = LPIPS(x , x) , (7)

which LPIPS is the learned perceptual distance proposed by Zhang et al. (Zhang et al., 2018).

We utilize the Gaussianized latent space technique proposed by Wulff and Torralba (Wulff & Torralba, 2020), which uses a Leaky Re LU of slope 5 to discount the last activation function (a Leaky Re LU with a slope 0.2) in the Style GAN2 mapping layer, and recovers Gaussian-like marginal distribution of z+ T. With the Gaussian distribution prior, we can use the empirical mean µ+ T and covariance matrix Σ+ T (computed by sampling 10,000 z+ T via zg) to recover an estimated z+ T distribution. With the empirical statistics, (Wulff & Torralba, 2020) proposes to compute Mahalanobis distance: d M(z, µ, Σ) = (z µ)T Σ 1 (z+ T µ). (8)

Then, we construct a prior loss Lprior (Wulff & Torralba, 2020) that regularizes the zg, zl and z+ T with Mahalanobis distance:

Lprior = λαd M(zg, 0, I) + λβd M(zl, 0, I) + λγd M(z+ T, µ+ T , Σ+ T ) , (9)

which λα, λβ and λγ are weight factors. For the cases of zg and zl with zero means and unit variances, the prior loss degenerates to an l2 loss.

Following Style GAN2, we adopt a noise regularization loss Lnreg and noise renormalization. The full objective function of the inversion is:

Linv = λpix Lpix + λpercept Lpercept + Lprior + λnreg Lnreg , (10)

where λs are the weighting factor of each loss terms.

Hyperparameters. For all tasks and datasets, we set λpix = 10, λpercept = 10, λnreg = 1,000, λα = 10, λβ = 10, and λγ = 0.01. We use Adam (Kingma & Ba, 2015) optimizer with a learning annealing (Karras et al., 2020) from 0.1 to 0 for 1000 iterations. We use a batch size of 1 to avoid batch samples interfere with each other. Despite we observe batched inversion can sometimes yield superior results, it is not a conventional setting for real-world applications that batched inputs are mostly unavailable, it also signiﬁcantly increases the stochasticity while reproducing the results.

Outpainting and inbetweening with inverted latent variables. With the inverted latent variables, we perform image outpainting by spatially extend the z l with its unit Gaussian prior, while using z g and z+ T everywhere. For image inbetweening, z l is also extended with its unit Gaussian prior, while z g and z+ T are fused with spatial style fusion. Notice that we do not optimize c in our pipeline, since inverting c is non-trivial and requires additional regularization losses. This introduces a limitation that the spatial position of the images is ﬁxed after the inversion, the users have to perform the inversion optimization again if they want to assign z l to a different location.

Another limitation is that, despite the use of the prior loss, some dimensions of the inverted latents tend to drift far away from the normal distribution. In combination with the use of the W+ space of z T, the inverted latents are of high-instability, sometimes introduce checkerboard-like artifacts, and frequently mix multiple irrelevant contexts together. We develop an interactive tool (released along with the code release) that allows users interactively and regionally resampling the undesired local latent variables zl in the outpainting area. Such a limitation is highly related to the generalization of the inverted latents, we put it as an important future working direction.

Published as a conference paper at ICLR 2022

L IMPLEMENTATION DETAILS OF SPATIAL STYLE FUSION

Example 2 3x3 Conv2d

Example 1 3x3 Conv2d Transpose

Local Latent & Coordinate

Image Space

(a) Image Inference (Single Style) (b) Fusion Map Calibration (c) Spatial Style Fusion Generation

Conv2d-Transpose

Synthesized Features / Pixels

Bilinear Interpolate (align_corners: True)

Fusion Map (Style X+Y)

Style X Style Y Style Y Style X Style X+Y

Figure 32: Illustration of fusion map creation procedure and spatial fusion generation. With a toy architecture example shown in (a) and a style fusion map in the pixel space (bottom of (b)), we can reversely create spatially aligned fusion maps in all intermediate layers by padding or interpolating the fusion map in the previous layer. Spatial style fusion in (c) uses the fusion maps to synthesize images with a natural style transition in the pixel space.

To achieve spatial style fusion in the existing Inﬁnity GAN pipeline, we introduce two additional procedures: style fusion map creation and fused modulation-demodulation . The former creates per-layer style fusion maps that specify the geometry of the style fusion area. The latter one is a modiﬁed version of feature modulation-demodulation that processes the volumetric styles created from the style fusion map.

Style fusion map creation. Given N style centers designated by the user, in the pixel space, the target of style fusion map creation is to construct a set of style fusion maps for each layer of both GS and GT. The fusion map is a spatially-shaped (i.e., batch N H W) tensor with N channels that speciﬁes the weight of the style for each spatial location, which the weights sums up to one across the N dimension for each spatial position.

We ﬁrst construct an initial fusion map in the pixel space by ﬁnding the spatially nearest style center, then assign a one-hot label for each spatial position in the initial fusion map. Since the spatial style fusion happens in all layers in the generator, we therefore reversely propagate the fusion map from the output of the generator to its input, we call such a procedure fusion map calibration. We show an illustration of fusion map calibration in Figure 32(b). The fusion map calibration starts from the image space and sequentially backward-constructs the fusion maps for all generator layers. For each pair (output-side and input-side) of the fusion maps in a network layer, we match the spatial dimension of the fusion map pair by padding or interpolating the output fusion map into a spatially aligned input fusion map. For different types of intermediate layers, the underlying implementation of the fusion map calibration can be slightly different, but a shared principle is to maintain a consistent geometrical position of the style fusion center throughout the generator.

In practice, such a binary map creates a sharp style transition that produces visible straight lines dividing the style regions. Accordingly, we apply a mean ﬁlter that smooths the style transition border. While different kernel sizes for the mean ﬁlter only alter the range of style transition and the visual smoothness, we use a kernel size of 127 in our experiments as it empirically produces good visual results.

Fused modulation-demodulation. After constructing the per-layer style fusion map, we can use the fusion maps to create volumetric styles (i.e., batch D H W) by weighted-sum the styles by the importance weights (D-channel dimension) in each spatial position. The volumetric styles are applied to each layer of the generator. As the feature modulation-demodulation strategy used in both Style GAN2 and Inﬁnity GAN is a pixel-wise operator, we can easily adapt it to volumetric styles. We demonstrate a possible implementation2 of the fused modulation-demodulation in Figure 33.

2The forward function is based on the implementation from https://github.com/rosinality/ stylegan2-pytorch.

Published as a conference paper at ICLR 2022

1 import torch 2 import torch.nn.functional as F 3 4 def forward(self, feature, style): 5 """ 6 feature: Feature with shape (B, C1, H, W) 7 style : Single style with shape (B, C2) 8 """ 9 batch, in_c, in_h, in_w = feature.shape 10 11 12 # Hyperparameters 13 k = self.kernel_size # Conv kernel size 14 out_c = self.out_c # Expected output channel 15 rmpad = k // 2 # Zero-padding removal 16 17 # Weight scaling (Style GAN2) 18 # Shape: 19 # (1, ) * (out_c, in_c, k, k) 20 # => (out_c, in_c, k, k) 21 weight = self.scale * self.weight 22 23 # Weight modulation (Style GAN2) 24 style = self.modulation(style) 25 style = style.view(batch, 1, in_c, 1, 1) 26 # Shape: 27 # (1, out_c, in_c, k, k) * (batch, 1, in_c, 1, 1) 28 # => (batch, out_c, in_c, k, k) 29 weight = weight.unsqueeze(0) * style 30 31 32 33 34 35 # Weight demodulation (Style GAN2) 36 demod = torch.rsqrt( 37 weight.pow(2).sum([2,3,4])) 38 weight *= demod.view(batch, out_c, 1, 1, 1) 39 40 41 42 43 44 45 46 47 48 49 50 # Convolution 51 feature = feature.view(1, batch*in_h, in_h, in_w) 52 if self.upsample: 53 weight = weight.view(batch, out_c, in_c, k, k) 54 weight = weight.transpose(1, 2).reshape( 55 batch*in_c, out_c, k, k) 56 out = F.conv_transpose2d( 57 feature, weight, 58 padding=0, stride=2, groups=batch) 59 60 # Clipping zero padding (Conv T special case) 61 out = out[:, :, rmpad:-rmpad, rmpad:-rmpad] 62 63 64 65 66 67 68 69 70 out = self.blur(out) # Style GAN2 Gaussian blur 71 else: 72 weight = weight.view(batch*out_c, in_c, k, k) 73 out = F.conv2d( 74 feature, weight, padding=0, groups=batch) 75 76 # Recover batch-channel shape due to grouping 77 _, _, out_h, out_w = out.shape 78 out = out.view(batch, out_c, out_h, out_w) 79 80 return out

import torch 1 import torch.nn.functional as F 2 3 def fused_forward(self, feature, style): 4 """ 5 feature: Feature with shape (B, C1, H, W) 6 style : Fusion style with shape (B, C2, H, W) 7 """ 8 batch, in_c, in_h, in_w = feature.shape 9 st_c = style.shape[1] 10 11 # Hyperparameters 12 k = self.kernel_size # Conv kernel size 13 out_c = self.out_c # Expected output channel 14 rmpad = k // 2 # Zero-padding removal 15 16 # Weight scaling (Style GAN2) 17 # Shape: 18 # (1, ) * (out_c, in_c, k, k) 19 # => (out_c, in_c, k, k) 20 weight = self.scale * self.weight 21 22 # Weight modulation (Casted) 23 # The following two forms are equivalent: 24 # - conv(in=feature, w=weight*style*demod) 25 # - conv(in=feature*style, w=weight) * demod 26 # Style GAN2 uses the former one for speed. 27 style = \ 28 style.permute(0, 2, 3, 1).reshape(-1, st_c) 29 style = self.modulation(style) 30 style = style.view(batch, in_h, in_w, in_c) 31 style = style.permute(0, 3, 1, 2) 32 feature = (style * feature) # (B, C, H, W) 33 34 # Weight demodulation (Approximated) 35 # Feature demodulation use patch statistics. 36 # The approximation here is similar to a 37 # mean of statistics from all styles. 38 demod = torch.zeros(batch, out_c, in_h, in_w) 39 for i in range(in_h): 40 for j in range(in_w): 41 style_v = style[:, :, i, j] \ 42 .view(batch, 1, in_c, 1, 1) 43 style_v = weight.unsqueeze(0) * style_v 44 # style_v shape: (B, out_ch, in_ch, k, k) 45 demod[:, :, i, j] = \ 46 torch.rsqrt( 47 style_v.pow(2).sum([2,3,4])) 48 49 # Convolution 50 # (All feature uses same weight, no need to group) 51 if self.upsample: 52 weight = weight.view(out_c, in_c, k, k) 53 weight = weight.transpose(0, 1).contiguous() 54 out = F.conv_transpose2d( 55 feature, weight, 56 padding=0, stride=2, groups=1) 57 58 59 # Clipping zero padding (Conv T special case) 60 out = out[:, :, rmpad:-rmpad, rmpad:-rmpad] 61 62 # Late demodulation (match output shape) 63 demod = F.interpolate( 64 demod, 65 size=(out.shape[-2], out.shape[-1]), 66 mode="bilinear", align_corners=True) 67 out = out * demod 68 69 out = self.blur(out) # Style GAN2 Gaussian blur 70 else: 71 out = F.conv2d( 72 feature, weight, padding=0, groups=1) 73 demod = demod[:, :, rmpad:-rmpad, rmpad:-rmpad] 74 out = out * demod 75 76 out = out.contiguous() 77 78 79 return out 80

Figure 33: Implementation of spatial style fusion. We present (left) the original Style GAN2 forward function, and (right) a corresponding implementation for the spatial style fusion. We align the related code blocks on the left and right.

Published as a conference paper at ICLR 2022

M MORE QUALITATIVE RESULTS OF OUTPAINTING VIA INVERSION

Figure 34: More outpainting via model inversion. We present more outpainting results from Inﬁnity GAN on Flickr-Scenery. We invert the latent variables from 256 128 pixels real images (marked with red box), then outpaint 256 640 area (5 real image size).

Published as a conference paper at ICLR 2022

N MORE QUALITATIVE RESULTS OF INBETWEENING VIA INVERSION

Figure 35: More image inbetweening with model inversion. By inverting the latent variables that reconstruct the two real images on two sides (marked with red box), Inﬁnity GAN can naturally inbetween the two images arbitrarily distant away. We synthesize the 256 1280 images using Inﬁnity GAN trained on Flickr-Scenery at 101 101 pixels.

Published as a conference paper at ICLR 2022

O MORE QUALITATIVE RESULTS OF CYCLIC PANORAMIC INBETWEENING VIA INVERSION

Figure 36: More cyclic panorama synthesized with model inversion. By setting the same real image on two sides (marked with red box) and inverting the latent variables that reconstruct the real image, Inﬁnity GAN can naturally synthesize horizontally cyclic panoramic images with image inbetweening. We synthesize the 256 1280 images using Inﬁnity GAN trained on Flickr-Scenery at 101 101 pixels.

Published as a conference paper at ICLR 2022

P EXPERIMENTAL DETAILS OF THE SPEED BENCHMARK WITH PARALLEL BATCHING

We perform all the experiments on a workstation with Intel Xeon CPU (E5-2650 2.20GHz) and 8 GTX 2080Ti GPUs. We implement our framework with Pytorch 1.6, and execute in an environment with Nvidia driver version 440.44, cu DNN version 4.6.5, and Cuda version 10.2.89.

We report the summation of pure GPU execution time and data scatter-collection time introduced by data parallelism. The model synthesizes a single image for each trial. We ﬁrst warm up the GPUs with 10 proceeding trials without recording their statistics, then compute the mean and variance over 100 trials. The numbers are reported in Table 5.

Table 5: Inference speed up with parallel batching. Beneﬁt from the spatial independent generation nature, Inﬁnity GAN achieves up to 7.20 inference speed up by with parallel batching. We conduct all experiments at a batch size of 1, and OOM indicates out-of-memory. Note that the GPU time here accounts for pure GPU execution time and (if applicable) data-parallel scatter-aggregation time.

Method Generation Paradigm Parallel Batch Size # GPUs GPU Time @ Inference Size (sec/image) Speed Up MFLOPs

1024 1024 2048 2048 4096 4096 8192 8192 8192 8192 1024 1024

Style GAN2 One-Shot - 1 0.60 0.01 OOM OOM OOM - 6,642

Inﬁnity GAN (Ours)

One-Shot - 1 0.67 0.01 OOM OOM OOM - 6,815

Spatially Independent Generation

1.24 0.15 7.96 0.17 34.35 1.69 137.44 1.85 1.00

2 1.58 0.09 5.31 0.13 24.13 0.42 95.77 1.63 1.44 4 1.35 0.01 5.20 0.02 20.93 0.04 82.52 0.08 1.67 8 1.28 0.01 5.14 0.02 19.63 0.02 78.41 0.17 1.75 16 1.23 0.01 5.01 0.01 19.11 0.02 76.41 0.02 1.80

32 2 0.96 0.01 3.90 0.02 14.84 0.06 59.33 0.15 2.32 64 4 0.56 0.01 2.25 0.05 8.64 0.11 35.20 0.39 3.90 128 8 0.32 0.05 1.30 0.05 4.82 0.06 19.09 0.16 7.20

Q ABLATION: FEATURE UNFOLDING

Figure 37: We plot the FID curve for a training episode of our complete Inﬁnity GAN (red curve) and Inﬁnity GAN without feature unfolding (blue curve). We observe the FID saturates at early stage if without feature unfolding.

Published as a conference paper at ICLR 2022

R LIMITATIONS

Here we discuss some empirical limitations of Inﬁnity GAN.

Patch-training leads to performance degradation. In order to construct a better conditional distribution in the vertical direction with an auxiliary loss Lar, Inﬁnity GAN trains the generator with patches instead of full images. However, training with patches reduces the training ﬁeld-of-view, thus leading to an inferior performance at the same ﬁeld-of-view compared to a model trained with full images. Designing unsupervised mechanisms in learning Lar without patch-cropping can improve Inﬁnity GAN performance.

Long-range coherence. Inﬁnity GAN assumes local coherence following a shared holistic appearance can achieve visually plausible synthesis. However, certain physical relationships still require long-range dependency, such as we can observe two suns in the top-left image in Figure 23, and twilight should only happen near the horizon in Figure 25. Despite Inﬁnity GAN can independently sample pixels arbitrarily distant away, it remains unclear how to construct unsupervised losses for such conditions, as we only have ﬁnite-pixel images in training data.

Strip-shaped artifacts. We observe Inﬁnity GAN creates a unique type of artifact that forms a strip-shaped structure sweeping through the sky or ground for a long distance, such as the clouds in the bottom-two image in Figure 23. We hypothesize the root cause of such an artifact is that the model attends too much to the strong structural characteristics of the horizon line and accidentally shares the representation with other less related contexts. We anticipate improving the modulation or coordinate encoding mechanisms may help suppress such behavior.

S IMPLEMENTATION ILLUSTRATION OF SCALEINV FID

1 import torch.nn.functional as F

3 def eval_scaleinv_fid(real_imgs, fake_imgs, scale):

4 # real_imgs: tensor of real images, shape (B, C, H, W).

5 # fake_imgs: tensor of fake images, shape (B, C, H, W).

6 # scale: the scale of the current Scale Inv FID.

7 fake_images = F.interpolate(

8 fake_images,

9 scale_factor=1/scale,

10 mode="bilinear",

11 align_corners=True)

13 # The regular FID evaluation

14 return eval_fid(real_images, fake_images)

Published as a conference paper at ICLR 2022

T MORE COMPARISONS WITH TEXTURE SYNTHESIS METHOD

As we discussed in Section 1 that texture synthesis models are not directly applicable to real-world image synthesis. In Figure 38, we demonstrate such a problem by running Tile GAN (Fr uhst uck et al., 2019) on our Flickr-Landscape dataset. The results show that the random texture synthesis cannot produce plausible global structures. It is important to note that the images are shown in Tile GAN and other texture synthesis papers (Bergmann et al., 2017; Jetchev et al., 2018) with plausible global structure, such as the satellite map of Jurassic Park and the paintings in Tile GAN, are all conditioned on an image that gives the blueprint of the global structure.

Implementation details of Tile GAN experiment. We use the ofﬁcially released Tile GAN pipeline to synthesize the results with randomly sampled latent variables. We follow the instructions and train a PGGAN model at 256 256 resolution, then test the model to synthesize at 512 512, 1024 1024, and 2048 2048 pixels. Note that we discovered that Tile GAN alters the PGGAN architecture from residual-based To RGB branch 3 to single To RGB projection 4. Such a modiﬁcation is not described in the Tile GAN paper, but can be found by diagnosing the model checkpoints released by the authors. The modiﬁcation leads to signiﬁcant visual-quality degradation. However, even without the visualquality degradation, the lack of structural clues makes Tile GAN impossible to infer a coherent global structure while synthesizing at larger image sizes.

Generated (1x) Generated (2x) Generated (8x)

Generated (4x)

Figure 38: Qualitative results of Tile GAN on Flickr-Landscape dataset. The results show that random texture synthesis models are not directly applicable to real-world image synthesis.

U MORE COMPARISONS WITH SINGAN-BASED MODELS

We conduct additional experiments on Con Sin GAN (Hinz et al., 2021), a concurrent work that proposes several improvements upon Sin GAN. We use the ofﬁcially released codes and hyperparameters to train the Con Sin GAN models. Con Sin GAN has two types of training modes, generation and retargeting. The author mentions the retargeting mode is more suitable for extending synthesis size in their Git Hub release. As shown in Figure 39, similar to Sin GAN, Con Sin GAN does not have specialized mechanisms to deal with different positional information while tested at a different synthesis size. Therefore, neither of the training modes can produce a plausible global view while tested at extended synthesis sizes.

3Codes: https://github.com/afruehstueck/tile GAN/blob/0460e228b1109528a0fefc6569b970c2934a649d/networks.py#L274L294. 4Codes: https://github.com/afruehstueck/tile GAN/blob/0460e228b1109528a0fefc6569b970c2934a649d/networks.py#L370L375.

Published as a conference paper at ICLR 2022

Generated (1x)

Generated (2x)

Generated (8x)

Generated (4x)

Generated (1x)

Generated (2x)

Generated (8x)

Generated (4x)

Figure 39: Qualitative results of Con Sin GAN on Flickr-Landscape dataset. We run Con Sin GAN under two different conﬁgurations released by the authors, (top) generation and (bottom) retarget. The results show that Con Sin GAN inherits similar behaviors from Sin GAN and fails to produce images with plausible global structure while synthesizing at larger image sizes.

Published as a conference paper at ICLR 2022

V ABLATION: MODE-SEEKING DIVERSITY LOSS (Ldiv)

(a) Infinity GAN (b) Infinity GAN w/o Diversity Loss

Figure 40: Ablation on the mode-seeking diversity loss. We show that the mode-seeking diversity loss discourages the model from synthesizing similar appearance at the same coordinate. Both Inﬁnity GAN models are trained on Flickr-Landscape data with 197 197 full-image size and 101 101 patch size, then synthesize at 101 101 pixels at testing. In this ﬁgure, all images share the same coordinate grid, and each row shares the same global latent variable. Therefore, only the local latent variables are varying in each row. In Figure (a), regular Inﬁnity GAN shows high diversity in each row, and no obvious structure-coordinate relation is presented. In contrast, in Figure (b), a consistent high-level layout is shared among each row, while the differences between samples are mostly local variations. In particular, the third, sixth, and seventh rows share a similar layout, which is a sign that the model learns a correspondence between the image structure and the coordinates. However, it is difﬁcult to quantify such a problem since the repetition is not an exact repetition of content but a structural/semantical level similarity.

W ABLATION: AUXILIARY LOSS (Lar)

Figure 41: Ablation on the auxiliary loss. We show that Inﬁnity GAN with auxiliary loss (orange curve) can provide slight improvement compared to the variant without the auxiliary loss (pink curve). However, such a performance difference is not very signiﬁcant. We believe the additional supervision in the vertical position should provide important clues in helping the model learn the spatial-varying distribution in the vertical direction. Our approach in modeling such information with two MLP-layers (see Figure 18) may be too naive. Future studies on improving the modeling performance with better loss functions or architecture may improve the overall performance of Inﬁnity GAN further.