# the_image_local_autoregressive_transformer__603c84f3.pdf

The Image Local Autoregressive Transformer

Chenjie Cao, Yuxin Hong, Xiang Li, Chengrong Wang, Chengming Xu, Yanwei Fu , Xiangyang Xue School of Data Science Fudan University {20110980001,yanweifu}@fudan.edu.cn

Recently, Auto Regressive (AR) models for the whole image generation empowered by transformers have achieved comparable or even better performance compared to Generative Adversarial Networks (GANs). Unfortunately, directly applying such AR models to edit/change local image regions, may suffer from the problems of missing global information, slow inference speed, and information leakage of local guidance. To address these limitations, we propose a novel model image Local Autoregressive Transformer (i LAT), to better facilitate the locally guided image synthesis. Our i LAT learns the novel local discrete representations, by the newly proposed local autoregressive (LA) transformer of the attention mask and convolution mechanism. Thus i LAT can efﬁciently synthesize the local image regions by key guidance information. Our i LAT is evaluated on various locally guided image syntheses, such as pose-guided person image synthesis and face editing. Both quantitative and qualitative results show the efﬁcacy of our model.

1 Introduction

Generating realistic images has been attracting ubiquitous research attention of the community for a long time. In particular, those image synthesis tasks involving persons or portrait [6, 28, 29] can be applied in a wide variety of scenarios, such as advertising, games, and motion capture, etc. Most real-world image synthesis tasks only involve the local generation, which means generating pixels in certain regions, while maintaining the semantic consistency, e.g., face editing [19, 1, 40], pose guiding [36, 55, 47], and image inpainting [51, 30, 49, 53]. Unfortunately, most works can only handle the well aligned images of icon-view foregrounds, rather than the image synthesis of non-iconic view foregrounds [47, 24], i.e., person instances with arbitrary poses in cluttered scenes, which is concerned in this paper. Even worse, the global semantics tend to be distorted during the generation of previous methods, even if subtle modiﬁcations are applied to a local image region. Critically, given the local editing/guidance such as sketches of faces, or skeleton of bodies in the ﬁrst column of Fig. 1(A), it is imperative to design our new algorithm for locally guided image synthesis.

Generally, several inherent problems exist in previous works for such a task. For example, despite impressive quality of images are generated, GANs/Autoencoder(AE)-based methods [51, 47, 19, 30, 18] are inclined to synthesize blurry local regions, as in Fig. 1(A)-row(c). Furthermore, some inspiring autoregressive (AR) methods, such as Pixel CNN [32, 41, 23] and recent transformers [8, 14], should efﬁciently model the joint image distribution (even in very complex background [32]) for whole image generation as Fig. 1(B)-row(b). These AR models, however, are still not ready for locally guided image synthesis, as several reasons. (1) Missing global information. As in Fig. 1(B)-row(b), vanilla AR models take the top-to-down and left-to-right sequential generation with limited receptive ﬁelds for the initial generating (top left corner), which are incapable of directly modeling global information. Additionally, the sequential AR models suffer from exposure bias [2], which may

Corresponding author. Dr. Fu is also with Fudan ISTBI ZJNU Algorithm Centre for Brain-inspired Intelligence, Zhejiang Normal University, Jinhua, China.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

AE mask 𝐌!"

(a) Autoencoder (AE)

Vanilla AR mask 𝐌!#

(b) Autoregressive (AR)

Local Autoregressive

(c) Local Autoregressive

Transformer (i LAT)

(A) Inputs and outputs of local generation compared with previous works (B) Comparison of different generative modes

consistent semantics inconsistent semantics

with information leak w./o. information leak

AE based method Our i LAT

Figure 1: The illustration of (A) inﬂuence of missing semantic consistency, the information leak, and blur in AE based method in the local generation, and (B) comparison of AE, AR, and our i LAT for different conditional image generations. Our method is more efﬁcient for locally guided image synthesis by keeping both global semantics and local guidance.

predict future pixels conditioned on the past ones with mistakes, due to the discrepancy between training and testing in AR. This makes small local guidance unpredictable changes to the whole image, resulting in inconsistent semantics as in Fig. 1(A)-row(a). (2) Slow inference speed. The AR models have to sequentially predict the pixels in the testing with notoriously slow inference speed, especially for high-resolution image generation. Although the parallel training techniques are used in pixel CNN [32] and transformer [14], the conditional probability sampling fails to work in parallel during the inference phase. (3) Information leakage of local guidance. As shown in Fig. 1(B)-row(c), the local guidance should be implemented with speciﬁc masks to ensure the validity of the local AR learning. During the sequential training process, pixels from masked regions may be exposed to AR models by convolutions with large kernel sizes or inappropriate attention masks in the transformer. We call it information leakage [44, 16] of local guidance, which makes models overﬁt the masked regions, and miss detailed local guidance as in Fig. 1(A)-row(b).

To this end, we propose a novel image Local Autoregressive Transformer (i LAT) for the task of locally guided image synthesis. Our key idea lies in learning the local discrete representations effectively. Particularly, we tailor the receptive ﬁelds of AR models to local guidance, achieving semantically consistent and visually realistic generation results. Furthermore, a local autoregressive (LA) transformer with the novel LA attention mask and convolution mechanism is proposed to enable successful local generation of images with efﬁcient inference time, without information leakage.

Formally, we propose the i LAT model with several novel components. (1) We complementarily incorporate receptive ﬁelds of both AR and AE to ﬁt LA generation with a novel attention mask as shown in Fig. 1(B)-row(c). In detail, local discrete representation is proposed to represent those masked regions, while the unmasked areas are encoded with continuous image features. Thus, we achieve favorable results with both consistent global semantics and realistic local generations. (2) Our i LAT dramatically reduces the inference time for local generation, since only masked regions will be generated autoregressively. (3) A simple but effective two-stream convolution and a local causal attention mask mechanism are proposed for discrete image encoder and transformer respectively, with which information leakage is prevented without detriment to the performance.

We make several contributions in this work. (1) A novel local discrete representation learning is proposed to efﬁciently help to learn our i LAT for the local generation. (2) We propose an image local autoregressive transformer for local image synthesis, which enjoys both semantically consistent and realistic generative results. (3) Our i LAT only generates necessary regions autoregressively, which is much faster than vanilla AR methods during the inference. (4) We propose a two-stream convolution and a LA attention mask to prevent both convolutions and transformer from information leakage, thus improving the quality of generated images. Empirically, we introduce several locally guidance tasks, including pose-guided image generation and face editing tasks; and extensive experiments are conducted on the corresponding dataset to validate the efﬁcacy of our model.

2 Related Work

Conditional Image Synthesis. Some conditional generation models are designed to globally generate images with pre-deﬁned styles based on user-provided references, such as poses and face sketches. These previous synthesis efforts are made on Variational auto-encoder (VAE) [10, 13], AR Model [48, 39], and AE with adversarial training [51, 49, 53]. Critically, it is very non-trivial for all these methods to generate images of locally guided image synthesis of the non-iconic foreground. Some tentative attempts have been conducted in pose-guided synthesis [42, 47] with person existing in non-ironic views. On the other hand, face editing methods are mostly based on adversarial AE based inpainting [30, 51, 19] and GAN inversion based methods [53, 40, 1]. Rather than synthesizing the whole image, our i LAT generates the local regions of images autoregressively, which not only improves the stability with the well-optimized joint conditional distribution for large masked regions but also maintains the global semantic information.

Autoregressive Generation. The deep AR models have achieved great success recently in the community [35, 15, 32, 9, 38]. Some well known works include Pixel RNN [43], Conditional Pixel CNN [32], Gated Pixel CNN [32], and Wave Net [31]. Recently, transformer based AR models [38, 5, 12] have achieved excellent results in many machine learning tasks. Unfortunately, the common troubles of these AR models are the expensive inference time and potential exposure bias, as AR models sequentially predict future values from the given past values. The inconsistent receptive ﬁelds for training and testing will lead to accumulated errors and unreasonable generated results [2]. Our i LAT is thus designed to address these limitations.

Visual Transformer. The transformer takes advantage of the self-attention module [44], and shows impressive expressive power in many Natural Language Processing (NLP) [38, 11, 5] and vision tasks [12, 7, 25]. With costly time and space complexity of O(n2) in transformers, Parmar et al. [34] utilize local self-attention to achieve AR generated results. Chen et al. [8] and Kumar et al. [22] autoregressively generate pixels with simpliﬁed discrete color palettes to save computations. But limited by the computing power, they still generate low-resolution images. To address this, some works have exploited the discrete representation learning, e.g. d VAE [39] and VQGAN [14]. It not only reduces the sequence length of image tokens but also shares perceptually rich latent features in image synthesis as the word embedding in NLP. However, recovering images from the discrete codebook still causes blur and artifacts in complex scenes. Besides, vanilla convolutions of the discrete encoder may leak information among different discrete codebook tokens. Moreover, local conditional generation based on VQGAN [14] suffers from poor semantical consistency compared with other unchanged image regions. To end these, the i LAT proposes the novel discrete representation to improve the model capability of local image synthesis.

Given the conditional image Ic and target image It, our image Local Autoregressive Transformer (i LAT) aims at producing the output image Io of semantically consistent and visually realistic. The key foreground objects (e.g., the skeleton of body), or the regions of interest (e.g., sketches of facial regions) extracted from Ic, are applied to guide the synthesis of output image Io. Essentially, the background and the other non-key foreground image regions of Io should be visually similar to It. As shown in Fig. 2, our i LAT includes the branches of Two-Stream convolutions based Vector Quantized GAN (TS-VQGAN) for the discrete representation learning and a transformer for the AR generation with Local Autoregressive (LA) attention mask. Particularly, our i LAT ﬁrstly encodes Ic and It, into codebook vectors zq,c and zq,t by TS-VQGAN (Sec. 3.1) without local information leakage. Then the index of the masked vectors ˆzq,t will be predicted by the transformer autoregressively with LA attention mask (Sec. 3.2). During the test phase, the decoder of TS-VQGAN takes the combination of ˆzq,t in masked regions and ˆz in unmasked regions to achieve the ﬁnal result.

3.1 Local Discrete Representation Learning

We propose a novel local discrete representation learning in this section. Since it is inefﬁcient to learn the generative transformer model through pixels directly, inspired by VQGAN [14], we incorporate the VQVAE mechanism [33] into the proposed i LAT for the discrete representation learning. The VQVAE is consisted of an encoder E, a decoder D, and a learnable discrete codebook Z = {zk}K k=1,

Masked Transformer

Position Embedding

Quantized Mask 𝐌𝒒

NLL Loss (Training)

(Inference)

Conditional Image 𝐈!

(Inference) Target Image 𝐈"

Target Pose

𝑝" (x,y,visible)

𝑐[𝑠] 𝑐% 𝑐! 𝑐$ 𝑐& 𝑡[𝑠] 𝑡% 𝑡! 𝑡$ 𝑡& 𝑡' 𝑡" 𝑐()*+

0 1 2 3 9 10 22 23 24 25 28 26 27 31

Image Mask 𝐌

Inference only

Train/Inference

𝑧#," 𝐌# + 𝑧 (1 𝐌#)

Output Image 𝐈%

(Training) Pose Guiding only

Figure 2: The overview of i LAT with 3 3 of latent quantization. The target image It is encoded by TS-VQGAN with a binary mask M, while conditional image Ic is directly encoded without mask. As explained in Sec.3.2, the guided information of poses and sketches are utilized in our two tasks. For pose guidence, target pose coordinates are encoded by a Point MLP into sequential features, which are concatenated with conditional tokens {c[s], c0, ..., c8, cpose}. The TS-VQGAN decoder takes the combination of ˆzq,t converted from generated target tokens {ˆt1, ˆt2, ..., ˆt7} in masked regions and directly encoded features ˆz in unmasked regions to get the ﬁnal result.

Input feature 𝐅

Masked feature 𝐅!

Resized Mask 𝐌

Combined convoluted

Leaked Region 𝐌!

Q\K t[s] t0 t1 t2 t3 t4 t5 t6 t7

t5 t6 t7 t8

Quantized Mask 𝐌#

The total Local Autoregressive

(LA) attention mask #𝐌$%

T2C T2T The T2T part 𝐌$% of #𝐌$%

t[s] t3 t8 t0 t1 t2 t4 t5 t6 t7

t4 t5 t6 t7

t8 t0 t1 t2 t4 t5 t6 t7

t4 t5 t6 t7

Global sub-mask 𝐌&'

Causal sub-mask 𝐌"' (b) The local autoregressive attention mask

C2C:condition to condition C2T:condition to target T2C:target to condition T2T:target to target

(a) The two-stream convolution Figure 3: Illustration of (a) the two-stream convolution in the TS-VQGAN encoder and (b) the LA attention mask ˆMLA. In (a), after the feature F is convoluted, the 3 3 convolution kernel spreads the features from the masked regions to unmasked regions. Patches with leaked information are circled with red. For (b), a 3 3 quantized mask Mq is assumed as the input mask. MLA can be divided into the global sub-mask Mgs and the causal sub-mask Mcs. Then, t[s], t3, t8 can be categorized into global pixel tokens, while others are casual tokens: t1, t2, t5, t7 are masked tokens tm, (Mq,m = 1), and t0, t1, t4, t6 are tokens tm 1 that need to predict masked ones. The global tokens can be attended with all tokens. And causal tokens attend to the targets with a lower triangular matrix. All colored tokens are valued 1 and white tokens are 0 in MLA.

where K means the total number of codebook vectors. Given an image I RH W 3, E encodes the image into latent features ˆz = E(I) Rh w ce, where ce indicates the channel of the encoder outputs. Then, the spatially unfolded ˆzh w Rce, (h h, w w) are replaced with the closest codebook vectors as

z(q) h w = arg min zk Z ||ˆzh w zk|| Rcq,

zq = fold(z(q) h w , h h, w w) Rh w cq, (1)

where cq indicates the codebook channels. However, VQVAE will suffer from obscure information leakage for the local image synthesis, if the receptive ﬁeld (kernel size) of vanilla convolution is larger than 1 1 as shown in Fig. 3(a). Intuitively, each 3 3 convolution layer spreads the masked features to the outside border of the mask. Furthermore, multi-convolutional based E accumulates the information leakage, which makes the model learn the local generating with unreasonable conﬁdence, leading to model overﬁtting (see Sec. 4.3).

To this end, we present two-stream convolutions in TS-VQGAN as shown in Fig. 3(a). Since the masked information is only leaked to a circle around the mask with each 3 3 convolution layer, we can just replace the corrupt features for each layer with masked ones. Thus, the inﬂuence of information leakage will be eliminated without hurting the integrity of both masked and unmasked features. Speciﬁcally, for the given image mask M RH W that 1 means masked regions, and 0 means unmasked regions, it should be resized into M with max-pooling to ﬁt the feature size. The two-stream convolution converts the input feature F into the masked feature Fm = F (1 M ), where is element-wise multiplication. Then, both F and Fm are convoluted with shared weights and combined according to the leaked regions Ml, which can be obtained from the difference of convoluted mask as

Ml = clip(conv1(M ), 0, 1) M , Ml[Ml > 0] = 1, (2)

where conv1 implemented with an all-one 3 3 kernel. Therefore, the output of two-stream convolution can be written as

Fc = conv(F) (1 Ml) + conv(Fm) Ml. (3)

So the leaked regions are replaced with features that only depend on unmasked regions. Besides, masked features can be further leveraged for AR learning without any limitations. Compared with VQVAE, we replace all vanilla convolutions with two-stream convolutions in the encoder of TSVQGAN. Note that the decoder D is unnecessary to prevent information leakage at all. Since the decoding process is implemented after the AR generating of the transformer as shown in Fig. 2.

For the VQVAE, the decoder D decodes the codebook vectors zq got from Eq.(1), and reconstructs the output image as Io = D(zq). Although VQGAN [14] can generate more reliable textures with adversarial training, handling complex real-world backgrounds and precise face details is still tough to the existed discrete learning methods. In TS-VQGAN, we further ﬁnetune the model with local quantized learning, which can be written as

Io = D(zq Mq + ˆz (1 Mq)), (4)

where Mq Rh w is the resized mask for quantized vectors, and ˆz is the output of the encoder. In Eq.(4), unmasked features are directly encoded from the encoder, while masked features are replaced with the codebook vectors, which works between AE and VQVAE. This simple trick effectively maintains the ﬁdelity of the unmasked regions and reduces the number of quantized vectors that have to be generated autoregressively, which also leads to a more efﬁcient local AR inference. Note that the back-propagation of Eq.( 4) is implemented with the straight-through gradient estimator [4].

3.2 Local Autoregressive Transformer Learning

From the discrete representation learning in Sec. 3.1, we can get the discrete codebook vectors zq,c, zq,t Rh w cq for conditional images and target images respectively. Then the conditional and target image tokens {ci, tj}hw i,j=1 {0, 1, ..., K 1} can be converted from the index-based representation of zq,c, zq,t in the codebook with length hw, where K indicates the all number of codebook vectors. For the resized target mask Mq Rh w, the second stage needs to learn the AR likelihood for the masked target tokens {tm} where Mq,m = 1 with conditional tokens {ci}hw i=1 and other unmasked target tokens {tu} where Mq,u = 0 as

p(tm|c, tu) = Y

j p(t(m,j)|c, tu, t(m,<j)). (5)

Beneﬁts from Eq. (4), i LAT only needs to generate masked target tokens {tm} rather than all. Then, the negative log likelihood (NLL) loss can be optimized as

LNLL = Etm p(tm|c,tu) log p(tm|c, tu). (6)

We use a decoder-only transformer to handle the AR likelihood. As shown in Fig. 2, two special tokens c[s], t[s] are concatenated to {ci} and {tj} as start tokens respectively. Then, the trainable position embedding [11] is added to the token embedding to introduce the position information to the self-attention modules. According to the self-attention mechanism in the transformer, the attention mask is the key factor to achieve parallel training without information leakage. As shown in Fig. 3(b), we propose a novel LA attention mask ˆMLA with four sub-masks, which indicate receptive ﬁelds

of condition to condition (C2C), condition to target (C2T), target to condition (T2C), and target to target (T2T) respectively. All conditional tokens {ci}hw i=1 can be attended by themselves and targets. So C2C and T2C should be all-one matrices. We think that the conditional tokens are unnecessary to attend targets in advance, so C2T is set as the all-zero matrix. Therefore, the LA attention mask ˆMLA can be written as ˆMLA = 1, 0 1, MLA

where MLA indicates the T2T LA mask. To leverage the AR generation and maintain the global information simultaneously, the target tokens are divided into two groups called the global group and the causal group. Furthermore, the causal group includes masked targets {tm}(Mq,m = 1) and {tm 1} that need to predict them, because the labels need to be shifted to the right with one position for the AR learning. Besides, other tokens are classiﬁed into the global group. Then, the global attention sub-mask Mgs can be attended to all tokens to share global information and handle the semantic consistency. On the other hand, the causal attention sub-mask Mcs constitutes the local AR generation. Note that Mgs can not attend any masked tokens to avoid information leakage. The T2T LA mask can be got with MLA = Mgs + Mcs2. A more intuitive example is shown in Fig. 3(b). Therefore, for the given feature h, the self-attention in i LAT can be written as

Self Attention(h) = softmax(QKT

d (1 ˆMLA) )V, (8)

where Q, K, V are h encoded with different weights of d channels. We make all masked elements to before the softmax. During the inference, all generated target tokens {ˆtm} are converted back to codebook vectors ˆzq,t. Then, they are further combined with encoded unmasked features ˆz, and decoded with Eq.(4) as shown in Fig. 2.

To highlight the difference of our proposed mask MLA, other common attention masks are shown in Fig. 1. The Vanilla AR mask MAR is widely used in the AR transformer [14, 8], but they fail to maintain the semantic consistency and cause unexpected identities for the face synthesis. AE mask MAE is utilized in some attention based image inpainting tasks [50, 49]. Although MAE enjoys good receptive ﬁelds, the masked regions are completely corrupted in the AE, which is much more unstable to restructure a large hole. Our method is an in-between strategy with both their superiorities mentioned above.

3.3 Implement Details for Different Tasks

Non-Iconic Posed-Guiding. The proposed TS-VQGAN is also learned with adversarial training. For the complex non-iconic pose guiding, we ﬁnetune the pretrained open-source Image Net based VQGAN weights with the two-stream convolution strategy. To avoid adding too many sequences with excessive computations, we use the coordinates of 13 target pose landmarks as the supplemental condition to the i LAT. They are encoded with 3 fully connected layers with Re LU. As shown in Fig. 2, both the condition and the target are images, which have different poses in the training phase, and the same pose in the inference phase. Besides, we use the union of conditional and target masks got by dilating the poses with different kernel sizes according to the scenes3 to the target image.

Face Editing. In face editing, we ﬁnd that the adaptive GAN learning weight λ makes the results unstable, so it is replaced with a ﬁxed λ = 0.1. Besides, the TS-VQGAN is simpliﬁed compared to the one used in the pose guiding. Speciﬁcally, all attention layers among the encoder and decoder are removed. Then, all Group Normalizations are replaced with Instance Normalization to save memory without a large performance drop. The conditions are composed of the sketch images extracted with the XDo G [46], while the targets are face images. The training masks for the face editing are COCO masks [24] and simulated irregular masks [51], while the test ones are drawn manually.

4 Experiments

In this section, we present experimental results on pose-guided generation of Penn Action (PA) [52] and Synthetic Deep Fashion (SDF) [26], face editing of Celeb A [27] and FFHQ [20] compared with other competitors and variants of i LAT.

2More about the expansion of LA attention mask are discussed in the supplementary. 3Details about the mask generation are illustrated in the supplementary.

Table 1: Quantitative results in PA (left) and SDF (right). means larger is better while means lower is better. i LAT* indicates that i LAT trained without two-stream convolutions.

PATN PN-GAN Posewarp MR-Net Taming i LAT* i LAT

PSNR 20.83 21.36 21.76 21.79 21.43 21.68 22.94 SSIM 0.744 0.761 0.794 0.792 0.746 0.748 0.800 MAE 0.062 0.062 0.053 0.066 0.057 0.056 0.046 FID 82.79 64.43 93.61 79.50 33.53 31.83 27.36

Taming i LAT

16.25 16.71 0.539 0.599 0.107 0.096 72.77 70.58

Datasets. For the pose guiding, PA dataset [52], which contains 2,326 video sequences of 15 action classes in non-iconic views is used in this section. Each frame from the video is annotated with 13 body landmarks consisted of 2D locations and visibility. The resolution of PA is resized into 256 256 during the preprocessing. We randomly gather pairs of the same video sequence in the training phase dynamically and select 1,000 testing pairs in the remaining videos. Besides, the SDF is synthesized with Deep Fashion [26] images as foregrounds and Places2 [54] images as backgrounds. Since only a few images of Deep Fashion have related exact segmentation masks, we select 4,500/285 pairs from it for training and testing respectively. Each pair of them contains two images of the same person with different poses and randomly chosen backgrounds. The face editing dataset consists of Flickr-Faces-HQ dataset (FFHQ) [20] and Celeb A-HQ [27]. FFHQ is a high-quality image dataset with 70,000 human faces. We resize them from 1024 1024 into 256 256 and use 68,000 of them for the training. The Celeb A is only used for testing in this section for the diversity. Since face editing has no paired ground truth, we randomly select 68 images from the rest of FFHQ and all Celeb A, and draw related sketches for them.

Implementation Details. Our method is implemented in Py Torch in 256 256 image size. For the TS-VQGAN training, we use the Adam optimizer [21] with β1= 0.5 and β2 = 0.9. For the pose guiding, the TS-VQGAN is ﬁnetuned from the Image Net pretrained VQGAN [14], while it is completely retrained for FFHQ. TS-VQGAN is trained with 150k steps without masks at ﬁrst, and then it is trained with another 150k steps with masks in batch size 16. The initial learning rates of pose guiding and face editing are 8e-5 and 2e-4 respectively, which are decayed by 0.5 for every 50k steps. For the transformer training, we use Adam with β1 = 0.9 and β2 = 0.95 with initial learning rate 5e-5 and 0.01 weight decay. Besides, we warmup the learning rate with the ﬁrst 10k steps, then it is linearly decayed to 0 for 300k iterations with batch size 16. During the inference, we simply use top-1 sampling for our i LAT.

Competitors. The model proposed in [14] is abbreviated as Taming transformer (Taming) in this section. For fair comparisons, VQGAN used in Taming is ﬁnetuned for pose guiding, and retrained for face editing with the same steps as TS-VQGAN. For the pose guiding, we compare the proposed i LAT with other state-of-the-art methods retrained in the PA dataset, which include PATN [56], PN-GAN [37], Pose Warp [3], MR-Net [47] and Taming [14]. As the image size of Pose Warp and MR-Net is 128 128, we resized the outputs for the comparison. For the face editing, we compare the i LAT with inpainting based SC-FEGAN [19] and Taming [14]. We also test the Taming results in our LA attention mask as Taming* (without retraining).

(a) Reference (b) Target (e) Posewarp (f) MR-Net (g) Taming (h) i LAT (c) PATN (d) PN-GAN (a) Reference (b) Target (c) Taming (d) Taming* (f) i LAT

(e) SC-FEGAN

(A) Pose-Guided Generation in PA. (B) FFHQ (row 1, 2) and Celeb A (row 3, 4).

Figure 4: Qualitative results. Targets in (B) are combined with masks and XDo G sketches. Taming* means that the Taming transformer tested with our LA attention mask. Please zoom-in for details.

Table 2: Average inference time (sec/image) in PA, SDF, and FFHQ of the vanilla AR transformer based generation (Taming) and i LAT. We also show the average masked rate of three datasets.

masked rate Taming i LAT

PA 31.97% 8.551 3.426 SDF 28.09% 8.372 3.898 FFHQ 6.64% 8.183 1.180

4.1 Quantitative Results

Pose-Guided Comparison. Quantitative results in PA and SDF datasets of our proposed i LAT and other competitors are presented in Tab. 1. Peak signal-to-noise ratio (PSNR), Structural Similarity (SSIM) [45], Mean Average Error (MAE) and Fréchet Inception Distance (FID) [17] are employed to measure the quality of results. We also add the results of i LAT*, which is implemented without the two-stream convolutions. The results in Tab. 1 clearly show that our proposed method outperforms other methods in all metrics, especially for the FID, which accords with the human perception. The good scores of i LAT indicate that the proposed i LAT can generate more convincing and photo-realistic images on locally guided image synthesis of the non-iconic foreground. For the more challenging SDF dataset, i LAT still achieves better results compared with Taming.

Inference Time. We also record the average inference times in PA, SDF, and FFHQ as showed in Tab. 2. Except for the better quality of generated images over Taming as discussed above, our i LAT costs less time for the local synthesis task according to the masking rate of the inputs. Low masking rates can achieve dramatic speedup, e.g., face editing.

4.2 Qualitative Results

Non-Iconic Pose Guiding. Fig. 4(A) shows qualitative results in the non-iconic pose-guided image synthesis task. Compared to other competitors, it is apparent that our method can generate more reasonable target images both in human bodies and backgrounds, while images generated by other methods suffer from either wrong poses or deformed backgrounds. Particularly, PATN collapses in most cases. PN-GAN and Pose Warp only copy the reference images as the target ones, which fails to be guided by the given poses due to the challenging PA dataset. Moreover, MR-Net and Taming* can indeed generate poses that are similar to the target ones, but the background details of reference images are not transferred properly. Especially for the results in column (g), Taming fails to synthesize complicated backgrounds, such as noisy audiences in rows 2 and 3 and the gym with various ﬁtness equipment in row 4. Compared to others, our proposed i LAT can capture the structure of human bodies given the target poses as well as retaining the vivid backgrounds, which demonstrate the efﬁcacy of our model in synthesizing high-quality images in the non-iconic pose guiding. Besides, for the pose guiding with synthetic backgrounds of SDF, i LAT can still get more reasonable and stable backgrounds and foregrounds compared with Taming as in Fig. 5(C).

Face Editing. Since there are no ground truth face editing targets, we only compared the qualitative results as shown in Fig. 4(B) of FFHQ and Celeb A. Note that the Taming results in column (c) fail to preserve the identity information in both FFHQ and Celeb A compared with the reference. For example, in rows 1, 2 and 3, the skin tones of Taming results are different from the original ones. And in row 4, Taming generates absolutely another person with contrasting ages, which indicates that vanilla AR is unsuited to the local face editing. When Taming is tested with our LA attention mask, column (d) shows that Taming* can retain the identities of persons. However, rows 1 and 2 demonstrate that Taming* fails to properly generate the target faces according to guided sketches, while in rows 3 and 4 some generations have obvious artifacts without consistency. Besides, inpainting-based SC-FEGAN achieves unstable results in rows 3 and 4. SC-FEGAN also strongly depends on the quality of input sketches, while unprofessional sketches lead to unnatural results as shown in row 1. Besides, detailed textures of AE-based SC-FEGAN are unsatisfactory too. Compared with these methods, our i LAT can always generate correct and vivid human faces with identities retained. Furthermore, beneﬁts from the discrete representation, i LAT enjoys robustness to the guided information.

(a) Reference (b) Target (c) i LAT* (d) i LAT

(a) Reference (b) Target (c) i LAT* (d) i LAT

(A) Ablation in pose guiding (B) Ablation in face editing

(C) Qualitative results in SDF

(a) Pose (b) Taming (c) i LAT

Figure 5: Ablation study for two-stream convolutions (A, B) and qualitative results in SDF (C). i LAT* means i LAT without two-stream convolutions. Please zoom-in for details.

4.3 Further Discussions

Ablation Study. The effectiveness of our proposed two-stream convolution is discussed in the ablation study. As we can ﬁnd in Fig. 5, the woman face in row 1, (c) generated by i LAT* has residual face parts that conﬂict with the guided sketch. Moreover, in row 2, i LAT* without two-stream convolutions leaks information from sunglasses that lacks correct semantic features and leads to the inconsistent color of the generated face. For the pose-guided instance shown in the second row, it is apparent that the man generated in column (c) has blurry leg positions. However, in column (d) the complete i LAT can always generate authentic and accurate images, validating the efﬁcacy of our designed two-stream convolutions.

Sequential Generation. Our i LAT can also be extended to guide the video generation properly. We give a qualitative example in this section. As shown in Fig. 6, given a sequence of poses and one reference image, i LAT can forecast a plausible motion of the person. And the results are robust for most kinds of activities in non-ironic views.

Figure 6: Sequential generated results by i LAT in different ironic views, where odd rows mean the generated images, while even rows indicate the guided poses.

5 Conclusion

This paper proposes a transformer based AR model called i LAT to solve local image generation tasks. This method leverages a novel LA attention mask to enlarge the receptive ﬁelds of AR, which achieves not only semantically consistent but also realistic generative results and accelerates the inference speed of AR. Besides, a TS-VQGAN is proposed to learn a discrete representation learning without information leakages. Such a model can get superior performance in detail editing. Extensive experiments validate the efﬁcacy of our i LAT for local image generation.

Social Impacts

This paper exploited the image editing with transformers. Since face editing may causes some privacy issues, we sincerely remind users to pay attention for it. Our method only focuses on technical aspects. The images used in this paper are all open sourced.

Acknowledgements

This work was supported in part by NSFC Project (62076067, 62176061), Science and Technology Commission of Shanghai Municipality Projects (19511120700, 2021SHZDZX0103).

[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2stylegan++: How to edit the embedded images? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8296 8305, 2020.

[2] Kenan E Ak, Ning Xu, Zhe Lin, and Yilin Wang. Incorporating reinforced adversarial learning in autoregressive image generation. ar Xiv preprint ar Xiv:2007.09923, 2020.

[3] Guha Balakrishnan, Amy Zhao, Adrian V Dalca, Fredo Durand, and John Guttag. Synthesizing images of humans in unseen poses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8340 8348, 2018.

[4] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. ar Xiv preprint ar Xiv:1308.3432, 2013.

[5] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. ar Xiv preprint ar Xiv:2005.14165, 2020.

[6] Haoye Cai, Chunyan Bai, Yu-Wing Tai, and Chi-Keung Tang. Deep video generation, prediction and completion of human action sequences. In Proceedings of the European Conference on Computer Vision (ECCV), pages 366 382, 2018.

[7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213 229. Springer, 2020.

[8] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pages 1691 1703. PMLR, 2020.

[9] Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pixelsnail: An improved autoregressive generative model. In International Conference on Machine Learning, pages 864 872. PMLR, 2018.

[10] Yen-Chi Cheng, Hsin-Ying Lee, Min Sun, and Ming-Hsuan Yang. Controllable image synthesis via segvae. In ECCV. Springer, 2020.

[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

[13] Yuki Endo and Yoshihiro Kanamori. Diversifying semantic image synthesis and editing via class-and layer-wise vaes. In Computer Graphics Forum, volume 39, pages 519 530. Wiley Online Library, 2020.

[14] Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis. ar Xiv preprint ar Xiv:2012.09841, 2020.

[15] Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. Lstm: A search space odyssey. IEEE transactions on neural networks and learning systems, 28(10):2222 2232, 2016.

[16] Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. Long text generation via adversarial training with leaked information. AAAI, 2018.

[17] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. ar Xiv preprint ar Xiv:1706.08500, 2017.

[18] Kun Huang, Yifan Wang, Zihan Zhou, Tianjiao Ding, Shenghua Gao, and Yi Ma. Learning to parse wireframes in images of man-made environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[19] Youngjoo Jo and Jongyoul Park. Sc-fegan: face editing generative adversarial network with user s sketch and color. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1745 1753, 2019.

[20] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401 4410, 2019.

[21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

[22] Manoj Kumar, Dirk Weissenborn, and Nal Kalchbrenner. Colorization transformer. ar Xiv preprint ar Xiv:2102.04432, 2021.

[23] Henry A Leopold, Jeff Orchard, John S Zelek, and Vasudevan Lakshminarayanan. Pixelbnn: Augmenting the pixelcnn with batch normalization and the presentation of a fast architecture for retinal vessel segmentation. Journal of Imaging, 5(2):26, 2019.

[24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740 755. Springer, 2014.

[25] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. ar Xiv preprint ar Xiv:2103.14030, 2021.

[26] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

[27] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730 3738, 2015.

[28] Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. Pose guided person image generation. ar Xiv preprint ar Xiv:1705.09368, 2017.

[29] Liqian Ma, Qianru Sun, Stamatios Georgoulis, Luc Van Gool, Bernt Schiele, and Mario Fritz. Disentangled person image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 99 108, 2018.

[30] Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Qureshi, and Mehran Ebrahimi. Edgeconnect: Structure guided image inpainting using edge prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0 0, 2019.

[31] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. ar Xiv preprint ar Xiv:1609.03499, 2016.

[32] Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional image generation with pixelcnn decoders. ar Xiv preprint ar Xiv:1606.05328, 2016.

[33] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. ar Xiv preprint ar Xiv:1711.00937, 2017.

[34] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In International Conference on Machine Learning, pages 4055 4064. PMLR, 2018.

[35] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. ar Xiv preprint ar Xiv:1802.05365, 2018.

[36] Albert Pumarola, Antonio Agudo, Alberto Sanfeliu, and Francesc Moreno-Noguer. Unsupervised person image synthesis in arbitrary poses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8620 8628, 2018.

[37] Xuelin Qian, Yanwei Fu, Tao Xiang, Wenxuan Wang, Jie Qiu, Yang Wu, Yu-Gang Jiang, and Xiangyang Xue. Pose-normalized image generation for person re-identiﬁcation. In Proceedings of the European conference on computer vision (ECCV), pages 650 667, 2018.

[38] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018.

[39] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. ar Xiv preprint ar Xiv:2102.12092, 2021.

[40] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a stylegan encoder for image-to-image translation. ar Xiv preprint ar Xiv:2008.00951, 2020.

[41] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modiﬁcations. ar Xiv preprint ar Xiv:1701.05517, 2017.

[42] Sijie Song, Wei Zhang, Jiaying Liu, Zongming Guo, and Tao Mei. Unpaired person image generation with semantic parsing transformation. IEEE transactions on pattern analysis and machine intelligence, 2020.

[43] Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In International Conference on Machine Learning, pages 1747 1756. PMLR, 2016.

[44] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. ar Xiv preprint ar Xiv:1706.03762, 2017.

[45] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600 612, 2004.

[46] Holger Winnemöller, Jan Eric Kyprianidis, and Sven C Olsen. Xdog: an extended differenceof-gaussians compendium including advanced image stylization. Computers & Graphics, 36(6):740 753, 2012.

[47] Chengming Xu, Yanwei Fu, Chao Wen, Ye Pan, Yu-Gang Jiang, and Xiangyang Xue. Poseguided person image synthesis in the non-iconic views. IEEE Transactions on Image Processing, 29:9060 9072, 2020.

[48] Jingyu Yang, Xinchen Ye, Kun Li, Chunping Hou, and Yao Wang. Color-guided depth recovery from rgb-d data using an adaptive autoregressive model. IEEE transactions on image processing, 23(8):3443 3458, 2014.

[49] Zili Yi, Qiang Tang, Shekoofeh Azizi, Daesik Jang, and Zhan Xu. Contextual residual aggregation for ultra high-resolution image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7508 7517, 2020.

[50] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5505 5514, 2018.

[51] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4471 4480, 2019.

[52] Weiyu Zhang, Menglong Zhu, and Konstantinos G Derpanis. From actemes to action: A strongly-supervised representation for detailed action understanding. In Proceedings of the IEEE International Conference on Computer Vision, pages 2248 2255, 2013.

[53] Shengyu Zhao, Jonathan Cui, Yilun Sheng, Yue Dong, Xiao Liang, Eric I Chang, and Yan Xu. Large scale image completion via co-modulated generative adversarial networks. ar Xiv preprint ar Xiv:2103.10428, 2021.

[54] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.

[55] Xingran Zhou, Siyu Huang, Bin Li, Yingming Li, Jiachen Li, and Zhongfei Zhang. Text guided person image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3663 3672, 2019.

[56] Zhen Zhu, Tengteng Huang, Baoguang Shi, Miao Yu, Bofei Wang, and Xiang Bai. Progressive pose attention transfer for person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2347 2356, 2019.