# the_image_local_autoregressive_transformer__603c84f3.pdf The Image Local Autoregressive Transformer Chenjie Cao, Yuxin Hong, Xiang Li, Chengrong Wang, Chengming Xu, Yanwei Fu , Xiangyang Xue School of Data Science Fudan University {20110980001,yanweifu}@fudan.edu.cn Recently, Auto Regressive (AR) models for the whole image generation empowered by transformers have achieved comparable or even better performance compared to Generative Adversarial Networks (GANs). Unfortunately, directly applying such AR models to edit/change local image regions, may suffer from the problems of missing global information, slow inference speed, and information leakage of local guidance. To address these limitations, we propose a novel model image Local Autoregressive Transformer (i LAT), to better facilitate the locally guided image synthesis. Our i LAT learns the novel local discrete representations, by the newly proposed local autoregressive (LA) transformer of the attention mask and convolution mechanism. Thus i LAT can efficiently synthesize the local image regions by key guidance information. Our i LAT is evaluated on various locally guided image syntheses, such as pose-guided person image synthesis and face editing. Both quantitative and qualitative results show the efficacy of our model. 1 Introduction Generating realistic images has been attracting ubiquitous research attention of the community for a long time. In particular, those image synthesis tasks involving persons or portrait [6, 28, 29] can be applied in a wide variety of scenarios, such as advertising, games, and motion capture, etc. Most real-world image synthesis tasks only involve the local generation, which means generating pixels in certain regions, while maintaining the semantic consistency, e.g., face editing [19, 1, 40], pose guiding [36, 55, 47], and image inpainting [51, 30, 49, 53]. Unfortunately, most works can only handle the well aligned images of icon-view foregrounds, rather than the image synthesis of non-iconic view foregrounds [47, 24], i.e., person instances with arbitrary poses in cluttered scenes, which is concerned in this paper. Even worse, the global semantics tend to be distorted during the generation of previous methods, even if subtle modifications are applied to a local image region. Critically, given the local editing/guidance such as sketches of faces, or skeleton of bodies in the first column of Fig. 1(A), it is imperative to design our new algorithm for locally guided image synthesis. Generally, several inherent problems exist in previous works for such a task. For example, despite impressive quality of images are generated, GANs/Autoencoder(AE)-based methods [51, 47, 19, 30, 18] are inclined to synthesize blurry local regions, as in Fig. 1(A)-row(c). Furthermore, some inspiring autoregressive (AR) methods, such as Pixel CNN [32, 41, 23] and recent transformers [8, 14], should efficiently model the joint image distribution (even in very complex background [32]) for whole image generation as Fig. 1(B)-row(b). These AR models, however, are still not ready for locally guided image synthesis, as several reasons. (1) Missing global information. As in Fig. 1(B)-row(b), vanilla AR models take the top-to-down and left-to-right sequential generation with limited receptive fields for the initial generating (top left corner), which are incapable of directly modeling global information. Additionally, the sequential AR models suffer from exposure bias [2], which may Corresponding author. Dr. Fu is also with Fudan ISTBI ZJNU Algorithm Centre for Brain-inspired Intelligence, Zhejiang Normal University, Jinhua, China. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). AE mask 𝐌!" (a) Autoencoder (AE) Vanilla AR mask 𝐌!# (b) Autoregressive (AR) Local Autoregressive (c) Local Autoregressive Transformer (i LAT) (A) Inputs and outputs of local generation compared with previous works (B) Comparison of different generative modes consistent semantics inconsistent semantics with information leak w./o. information leak AE based method Our i LAT Figure 1: The illustration of (A) influence of missing semantic consistency, the information leak, and blur in AE based method in the local generation, and (B) comparison of AE, AR, and our i LAT for different conditional image generations. Our method is more efficient for locally guided image synthesis by keeping both global semantics and local guidance. predict future pixels conditioned on the past ones with mistakes, due to the discrepancy between training and testing in AR. This makes small local guidance unpredictable changes to the whole image, resulting in inconsistent semantics as in Fig. 1(A)-row(a). (2) Slow inference speed. The AR models have to sequentially predict the pixels in the testing with notoriously slow inference speed, especially for high-resolution image generation. Although the parallel training techniques are used in pixel CNN [32] and transformer [14], the conditional probability sampling fails to work in parallel during the inference phase. (3) Information leakage of local guidance. As shown in Fig. 1(B)-row(c), the local guidance should be implemented with specific masks to ensure the validity of the local AR learning. During the sequential training process, pixels from masked regions may be exposed to AR models by convolutions with large kernel sizes or inappropriate attention masks in the transformer. We call it information leakage [44, 16] of local guidance, which makes models overfit the masked regions, and miss detailed local guidance as in Fig. 1(A)-row(b). To this end, we propose a novel image Local Autoregressive Transformer (i LAT) for the task of locally guided image synthesis. Our key idea lies in learning the local discrete representations effectively. Particularly, we tailor the receptive fields of AR models to local guidance, achieving semantically consistent and visually realistic generation results. Furthermore, a local autoregressive (LA) transformer with the novel LA attention mask and convolution mechanism is proposed to enable successful local generation of images with efficient inference time, without information leakage. Formally, we propose the i LAT model with several novel components. (1) We complementarily incorporate receptive fields of both AR and AE to fit LA generation with a novel attention mask as shown in Fig. 1(B)-row(c). In detail, local discrete representation is proposed to represent those masked regions, while the unmasked areas are encoded with continuous image features. Thus, we achieve favorable results with both consistent global semantics and realistic local generations. (2) Our i LAT dramatically reduces the inference time for local generation, since only masked regions will be generated autoregressively. (3) A simple but effective two-stream convolution and a local causal attention mask mechanism are proposed for discrete image encoder and transformer respectively, with which information leakage is prevented without detriment to the performance. We make several contributions in this work. (1) A novel local discrete representation learning is proposed to efficiently help to learn our i LAT for the local generation. (2) We propose an image local autoregressive transformer for local image synthesis, which enjoys both semantically consistent and realistic generative results. (3) Our i LAT only generates necessary regions autoregressively, which is much faster than vanilla AR methods during the inference. (4) We propose a two-stream convolution and a LA attention mask to prevent both convolutions and transformer from information leakage, thus improving the quality of generated images. Empirically, we introduce several locally guidance tasks, including pose-guided image generation and face editing tasks; and extensive experiments are conducted on the corresponding dataset to validate the efficacy of our model. 2 Related Work Conditional Image Synthesis. Some conditional generation models are designed to globally generate images with pre-defined styles based on user-provided references, such as poses and face sketches. These previous synthesis efforts are made on Variational auto-encoder (VAE) [10, 13], AR Model [48, 39], and AE with adversarial training [51, 49, 53]. Critically, it is very non-trivial for all these methods to generate images of locally guided image synthesis of the non-iconic foreground. Some tentative attempts have been conducted in pose-guided synthesis [42, 47] with person existing in non-ironic views. On the other hand, face editing methods are mostly based on adversarial AE based inpainting [30, 51, 19] and GAN inversion based methods [53, 40, 1]. Rather than synthesizing the whole image, our i LAT generates the local regions of images autoregressively, which not only improves the stability with the well-optimized joint conditional distribution for large masked regions but also maintains the global semantic information. Autoregressive Generation. The deep AR models have achieved great success recently in the community [35, 15, 32, 9, 38]. Some well known works include Pixel RNN [43], Conditional Pixel CNN [32], Gated Pixel CNN [32], and Wave Net [31]. Recently, transformer based AR models [38, 5, 12] have achieved excellent results in many machine learning tasks. Unfortunately, the common troubles of these AR models are the expensive inference time and potential exposure bias, as AR models sequentially predict future values from the given past values. The inconsistent receptive fields for training and testing will lead to accumulated errors and unreasonable generated results [2]. Our i LAT is thus designed to address these limitations. Visual Transformer. The transformer takes advantage of the self-attention module [44], and shows impressive expressive power in many Natural Language Processing (NLP) [38, 11, 5] and vision tasks [12, 7, 25]. With costly time and space complexity of O(n2) in transformers, Parmar et al. [34] utilize local self-attention to achieve AR generated results. Chen et al. [8] and Kumar et al. [22] autoregressively generate pixels with simplified discrete color palettes to save computations. But limited by the computing power, they still generate low-resolution images. To address this, some works have exploited the discrete representation learning, e.g. d VAE [39] and VQGAN [14]. It not only reduces the sequence length of image tokens but also shares perceptually rich latent features in image synthesis as the word embedding in NLP. However, recovering images from the discrete codebook still causes blur and artifacts in complex scenes. Besides, vanilla convolutions of the discrete encoder may leak information among different discrete codebook tokens. Moreover, local conditional generation based on VQGAN [14] suffers from poor semantical consistency compared with other unchanged image regions. To end these, the i LAT proposes the novel discrete representation to improve the model capability of local image synthesis. Given the conditional image Ic and target image It, our image Local Autoregressive Transformer (i LAT) aims at producing the output image Io of semantically consistent and visually realistic. The key foreground objects (e.g., the skeleton of body), or the regions of interest (e.g., sketches of facial regions) extracted from Ic, are applied to guide the synthesis of output image Io. Essentially, the background and the other non-key foreground image regions of Io should be visually similar to It. As shown in Fig. 2, our i LAT includes the branches of Two-Stream convolutions based Vector Quantized GAN (TS-VQGAN) for the discrete representation learning and a transformer for the AR generation with Local Autoregressive (LA) attention mask. Particularly, our i LAT firstly encodes Ic and It, into codebook vectors zq,c and zq,t by TS-VQGAN (Sec. 3.1) without local information leakage. Then the index of the masked vectors ˆzq,t will be predicted by the transformer autoregressively with LA attention mask (Sec. 3.2). During the test phase, the decoder of TS-VQGAN takes the combination of ˆzq,t in masked regions and ˆz in unmasked regions to achieve the final result. 3.1 Local Discrete Representation Learning We propose a novel local discrete representation learning in this section. Since it is inefficient to learn the generative transformer model through pixels directly, inspired by VQGAN [14], we incorporate the VQVAE mechanism [33] into the proposed i LAT for the discrete representation learning. The VQVAE is consisted of an encoder E, a decoder D, and a learnable discrete codebook Z = {zk}K k=1, Masked Transformer Position Embedding Quantized Mask 𝐌𝒒 NLL Loss (Training) (Inference) Conditional Image 𝐈! (Inference) Target Image 𝐈" Target Pose 𝑝" (x,y,visible) 𝑐[𝑠] 𝑐% 𝑐! 𝑐$ 𝑐& 𝑡[𝑠] 𝑡% 𝑡! 𝑡$ 𝑡& 𝑡' 𝑡" 𝑐()*+ 0 1 2 3 9 10 22 23 24 25 28 26 27 31 Image Mask 𝐌 Inference only Train/Inference 𝑧#," 𝐌# + 𝑧 (1 𝐌#) Output Image 𝐈% (Training) Pose Guiding only Figure 2: The overview of i LAT with 3 3 of latent quantization. The target image It is encoded by TS-VQGAN with a binary mask M, while conditional image Ic is directly encoded without mask. As explained in Sec.3.2, the guided information of poses and sketches are utilized in our two tasks. For pose guidence, target pose coordinates are encoded by a Point MLP into sequential features, which are concatenated with conditional tokens {c[s], c0, ..., c8, cpose}. The TS-VQGAN decoder takes the combination of ˆzq,t converted from generated target tokens {ˆt1, ˆt2, ..., ˆt7} in masked regions and directly encoded features ˆz in unmasked regions to get the final result. Input feature 𝐅 Masked feature 𝐅! Resized Mask 𝐌 Combined convoluted Leaked Region 𝐌! Q\K t[s] t0 t1 t2 t3 t4 t5 t6 t7 t5 t6 t7 t8 Quantized Mask 𝐌# The total Local Autoregressive (LA) attention mask #𝐌$% T2C T2T The T2T part 𝐌$% of #𝐌$% t[s] t3 t8 t0 t1 t2 t4 t5 t6 t7 t4 t5 t6 t7 t8 t0 t1 t2 t4 t5 t6 t7 t4 t5 t6 t7 Global sub-mask 𝐌&' Causal sub-mask 𝐌"' (b) The local autoregressive attention mask C2C:condition to condition C2T:condition to target T2C:target to condition T2T:target to target (a) The two-stream convolution Figure 3: Illustration of (a) the two-stream convolution in the TS-VQGAN encoder and (b) the LA attention mask ˆMLA. In (a), after the feature F is convoluted, the 3 3 convolution kernel spreads the features from the masked regions to unmasked regions. Patches with leaked information are circled with red. For (b), a 3 3 quantized mask Mq is assumed as the input mask. MLA can be divided into the global sub-mask Mgs and the causal sub-mask Mcs. Then, t[s], t3, t8 can be categorized into global pixel tokens, while others are casual tokens: t1, t2, t5, t7 are masked tokens tm, (Mq,m = 1), and t0, t1, t4, t6 are tokens tm 1 that need to predict masked ones. The global tokens can be attended with all tokens. And causal tokens attend to the targets with a lower triangular matrix. All colored tokens are valued 1 and white tokens are 0 in MLA. where K means the total number of codebook vectors. Given an image I RH W 3, E encodes the image into latent features ˆz = E(I) Rh w ce, where ce indicates the channel of the encoder outputs. Then, the spatially unfolded ˆzh w Rce, (h h, w w) are replaced with the closest codebook vectors as z(q) h w = arg min zk Z ||ˆzh w zk|| Rcq, zq = fold(z(q) h w , h h, w w) Rh w cq, (1) where cq indicates the codebook channels. However, VQVAE will suffer from obscure information leakage for the local image synthesis, if the receptive field (kernel size) of vanilla convolution is larger than 1 1 as shown in Fig. 3(a). Intuitively, each 3 3 convolution layer spreads the masked features to the outside border of the mask. Furthermore, multi-convolutional based E accumulates the information leakage, which makes the model learn the local generating with unreasonable confidence, leading to model overfitting (see Sec. 4.3). To this end, we present two-stream convolutions in TS-VQGAN as shown in Fig. 3(a). Since the masked information is only leaked to a circle around the mask with each 3 3 convolution layer, we can just replace the corrupt features for each layer with masked ones. Thus, the influence of information leakage will be eliminated without hurting the integrity of both masked and unmasked features. Specifically, for the given image mask M RH W that 1 means masked regions, and 0 means unmasked regions, it should be resized into M with max-pooling to fit the feature size. The two-stream convolution converts the input feature F into the masked feature Fm = F (1 M ), where is element-wise multiplication. Then, both F and Fm are convoluted with shared weights and combined according to the leaked regions Ml, which can be obtained from the difference of convoluted mask as Ml = clip(conv1(M ), 0, 1) M , Ml[Ml > 0] = 1, (2) where conv1 implemented with an all-one 3 3 kernel. Therefore, the output of two-stream convolution can be written as Fc = conv(F) (1 Ml) + conv(Fm) Ml. (3) So the leaked regions are replaced with features that only depend on unmasked regions. Besides, masked features can be further leveraged for AR learning without any limitations. Compared with VQVAE, we replace all vanilla convolutions with two-stream convolutions in the encoder of TSVQGAN. Note that the decoder D is unnecessary to prevent information leakage at all. Since the decoding process is implemented after the AR generating of the transformer as shown in Fig. 2. For the VQVAE, the decoder D decodes the codebook vectors zq got from Eq.(1), and reconstructs the output image as Io = D(zq). Although VQGAN [14] can generate more reliable textures with adversarial training, handling complex real-world backgrounds and precise face details is still tough to the existed discrete learning methods. In TS-VQGAN, we further finetune the model with local quantized learning, which can be written as Io = D(zq Mq + ˆz (1 Mq)), (4) where Mq Rh w is the resized mask for quantized vectors, and ˆz is the output of the encoder. In Eq.(4), unmasked features are directly encoded from the encoder, while masked features are replaced with the codebook vectors, which works between AE and VQVAE. This simple trick effectively maintains the fidelity of the unmasked regions and reduces the number of quantized vectors that have to be generated autoregressively, which also leads to a more efficient local AR inference. Note that the back-propagation of Eq.( 4) is implemented with the straight-through gradient estimator [4]. 3.2 Local Autoregressive Transformer Learning From the discrete representation learning in Sec. 3.1, we can get the discrete codebook vectors zq,c, zq,t Rh w cq for conditional images and target images respectively. Then the conditional and target image tokens {ci, tj}hw i,j=1 {0, 1, ..., K 1} can be converted from the index-based representation of zq,c, zq,t in the codebook with length hw, where K indicates the all number of codebook vectors. For the resized target mask Mq Rh w, the second stage needs to learn the AR likelihood for the masked target tokens {tm} where Mq,m = 1 with conditional tokens {ci}hw i=1 and other unmasked target tokens {tu} where Mq,u = 0 as p(tm|c, tu) = Y j p(t(m,j)|c, tu, t(m,