# cogview_mastering_texttoimage_generation_via_transformers__9bdba941.pdf

Cog View: Mastering Text-to-Image Generation via Transformers

Ming Ding , Zhuoyi Yang , Wenyi Hong , Wendi Zheng , Chang Zhou , Da Yin , Junyang Lin , Xu Zou , Zhou Shao , Hongxia Yang , Jie Tang

Tsinghua University DAMO Academy, Alibaba Group BAAI {dm18@mails, jietang@mail}.tsinghua.edu.cn

Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose Cog View, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the ﬁnetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating Na N losses. Cog View achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E. 1

A tiger is playing football. A coﬀee cup printed with

a cat. Sky background. A beautiful young blond woman talking on a phone.

A Big Ben clock towering

over the city of London.

A man is ﬂying to the moon on his bicycle A couple wearing leather bik

er garb rides a motorcycle.

Super-resolution mid-lake pavilion

Chinese traditional draw

ing. Statue of Liberty. Oil painting. Lion. Cartoon. A tiger is playing

football. Sketch. Houses.

Figure 1: Samples generated by Cog View. The text in the ﬁrst line is either from MS COCO (outside our training set) or user queries on our demo website. The images in the second line are ﬁnetuned results for different styles or super-resolution. The actual input text is in Chinese, which is translated into English here for better understanding. More samples for captions from MS COCO are included in Appendix F.

1 Introduction

There are two things for a painter, the eye and the mind... eyes, through which we view the nature; brain, in which we organize sensations by logic for meaningful expression. (Paul Cézanne [17])

1Codes and models are at https://github.com/THUDM/Cog View. We also have a demo website of our latest model at https://wudao.aminer.cn/Cog View/index.html (without post-selection).

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

As contrastive self-supervised pretraining has revolutionized computer vision (CV) [24, 21, 8, 32], visual-language pretraining, which brings high-level semantics to images, is becoming the next frontier of visual understanding [38, 30, 39]. Among various pretext tasks, text-to-image generation expects the model to (1) disentangle shape, color, gesture and other features from pixels, (2) understand the input text, (2) align objects and features with corresponding words and their synonyms and (4) learn complex distributions to generate the overlapping and composite of different objects and features, which, like painting, is beyond basic visual functions (related to eyes and the V1 V4 in brain [22]), requiring a higher-level cognitive ability (more related to the angular gyrus in brain [3]).

The attempts to teach machines text-to-image generation can be traced to the early times of deep generative models, when Mansimov et al. [35] added text information to DRAW [20]. Then Generative Adversarial Nets [19] (GANs) began to dominate this task. Reed et al. [42] fed the text embeddings to both generator and discriminator as extra inputs. Stack GAN [54] decomposed the generation into a sketch-reﬁnement process. Attn GAN [51] used attention on words to focus on the corresponding subregion. Object GAN [29] generated images following a text boxes layouts image process. DM-GAN [55] and DF-GAN [45] introduced new architectures, e.g. dyanmic memory or deep fusion block, for better image reﬁnement. Although these GAN-based models can perform reasonable synthesis in simple and domain-speciﬁc dataset, e.g. Caltech-UCSD Birds 200 (CUB), the results on complex and domain-general scenes, e.g. MS COCO [31], are far from satisfactory.

Recent years have seen a rise of the auto-regressive generative models. Generative Pre-Training (GPT) models [37, 4] leveraged Transformers [48] to learn language models in large-scale corpus, greatly promoting the performance of natural language generation and few-shot language understanding [33]. Auto-regressive model is not nascent in CV. Pixel CNN, Pixel RNN [47] and Image Transformer [36] factorized the probability density function on an image over its sub-pixels (color channels in a pixel) with different network backbones, showing promising results. However, a real image usually comprises millions of sub-pixels, indicating an unaffordable amount of computation for large models. Even the biggest pixel-level auto-regressive model, Image GPT [7], was pretrained on Image Net at a max resolution of only 96 96.

The framework of Vector Quantized Variational Auto Encoders (VQ-VAE) [46] alleviates this problem. VQ-VAE trains an encoder to compress the image into a low-dimensional discrete latent space, and a decoder to recover the image from the hidden variable in the stage 1. Then in the stage 2, an auto-regressive model (such as Pixel CNN [47]) learns to ﬁt the prior of hidden variables. This discrete compression loses less ﬁdelity than direct downsampling, meanwhile maintains the spatial relevance of pixels. Therefore, VQ-VAE revitalized the auto-regressive models in CV [41]. Following this framework, Esser et al. [15] used Transformer to ﬁt the prior and further switches from L2 loss to GAN loss for the decoder training, greatly improving the performance of domain-speciﬁc unconditional generation.

The idea of Cog View comes naturally: large-scale generative joint pretraining for both text and image (from VQ-VAE) tokens. We collect 30 million high-quality (Chinese) text-image pairs and pretrain a Transformer with 4 billion parameters. However, large-scale text-to-image generative pretraining could be very unstable due to the heterogeneity of data. We systematically analyze the reasons and solved this problem by the proposed Precision Bottleneck Relaxation and Sandwich Layernorm. As a result, Cog View greatly advances the quality of text-to-image generation.

A recent work DALL-E [39] independently proposed the same idea, and was released earlier than Cog View. Compared with DALL-E, Cog View steps forward on the following four aspects:

Cog View outperforms DALL-E and previous GAN-based methods at a large margin according to the Fréchet Inception Distance (FID) [25] on blurred MS COCO, and is the ﬁrst open-source large text-to-image transformer.

Beyond zero-shot generation, we further investigate the potential of ﬁnetuning the pretrained Cog View. Cog View can be adapted for diverse downstream tasks, such as style learning (domain-speciﬁc text-to-image), super-resolution (image-to-image), image captioning (image-to-text), and even text-image reranking.

The ﬁnetuned Cog View enables self-reranking for post-selection, and gets rid of an additional CLIP model [38] in DALL-E. It also provides a new metric Caption Loss to measure the quality and accuracy for text-image generation at a ﬁner granularity than FID and Inception Score (IS) [43].

We proposed PB-relaxation and Sandwich-LN to stabilize the training of large Transformers on complex datasets. These techniques are very simple and can eliminate overﬂow in forwarding (characterized as Na N losses), and make Cog View able to be trained with almost FP16 (O22). They can also be generalized to the training of other transformers.

In this section, we will derive the theory of Cog View from VAE3 [26]: Cog View optimizes the Evidence Lower BOund (ELBO) of joint likelihood of image and text. The following derivation will turn into a clear re-interpretation of VQ-VAE if without text t.

Suppose the dataset (X, T) = {xi, ti}N i=1 consists of N i.i.d. samples of image variable x and its description text variable t. We assume the image x can be generated by a random process involving a latent variable z: (1) ti is ﬁrst generated from a prior p(t; θ). (2) zi is then generated from the conditional distribution p(z|t = ti; θ). (3) xi is ﬁnally generated from p(x|z = zi; ψ). We will use a shorthand form like p(xi) to refer to p(x = xi) in the following part.

Let q(z|xi; φ) be the variational distribution, which is the output of the encoder φ of VAE. The log-likelihood and the evidence lower bound (ELBO) can be written as:

log p(X, T; θ, ψ) =

i=1 log p(ti; θ) +

i=1 log p(xi|ti; θ, ψ) (1)

log p(ti; θ) | {z } NLL loss for text

+ E zi q(z|xi;φ)[ log p(xi|zi; ψ)] | {z } reconstruction loss

+ KL q(z|xi; φ) p(z|ti; θ)

| {z } KL between q and (text conditional) prior

The framework of VQ-VAE differs with traditional VAE mainly in the KL term. Traditional VAE ﬁxes the prior p(z|ti; θ), usually as N(0, I), and learns the encoder φ. However, it leads to posterior collapse [23], meaning that q(z|xi; φ) sometimes collapses towards the prior. VQ-VAE turns to ﬁx φ and ﬁt the prior p(z|ti; θ) with another model parameterized by θ. This technique eliminates posterior collapse, because the encoder φ is now only updated for the optimization of the reconstruction loss. In exchange, the approximated posterior q(z|xi; φ) could be very different for different xi, so we need a very powerful model for p(z|ti; θ) to minimize the KL term.

Currently, the most powerful generative model, Transformer (GPT), copes with sequences of tokens over a discrete codebook. To use it, we make z {0, ..., |V | 1}h w, where |V | is the size of codebook and h w is the number of dimensions of z. The sequences zi can be either sampled from q(z|xi; φ), or directly zi = argmaxz q(z|xi; φ). We choose the latter for simplicity, so that q(z|xi; φ) becomes a one-point distribution on zi. The Equation (2) can be rewritten as:

E zi q(z|xi;φ)[ log p(xi|zi; ψ)] | {z } reconstruction loss

log p(ti; θ) | {z } NLL loss for text

log p(zi|ti; θ) | {z } NLL loss for z

The learning process is then divided into two stages: (1) The encoder φ and decoder ψ learn to minimize the reconstruction loss. (2) A single GPT optimizes the two negative log-likelihood (NLL) losses by concatenating text ti and zi as an input sequence.

As a result, the ﬁrst stage degenerates into a pure discrete Auto-Encoder, serving as an image tokenizer to transform an image to a sequence of tokens; the GPT in the second stage undertakes most of the modeling task. Figure 3 illustrates the framework of Cog View.

2meaning that all computation, including forwarding and backwarding are in FP16 without any conversion, but the optimizer states and the master weights are FP32. 3In this paper, bold font denotes a random variable, and regular font denotes a concrete value. See this comprehensive tutorial [12] for the basics of VAE.

2.2 Tokenization

In this section, we will introduce the details about the tokenizers in Cog View and a comparison about different training strategies about the image tokenizer (VQVAE stage 1).

Tokenization for text is already well-studied, e.g. BPE [16] and Sentence Piece [28]. In Cog View, we ran Sentence Piece on a large Chinese corpus to extract 50,000 text tokens.

The image tokenizer is a discrete Auto-Encoder, which is similar to the stage 1 of VQ-VAE [46] or d-VAE [39]. More speciﬁcally, the Encoder φ maps an image x of shape H W 3 into Encφ(x) of shape h w d, and then each d dimensional vector is quantized to a nearby embedding in a learnable codebook {v0, ..., v|V | 1}, vk Rd. The quantized result can be represented by h w indices of embeddings, and then we get the latent variable z {0, ..., |V | 1}h w. The Decoder ψ maps the quantized vectors back to a (blurred) image to reconstruct the input. In our 4B-parameter Cog View, |V | = 8192, d = 256, H = W = 256, h = w = 32.

The training of the image tokenizer is non-trivial due to the existence of discrete selection. Here we introduce four methods to train an image tokenizer.

The nearest-neighbor mapping, straight-through estimator [2], which is proposed by the original VQVAE. A common concern of this method [39] is that, when the codebook is large and not initialized carefully, only a few of embeddings will be used due to the curse of dimensionality. We did not observe this phenomenon in the experiments.

Gumbel sampling, straight-through estimator. If we follow the original VAE to reparameterize a categorical distribution of latent variable z based on distance between vectors, i.e. p(zi w+j = vk|x) = e vk Encφ(x)ij 2/τ P|V | 1 k=0 e vk Encφ(x)ij 2/τ , an unbiased sampling strategy is

zi w+j = argmaxk gk vk Encφ(x)ij 2/τ, gk Gumbel(0, 1), where the temperature τ is gradually decreased to 0. We can further use the differentiable softmax to approximate the one-hot distribution from argmax. DALL-E adopts this method with many other tricks to stabilize the training.

The nearest-neighbor mapping, moving average, where each embedding in the codebook is updated periodically during training as the mean of the vectors recently mapped to it [46].

The nearest-neighbor mapping, ﬁxed codebook, where the codebook is ﬁxed after initialized.

Figure 2: L2 loss curves during training image tokenizers. All the above methods ﬁnally converge to a similar loss level.

Comparison. To compare the methods, we train four image tokenizers with the same architecture on the same dataset and random seed, and demonstrate the loss curves in Figure 2. We ﬁnd that all the methods are basically evenly matched, meaning that the learning of the embeddings in the codebook is not very important, if initialized properly. In pretraining, we use the tokenizer of moving average method.

The introduction of data and more details about tokenization are in Appendix A.

2.3 Auto-regressive Transformer

The backbone of Cog View is a unidirectional Transformer (GPT). The Transformer has 48 layers, with the hidden size of 2560, 40 attention heads and 4 billion parameters in total. As shown in Figure 3, four seperator tokens, [ROI1] (reference text of image), [BASE], [BOI1] (beginning of image), [EOI1] (end of image) are added to each sequence to indicate the boundaries of text and image. All the sequences are clipped or padded to a length of 1088.

The pretext task of pretraining is left-to-right token prediction, a.k.a. language modeling. Both image and text tokens are equally treated. DALL-E [39] suggests to lower the loss weight of text tokens; on the contrary, during small-scale experiments we surprisingly ﬁnd the text modeling is the key for the success of text-to-image pretraining. If the loss weight of text tokens is set to zero, the model will fail to ﬁnd the connections between text and image and generate images totally unrelated to the input text.

7KH KHDG RI D ORYHO\ FDW

Discretize Recover

Ӟݝ ݢᆽ ጱ ੜ ሞ ጱ १

Text Tokenizer (sentence pieces)

Image Tokenizer (Discrete Auto Encoder)

Image Token

Image Token ŏŏ ŏŏ

Input Text: Input Image:

Transformer (GPT)

7H[W WRNHQV UDQJLQJ IURP WR ,PDJH WRNHQV UDQJLQJ IURP WR

Figure 3: The framework of Cog View. [ROI1], [BASE1], etc., are seperator tokens.

We hypothesize that text modeling abstracts knowledge in hidden layers, which can be efﬁciently exploited during the later image modeling.

We train the model with batch size of 6,144 sequences (6.7 million tokens per batch) for 144,000 steps on 512 V100 GPUs (32GB). The parameters are updated by Adam with max lr = 3 10 4, β1 = 0.9, β2 = 0.95, weight decay = 4 10 2. The learning rate warms up during the ﬁrst 2% steps and decays with cosine annealing [34]. With hyperparameters in an appropriate range, we ﬁnd that the training loss mainly depends on the total number of trained tokens (tokens per batch steps), which means that doubling the batch size (and learning rate) results in a very similar loss if the same number of tokens are trained. Thus, we use a relatively large batch size to improve the parallelism and reduce the percentage of time for communication. We also design a three-region sparse attention to speed up training and save memory without hurting the performance, which is introduced in Appendix B.

2.4 Stabilization of training

Currently, pretraining large models (>2B parameters) usually relies on 16-bit precision to save GPU memory and speed up the computation. Many frameworks, e.g. Deep Speed Ze RO [40], even only support FP16 parameters. However, text-to-image pretraining is very unstable under 16-bit precision. Training a 4B ordinary pre-LN Transformer will quickly result in Na N loss within 1,000 iterations. To stabilize the training is the most challenging part of Cog View, which is well-aligned with DALL-E.

We summarize the solution of DALL-E as to tolerate the numerical problem of training. Since the values and gradients vary dramatically in scale in different layers, they propose a new mixed-precision framework per-resblock loss scaling and store all gains, biases, embeddings, and unembeddings in 32-bit precision, with 32-bit gradients. This solution is complex, consuming extra time and memory and not supported by most current training frameworks.

Cog View instead regularizes the values. We ﬁnd that there are two kinds of instability: overﬂow (characterized by Na N losses) and underﬂow (characterized by diverging loss). The following techniques are proposed to solve them.

Precision Bottleneck Relaxation (PB-Relax). After analyzing the dynamics of training, we ﬁnd that overﬂow always happens at two bottleneck operations, the ﬁnal Layer Norm or attention.

In the deep layers, the values of the outputs could explode to be as large as 104 105, making the variation in Layer Norm overﬂow. Luckily, as Layer Norm(x) = Layer Norm(x/ max(x)), we can relax this bottleneck by dividing the maximum ﬁrst4.

The attention scores QT K/

d could be signiﬁcantly larger than input elements, and result in overﬂow. Changing the computational order into QT (K/

d) alleviates the problem. To eliminate the overﬂow, we notice that softmax(QT K/

d) = softmax(QT K/

4We cannot directly divide x by a large constant, which will lead to underﬂow in the early stage of training.

Figure 4: (a) Illustration of different Layer Norm structures in Transformers. Post-LN is from the original paper; Pre-LN is the most popular structure currently; Sandwich-LN is our proposed structure to stabilize training. (b) The numerical scales in our toy experiments with 64 layers and a large learning rate. Trainings without Sandwich-LN overﬂow in main branch; trainings without PB-relax overﬂow in attention; Only the training with both can continue.

constant), meaning that we can change the computation of attention into

softmax(QT K

d ) = softmax QT

d K max( QT

d K) α , (4)

where α is a big number, e.g. α = 32.5 In this way, the maximum (absolute value) of attention scores are also divided by α to prevent it from overﬂow. A detailed analysis about the attention in Cog View is in Appendix C.

Sandwich Layer Norm (Sandwich-LN). The Layer Norms [1] in Transformers are essential for stable training. Pre-LN [50] is proven to converge faster and more stable than the original Post-LN, and becomes the default structure of Transformer layers in recent works. However, it is not enough for text-to-image pretraining. The output of Layer Norm (x x)

i(xi x)2 γ + β is basically proportional

to the square root of the hidden size of x, which is

2560 50 in Cog View. If input values in some dimensions are obviously larger than the others which is true for Transformers output values in these dimensions will also be large (101 102). In the residual branch, these large values are magniﬁed and be added back to the main branch, which aggravates this phenomenon in the next layer, and ﬁnally causes the value explosion in the deep layers.

This reason behind value explosion inspires us to restrict the layer-by-layer aggravation. We propose Sandwich Layer Norm, which also adds a Layer Norm at the end of each residual branch. Sandwich LN ensures the scale of input values in each layer within a reasonable range, and experiments on training 500M model shows that its inﬂuence on convergence is negligible. Figure 4(a) illustrates different Layer Norm structures in Transformers.

Toy Experiments. Figure 4(b) shows the effectiveness of PB-relax and Sandwich-LN with a toy experimental setting, since training many large models for veriﬁcation is not realistic. We ﬁnd that deep transformers (64 layers, 1024 hidden size), large learning rates (0.1 or 0.01), small batch size (4) can simulate the value explosion in training with reasonable hyperparameters. PB-relax + Sandwich-LN can even stabilize the toy experiments.

Shrink embedding gradient. Although we did not observe any sign of underﬂow after using Sandwich-LN, we ﬁnd that the gradient of token embeddings is much larger than that of the other parameters, so that simply shrinking its scale by α = 0.1 increases the dynamic loss scale to further prevent underﬂow, which can be implemented by emb=emb*alpha+emb.detach()*(1-alpha) in Pytorch. It seems to slow down the updating of token embeddings, but actually does not hurt performance in our experiments, which also corresponds to a recent work Mo Co v3 [9].

Discussion. The PB-relax and Sandwich-LN successfully stabilize the training of Cog View and a 8.3B-parameter Cog View-large. They are also general for all Transformer pretraining, and will enable the training of very deep Transformers in the future. As an evidence, we used PB-relax successfully eliminating the overﬂow in training a 10B-parameter GLM [14]. However, in general,

5The max must be at least head-wise, because the values vary greatly in different heads.

the precision problems in language pretraining is not so signiﬁcant as in text-to-image pretraining. We hypothesize that the root is the heterogeneity of data, because we observed that text and image tokens are distinguished by scale in some hidden states. Another possible reason is hard-to-ﬁnd underﬂow, guessed by DALL-E. A thorough investigation is left for future work.

3 Finetuning

Cog View steps further than DALL-E on ﬁnetuning. Especially, we can improve the text-to-image generation via ﬁnetuning Cog View for super-resolution and self-reranking. All the ﬁnetuning tasks can be completed within one day on a single DGX-2.

3.1 Super-resolution

Since the image tokenizer compresses 256 256-pixel images into 32 32-token sequences before training, the generated images are blurrier than real images due to the lossy compression. However, enlarging the sequence length will consume much more computation and memory due to the O(n2) complex of attention operations. Previous works [13] about super-resolution, or image restoration, usually deal with images already in high resolution, mapping the blurred local textures to clear ones. They cannot be applied to our case, where we need to add meaningful details to the generated low-resolution images. Figure 5 (b) is an example of our ﬁnetuning method, and illustrates our desired behavior of super-resolution.

The motivation of our ﬁnetuning solution for super-resolution is a belief that Cog View is trained on the most complex distribution in general domain, and the objects of different resolution has already been covered.6 Therefore, ﬁnetuning Cog View for super-resolution should not be hard.

Speciﬁcally, we ﬁrst ﬁnetune Cog View into a conditional super-resolution model from 16 16 image tokens to 32 32 tokens. Then we magnify an image of 32 32 tokens to 64 64 tokens (512 512 pixels) patch-by-patch via a center-continuous sliding-window strategy in Figure 5 (a). This order performs better that the raster-scan order in preserving the completeness of the central area.

To prepare data, we crop about 2 million images to 256 256 regions and downsample them to 128 128. After tokenization, we get 32 32 and 16 16 sequence pairs for different resolution. The pattern of ﬁnetuning sequence is [ROI1] text tokens [BASE][BOI1] 16 16 image tokens [EOI1] [ROI2][BASE] [BOI2] 32 32 image tokens [EOI2] , longer than the max position embedding index 1087. As a solution, we recount the position index from 0 at [ROI2].7

D &HQWHU FRQWLQXRXV VOLGLQJ ZLQGRZ E 'LƈHUHQW VXSHU UHVROXWLRQ UHVXOWV IRU ŉD WLJHU LV SOD\LQJ IRRWEDOOŊ

Figure 5: (a) A 64 64-token image are generated patch-by-patch in the numerical order. The overlapping positions will not be overwritten. The key idea is to make the tokens in the 2nd and 4th regions usually regions of faces or other important parts generated when attending to the whole region. (b) The ﬁnetuned super-resolution model does not barely transform the textures, but generates new local structures, e.g. the open mouth or tail in the example.

6An evidence to support the belief is that if we append close-up view at the end of the text, the model will generate details of a part of the object. 7One might worry about that the reuse of position indices could cause confusions, but in practice, the model can distinguish the two images well, probably based on whether they can attend to a [ROI2] in front.

3.2 Image Captioning and Self-reranking

To ﬁnetune Cog View for image captioning is straightforward: exchanging the order of text and image tokens in the input sequences. Since the model has already learnt the corresponding relationships between text and images, reversing the generation is not hard. We did not evaluate the performance due to that (1) there is no authoritative Chinese image captioning benchmark (2) image captioning is not the focus of this work. The main purpose of ﬁnetuning such a model is for self-reranking.

We propose the Caption Loss (Cap Loss) to evaluate the correspondence between images and text. More speciﬁcally, Cap Loss(x, t) = 1 |t| P|t| i=0 log p(ti|x, t0:i 1), where t is a sequence of text tokens and x is the image. Cap Loss(x, t) is the cross-entropy loss for the text tokens, and this method can be seen as an adaptation of inverse prompting [56] for text-to-image generation. Finally, images with the lowest Cap Losses are chosen.

Compared to additionally training another constrastive self-supervised model, e.g. CLIP [38], for reranking, our method consumes less computational resource because we only need ﬁnetuning. The results in Figure 9 shows the images selected by our methods performs better in FID than those selected by CLIP. Figure 6 shows an example for reranking.

Figure 6: 60 generated images for A man in red shirt is playing video games (selected at random from COCO), displayed in the order of Cap Loss. Most bad cases are ranked in last places. The diversity also eases the concern that Cog View might be overﬁtting a similar image in the training set.

3.3 Style Learning

Although Cog View is pretrained to cover diverse images as possible, the desire to generate images of a speciﬁc style or topic cannot be satisﬁed well. We ﬁnetune models on four styles: Chinese traditional drawing, oil painting, sketch, and cartoon. Images of these styles are automatically extracted from search engine pages including Google, Baidu and Bing, etc., with keyword as An image of {style} style , where {style} is the name of style. We ﬁnetune the model for different styles separately, with 1,000 images each.

During ﬁnetuning, the corresponding text for the images are also An image of {style} style . When generating, the text is A {object} of {style} style , where {object} is the object to generate. In this way, Cog View can transfer the knowledge of shape of the objects learned from pretraining to the style of ﬁnetuning. Figure 7 shows examples for the styles.

Figure 7: Generated images for The Oriental Pearl (a landmark of Shanghai) in different styles.

3.4 Industrial Fashion Design

Figure 8: Generated images for fashion design.

When the generation targets at a single domain, the complexity of the textures are largely reduced. In these scenarios, we can (1) train a VQGAN [15] instead of VQVAE for the latent variable for more realistic textures, (2) decrease the number of parameters and increase the length of sequences for a higher resolution. Our three-region sparse attention (Appendix B) can speed up the generation of high-resolution images in this case.

We train a 3B-parameter model on about 10 million fashion-caption pairs, using 50 50 VQGAN image tokens and decodes them into 800 800 pixels. Figure 8 shows samples of Cog View for fashion design, which has been successfully deployed to Alibaba Rhino fashion production.

4 Experimental Results

4.1 Machine Evaluation

At present, the most authoritative machine evaluation metrics for general-domain text-to-image generation is the FID on MS COCO, which is not included in our training set. To compare with DALL-E, we follow the same setting, evaluating Cog View on a subset of 30,000 captions sampled from the dataset, after applying a Gaussian ﬁlter with varying radius to both the ground-truth and generated images.8 The captions are translated into Chinese for Cog View by machine translation. To fairly compare with DALL-E, we do not use super-resolution. Besides, DALL-E generates 512 images for each caption and selects the best one by CLIP, which needs to generate about 15 billion tokens. To save computational resource, we select the best one from 60 generated images according to their Cap Losses. The evaluation of Cap Loss is on a subset of 5,000 images. We ﬁnally enhance the contrast of generated images by 1.5. Table 1 shows the metrics for Cog View and other methods. Table 1: Metrics for machine evaluation. Statistics about DALL-E and GANs are extracted from their ﬁgures. FID-k means that all the images are blurred by a Gaussian Filter with radius k.

Model FID-0 FID-1 FID-2 FID-4 FID-8 IS Cap Loss

Attn GAN 35.2 44.0 72.0 108.0 100.0 23.3 3.01 DM-GAN 26.5 39.0 73.0 119.0 112.3 32.2 2.87 DF-GAN 26.5 33.8 55.9 91.0 97.0 18.7 3.09 DALL-E 27.5 28.0 45.5 83.5 85.0 17.9

Cog View 27.1 19.4 13.9 19.4 23.6 18.2 2.43

Caption Loss as a Metric. FID and IS are designed to measure the quality of unconditional generation from relatively simple distributions, usually single objects. However, text-to-image generation should be evaluated pair-by-pair. Table 1 shows that DM-GAN achieves the best unblurred FID and IS, but is ranked last in human preference (Figure 10(a)). Caption Loss is an absolute (instead of relative, like CLIP) score, so that it can be averaged across samples. It should be a better metrics for this task and is more consistent with the overall scores of our human evaluation in 4.2.

Figure 9: IS and FID-0 for CLIP and self-ranking.

Comparing self-reranking with CLIP. We evaluate the FID-0 and IS of Cog View-generated images selected by CLIP and self-reranking on MS COCO. Figure 9 shows the curves with different number of candidates. Self-reranking gets better FID, and steadily reﬁnes FID as the number of candidates increases. CLIP performs better in increasing IS, but as discussed above, it is not a suitable metric for this task.

8We use the same evaluation codes with DM-GAN and DALL-E, which is available at https://github. com/Minfeng Zhu/DM-GAN.

Discussion about the differences in performance between Cog View and DALL-E. Since DALLE is pretrained with more data and parameters than Cog View, why Cog View gets a better FID even without super-resolution? It is hard to know the accurate reason, because DALL-E is not open-source, but we guess that the reasons include: (1) Cog View uses PB-relax and Sandwich-LN for a more stable optimization. (2) DALL-E uses many cartoon and rendered data, making the texture of generated images quite different from that of the photos in MS COCO. (3) Self-reranking selects images better in FID than CLIP. (4) Cog View is trained longer (96B trained tokens in Cog View vs. 56B trained tokens in DALL-E).

4.2 Human Evaluation

Human evaluation is much more persuasive than machine evaluation on text-to-image generation. Our human evaluation consists of 2,950 groups of comparison between images generated by Attn GAN, DM-GAN, DF-GAN, Cog View, and recovered ground truth, i.e., the ground truth blurred by our image tokenizer. Details and example-based comparison between models are in Appendix E.

Results in Figure 10 show that Cog View outperforms GAN-based baselines at a large margin. Cog View is chosen as the best one with probability 37.02%, competitive with the performance of recovered ground truth (59.53%). Figure 10(b)(c) also indicates our super-resolution model consistently improves the quality of images, especially the clarity, which even outperforms the recovered ground truth.

Figure 10: Human Evaluation results. The recovered ground truth is obtained by ﬁrst encoding the ground truth image and then decoding it, which is theoretically the upper bound of Cog View.

5 Conclusion and Discussion

Limitations. A disadvantage of Cog View is the slow generation, which is common for auto-regressive model, because each image is generated token-by-token. The blurriness brought by VQVAE is also an important limitation. These problems will be solved in the future work.

Ethics Concerns. Similar to Deepfake, Cog View is vulnerable to malicious use [49] because of its controllable and strong capacity to generate images. The possible methods to mitigate this issue are discussed in a survey [5]. Moreover, there are usually fairness problems in generative models about human 9. In Appendix D, we analyze the situation about fairness in Cog View and introduce a simple word replacing method to solve this problem.

We systematically investigate the framework of combining VQVAE and Transformers for text-toimage generation. Cog View demonstrates promising results for scalable cross-modal generative pretraining, and also reveals and solves the precision problems probably originating from data heterogeneity. We also introduce methods to ﬁnetune Cog View for diverse downstream tasks. We hope that Cog View could advance both research and application of controllable image generation and cross-modal knowledge understanding, but need to prevent it from being used to create images for misinformation.

9https://thegradient.pub/pulse-lessons

Acknowledgments and Disclosure of Funding

We would like to thank Zhao Xue, Zhengxiao Du, Hanxiao Qu, Hanyu Zhao, Sha Yuan, Yukuo Cen, Xiao Liu, An Yang, Yiming Ju for their help in data, machine maintaining or discussion. We would also thank Zhilin Yang for presenting this work at the conference of BAAI.

Funding in direct support of this work: a fund for GPUs donated by BAAI, a research fund from Alibaba Group, NSFC for Distinguished Young Scholar (61825602), NSFC (61836013).

[1] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016.

[2] Y. Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. ar Xiv preprint ar Xiv:1308.3432, 2013.

[3] H. M. Bonnici, F. R. Richter, Y. Yazar, and J. S. Simons. Multimodal feature integration in the angular gyrus during episodic and semantic retrieval. Journal of Neuroscience, 36(20): 5462 5471, 2016.

[4] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. ar Xiv preprint ar Xiv:2005.14165, 2020.

[5] M. Brundage, S. Avin, J. Clark, H. Toner, P. Eckersley, B. Garﬁnkel, A. Dafoe, P. Scharre, T. Zeitzoff, B. Filar, et al. The malicious use of artiﬁcial intelligence: Forecasting, prevention, and mitigation. ar Xiv preprint ar Xiv:1802.07228, 2018.

[6] A. Caliskan, J. J. Bryson, and A. Narayanan. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183 186, 2017.

[7] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pages 1691 1703. PMLR, 2020.

[8] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597 1607. PMLR, 2020.

[9] X. Chen, S. Xie, and K. He. An empirical study of training self-supervised visual transformers. ar Xiv preprint ar Xiv:2104.02057, 2021.

[10] K. Clark, U. Khandelwal, O. Levy, and C. D. Manning. What does bert look at? an analysis of bert s attention. ar Xiv preprint ar Xiv:1906.04341, 2019.

[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009.

[12] M. Ding. The road from MLE to EM to VAE: A brief tutorial. URL https://www. researchgate.net/profile/Ming-Ding-2/publication/342347643_The_Road_ from_MLE_to_EM_to_VAE_A_Brief_Tutorial/links/5f1e986792851cd5fa4b2290/ The-Road-from-MLE-to-EM-to-VAE-A-Brief-Tutorial.pdf.

[13] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In European conference on computer vision, pages 184 199. Springer, 2014.

[14] Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. All nlp tasks are generation tasks: A general pretraining framework. ar Xiv preprint ar Xiv:2103.10360, 2021.

[15] P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. ar Xiv preprint ar Xiv:2012.09841, 2020.

[16] P. Gage. A new algorithm for data compression. C Users Journal, 12(2):23 38, 1994.

[17] J. Gasquet. Cézanne. pages 159 186, 1926.

[18] X. Glorot and Y. Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics, pages 249 256. JMLR Workshop and Conference Proceedings, 2010.

[19] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. ar Xiv preprint ar Xiv:1406.2661, 2014.

[20] K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra. Draw: A recurrent neural network for image generation. In International Conference on Machine Learning, pages 1462 1471. PMLR, 2015.

[21] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. ar Xiv preprint ar Xiv:2006.07733, 2020.

[22] K. Grill-Spector and R. Malach. The human visual cortex. Annu. Rev. Neurosci., 27:649 677, 2004.

[23] J. He, D. Spokoyny, G. Neubig, and T. Berg-Kirkpatrick. Lagging inference networks and posterior collapse in variational autoencoders. In International Conference on Learning Representations, 2018.

[24] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729 9738, 2020.

[25] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6629 6640, 2017.

[26] D. P. Kingma and M. Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

[27] J. Y. Koh, J. Baldridge, H. Lee, and Y. Yang. Text-to-image generation grounded by ﬁne-grained user attention. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 237 246, 2021.

[28] T. Kudo and J. Richardson. Sentence Piece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66 71, Brussels, Belgium, Nov. 2018. Association for Computational Linguistics. doi: 10.18653/v1/ D18-2012. URL https://www.aclweb.org/anthology/D18-2012.

[29] W. Li, P. Zhang, L. Zhang, Q. Huang, X. He, S. Lyu, and J. Gao. Object-driven text-to-image synthesis via adversarial training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12174 12182, 2019.

[30] J. Lin, R. Men, A. Yang, C. Zhou, M. Ding, Y. Zhang, P. Wang, A. Wang, L. Jiang, X. Jia, et al. M6: A chinese multimodal pretrainer. ar Xiv preprint ar Xiv:2103.00823, 2021.

[31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740 755. Springer, 2014.

[32] X. Liu, F. Zhang, Z. Hou, Z. Wang, L. Mian, J. Zhang, and J. Tang. Self-supervised learning: Generative or contrastive. ar Xiv preprint ar Xiv:2006.08218, 1(2), 2020.

[33] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang. Gpt understands, too. ar Xiv preprint ar Xiv:2103.10385, 2021.

[34] I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. ar Xiv preprint ar Xiv:1608.03983, 2016.

[35] E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov. Generating images from captions with attention. ICLR, 2016.

[36] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran. Image transformer. In International Conference on Machine Learning, pages 4055 4064. PMLR, 2018.

[37] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

[38] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. ar Xiv preprint ar Xiv:2103.00020, 2021.

[39] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever. Zero-shot text-to-image generation. ar Xiv preprint ar Xiv:2102.12092, 2021.

[40] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505 3506, 2020.

[41] A. Razavi, A. v. d. Oord, and O. Vinyals. Generating diverse high-ﬁdelity images with vq-vae-2. ar Xiv preprint ar Xiv:1906.00446, 2019.

[42] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In International Conference on Machine Learning, pages 1060 1069. PMLR, 2016.

[43] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 2234 2242, 2016.

[44] P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556 2565, 2018.

[45] M. Tao, H. Tang, S. Wu, N. Sebe, F. Wu, and X.-Y. Jing. Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. ar Xiv preprint ar Xiv:2008.05865, 2020.

[46] A. van den Oord, O. Vinyals, and K. Kavukcuoglu. Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6309 6318, 2017.

[47] A. Van Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In International Conference on Machine Learning, pages 1747 1756. PMLR, 2016.

[48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. ar Xiv preprint ar Xiv:1706.03762, 2017.

[49] M. Westerlund. The emergence of deepfake technology: A review. Technology Innovation Management Review, 9(11), 2019.

[50] R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524 10533. PMLR, 2020.

[51] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316 1324, 2018.

[52] S. Yuan, H. Zhao, Z. Du, M. Ding, X. Liu, Y. Cen, X. Zou, and Z. Yang. Wudaocorpora: A super large-scale chinese corpora for pre-training language models. Preprint, 2021.

[53] M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, et al. Big bird: Transformers for longer sequences. ar Xiv preprint ar Xiv:2007.14062, 2020.

[54] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 5907 5915, 2017.

[55] M. Zhu, P. Pan, W. Chen, and Y. Yang. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5802 5810, 2019.

[56] X. Zou, D. Yin, Q. Zhong, H. Yang, Z. Yang, and J. Tang. Controllable generation from pre-trained language models via inverse prompting. ar Xiv preprint ar Xiv:2103.10685, 2021.