# generating_images_with_sparse_representations__e27a4ef6.pdf Generating images with sparse representations Charlie Nash 1 Jacob Menick 1 Sander Dieleman 1 Peter Battaglia 1 The high dimensionality of images presents architecture and sampling-efficiency challenges for likelihood-based generative models. Previous approaches such as VQ-VAE use deep autoencoders to obtain compact representations, which are more practical as inputs for likelihood-based models. We present an alternative approach, inspired by common image compression methods like JPEG, and convert images to quantized discrete cosine transform (DCT) blocks, which are represented sparsely as a sequence of DCT channel, spatial location, and DCT coefficient triples. We propose a Transformer-based autoregressive architecture, which is trained to sequentially predict the conditional distribution of the next element in such sequences, and which scales effectively to high resolution images. On a range of image datasets, we demonstrate that our approach can generate high quality, diverse images, with sample metric scores competitive with state of the art methods. We additionally show that simple modifications to our method yield effective image colorization and super-resolution models. 1. Introduction Deep generative models of images are neural networks trained to output synthetic imagery. Current models generate sample images that are difficult for humans to distinguish from real images, and have found applications in image super-resolution (NVIDIA), colorization (Antic), and text-guided generation (Ramesh et al.). Such models fall broadly into three categories: generative adversarial networks (GANs, Goodfellow et al. 2014), likelihood-based models, and energy-based models. GANs use discriminator networks that are trained to distinguish samples from generator networks and real examples. Likelihood-based models, including variational autoencoders (VAEs, Kingma *Equal contribution 1Deep Mind, London, United Kingdom. Correspondence to: Charlie Nash . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Figure 1. Images often have sparse structure, with salient content distributed unevenly across locations. Our model DCTransformer predicts where to place content, as well as what content to add. (left) Images generated by DCTransformer, and (right) associated heatmaps of image locations selected in the generation process. & Welling 2014; Rezende et al. 2014), normalizing flows (Rezende & Mohamed, 2015), and autoregressive models (Van Oord et al., 2016), directly optimize the model log-likelihood or the evidence lower bound. Energy-based models estimate a scalar energy for each example that corresponds to an unnormalized log-probability, and can be trained with a variety of objectives (Du & Mordatch, 2019). Likelihood-based models offer several important advantages. The training objective incentivizes learning the full data distribution, rather than only a subset of the modes (termed mode-dropping ), which is a common downside of GANs. Likelihood-based training tends to be more stable than adversarial alternatives, and the objective can be used to detect overfitting using a held-out set. The dramatic recent advances in modelling natural language made by Generating images with sparse representations GPT-3 (Brown et al., 2020) demonstrate that optimizing a simple log-likelihood objective, using expressive models and large datasets, is a effective approach for modelling complex data. Optimizing the likelihood of pixel-based images can be problematic due to their complexity and high dimensionality, however. To our knowledge, no likelihood-based model operating on raw pixels has demonstrated competitive sample quality on the Image Net dataset (Russakovsky et al., 2015) at a resolution of 256x256 or higher. For autoregressive models, conditioning on, and sequentially sampling, the hundreds of thousands of pixels in a typical image can be prohibitive. Some likelihood-based approaches address this by reducing the dimensionality using, e.g., lowprecision or quantized color spaces and images (Kingma & Dhariwal, 2018; Chen et al., 2020; Van Oord et al., 2016). VQ-VAE (Oord et al., 2017; Razavi et al., 2019) uses a vector-quantized autoencoder network to first perform neural lossy compression before downstream generative modelling, which reduces representation size while maintaining quality. Beyond generative models, the dimensionality of images poses a challenge whenever data storage, transmission, and processing budgets are at a premium. Fortunately, natural images have tremendous redundancy1, which all modern image compression methods exploit: even the images in this paper s PDF file are stored in a compressed format. JPEG (Wallace, 1992) and other popular lossy image compression methods use the discrete cosine transform (DCT) (Ahmed et al., 1974; Rao & Yip, 1990) to separate spatial frequencies of an image, and encode them with controllable resource budgets (e.g., the quality parameter). This takes advantage of the statistical structure of natural images (e.g., smooth signals with high frequency noise) and human vision (e.g., low frequencies tend to be more perceptually salient) by dropping high frequency information, to strike favorable efficiency/quality trade-offs (Wang et al., 2004). To capitalize on the decades of engineering behind modern compression tools, we propose a generative model over DCT representations of images, rather than pixels. We convert an image to a 3D tensor of quantized DCT coefficients, and represent them sparsely as a sequence of 3-tuples that encode the DCT channel, spatial location, and DCT coefficient for each non-zero element in the tensor. We present a novel Transformer-based autoregressive architec- 1Kersten (1987) estimated 4-bit grayscale pixel images have (an upper bound of) 1.42 median bits of information per pixel, which means they can generally be compressed to 1.42 4 = 35.5% their original size. They used human vision as a model, analogous to how Shannon (1951) measured the redundancy of English by having people guess the next character in a text sequence. ture (Vaswani et al., 2017), called DCTransformer , which is trained to sequentially predict the conditional distribution over the next element in the sequence, resulting in a model that predicts both where to add content in a image, and what content to add (Figure 1). We find that the DCTransformer s sample quality is competitive with state-of-the-art GANs, but with better sample diversity. We find sparse DCT-based representations help mitigate the inference time, memory, and compute costs of traditional pixel-based autoregressive models, as well as the time-consuming training of neural lossy compression embedding functions used in models usch as VQ-VAE (Oord et al., 2017). 2. DCT-based sparse image representations 2.1. Block DCT The DCT projects an image into a collection of cosine components at differing 2D frequencies. The two-dimensional DCT is typically applied to zero-centered B B pixel blocks P to obtain a B B DCT block D: 4α(u)α(v) (1) y=0 Pxycos (2x + 1)uπ cos (2y + 1)vπ (2), if u = 0 1, otherwise , where x and y represent horizontal and vertical pixel coordinates, u and v index the horizontal and vertical spatial frequencies, and α is a normalizing scale factor to enforce orthonormality. We follow the JPEG codec and split images into the YCb Cr color space, containing a brightness component Y (luma) and two color components Cb and Cr (chroma). We perform 2x downsampling in both the horizontal and vertical dimensions for the two chroma channels, and then apply the DCT independently to blocks in each channel. Chroma downsampling in this way substantially reduces color information, at a minimal perceptual cost. 2.2. Quantization In the JPEG codec, DCT blocks are quantized by dividing elementwise by a quantization matrix, and then rounding to the nearest integer. The quantization matrix is typically structured so that higher frequency components are squashed to a larger extent, as high frequency variation is typically harder to detect than low frequency variation. We follow the same approach, and use quality-parameterized quantization matrices for 8 8 blocks as defined by the Independent JPEG Group in all our experiments. For block sizes other than 8, we interpolate the size 8 quality matrix. Appendix A has more details. Generating images with sparse representations Figure 2. a) The input image is split into 8x8 blocks of pixels. b-c) The symmetric 2D DCT is used to transform each block into frequency coefficients. d) The block is flattened using the indicated zigzag ordering, and then quantized using a quality-parameterized quantization vector. e) The collection of quantized DCT vectors can be reformed into a DCT image, with 8x8=64 channels. f) The DCT image is converted to a coordinate list of non-zero channel-position-value tuples, as a sparsified representation of the image. g) Images decoded at intermediate steps indicated on the coordinate list in f). 2.3. Sparsity Sparsity is induced in DCT blocks through the quantization process, which squashes high frequency components and rounds low values to zero. The JPEG codec takes advantage of this sparsity by flattening the DCT blocks using a zigzag ordering from low to high frequency components, and applying a run-length encoding to compactly represent the strings of zeros that typically appear at the end of the DCT vectors. We use an alternative sparse representation, where zigzagflattened DCT vectors are reassembled in their corresponding spatial positions into a DCT image, of size H/B W/B B2, where H and W are the height and width of the image, respectively, and B is the block size (Figure 2e). The DCT image s non-zero elements are then serialized to a sparse list consisting of (DCT channel, spatial-position, value) tuples. This is repeated for the Y, Cb, and Cr image channels, with the Y channel occupying the first B2 DCT bands, and the Cb and Cr channels occupying the second and third B2 DCT bands, respectively. The spatial positions of the 2 downsampled chroma channels are multiplied by 2 to take them into correspondence with the Y channel positions. The resulting lists are concatenated, and a stopping token is added in order to indicate the end of the variable length sequences. For unconditional image generation we sort from low to high frequencies, with color channels interleaved at intervals with the luminance channel. This results in a natural upsampling structure, where low frequency content is represented first, and high frequency content is progressively added, as shown Figure 3. Bits per image-subpixel for dense and sparse block-DCT representations, quantized at various quality settings, reported for 1000 images from CLEVR (Johnson et al., 2017), FFHQ (Karras et al., 2019) and Image Net (Russakovsky et al., 2015). Dense representations use the same amount of bits in each image location, whereas sparse representations use more bits on regions of greater detail, resulting in variable length codes. For spatially sparse image datasets like CLEVR, sparse DCT representations are substantially more compact, even at higher DCT quality settings. in Figure 2g. Note, this ordering is not the only option: An alternative that we discuss in Section 4.3 places the luma data before the chroma data, resulting in an image colorization scheme. Figure 3 compares the size of dense and sparse representations as a function of the DCT quality setting, and shows that for all but the highest quality setting, sparse representations are substantially smaller. Generating images with sparse representations Figure 4. Chunk-based training and stacked Transformer architecture. A target chunk is selected during training, and previous elements are collected in an input slice. The input slice is gathered into a 3D DCT image Tensor. The DCT-image is flattened and embedded using a Transformer encoder, and the target slice is passed through a series of right-masked Transformer decoders, each of which performs cross-attention into the encoded DCT image, in order to predict channels, positions and values. 3. DCTransformer We model the sparse DCT sequence autoregressively, by predicting the distribution of the next element conditioned on the previous elements. Samples are generated by iteratively sampling from the predictive distributions, and decoding the resulting sparse DCT sequence to images. For a sequence consisting of channel, position, value tuples [tl]L l=1 = [(cl, pl, vl)]L l=1 the joint distribution factors as: l=1 p(cl|t