# denoising_autoregressive_representation_learning__104aa36a.pdf Denoising Autoregressive Representation Learning Yazhe Li 1 Jorg Bornschein 1 Ting Chen 1 2 Abstract In this paper, we explore a new generative approach for learning visual representations. Our method, DARL, employs a decoder-only Transformer to predict image patches autoregressively. We find that training with Mean Squared Error (MSE) alone leads to strong representations. To enhance the image generation ability, we replace the MSE loss with the diffusion objective by using a denoising patch decoder. We show that the learned representation can be improved by using tailored noise schedules and longer training in larger models. Notably, the optimal schedule differs significantly from the typical ones used in standard image diffusion models. Overall, despite its simple architecture, DARL delivers performance remarkably close to state-of-the-art masked prediction models under the fine-tuning protocol. This marks an important step towards a unified model capable of both visual perception and generation, effectively combining the strengths of autoregressive and denoising diffusion models. 1. Introduction With the rise of Large Language Models (LLMs), generative pre-training has become increasingly popular. Representations learned via next token prediction improves the performance when transferred to a diverse set of downstream tasks (Radford et al., 2018; 2021; Brown et al., 2020; Raffel et al., 2023). Beyond learning representations, the model directly generates language, acting as a user interface and allowing for interactive adjustments (Liu et al., 2021). The likelihoodbased pre-training objective enables us to investigate scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022). These scaling laws predict how a model s pre-training loss relates to its capacity and the amount of training data. Generally, 1Google Deep Mind, London, UK. 2x AI, San Francisco, US. Work done while at Google Deep Mind. Correspondence to: Yazhe Li . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). we anticipate that achieving a lower pre-training loss leads to superior performance on downstream tasks. In vision, however, representation learning and image generation often use separate techniques. For learning representations, methods such as contrastive learning (van den Oord et al., 2019; Chen et al., 2020b; He et al., 2019), distillation-based self-supervised learning (Grill et al., 2020; Caron et al., 2021) and masked image modelling (MIM) (He et al., 2021; Bao et al., 2022) are widely used. Despite their strengths in learning robust visual and cross-modal (Radford et al., 2021) representations, as well as their efficient use of model capacity, these methods lack generation capabilities. Furthermore, the pre-training loss, influenced by the difficulty of the pre-training task, does not serve as a reliable indicator of performance on downstream tasks. In this paper, we investigate the potential of a unified model capable of both visual perception and generation by combining autoregressive and denoising diffusion models. We use a straightforward architecture - a decoder-only Transformer - which predicts the next image patch based on a sequence of previously observed patches. Instead of absolute or learnable positional encoding, we implement relative positional encodings through the utilization of decomposed rotary position embedding (2D Ro PE). We show that 2D Ro PE improves the performance, in particular for causal Transformers. When trained with MSE loss, the fine-tuning performance of the model is not far away from the stateof-the-art representation methods. To enhance the image generation ability, we introduce a denosing patch decoder and substitute the MSE loss with the diffusion objective. Our results demonstrate that model performance depends on the noise schedule employed during training. When the noise schedule is more focused on high noise levels, training with diffusion objective leads to an improvement which becomes more pronounced with extended pre-training epochs. Due to the concentration on high noise level, the optimal noise schedule differs significantly from those suitable for generation purpose (Chen et al., 2020a; Nichol & Dhariwal, 2021; Rombach et al., 2022; Chen, 2023; Hoogeboom et al., 2023). This deviation from image generation models can be interpreted as the competition for model capacity between higher-level abstraction and lower-level details. Overall, our method significantly advances representation learning with generative pre-training. Under fair comparison conditions, Denoising Autoregressive Representation Learning our best model achieves performance remarkably close to state-of-the-art masked prediction models like Masked Autoencoders (MAE), with a minor performance gap of 1%. This demonstrates the potential of generative pre-training for vision data. Our contributions and findings: Denoising autoregressive representation learning: We propose DARL, a generative approach for learning visual representations that demonstrates performance comparable to leading masked prediction models. Decomposed Ro PE: We show that causal Transformers significantly benefit from employing 2D Ro PE, an implementation of relative positional encodings. MSE and diffusion objectives: We observe that training on MSE loss alone yields strong performance. Incorporating a denoising patch decoder further enhances representation quality and generative ability, especially in larger models with extended training and optimized noise schedules. Denoising is also beneficial when using large patch sizes. Patch ordering: Extensive analysis reveals that raster scan order is near-optimal for fixed patch ordering. Random ordering does not offer any performance advantages. 2. Related Works Self-supervised Representation Learning learns through solving pretext tasks which are constructed without external labels. Meticulously designed pretext tasks, such as relative location prediction (Doersch et al., 2015), colorization (Zhang et al., 2016), jigsaw puzzle solving (Noroozi & Favaro, 2017) and rotation prediction (Gidaris et al., 2018), are empirically demonstrated to learn meaningful visual representations. Contrastive learning (Bachman et al., 2019; Chen et al., 2020b) involves constructing distinct views of the same image through data augmentation. Given one view, the model is trained to distinguish data originating from the same source image from others. Info NCE loss (Belghazi et al., 2018; van den Oord et al., 2019) is often used and could be seen as minimizing the distance between positive pairs while maximizing those between negative pairs. Alternative metrics for measuring distances between positive and negative pairs, such as L2 (Grill et al., 2020) or kernel distance (Li et al., 2021), can also be employed. Performance of contrastive learning can be improved by using a momentum encoder (He et al., 2019), which can also be leveraged to remove the necessity of negative examples (Grill et al., 2020; Caron et al., 2021). This can be seen as an instance of self-distillation. Masked prediction task predicts the missing content from a partial input. It has demonstrated strong performance and gained popularity through methods like BERT (Devlin et al., 2019) in Natural Language Processing (NLP), Masked Autoencoders (MAE) (He et al., 2021) and Bei T (Bao et al., 2022) in vision. Generative Pre-Training. In the vision domain, earlier attempts of using generative models for representation learning includes Variational Autoencoders (VAE) (Kingma & Welling, 2022; Rezende et al., 2014; Higgins et al., 2017; van den Oord et al., 2018) and GANs (Donahue & Simonyan, 2019). With the success of GPT (Radford et al., 2018; 2021; Brown et al., 2020) in NLP, generative pretraining attracts renewed attention. Image-GPT (Chen et al., 2020a) adapts the GPT model for pre-training on images. Diffusion Models are a class of latent variable models inspired by statistical physics and non-equilibrium thermodynamics (Sohl-Dickstein et al., 2015). It has been demonstrated that diffusion models excel at generating high-quality images (Ho et al., 2020; Dhariwal & Nichol, 2021; Rombach et al., 2022). In addition, they offer flexibility to generate images guided by labels (Dhariwal & Nichol, 2021; Ho & Salimans, 2022) or textual descriptions (Nichol et al., 2022; Saharia et al., 2022). There is also a growing interest in utilizing diffusion models for representation learning (Hudson et al., 2023; Wei et al., 2023). Autoregressive Models (AR) models have a rich history in language and speech. Developments of innovative architectures, such as recurrent neural networks (Rumelhart et al., 1986), long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997) and Transformers (Vaswani et al., 2023), keep improving their capabilities. In the image domain, AR models are adopted in NADE (Uria et al., 2016), MADE (Germain et al., 2015), Pixel RNN (van den Oord et al., 2016a), Pixel CNN (van den Oord et al., 2016b), Image Transformer (Parmar et al., 2018) and Image-GPT (Chen et al., 2020a). AR models are tightly related to generative models as they are often trained with a likelihood-based objective functions. Concurrent work (El-Nouby et al., 2024) shows that patch-based image Transformer trained with L2 loss possesses similar scaling property as their NLP counterpart. However, their model cannot be regarded as a full-fledged generative model. It s perhaps worth noting that diffusion models can also be seen as AR model, but in the frequency space (Dieleman, 2023). 3. Denoising Autoregressive Representation Learning (DARL) 3.1. Architecture The architecture used for our study is straighforward (see Figure 1): a Vision Transformer (Vi T) (Dosovitskiy et al., 2021) backbone with causal attention masking. Adopting this backbone allows us to make a direct comparison with prior representation learning methods. Denoising Autoregressive Representation Learning Figure 1. DARL architecture. Images are segmented into non-overlapping patches to form an input sequence. Causal attention masking is applied to the Vision Transformer. Random noises, parameterized by a noise schedule, are independently sampled to corrupt the patches. The output of the Transformer, along with the corrupted patch, are taken as input to the patch decoder to reconstruct the clean patch. Following the Vi T approach, images are segmented into non-overlapping patches and linearly projected into an embedding space. The resulting embeddings are arranged in raster scan order, and a start-of-sequence token is prepended. This forms the inputs to the Transformer. The combination of causal attention masking and the one-position offset by start-of-sequence token ensures that the patch generated at current position only receives information from the previous patches. We use relative positional encodings in the form of decomposed Ro PE (detailed in Section 3.4). We find that relative positional encodings outperform absolute and learnable ones, in particular for AR models. Extending Ro PE from 1D to 2D allows better generalization for image data (see Section 4.1). A patch decoder is responsible for mapping the Transformer output into pixel space. In case of training with MSE loss, we simply use a linear layer. In the case of diffusion objective, we use a denoising patch decoder consisting of a single layer of Transformer block which processes the output of the backbone and the embedding of the corrupted patch (treating each as an input token). 3.2. Training Objective The training uses the standard AR objective function: L(θ; D) = X t=1 log pθ(xt|x 1, the mode is concentrated on (a 1)/(a + b 2) and the bigger the total counts a + b the smaller the variance; when a < 1 or b < 1, it s a bi-modal distribution that concentrates on 0 and 1. This formulation offers dual benefits: it reduces the number of hyperparameters and offers more interpretability to the model s preference on noise schedules. Reverse Process The reverse process relies on a denoising model pθ(xs 1|xs) to remove the noise added in the forward process. pθ(xs 1|xs) is parameterized as a Gaussian distribution centered at µθ with a fixed variance. With the simplified objective proposed in Ho et al. (2020), the variance only affects the likelihood computation and can be ignored for training. The mean µθ can be formulated by either using the noise ϵ or the target x0: µθ = 1 αs xs 1 αs p αs(1 γs) ϵ (2) = αs(1 γs 1) 1 γs xs + γs 1(1 αs) 1 γt x0 (3) In Equation (2), the model learns to predict the noise ˆϵ; while in Equation (3), the model learns to predict the original image patch ˆx0. We use the latter formulation and empirically show that predicting target works better. Denote g(xs t, zt) the denoising patch decoder, where the conditioning zt = f(x