# distribution_augmentation_for_generative_modeling__856a72ab.pdf Distribution Augmentation for Generative Modeling Heewoo Jun * 1 Rewon Child * 1 Mark Chen 1 John Schulman 1 Aditya Ramesh 1 Alec Radford 1 Ilya Sutskever 1 We present distribution augmentation (Dist Aug), a simple and powerful method of regularizing generative models. Our approach applies augmentation functions to data and, importantly, conditions the generative model on the specific function used. Unlike typical data augmentation, Dist Aug allows usage of functions which modify the target density, enabling aggressive augmentations more commonly seen in supervised and self-supervised learning. We demonstrate this is a more effective regularizer than standard methods, and use it to train a 152M parameter autoregressive model on CIFAR-10 to 2.56 bits per dim (relative to the state-of-the-art 2.80). Samples from this model attain FID 12.75 and IS 8.40, outperforming the majority of GANs. We further demonstrate the technique is broadly applicable across model architectures and problem domains. 1. Introduction Data augmentation is an indispensable tool in the training of deep neural networks, especially for discriminative (Cubuk et al., 2019), semi-supervised (Xie et al., 2019a;b), and self-supervised (H enaff et al., 2019; He et al., 2019) vision tasks. However, data augmentation has not played a role in many of the recent advances on pixel-level generative modeling (Salimans et al., 2017; Chen et al., 2018; Menick & Kalchbrenner, 2018; Kingma & Dhariwal, 2018; Ho et al., 2019; Parmar et al., 2018; Child et al., 2019). This is partly caused by the difficulty of applying data augmentation to generative modeling: aggressive augmentation functions can change the data distribution the model learns, distorting its samples and incurring a penalty to its log-likelihood. *Equal contribution 1Open AI, San Francisco, California, USA. Correspondence to: Heewoo Jun , Rewon Child . Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). Figure 1. Unconditional samples from a large autoregressive model trained on CIFAR-10 with Dist Aug. In addition to attaining stateof-the-art likelihoods, at temperature 0.94 this model generates samples competitive with or superior to many GANs (FID 12.75, IS 8.40). To illustrate, consider a supervised learning setting where the distribution of interest is over a small number of class variables y given a high-dimensional image x. In this case, many transformations t(x) can be applied to x such that the conditional distribution p(y|t(x)) remains unchanged. Generative models, in contrast, seek to learn the distribution p(x) itself, and as such any t(x) we use must result in images which are also likely under the original distribution. Even mild augmentation functions, however, (like adding Gaussian noise, shifting/rotating and padding with null pixels, or cutting out portions of the image), result in images very unlikely under the original distribution. Models trained with these augmentations may fit the original distribution poorly and generate distorted samples. This work introduces distribution augmentation (Dist Aug), a method of regularizing generative models even under augmentations which distort the data distribution. The Distribution Augmentation for Generative Modeling (a) Data augmentation (b) Dist Aug Figure 2. Illustration of the main differences between data augmentation and distribution augmentation (Dist Aug). Dist Aug can use more aggressive transformations than data augmentation because it conditions the generative model on the transformation type. idea is to simply learn a density pθ(t(x) | t) of the data under some transformation t, conditioned on the transformation itself. The function t can be drawn from a large family of transformations, including state-of-the-art aggressive data augmentations from supervised learning (Cubuk et al., 2019) and self-supervised transformation functions (Gidaris et al., 2018; Kolesnikov et al., 2019). To estimate the un-transformed density and draw samples from it, the transformation t is simply set to the identity function. This can be interpreted as a unique data-dependent and model-dependent regularizer similar to multi-task learning. We find it substantially outperforms standard methods, allowing us to train an autoregressive CIFAR-10 model to 2.56 bits per dim (versus the previous 2.80 state-of-the-art) and attain sample quality on par with most GANs. We show evidence that this increased performance is both due to the scalability of the method relative to standard regularization, and also to the beneficial inductive biases received from well-chosen augmentation functions. We also examine the applicability of Dist Aug to other scenarios by testing different domains (language modeling and neural machine translation), different models (autoregressive models and invertible flows) and different architectures (self-attention based and convolution-based). We find our technique, with proper dropout tuning, yields improvements in most cases we test. We release our model weights and code at https://github.com/openai/ distribution_augmentation. 2. Background 2.1. Density estimation We consider the task of estimating a parametric density pθ(x) under the negative log-likelihood objective: L(pθ, Dn) := E x Dn [ log pθ(x)] (1) The model θ can be any likelihood-based generative model, including autoregressive models, invertible flows, variational autoencoders, and more. 2.2. Data augmentation Standard data augmentation introduces a family of transformation functions T where each t T transforms the data in a particular way, except for the identity transformation I. The objective is to maximize the likelihood over transformed samples: Ldata(pθ, Dn, T) := E x Dn E t T[ log pθ(t(x))] (2) The specifics of the distribution over T are determined on a case-by-case basis, typically with regard to the performance of the model when evaluating non-transformed data on a held-out validation set. In the case of images, the types of transformations are usually significantly more conservative than those used in supervised learning, and are commonly limited to just horizontal flipping. In the case of natural language, transformations might include mild augmentations like replacing words with their synonyms. This is because using more aggressive transforms will distort the distribution the model learns, often resulting in higher loss on the validation set. Additionally, model samples will include the augmentations, which is usually not desired. 3. Dist Aug Distribution augmentation (Dist Aug), similar to data augmentation, introduces a family of transformations T to the training process. Unlike data augmentation, however, the model learns the density of a transformed data sample, conditioned on the specific transformation: Ldist(pθ, Dn, T) := E x Dn E t T[ log pθ(t(x)|t)] (3) A graphical depiction of this process appears in Figure 2. Dist Aug allows the model to adapt its density estimate of the sample based off which transformation is being applied. In practice, this means that regardless of how radically a transformation distorts a sample, adding it to the training process need not affect the model s estimate of the original samples as long as the model has enough capacity. Distribution Augmentation for Generative Modeling In principle, any number and type of transformations can be applied, regardless of whether they preserve the information in the data and regardless of how similar the transformed samples are to the original. However, if | det( t/ x)| > 0, it is possible to invert the transformed density. For permutation-motivated transforms like rotation, jigsaw, and reordering of words, it is possible to estimate and sample from the density in non-canonical order. As a varied and wide family of transformations can be used with Dist Aug, we hypothesize that it can be used to train more powerful models without overfitting than is possible with standard regularization methods. In section 4.1, we provide experimental evidence supporting this claim. 3.1. Dist Aug as a data-dependent regularizer A helpful way of rewriting the objective of a generative model trained with Dist Aug is as follows: Ldist(pθ, Dn, T) = E x Dn [ log pθ(I(x)|I)] log pθ(t(x)|t) = L(pθ, Dn) + Ω(pθ, Dn, T). (5) In other words, the objective under Dist Aug is equivalent to the original objective, plus some data and model-dependent regularizer Ωwhose difficulty is dependent on T. The model incurs a penalty if it models the original distribution much better than any of the transformed distributions. 3.2. Dist Aug as multi-task learning The objective in Eq. 3 can also be interpreted in the framework of multi-task learning, where the separate tasks in this case are to learn the transformed distributions. We speculate that, in a similar nature to multi-task learning, when the family of transformations T is chosen with knowledge of the original distribution and model class, Dist Aug may lead to better generalization than techniques which penalize complexity uniformly. In section 4.2, we provide evidence that even small models trained with Dist Aug achieve gains in validation performance relative to standard training methods, suggesting that gains are not only due to the scale of augmentations applied, but also due to the helpful inductive bias of certain transformations. 4. Experiments In this section, we provide evidence supporting our main claims that 1) Dist Aug scales to larger model sizes than standard regularization techniques, 2) specific choices of distributions can lead to improved performance even for small models, and 3) the conditioning signal leads to lower error and improved samples. In this section, we primarily study an autoregressive model (the Sparse Transformer (Child et al., 2019)) and its performance on the natural image benchmark datasets CIFAR-10 and Image Net-64. In section 5, we test whether it applies to other data types, model architectures, and generative objectives. In all image experiments, models using Dist Aug were tuned to find the best subset of augmentation functions from rotation (Gidaris et al., 2018), spatial transposition, color swapping, jigsaw shuffling (Noroozi & Favaro, 2016), and an ensemble of data augmentations from supervised learning called Rand Augment (Cubuk et al., 2019). To condition the model on these transformations, we replace the start of sequence embedding vector with the sum of augmentation embeddings. For example, if we ask the model to generate images with rotation and transposition, we choose one each from four rotation embeddings and two transposition embeddings. Each option is selected uniformly at random. For rotation, spatial transposition, color swapping, and jigsaw recordering, we also permute the positional embeddings supplied to the Transformer, which results in a smoother learning and an increase of around 0.01 BPD at convergence. Baseline regularization methods include just model regularization (dropout and weight decay), as well as model regularization and limited data augmentation (horizontal flipping). Also, all Dist Aug runs were tuned for the best value of dropout, which was typically much lower than the corresponding baseline run. Detailed hyperparameter settings for experiments are available in the Supplementary Material. 4.1. Scalability of Dist Aug Here, we assessed the performance of Dist Aug relative to standard regularization methods when both increasing model size and data size. 4.1.1. CIFAR-10 We took a CIFAR-10 model with three different capacities: 15M parameters (smaller than typically used), 58M parameters (the previous state-of-art), and 152M parameters (more than is typically used), and assessed their performance when trained with Dist Aug and with standard regularization methods. The results are presented in Figure 3. Without Dist Aug, the larger 152M parameter models perform worse than their 58M parameter counterparts, suggesting that a good regularization technique would be to simply reduce the number of Distribution Augmentation for Generative Modeling baseline hor. flipping randaugment rot rot, tr rot, tr, col, js (a) CIFAR-10 validation bits per dim of different augmentation strategies across model sizes (in millions of parameters). Baseline and horizontal flipping do not use Dist Aug. Dist Aug type Dropout BPD - 60% 2.88 horizontal flipping 40% 2.79 randaugment 25% 2.66 rot 40% 2.59 rot, tr 40% 2.56 rot, tr, col 10% 2.58 rot, tr, col, js 5% 2.53 (b) CIFAR-10 test bits per dim using a 152M Sparse Transformer. Increasing levels of Dist Aug and less dropout yields increased performance. Figure 3. Using large models with increasing Dist Aug strength can improve generalization to an extent that is not possible with standard regularization methods. Dropout and traditional data augmentation plateau at lower parameter count. parameters in the model. With Dist Aug, on the other hand, performance continues to improve as we increase model capacity. The model with the most augmentations and the lowest dropout rate performs the best, attaining 2.53 bits per dim on the test set, compared to 2.79 for a similarly-sized model with tuned dropout and standard data augmentation. It should be noted that this improved performance is not just with models with greater numbers of parameters in fact, the 15M model trained with rotation and transposition, at 2.67 bits per dim, also improves upon the previous 58M state-of-the-art model by 0.13 bits per dim. We attribute this gain to a useful inductive bias from Dist Aug, and further elaborate on this phenomenon in section 4.2. 4.1.2. IMAGENET-64 We also evaluated models trained with Dist Aug on the larger and more complex Image Net-64 dataset. We trained 152M and 303M parameter models for comparison with previous works. Dist Aug with rotation showed modest gains in performance, as seen in Table 1. We suspect that, as the baseline models do not require training with dropout, that their capacity is still insufficient to benefit fully from Dist Aug. Larger models will be needed for more substantial gains. 4.2. Dist Aug on small models We observe that even relatively small models can benefit from Dist Aug, which is difficult to explain if it acts purely as a regularizer on the model s capacity. Drawing from Dist Aug s connections to multi-task learning explored in section 3.2, we hypothesize that the specific inductive bias of the transformations may aid in learning a generalizable solution as opposed to merely regularizing the model. To test this, we apply single augmentation types to a 15M Table 1. Validation bits per dim on the Image Net-64 dataset. We found gains to be modest relative to CIFAR-10, suggesting that model capacity may still be too limited to fully take advantage of Dist Aug. Dist Aug still shows some improvement, however, which we discuss in section 4.2 Parameters Dist Aug type Validation BPD 152M - 3.432 152M rot 3.427 303M - 3.425 303M rot 3.419 parameter CIFAR-10 model, reporting results in Table 2. Although each variation provides exactly one auxiliary task for the model to complete, the validation performance varies widely, from an increase of 0.13 bits per dim over the baseline (rotation) to a decrease of 0.06 bits per dim (with a set of fixed random orderings). Thus, we conclude that the task type is important and can lead to wide variations in performance. It is unclear why certain augmentation types are so much stronger than others in improving density estimation. Take, for example, jigsaw, which effectively amplifies the amount of training data by a factor of (2 2)! = 24 larger than the factor of 4 for rotation. In line with results from selfsupervision literature (Kolesnikov et al., 2019), however, jigsaw turns out to be weaker than rotation. We speculate that rotation may just happen to provide a useful inductive bias for the specific model and datasets we use, and posit this as the reason it helps even for models which are not prone to overfitting. Another interpretation is that the model is encouraged to learn the original data distribution and a neural circuit to generate augmented distributions from it. It is possible that certain augmentation types such as rotation are a simple Distribution Augmentation for Generative Modeling Table 2. CIFAR-10 performance for a relatively small 15M parameter models, trained with varying choices of a single augmentation function. Rotation and Rand Augment (Cubuk et al., 2019) perform remarkably well, even outperforming larger state-of-the-art models, whereas random orderings or excess data decrease performance. The specific inductive biases of transformations thus can aid model learning, independent of their regularizing effect. Augmentation Validation BPD Baseline 2.82 Rand Augment 2.73 Jigsaw 2.78 Color 2.75 Transposition 2.73 Rotation 2.69 75% Fixed random order 2.88 75% Image Net data 2.80 enough function to implement, and facilitate in learning the original data distribution in the process of jointly compressing the data and the corruption process. We found tiny models with 2M parameters plateaued to a worse likelihood than the baseline when Dist Aug was used. Because the model does not have enough capacity to learn this joint representation let alone all distributions, it does not learn any one distribution well. This is contrasted by sufficiently larger models that could eventually learn a more compact representation. Such mechanics may lead the model to avoid rote memorization and improve generalization. We leave detailed examination of this mechanism to future work. 4.3. Importance of the conditioning signal For a 32 32 image, 1 bit of conditioning signal provides only about 0.0003 BPD of information. But, the presence of this bit allows the model to estimate and sample from the original distribution much more effectively, as shown in Figure 4. While models trained without conditioning can still put significant probability mass on unaugmented data, during sampling augmented samples are drawn frequently and appear visually distorted, a phenomenon introduced in (Theis et al., 2015). We provide quantitative evaluations of this behavior in Table 3 where heavy Rand Augment normally used for larger images is applied 99% of the time. In the case of a Rand Augment-conditioned model, the conditioning signal provides a gain of only 0.007 bits per dim likelihood yet improves the FID evaluation by over 20 points. We find that the model can still sample plausible objects and obtains a comparable FID to GAN models, whereas the model trained with traditional data augmentation rarely produces unaugmented samples, as shown in Figure 8b. Moreover, the corruption process internalized by an unconditional augmentation model often can not be fixed by finetuning on the original data. For the above example with heavy Rand Augment, we further finetune for 10 to 100 epochs with a varying amount of model regularization and find that the sample quality and the rate of unaugmented samples do not change much. It is interesting, however, that when the unconditional model generates a sample from the original distribution, the sample quality is often quite high. There are clearly object-like images, demonstrating that augmentation can still help with sample quality of the original distribution, even without conditioning. But to ensure that the model can reliably sample from the original distribution instead of following its adjusted training distribution, conditioning must be used. Table 3. Ablation on the conditioning signal Augmentation Conditioning? BPD FID IS Rand Augment no 2.715 42.0 6.36 Rand Augment yes 2.708 21.1 7.52 (a) With conditioning (b) Without conditioning Figure 4. Even though the log-likelihood of unconditional and conditional augmented models can be similar, samples from Dist Aug are much improved visually and have a lower incidence of artifacts. 5. Comparison across architectures, domains, and other existing work We tested whether the same technique applies to different model architectures, problem domains, and generative modeling objectives. We also attempted to contextualize our work in the body of existing generative modeling literature. Our results are summarized in Tables 4, 5, and 6. 5.1. Comparison with existing autoregressive models We compare the best Sparse Transformer (Child et al., 2019) we could train with Dist Aug with the previous state-of-theart self-attention based generative models. With an equal amount of parameters, simply adding rotation leads to a gain of 0.18 bits per dim (BPD) over the state-of-the-art, to 2.62. Larger models benefit even more: a 152M parameter Distribution Augmentation for Generative Modeling Table 4. Performance on the CIFAR-10 natural image density estimation benchmark. Well-tuned distribution augmentation leads to significant gains across model architectures, generative objectives, and datasets when compared with standard augmentation or regularization techniques. rot, tr corresponds to augmenting by spatially rotating images and transposing them which creates 8 left-right-aware reorientations of the image. Model Parameters Regularization Dist Aug BPD Autoregressive, Self-Attention Pixel SNAIL (Chen et al., 2018) - 2.85 Sparse Transformer (Child et al., 2019) 58M dropout 25% - 2.80 Sparse Transformer 58M dropout 1% rot 2.62 Sparse Transformer 152M dropout 40%,hflip - 2.78 Sparse Transformer 152M dropout 40% rot,tr 2.56 Sparse Transformer 152M dropout 5% rot,tr,js,c 2.53 Autoregressive, Convolutional Pixel CNN (van den Oord et al., 2016b) 3.14 Gated Pixel CNN (van den Oord et al., 2016a) 3.03 Pixel CNN++ (Salimans et al., 2017) 53M 50% - 2.93 Pixel CNN++ 53M 50% rot, tr 2.88 Pixel CNN++ 53M 5% rot, tr, col 2.84 Invertible flow Glow (Kingma & Dhariwal, 2018) - 3.35 Flow++ (Ho et al., 2019) 32M 20% - 3.08 Flow++ 32M 1% rot, tr 2.98 Table 5. Frechet Inception Distance (FID) and Inception Scores (IS) for models trained with Dist Aug on CIFAR-10, relative to existing state-of-the-art. We find that better sample quality tends to correlate with fewer augmentations, and only loosely follows likelihoods. When the sampling temperature is varied, moreover, networks trained with Dist Aug can generate samples with similar or better quality than GAN-based networks, despite not incorporating any optimization objective for sample quality. Model Parameters Regularization Dist Aug BPD Temperature FID IS Sparse Transformer 58M Dropout 1% rot 2.62 1.0 37.5 6.41 Sparse Transformer 152M Dropout 5% rot,tr,js,c 2.53 1.0 42.9 6.85 Sparse Transformer 152M Dropout 40%, hflip - 2.78 1.0 27.8 7.16 Sparse Transformer 152M Dropout 25% randaugment 2.66 1.0 14.7 8.18 Sparse Transformer 152M Dropout 40% rot,tr 2.56 1.0 21.8 7.81 Sparse Transformer 152M Dropout 40% rot,tr 0.98 17.0 8.00 Sparse Transformer 152M Dropout 40% rot,tr 0.94 12.7 8.40 Sparse Transformer 152M Dropout 40% rot,tr 0.90 15.2 8.36 Sparse Transformer 152M Dropout 25% randaugment - 0.94 10.57 8.93 PGGAN (Karras et al., 2017) 8.80 Auto GAN (Gong et al., 2019) 12.42 8.55 CR-GAN (Tian et al., 2018) 14.56 8.40 SN-GAN (Miyato et al., 2018) 21.7 8.22 WGAN-GP (Gulrajani et al., 2017) 29.3 7.86 Table 6. Performance on natural language generation benchmarks. Enwik8 is unconditional generation, whereas Neural Machine Translation (NMT) is conditional generation. Dist Aug increases performance on the tasks we studied. Model Parameters Dropout Dist Aug Enwik8 Bits per byte Sparse Transformer (Child et al., 2019) 95M 20% - 0.99 Sparse Transformer 95M 10% document, sentence, word reversal 0.968 0.001 En-De Neural Machine Translation BLEU Transformer (Vaswani et al., 2017) 210M 30% - 26.23 0.77 Transformer 210M 15% sentence reversal 26.82 0.24 Distribution Augmentation for Generative Modeling Sparse Transformer attains 2.78 BPD using traditional data augmentation, but with aggressive Dist Aug gets 2.53 BPD, a relative gain of 0.25 BPD. 5.2. Comparison across architectures 5.2.1. PIXELCNN++ We also compare with a model architecture variant which incorporates no self-attention, the Pixel CNN++ (Salimans et al., 2017). The relative gain from introducing Dist Aug appears to be more limited than in the self-attention case, gaining 0.09 bits per dim to a total of 2.84. The reason for this difference is unclear, although we speculate that selfattention may be more capable of learning flexible routing patterns than a purely convolutional approach. 5.2.2. FLOW++ Additionally, to test whether it may be just the quirks of the autoregressive objective which lead to gains from Dist Aug, we also test on a state-of-the-art invertible flow with variational dequantization, Flow++ (Ho et al., 2019). We added a conditioning embedding to every coupling network in order to allow the model to incorporate transformation information. We find that the model trained with rotation and tranposition gains 0.1 bits per dim over the baseline, to a total of 2.98. When we inspected the learned conditioning embeddings, however, we found that many of them had similar values, and samples generated with different orientations looked visually similar. Thus, we conclude that the network did not learn to use our conditioning, and that similar gains would be seen with an unconditional model. 5.3. Comparison of sample quality relative to other generative models We also compare the sample quality relative to existing work by calculating Frechet Inception Distance (FID) and Inception Scores (IS). We draw 10,000 samples and report scores in Table 5. Inception scores are calculated in 10 batches, and we report the mean. We notice only a loose correlation between bits per dim and sample quality metrics, a phenomenon observed to be possible in (Theis et al., 2015). Instead, we notice that applying fewer augmentations tends to result in better sample quality for the model size considered in this work, with both conditional Rand Augment and rot,tr performing similarly well. The model trained with rot,tr,js,col, on the other hand, achieves high FID and low IS despite achieving the lowest bits-per-dim of any model. Samples of this model are shown in 8a. The reason for this disparity is unknown, but may be Figure 5. The top row is cherry-picked samples from a Dist Aug model where the augmentation is Rand Augment (Cubuk et al., 2019). More representative samples are shown in Figure 7b. Subsequent rows are the nearest training data in the Inception embedding space. We found none of the generated samples were clearly copied pixel-by-pixel, although numerous, near identical copies of the white sedan are found in the training set. due to the limited capacity of the model to condition on a wide number of transformations. When samples are generated by tempering the output distribution, as in common practice with autoregressive models, we find that sample quality matches or exceeds the best GANs in the literature. This is despite the fact that ours is a likelihood-based model that must cover all modes of the data, and incorporates no objective specifically tailored to promote sample quality. In Figure 5, we cherry-pick some of the best samples and show the nearest neighbors in the Inception embedding space. We found little evidence to suggest the model is merely memorizing data. One case of a white sedan did appear to be very similar to training examples, but the same car appears numerous times in the training data. Additionally, the sampled image has many differences on a pixel-by-pixel level. 5.4. Comparison on text domain In principle, the benefits of Dist Aug need not be limited to images. Here, we explore augmentation for generative modeling of text on two tasks: neural language modeling and neural machine translation. For language modeling, we factorize the prose probability in various ways. Namely, we consider independently reversing the order of sentences, words, and letters within a word. For example, for the last two options, "the cat sat on the mat" becomes "mat the on sat cat the" and "eht tac tas no eht tam" respectively. In general, this idea is equivalent to progressively factorizing a string x into (s)entences, (w)ords, and (c)haracters: pθ(x) = Q i pθ(sti|st