# fast_autoaugment__ac47e9f3.pdf Fast Auto Augment Sungbin Lim UNIST sungbin@unist.ac.kr Ildoo Kim Kakao Brain ildoo.kim@kakaobrain.com Taesup Kim MILA, Université de Montréal, Canada taesup.kim@umontreal.ca Chiheon Kim Kakao Brain chiheon.kim@kakaobrain.com Sungwoong Kim Kakao Brain swkim@kakaobrain.com Data augmentation is an essential technique for improving generalization ability of deep learning models. Recently, Auto Augment [5] has been proposed as an algorithm to automatically search for augmentation policies from a dataset and has significantly enhanced performances on many image recognition tasks. However, its search method requires thousands of GPU hours even for a relatively small dataset. In this paper, we propose an algorithm called Fast Auto Augment that finds effective augmentation policies via a more efficient search strategy based on density matching. In comparison to Auto Augment, the proposed algorithm speeds up the search time by orders of magnitude while achieves comparable performances on image recognition tasks with various models and datasets including CIFAR-10, CIFAR-100, SVHN, and Image Net. Our code is open to the public by the official Git Hub3 of Kakao Brain. 1 Introduction Deep learning has become a state-of-the-art technique for computer vision tasks, including object recognition [16, 28, 37], detection [23, 29], and segmentation [4, 11]. However, deep learning models with large capacity often suffer from overfitting unless significantly large amounts of labeled data are supported. Data augmentation (DA) has been shown as a useful regularization technique to increase both the quantity and the diversity of training data. Notably, applying a carefully designed set of augmentations rather than naive random transformations in training improves the generalization ability of a network significantly [21, 26]. However, in most cases, designing such augmentations has relied on human experts with prior knowledge on the dataset. With the recent advancement of automated machine learning (Auto ML), there exist some efforts for designing an automated process of searching for augmentation strategies directly from a dataset. Auto Augment [5] uses reinforcement learning (RL) to automatically find data augmentation policy when a target dataset and a model are given. It samples an augmentation policy at a time using a controller RNN, trains the model using the policy, and gets the validation accuracy as a reward to Equal Contribution This work is done at Kakao Brain 3https://github.com/kakaobrain/fast-autoaugment 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. update the controller. Auto Augment especially achieves a dramatic improvement in performances on several image recognition benchmarks. However, Auto Augment requires thousands of GPU hours even in a reduced setting, in which the size of the target dataset and the network is small. Recently proposed Population Based Augmentation (PBA) [15] is a method to deal with this problem, which is based on population-based training method of hyperparameter optimization. In contrast to previous methods, we propose a new search strategy that does not require any repeated training of child models. Instead, the proposed algorithm directly searches for augmentation policies that maximize the match between the distribution of augmented split and the distribution of another, unaugmented split via a single model. Dataset Auto Aug [5] Fast Auto Aug CIFAR-10 5000 3.5 SVHN 1000 1.5 Image Net 15000 450 Table 1: GPU hours comparison of the proposed method with [5]. We estimate computation cost with an NVIDIA Tesla V100 while Auto Augment measured computation cost in Tesla P100. In this paper, we propose an efficient search method of augmentation policies, called Fast Auto Augment, motivated by Bayesian DA [36]. Our strategy is to improve the generalization performance of a given network by learning the augmentation policies which treat augmented data as missing data points of training data. However, different from Bayesian DA, the proposed method recovers those missing data points by the exploitation-and-exploration of a family of inference-time augmentations [33, 34] via Bayesian optimization in the policy search phase. We realize this by using an efficient density matching algorithm that does not require any back-propagation for network training for each policy evaluation. The proposed algorithm can be easily implemented by making good use of distributed learning frameworks such as Ray [24]. Our experiments show that the proposed method can search augmentation policies significantly faster than Auto Augment (see Table 1), while retaining comparable performances to Auto Augment on diverse image datasets and networks, especially in two use cases: (a) direct augmentation search on the dataset of interest, (b) transferring learned augmentation policies to new datasets. On Image Net, we achieve an error rate of 19.4% for Res Net-200 trained with our searched policy, which is 0.6% better than 20.0% with Auto Augment. This paper is organized as follows. First, we introduce related works on automatic data augmentation in Section 2. Then, we present our problem setting to achieve the desired goal and suggest Fast Auto Augment algorithm to solve the objective efficiently in Section 3. Finally, we demonstrate the efficiency of our method through comparison with baseline augmentation methods and Auto Augment in Section 4. 2 Related Work There are many studies on data augmentation, especially for image recognition. On the benchmark image dataset, such as CIFAR and Image Net, random crop, flip, rotation, scaling, and color transformation, have been performed as baseline augmentation methods [10, 21, 30]. Mixup [41], Cutout [7], and Cut Mix [39] have been recently proposed to either replace or mask out the image patches randomly and obtained more improved performances on image recognition tasks. However, these methods are designed manually based on domain knowledge. Naturally, automatically finding data augmentation methods from data in principle has emerged to overcome the performance limitation that originated from a cumbersome exploration of methods by a human. Smart Augmentation [22] introduced a network that learns to generate augmented data by merging two or more samples in the same class. [32] employed a generative adversarial network (GAN) [9] to generate images that augment datasets. Bayesian DA [36] combined Monte Carlo expectation maximization algorithm with GAN to generate data by treating augmented data as missing data points on the distribution of the training set. Due to the remarkable successes of NAS algorithms on various computer vision tasks [19, 28, 42], several current studies also deal with automated search algorithms to obtain augmentation policies for given datasets and models. The main difference between the previously learned methods and these automated augmentation search methods is that the former methods exploit generative models to create augmented data directly, whereas the latter methods find optimal combinations of predefined Operation 1 Operation 2 (autocontrast) Operation 2 (autocontrast) [autocontrast] [cutout, autocontrast] Figure 1: An example of augmented images via a sub-policy in the search space S. Each sub-policy τ consists of 2 operations; for instance, τ =[cutout, autocontrast] is used in this figure. Each operation O(τ) i has two parameters: the probability pi of calling the operation and the magnitude λi of the operation. These operations are applied with the corresponding probabilities. As a result, a sub-policy randomly maps an input data to the one of 4 images. Note that the identity map (no augmentation) is also possible with probability (1 p1)(1 p2). transformation functions. Auto Augment [5] introduced an RL based search strategy that alternately trained a child model and RNN controller and showed the state-of-the-art performances on various datasets with different models. Recently, PBA [15] proposed a new algorithm which generates augmentation policy schedules based on population based training [17]. Similar to PBA, our method also employs hyperparameter optimization to search for optimal policies but uses Tree-structured Parzen Estimator (TPE) algorithm [2] for practical implementation. 3 Fast Auto Augment In this section, we first introduce the search space of the symbolic augmentation operations and formulate a new search strategy, efficient density matching, to find the optimal augmentation policies efficiently. We then describe our implementation based on Bayesian hyperparameter optimization incorporated into a distributed learning framework. 3.1 Search Space Let O be a set of augmentation (image transformation) operations O : X X defined on the input image space X. Each operation O has two parameters: the calling probability p and the magnitude λ which determines the variability of operation. Some operations (e.g. invert, flip) do not use the magnitude. Let S be the set of sub-policies where a sub-policy τ S consists of Nτ consecutive operations { O(τ) n (x; p(τ) n , λ(τ) n ) : n = 1, . . . , Nτ} where each operation is applied to an input image sequentially with the probability p as follows: O(x; p, λ) := O(x; λ) : with probability p x : with probability 1 p. (1) Hence, the output of sub-policy τ(x) can be described by a composition of operations as x(n) = O(τ) n ( x(n 1)), n = 1, . . . , Nτ where x(0) = x and x(Nτ ) = τ(x). Figure 1 shows a specific example of augmented images by τ. Note that each sub-policy τ is a random sequence of image transformations which depend on p and λ, and this enables to cover a wide range of data augmentations. Our final policy T is a collection of NT sub-policies and T (D) indicates a set of augmented images of dataset D transformed by every sub-policies τ T : τ T {(τ(x), y) : (x, y) D} (2) Policy D(1) . . . . . . apply evaluate T train evaluate T Figure 2: An overall procedure of augmentation search by Fast Auto Augment algorithm. For exploration, the proposed method splits the train dataset Dtrain into K-folds, which consists of two datasets D(k) M and D(k) A . Then model parameter θ is trained in parallel on each D(k) M . After training θ, the algorithm evaluates B bundles of augmentation policies on DA. During the exploration process, the proposed algorithm does not train model parameter θ from scratch again. The top-N policies obtained from each K-fold are appended to an augmentation list T . Our search space is similar to previous methods except that we use both continuous values of probability p and magnitude λ at [0, 1] which has more possibilities than discretized search space. 3.2 Search Strategy In Fast Auto Augment, we consider searching the augmentation policy as a density matching between a pair of train datasets. Let D be a probability distribution on X Y and assume dataset D is sampled from this distribution. For a given classification model M( |θ) : X Y that is parameterized by θ, the expected accuracy and the expected loss of model M( |θ) on dataset D are denoted by R(θ|D) and L(θ|D), respectively. For a given augmentation policy T , L(θ|T (D)) denotes the expected loss of model for augmented images of data by (2). Note that the value of the loss for fixed policy T can vary according to the randomness in sub-policies due to (1). 3.2.1 Efficient Density Matching for Augmentation Policy Search For any given pair of Dtrain and Dvalid, our goal is to improve the generalization ability by searching the augmentation policies that match the density of Dtrain with density of augmented Dvalid. However, it is impractical to compare these two distributions directly for an evaluation of every candidate policy. Therefore, we perform this evaluation by measuring how much one dataset follows the pattern of the other by making use of the model predictions on both datasets. In detail, let us split Dtrain = DM DA into DM and DA that are used for learning the model parameter θ and exploring the augmentation policy T , respectively. We employ the following objective to find a set of learned augmentation policies T T = argmax T R(θ |T (DA)) (3) where model parameter θ is trained on DM. It is noted that in this objective, T approximately minimizes the distance between density of DM and density of T (DA) from the perspective of maximizing the performance of both model predictions with the same parameter θ. The proposed search objective pursues to find label-preserving transformations that generates unseen but plausible missing data samples. Namely, it does not transform but augment the data space which has to be correctly predicted by a classification network for better generalization. This perspective is also inline with the motivation of Bayesian DA [36]. In practice, we minimize the categorical cross-entropy loss L(θ|T (DA)) instead of maximizing accuracy in (3). To achieve (3), we propose an efficient strategy for augmentation policy search (see Figure 2). First, we conduct the K-fold stratified shuffling [31] to split the train dataset into D(1) train, . . . , D(K) train where each D(k) train consists of two datasets D(k) M and D(k) A . As a matter of convenience, we omit k in the notation of datasets in the remaining parts. Next, we train model parameter θ on DM from scratch without data augmentation. Contrary to previous methods [5, 15], our method does not necessarily reduce the given network to child models or proxy tasks. After training the model parameter, for each step 1 t T, we explore B candidate policies B = {T1, . . . , TB} via Bayesian optimization method which repeatedly samples a sequence of sub-policies from search space S to construct a policy T = {τ1, . . . , τNT } and tunes corresponding calling probabilities {p1, . . . , p NT } and magnitudes {λ1, . . . , λNT } to minimize the expected loss L(θ| ) on augmented dataset T (DA) (see line 6 in Algorithm 1). Note that, during the policy exploration-and-exploitation procedure, the proposed algorithm does not train model parameter from scratch again, hence the proposed method find augmentation policies significantly faster than Auto Augment. The concrete Bayesian optimization method is explained in Section 3.2.2. As the algorithm completes the exploration step, we select top-N policies over B and denote them Tt collectively. Finally, we merge every Tt into T . See Algorithm 1 for the overall procedure. At the end of the process, we augment the whole dataset Dtrain with T and retrain the model parameter θ. Through the proposed method, we can expect the performance R(θ| ) on augmented dataset T (DA) is statistically higher than that on DA: R(θ|T (DA)) R(θ|DA) since augmentation policy T works as optimized inference-time augmentation [33, 34] to make the model robustly predict correct answers. Consequently, learned augmentation policies approach (3) and improve generalization performance as we desired. 3.2.2 Policy Exploration via Bayesian Optimization Policy exploration is an essential ingredient in the process of automated augmentation search. Since the evaluation of the model performance for every candidate policies is computationally expensive, we apply Bayesian optimization to the exploration of augmentation strategies. Precisely, at the line 6 in Algorithm 1, we employ the following Expected Improvement (EI) criterion [18] for acquisition function to explore candidate policies B efficiently: EI(T ) = E min(L(θ|T (DA)) L , 0) = Z min(L L , 0)Pθ,DA(L|T )d L (4) Here the expectation in (4) is taken over the density function Pθ,DA on the codomain of value of the loss function L(θ|T (DA)) which measures statistical potential of unexplored augmented data (τ(x), y) T (DA) to approximate (3) for given pre-trained model M( |θ). Recall that T consists of sub-policies τ1, . . . , τNT and corresponding parameters {p1, . . . , p NT } and {λ1 . . . , λNT } hence the density function Pθ,DA(L|T ) is actually determined by these parameters. L in (4) denotes the constant threshold of loss value determined by the quantile of observations among previously explored policies. We employ variable kernel density estimation [35] on graph-structured search space S to estimate the density function Pθ,DA(L|T ) and eventually approximate the criterion (4). Practically, since the optimization method is already proposed in tree-structured Parzen estimator (TPE) algorithm [2], we apply their Hyper Opt library for the parallelized implementation. 3.3 Implementation Fast Auto Augment searches desired augmentation policies applying aforementioned Bayesian optimization to distributed train splits. In other words, the overall search process consists of two steps, (1) training model parameters on K-fold train data with default augmentation rules and (2) exploration-and-exploitation using Hyper Opt to search the optimal augmentation policies. In the below, we describe the practical implementation of the overall steps in Algorithm 1. The following procedures are mostly parallelizable, which makes the proposed method more efficient to be used in actual usage. We utilize Ray [24] to implement Fast Auto Augment, which enables us to train models and search policies in a distributed manner. Shuffle (Line 1): We split training sets while preserving the percentage of samples for each class (stratified shuffling) using Stratified Shuffle Split method in sklearn [27]. Algorithm 1: Fast Auto Augment Input :(θ, Dtrain, K, T, B, N) 1 Split Dtrain into K-fold data D(k) train = {(D(k) M , D(k) A )} // stratified shuffling 2 for k {1, . . . , K} do 3 T (k) , (DM, DA) (D(k) M , D(k) A ) // initialize 4 Train θ on DM 5 for t {0, . . . , T 1} do 6 B Bayes Optim(T , L(θ|T (DA)), B) // explore-and-exploit 7 Tt Select top-N policies in B 8 T (k) T (k) Tt // merge augmentation policies 9 return T = S Train (Line 4): Train models on each training split. We implement this to run parallelly across multiple machines to reduce total running time if the computational resource is enough. Explore-and-Exploit (Line 6): We use Hyper Opt library from Ray with B search numbers and 20 maximum concurrent evaluations. Different from Auto Augment, we do not discretize search spaces since our search algorithm can handle continuous values. We explore one of the possible operations with probability p and magnitude λ. The values of probability and magnitude are uniformly sampled from [0, 1] at the beginning, then Hyper Opt modulates the values to optimize the objective L. Merge (Line 7-9): Select the top N best policies for each split and then combine the obtained policies from all splits. This set of final policies is used for re-train. 4 Experiments and Results In this section, we examine the performance of Fast Auto Augment (FAA) on the CIFAR-10, CIFAR100 [20], and Image Net [6] datasets and compare the results with baseline preprocessing, Cutout [7], Auto Augment (AA) [5], and PBA [15]. For Image Net, we only compare the baseline, AA, and FAA since PBA does not conduct experiments on Image Net. We follow the experimental setting of AA for fair comparison, except that an evaluation of the proposed method on Amoeba Net-B model [28] is omitted. As in AA, each sub-policy consists of two operations (Nτ = 2), each policy consists of five sub-policies (NT = 5), and the search space consists of the same 16 operations (Shear X, Shear Y, Translate X, Translate Y, Rotate, Auto Contrast, Invert, Equalize, Solarize, Posterize, Contrast, Color, Brightness, Sharpness, Cutout, Sample Pairing). Interestingly, FAA is able to select Cutout in searched policies. We conjecture that Cutout can probably eliminate irrelevant backgrounds and improve the classification accuracy when the inference is performed on a well-trained network. We utilize 5-folds stratified shuffling (K = 5), 2 search width (T = 2), 200 search depth (B = 200), and 10 selected policies (N = 10) for policy evaluation. Due to the efficiency in the proposed search process, FAA can find more numbers of optimized augmentation policies, almost regardless of its number. Therefore, we can consider the number of sub-policies as a hyperparameter to tune. When we use a multi-threading functionality for data augmentation, we observe that there is no actual extension of training time by augmentation in comparison to the baseline without augmentation. Moreover, even when we perform both the data augmentation and weight updating by SGD in a single thread as a sequential processing, the increased training time that we observe is only 10-20% over 200 epochs; in total, less than 5 hours on CIFAR-10/100 with WRes Net28x10 and a single V100 GPU. Hence the training time overhead by increased number of sub-policies is also limited. Having this in mind, we performed FAA with different numbers of sub-policies and determined the number of sub-policies that produces the best average performances across different datasets and networks. However, as shown in Figure 3, the performances obtained by 25 numbers of sub-policies are also comparable to those by more numbers of sub-policies. We increase the batch size and adapt the learning rate accordingly to boost the training [38]. Otherwise, we set other hyperparameters equal to AA if possible. For the unknown hyperparameters, we follow values from the original references or we tune them to match baseline performances. Model Baseline Cutout [7] AA [5] PBA [15] FAA (transfer / direct) Wide-Res Net-40-2 5.3 4.1 3.7 3.6 / 3.7 Wide-Res Net-28-10 3.9 3.1 2.6 2.6 2.7 / 2.7 Shake-Shake(26 2 32d) 3.6 3.0 2.5 2.5 2.7 / 2.5 Shake-Shake(26 2 96d) 2.9 2.6 2.0 2.0 2.0 / 2.0 Shake-Shake(26 2 112d) 2.8 2.6 1.9 2.0 2.0 / 1.9 Pyramid Net+Shake Drop 2.7 2.3 1.5 1.5 1.8 / 1.7 Table 2: Test set error rate (%) on CIFAR-10. Model Baseline Cutout [7] AA [5] PBA [15] FAA (transfer / direct) Wide-Res Net-40-2 26.0 25.2 20.7 20.7 / 20.6 Wide-Res Net-28-10 18.8 18.4 17.1 16.7 17.2 / 17.2 Shake-Shake(26 2 96d) 17.1 16.0 14.3 15.3 14.9 / 14.6 Pyramid Net+Shake Drop 14.0 12.2 10.7 10.9 11.9 / 11.7 Table 3: Test set error rate (%) on CIFAR-100. Model Baseline Cutout [7] AA [5] PBA [15] FAA Wide-Res Net-28-10 1.5 1.3 1.1 1.2 1.1 Table 4: Test set error rate (%) on SVHN. Model Baseline AA [5] FAA Res Net-50 23.7 / 6.9 22.4 / 6.2 22.4 / 6.3 Res Net-200 21.5 / 5.8 20.00 / 5.0 19.4 / 4.7 Table 5: Validation set Top-1 / Top-5 error rate (%) on Image Net. 4.1 CIFAR-10 and CIFAR-100 For both CIFAR-10 and CIFAR-100, we conduct two experiments using FAA: (1) direct search on the full dataset given target network (2) transfer policies found by Wide-Res Net-40-2 on the reduced CIFAR-10 which consists of 4,000 randomly chosen examples. As shown in Table 2 and 3, overall, FAA significantly improves the performances of the baseline and Cutout for any network while achieving comparable performances to those of AA. CIFAR-10 Results In Table 2, we present the test set accuracies according to different models. We examine Wide-Res Net-40-2, Wide-Res Net-28-10 [40], Shake-Shake [8], Shake-Drop [37] models to evaluate the test set accuracy of FAA. It is shown that, FAA achieves comparable results to AA and PBA on both experiments. We emphasize that it only takes 3.5 GPU-hours for the policy search on the reduced CIFAR-10. We also estimate the search time via full direct search. By considering the worst case, Pyramid-Net+Shake Drop requires 780 GPU-hours which is even less than the computation time of AA (5000 GPU-hours). CIFAR-100 Results Results are shown in Table 3. Again, FAA achieves significantly better results than baseline and cutout. However, except Wide-Res Net-40-2, FAA shows slightly worse results than AA and PBA. Nevertheless, the search costs of the proposed method on CIFAR-100 are same as those on CIFAR-10. We conjecture the performance gaps between other methods and FAA are probably caused by the insufficient policy search in the exploration procedure or the over-training of the model parameters in the proposed algorithm. Number of sub-policies 25 50 75 100 125 150 175 200 225 250 WRes Net40x2 WRes Net28x10 Validation Error (CIFAR-10) Number of sub-policies 25 50 75 100 125 150 175 200 225 250 WRes Net40x2 WRes Net28x10 Validation Error (CIFAR-100) Figure 3: Validation error (%) of Wide-Res Net-40-2 and Wide-Res Net-28-10 trained on CIFAR-10 and CIFAR-100 as number of sub-policies used in training. We conducted an experiment with the SVHN dataset [25] with the same settings in AA. We chose 1,000 examples randomly and applied FAA to find augmentation policies. The obtained policies are applied to an initial model and we obtain the comparable performance to AA. Results are shown in Table 4 and Wide-Res Net-28-10 Model with the searched policies performs better than Baseline and Cutout and it is comparable with other methods. We emphasize that we use the same settings as CIFAR while AA tuned several hyperparameters on the validation dataset. 4.3 Image Net Following the experiment setting of AA, we use a reduced subset of the Image Net train data which is composed of 6,000 samples from randomly selected 120 classes. Res Net-50 [12] on each fold were trained for 90 epochs during policy search phase, and we trained Res Net-50 [12] and Res Net-200 [13] with the searched augmentation policy. In Table 5, we compare the validation accuracies of FAA with those of baseline and of AA via Res Net-50 and Res Net-200. In this test, we except the Amoeba Net [28] since its exact implementation is not open to public. As one can see from the table, the proposed method outperforms benchmarks. Furthermore, our search method is 33 times faster than AA on the same experimental settings (see Table 1). Since extensive data augmentation protects the network from overfitting [14], we believe the performance will be improved by reducing the weight decay which is tuned for the model with default augmentation rules. 5 Discussion Effect of Number of Augmentation Policies Similar to AA, we hypothesize that as we increase the number of sub-policies searched by FAA, the given neural network should show improved generalization performance. We investigate this hypothesis by testing trained models Wide-Res Net40-2 and Wide-Res Net-28-10 on CIFAR-10 and CIFAR-100. We select sub-policy sets from a pool of 400 searched sub-policies, and train models again with each of these sub-policy sets. Figure 3 shows the relation between average validation error and the number of sub-policies used in training. This result verifies that the performance improves with more sub-policies up to 100-125 sub-policies. As one can observe in Table 2-3, there are small gaps between the performance of policies from direct search and the transferred policies from the reduced CIFAR-10 with Wide-Res Net-40-2. One can see that those gaps increase as the model capacities increase since the searched augmentation policies by the small model have a limitation to improve the generalization performance for the large model (e.g., Shake-Shake). Nevertheless, transferred policies are better than default augmentations; hence, one can apply those policies to different image recognition tasks. Comparison between Random Search Strategies We performed additional experiments with two random search strategies (1) Randomly pre-selected augmentations (RPSA), which first selects a certain number (25/50) of augmentation policies randomly from the search space, and then trains Wide-Res Net-28-10 using the selected augmentations over 200 epochs; (2) Random augmentations (RA), that independently samples an augmentation policy for each train input from the whole search space during training with 400 epochs, which is two times more epochs than AA and FAA considering the compensation for the search time of the both algorithms. RA RPSA-25 RPSA-50 AA FAA Figure 4: Comparison of test error (%) of Wide Res Net-28-10 trained on CIFAR-100 between random search strategies, AA, and FAA. Both the RPSA and RA are performed on CIFAR100 and repeated 20 times. As shown in the Figure 4, the performances of the RPSA is better than baseline but not improved as the number of selected policies increases. And the best performance obtained by RPSA is still worse than FAA. In addition, the RA achieves a little bit worse result than those obtained by RPSA, and the improvement by RA is also less than that by FAA. It is noted that even though we take into account the search time of the proposed method on CIFAR-10/100 (see Table 1), the training time for FAA with 200 epochs including the search time is shorter than the training time for the RA with 400 epochs. Recently, the proposed FAA contributed to win the first place in Auto CV competition of Neur IPS 2019 Auto DL challenge [1]. Especially, since this competition required an Auto ML approach under very limited computational resources and time, the (light version of) FAA [3] was only able to apply for augmentation searching under this situation and eventually leaded to performance improvement. The details of this result will be published in the near future. Search of Augmentation Policies per Class Taking advantage of the fact that the algorithm is efficient, we experimented with searching for augmentation policies per class in CIFAR-100 with Wide-Res Net-40-2 Model. We changed search depth B to 100, and kept other parameters the same. With the 70 best-performing policies per class, we obtained a slightly improved error rate. Although it is difficult to see a definite improvement compared to AA and FAA, we believe that further optimization in this direction may improve performances more. Mainly, it is expected that the effect should be greater in the case of a dataset in which the difference between classes such as the object scale is enormous. One can try tuning the other meta-parameters of Bayesian optimization such as search depth or kernel type in the TPE algorithm in the augmentation search phase. However, this does not significantly help to improve model performance empirically. 6 Conclusion We propose an automatic process of learning augmentation policies for a given task and a convolutional neural network. Our search method is significantly faster than Auto Augment, and its performances overwhelm the human-crafted augmentation methods. One can apply Fast Auto Augment to the advanced architectures such as Amoeba Net and consider various augmentation operations in the proposed search algorithm without increasing search costs. Moreover, the joint optimization of NAS and Fast Auto Augment is a a curious area in Auto ML. We leave them for future works. We are also going to deal with the application of Fast Auto Augment to various computer vision tasks beyond image classification in the near future. Acknowledgement We appreciate every reviewer for valuable comments. We are also grateful to Brain Cloud team at Kakao Brain for GPU support. [1] Neur IPS 2019 Auto DL challenges. https://autodl.chalearn.org/. [2] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems, pages 2546 2554, 2011. [3] K. Brain. Auto CLINT, automatic computationally light network transfer. https://github. com/kakaobrain/autoclint, 2019. [4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834 848, 2018. [5] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2019. [6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009. [7] T. De Vries and G. W. Taylor. Improved regularization of convolutional neural networks with cutout. ar Xiv preprint ar Xiv:1708.04552, 2017. [8] X. Gastaldi. Shake-shake regularization. ar Xiv preprint ar Xiv:1705.07485, 2017. [9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672 2680, 2014. [10] D. Han, J. Kim, and J. Kim. Deep pyramidal residual networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. [11] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961 2969, 2017. [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [13] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630 645. Springer, 2016. [14] A. Hernández-García and P. König. Data augmentation instead of explicit regularization. ar Xiv preprint ar Xiv:1806.03852, 2018. [15] D. Ho, E. Liang, I. Stoica, P. Abbeel, and X. Chen. Population based augmentation: Efficient learning of augmentation policy schedules. In ICML, 2019. [16] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132 7141, 2018. [17] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, et al. Population based training of neural networks. ar Xiv preprint ar Xiv:1711.09846, 2017. [18] D. R. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of global optimization, 21(4):345 383, 2001. [19] S. Kim, I. Kim, S. Lim, C. Kim, W. Baek, H. Cho, B. Yoon, and T. Kim. Scalable neural architecture search for 3d medical image segmentation. 2018. [20] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097 1105, 2012. [22] J. Lemley, S. Bazrafkan, and P. Corcoran. Smart augmentation learning an optimal data augmentation strategy. IEEE Access, 5:5858 5869, 2017. [23] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21 37. Springer, 2016. [24] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, et al. Ray: A distributed framework for emerging {AI} applications. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), pages 561 577, 2018. [25] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. 2011. [26] M. Paschali, W. Simson, A. G. Roy, M. F. Naeem, R. Göbl, C. Wachinger, and N. Navab. Data augmentation with manifold exploring geometric transformations for increased performance and robustness. ar Xiv preprint ar Xiv:1901.04420, 2019. [27] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. [28] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. ar Xiv preprint ar Xiv:1802.01548, 2018. [29] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91 99, 2015. [30] I. Sato, H. Nishimura, and K. Yokoi. Apac: Augmented pattern classification with neural networks. ar Xiv preprint ar Xiv:1505.03229, 2015. [31] M. Shahrokh Esfahani and E. R. Dougherty. Effect of separate sampling on classification accuracy. Bioinformatics, 30(2):242 250, 2013. [32] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2107 2116, 2017. [33] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. [34] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818 2826, 2016. [35] G. R. Terrell, D. W. Scott, et al. Variable kernel density estimation. The Annals of Statistics, 20(3):1236 1265, 1992. [36] T. Tran, T. Pham, G. Carneiro, L. Palmer, and I. Reid. A bayesian data augmentation approach for learning deep models. In Advances in Neural Information Processing Systems, pages 2797 2806, 2017. [37] Y. Yamada, M. Iwamura, T. Akiba, and K. Kise. Shakedrop regularization for deep residual learning. ar Xiv preprint ar Xiv:1802.02375, 2018. [38] Y. You, I. Gitman, and B. Ginsburg. Large batch training of convolutional networks. ar Xiv preprint ar Xiv:1708.03888, 2017. [39] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. ar Xiv preprint ar Xiv:1905.04899, 2019. [40] S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference 2016. British Machine Vision Association, 2016. [41] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412, 2017. [42] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697 8710, 2018.