# genlabel_mixup_relabeling_using_generative_models__b5297024.pdf Gen Label: Mixup Relabeling using Generative Models Jy-yong Sohn 1 Liang Shang 1 Hongxu Chen 1 Jaekyun Moon 2 Dimitris Papailiopoulos 1 Kangwook Lee 1 Mixup is a data augmentation method that generates new data points by mixing a pair of input data. While mixup generally improves the prediction performance, it sometimes degrades the performance. In this paper, we first identify the main causes of this phenomenon by theoretically and empirically analyzing the mixup algorithm. To resolve this, we propose Gen Label, a simple yet effective relabeling algorithm designed for mixup. In particular, Gen Label helps the mixup algorithm correctly label mixup samples by learning the class-conditional data distribution using generative models. Via theoretical and empirical analysis, we show that mixup, when used together with Gen Label, can effectively resolve the aforementioned phenomenon, improving the accuracy of mixup-trained model. 1. Introduction Mixup (Zhang et al., 2018) is a widely adopted data augmentation algorithm used when training a classifier, which generates synthetic samples by linearly interpolating two randomly chosen samples. Each mixed sample is soft-labeled, i.e., it is labeled as a mixture of two (possibly same) classes of the chosen samples. The rationale behind the mixup algorithm is that such mixed samples can fill up the void space in between different class manifolds, effectively regularizing the model behavior. Mixup has been shown to improve generalization on multiple benchmark image datasets. Various variants of mixup have been proposed in the past few years such as Manifold-mixup (Verma et al., 2019), which apply mixup in the latent feature space, and several computer vision-specific variants (Yun et al., 2019; Kim et al., 2020; Uddin et al., 2021; Kim et al., 2021). Recent studies also provide some theoretical explanations for the success 1Department of Electrical and Computer Engineering, University of Wisconsin, Madison, USA 2School of Electrical Engineering, Daejeon, KAIST. Correspondence to: Kangwook Lee . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). of mixup (Zhang et al., 2021; Carratino et al., 2020). Mixup, however, does not always improve accuracy, and sometimes it even hurts. Guo et al. (2019) observed that the accuracies of mixup is up to 1.8% worse than vanilla training on some image classification tasks. Greenewald et al. (2021) reported a similar finding; the original mixup degrades the accuracy of vanilla training up to 2.5% on an UCI classification task (Dua & Graff, 2017). Despite such empirical findings, (1) why mixup fails and (2) how we can prevent them are not fully understood yet. 1.1. Main contributions In this work, we first take a closer look at the failure scenarios of mixup, particularly focusing on the low-dimensional input setting. As a result, we identify two main reasons behind mixup s failure scenarios. The first one is manifold intrusion, which was defined in (Guo et al., 2019). Samples generated by mixing two classes may intrude the manifold of another class, causing label conflicts with the samples from the intruded class. We perform analysis of the effect of manifold intrusion. The second reason we identify is linear labeling. Mixup assigns a mixed sample with a linear combination of the two one-hot encoded labels. We prove that, focusing on a specific softmax regression setting, such linear labeling results in a strictly suboptimal margin and accuracy. After identifying the key reasons behind mixup s failure, we propose a simple yet effective fix for mixup. Our proposed algorithm Gen Label is a relabeling algorithm designed for mixup. The idea is strikingly simple Gen Label relabels a mixed sample using the likelihoods that are estimated with learned generative models. Consider a three-way classification problem in Fig. 1. Gen Label learns generative models to estimate the likelihood of a mixed sample. Let pc(x) be the likelihood of sample x drawn from class c, and let bpc(x) be the estimated likelihood. Given a mixed sample xmix, Gen Label first computes all three likelihoods bpc(xmix) for class c {1, 2, 3}, and then assigns the mixed sample the following label: ygen = softmax(log bp1(xmix), log bp2(xmix), log bp3(xmix)). Note that the cth entry of ygen is b pc(xmix) P3 c =1 c pc (xmix), which equals the posterior probability estimate b P(y = c|xmix), Gen Label: Mixup Relabeling using Generative Models xmix = λx + (1 λ)x ymix = λy + (1 λ)y = λe1 + (1 λ)e3 bpc(xmix) X c c pc (xmix) ei e2 bp2(xmix) bp1(xmix), bp2(xmix) bp3(xmix) Figure 1: Visualization of Gen Label applied to mixup. Consider two labeled samples (x, y) and (x , y ). Mixup generates xmix = λx + (1 λ)x for some λ [0, 1] and label it as ymix = λy +(1 λ)y = λe1 +(1 λ)e3, where ec is the c-th standard basis vector. This induces label noise as xmix lies on the manifold of class 2. On the other hand, Gen Label properly relabel it as ygen e2, which is computed using the learned distribution for each class c ( bpc(x)) and the relabeling formula in the figure. when we have a balanced dataset. Thus, Gen Label can be viewed as a labeling method that assigns the posterior probability of the label y given a mixed sample xmix. This property of Gen Label allows us to fix the issue of the conventional labeling method in mixup, as shown in Fig. 1. The suggested Gen Label has been analyzed in diverse perspectives. First, we empirically show that Gen Label fixes the manifold intrusion issue on toy datasets. Second, we provide a problem setup where Gen Label combined with mixup maximizes the margin of a classifier, while mixup alone leads to a much smaller margin. Third, we show that Gen Label improves the adversarial robustness of mixup in logistic regression models and fully-connected (FC) networks with Re LU activations. Finally, we tested Gen Label on 133 low-dimensional real datasets in Open ML (Vanschoren et al., 2013). Our experimental results show that the suggested Gen Label helps mixup improve the accuracy in various low-dimensional datasets. We also found that adversarial robustness can be improved by applying Gen Label. 1.2. Preliminaries Notations We focus on k-way classification tasks. A dataset with n data points is denoted by S = {zi}n i=1, where the i-th data point is represented by zi = (xi, yi) composed of the input feature xi Rd and the label yi [0, 1]k. We use one-hot encoding for the label, i.e., the label of classc data points is represented as y = ec, where ec is the standard basis vector with a 1 in the c-th coordinate and 0 s elsewhere. Mixed samples can have soft label, e.g., 0.5e1 + 0.5e2 denotes that the mixed sample is equally likely to be from class 1 and 2. The set of input features is denoted by X = {xi}n i=1. We assume that each data point x in class c is generated from (unknown) probability distribution pc(x) := px|y(x|ec). For a positive integer k, we define [k] := {1, 2, , k}. Given a distance metric, d(x1, x2) denotes the distance between x1 and x2, and d(x, A) = mina A d(x, a) denotes the minimum of the distances between x and the points in a closed set A. Mixup Mixup (Zhang et al., 2018) generates synthetic data points by applying a linear combination of two samples. Given samples xi and xj, it generates xmix = λxi + (1 λ)xj having a mixed label ymix = λyi + (1 λ)yj, for randomly sampled λ Beta(α, α) for a given α > 0. Gaussian mixture model Gaussian mixture (GM) model is a generative model, which assumes that samples x in each class (say class c) follow a multivariate Gaussian distribution N(µc, Σc) for some mean µc and covariance matrix Σc. One can estimate the model parameters by computing the within-class sample mean c µc and the within-class sample covariance matrix c Σc for each class. A GM model of two classes, say class 1 and 2, can be modeled as π1N(c µ1, c Σ1)+ (1 π1)N(c µ2, c Σ2), where π1 = P(y = e1). Kernel density estimator Kernel density estimator (KDE) is a non-parametric density estimator that makes use of a kernel function. For example, KDE with Gaussian kernel estimates the distribution of class c as 1 nc Pnc i=1 N(xi, h2 c Σc) for a given bandwidth h, where {xi}nc i=1 is the set of samples in class c and c Σc is the sample covariance matrix of class c. One can use KDE as a generative model, creating new samples from the estimated density. Nearest neighbor (NN) based density estimator Suppose we have nc data points in class c. Then, kc-NN density estimator provides the estimated class-conditional density pc(x) = kc nc V for a given sample x, where V is the minimum volume of a sphere centered at x and containing kc data points belonging to class c (Bishop, 2006). 2. Related Works Mixup and variants Mixup and its variants have been considered as promising data augmentation schemes improving the generalization and robustness performance in various image classification tasks (Zhang et al., 2018; Verma et al., 2019; Tokozume et al., 2018; Inoue, 2018; Shimada et al., 2019; Hendrycks et al., 2020; Yun et al., 2019; Kim et al., 2020; Uddin et al., 2021; Kim et al., 2021; Zhang et al., 2021; Park et al., 2022; Liu et al., 2021). However, the performance of mixup for low-dimesional datasets have been rarely observed in previous works. This paper focuses on the failures of mixup in low-dimensional datasets, and provide a simple label correction method to solve this issue, which improves generalization performance in various real datasets. Manifold intrusion Guo et al. (2019) observed that mixup samples of two classes may intrude the manifold of a third class. The authors dubbed this label conflict phenomenon as manifold intrusion. Hwang & Whang (2021) found that a similar label conflict problem becomes even more salient in the regression setting. To resolve manifold intrusion issue, previous works have suggested mixing Gen Label: Mixup Relabeling using Generative Models Table 1: Clean accuracy (%) on various synthetic and real datasets. Mixup performs worse than vanilla training on these datasets. Dataset Moon Four-circle 3D cube Open ML-48 Open ML-307 Open ML-818 Vanilla 98.7 91.3 93.0 39.1 66.8 100.0 Mixup 97.0 (1.7 ) 57.7 (33.6 ) 89.4 (3.6 ) 30.9 (8.2 ) 54.4 (12.4 ) 92.3 (7.7 ) strategies which avoid generating mixup samples causing the label conflict. Guo et al. (2019) suggested learning a mixing policy network that prohibits generating the in-manifold mixup samples. Greenewald et al. (2021) suggested mixing adjacent data samples by using the concept of optimal transport. This allows both the mixed sample and the corresponding data sample pair lie on the same manifold, which avoids facing the label-conflicting scenarios. Focusing on the regression setting, Hwang & Whang (2021) suggested learning a mixing policy by measuring how helpful mixing each pair is. Although all these regularization techniques prohibit generating mixup points that incur label conflicts, they also inherently give up the potential benefits of using such label-conflicting mixup samples by properly relabeling them. In this paper, for the first time, we solve the manifold intrusion issue by re-labeling the mixup samples, with the aid of generative models. Generative models for classification Generative models have been widely used for classification tasks for several decades. A generative classifier (Ng & Jordan, 2002) predicts label y based on the class-conditional density p(x|y) estimated by generative models, and has been developed until recent years (Schott et al., 2018; Ju & Wagner, 2020). The present paper also makes use of generative models for classification task, but we use them for re-labeling augmented data, while existing works use them for the prediction itself. Some previous works proposed generative model-based data augmentation schemes (Antoniou et al., 2017; Perez & Wang, 2017; Tanaka & Aranha, 2019). While these schemes use the learned distribution to create onmanifold synthetic data, we use the learned distribution to re-label both on-manifold and out-of-manifold mixup samples. There have been extensive works on using generative models to improve the robustness against adversarial attacks and out-of-distribution samples (Ilyas et al., 2017; Xiao et al., 2018; Samangouei et al., 2018; Song et al., 2018; Schott et al., 2018; Li et al., 2019; Ghosh et al., 2019; Serrà et al., 2020; Choi et al., 2018; Lee et al., 2018). Though looking similar, these are not data augmentation algorithms and hence only tangentially related to our algorithm. Our method can be used together with any of these algorithms. 3. Failure of Mixup on Low-Dimensional Data In this section, we observe the failure scenarios of mixup, i.e., when mixup performs even worse than vanilla training, especially focusing on the low-dimensional data setting. Table 1 shows the scenarios when mixup has a lower accuracy x x1 = 1 x2 = 0 x3 = +1 y1 = e1 y2 = e2 y3 = e1 Mix of (x1, y1) and (x3, y3) Label conflict! Figure 2: Example 1 visualization. Top: dataset S = {xi, yi}3 i=1, Bottom: A mixup point generated with x1 and x3 with λ = 0.5. This incurs the manifold intrusion because of x2 at the same location but with a different label. Our analysis shows that this phenomenon is what reduces the margin from 1/2 to 7/16. than vanilla training, for synthetic datasets1 and Open ML datasets (Vanschoren et al., 2013). For example, in the Four-circle dataset, the performance gap between mixup and vanilla training is larger than 30%. Why does mixup have such failure scenarios? Here we identify and analyze two main reasons. First, we show the manifold intrusion defined in (Guo et al., 2019) may degrade the margin/accuracy of mixup. Second, we show that even when there is no manifold intrusion issue, the labeling method used in mixup may harm the margin/accuracy of a classifier. 3.1. Manifold intrusion reduces margin and accuracy In (Guo et al., 2019), the manifold intrusion (MI) is defined as the case when the mixup sample xmix, generated by mixing data in class c1 and c2, collides with a real data sample having label c3 / {c1, c2}. Below we provide a concrete problem instance where the margin reduction property of manifold intrusion is rigorously proved. Example 1. Consider binary classification on the dataset S = {(xi, yi)}3 i=1 = {( 1, e1), (0, e2), (+1, e1)} in Fig. 2, where each data point in class 1 is represented as brown circle, and each data point in class 2 is shown as blue triangle. We consider the classifier fθ defined as fθ(x) = 1 if |x| 2θ, and fθ(x) = |x| 2θ otherwise. This classifier estimates the label of a given feature x as ˆy = [fθ(x), 1 fθ(x)], and the margin of this classifier is represented as margin(fθ) = min{θ, 1 θ}. As in Fig. 2, applying mixup on this dataset suffers from manifold intrusion (MI); mixing x1 and x3 with coefficient λ = 0.5 generates xmix = 0 with label ymix = e1, overlapping with x2 = 0 having different label y2 = e2. Here we compare (1) mixup and (2) mixup-without-MI. To avoid MI, we set the scheme (2) to mix only samples with different classes. Here, we sample λ Beta(1, 1). For each scheme s {mixup, mixup-without-MI}, let θs be the parameter θ that minimizes the MSE loss ℓ(ˆy, y) = ˆy y 2 2. We have margin(θmixup) = 7 16 and margin(θmixup-without-MI) = 1 2 as in Section C.1 of Appendix. This shows that we can achieve the maximum margin 1 2 if we remove mixup points having manifold intrusion, while 1 See Section E in Appendix for the details of synthetic datasets. Gen Label: Mixup Relabeling using Generative Models Table 2: The effect of manifold intrusion (MI) on logistic regression accuracy, for 7 balanced low-dimensional Open ML datasets, having more than two classes. For six out of seven, one can increase the accuracy by discarding the mixup points which incur manifold intrusion, demonstrating MI s effect on accuracy. Open ML dataset ID 18 36 307 468 1413 1499 40984 Mixup-without-MI (%) 73.70 89.21 56.77 82.73 93.33 96.19 84.88 Mixup (%) 72.93 88.98 54.41 83.64 88.00 95.56 83.90 vanilla mixup degrades the margin to 7 16, implying the margin reduction effect of manifold intrusion. Now we empirically show how the manifold intrusion affects the classification accuracy of mixup for real datasets in Open ML. Similar to the previous example, we consider two schemes: (1) mixup and (2) mixup-without-MI, where the second scheme is defined as a usual mixup with the exclusion of mixup points that incur the manifold intrusion. Here we decide whether a mixup point is suffering from manifold intrusion, using a relaxed version of the definition suggested in (Guo et al., 2019): Definition 1. For a real dataset D = {xi, yi}n i=1, we call a mixed point xmix = λx1 + (1 λ)x2 has manifold intrusion if the label of the nearest neighbor xnn = arg minx X d(x, xmix) is different from y1 and y2. Table 2 compares the classification accuracy of (1) mixup and (2) mixup-without-MI, for 7 balanced low-dimensional datasets in Open ML having more than two classes, and having less than 20 features. We trained a logistic regression model for classification, and the result is averaged out over five trials. It turns out that for 6 out of 7 datasets, mixupwithout-MI has a higher accuracy than mixup. This shows that excluding the manifold-intruding mixup points is beneficial for improving the classification accuracy, for a large portion of tested low-dimensional real datasets. 3.2. Labeling method in mixup is sub-optimal Here we show that for datasets not having manifold intrusion, the labeling method used in mixup is sub-optimal in terms of margin and accuracy. Note that for a given mixed point xmix = λxi + (1 λ)xj, the conventional method uses a linear interpolation of labels of original samples, represented as ylin = λyi + (1 λ)yj. We call this method as linear labeling. Here we compare this with an alternative labeling dubbed as logistic labeling, represented as ylog = ρyi + (1 ρ)yj where ρ = 1 1+exp{ 2(λ 1/2)/σ2} for some σ > 0. We show that mixup with linear labeling performs worse than mixup with logistic labeling for some datasets, implying the sub-optimality of conventional label. Below we start with analyzing the margin of mixup with linear/logistic labeling methods for a synthetic dataset. Example 2. Consider a dataset with three data points in Fig. 3a, where the feature-label pairs are defined as x2 = [+d, + d] x1 = [ d, + d] x3 = [ d, d] x(1) (a) Dataset (b) vanilla training (e) label for mixup + linear labeling (f) label for mixup + logistic labeling (c) mixup + linear labeling (d) mixup + logistic labeling (g) score for SVM p1 p2 p3 y1 y2 y3 Figure 3: Comparison of linear/logistic labeling for a dataset in (a). From (b) (d), one can see that logistic labeling results in the max-margin classifier while linear labeling does worse than vanilla training. This implies the need for nonlinear labeling for mixup. See Section 3.2 for more detailed explanations. (x1, y1) = ([ d, +d], e1), (x2, y2) = ([+d, +d], e2), and (x3, y3) = ([ d, d], e3) with d = 5. We train softmax regression model W = [w T 1 ; w T 2 ; w T 3 ] R3 2 for vanilla training and mixup with linear/logistic labeling. Given x = [x(1), x(2)], the prediction score is denoted as p = [p1, p2, p3] = 1 P3 i=1 exp(w T i x)[ew T 1 x, ew T 2 x, ew T 3 x]. Figs. 3b, 3c, 3d compare the decision boundaries. Mixup with linear label harms the margin, while mixup with logistic label achieves the maximum margin. The effect of linear/logistic labeling methods on the margin can be explained as follows. Recall that the softmax regression finds the model that minimizes the cross entropy loss between the prediction p and the label y, and we achieve the minimum when p = y holds. In Fig. 3a, consider mixing x2 and x3 with coefficient λ [0, 1] along the line x(2) = x(1), generating xmix = λx2 + (1 λ)x3 = [(2λ 1)d, (2λ 1)d]. As in Fig. 3e, mixup with linear labeling assigns the label ylin = [y1, y2, y3] = [0, λ, 1 λ] for the mixup points on the line x(2) = x(1). The model W is trained in a way that p resembles ylin, i.e., set p1 = 0 and set both p2 and p3 as a linear function of λ along the line x(2) = x(1). This is true when w2 and w3 are close enough and symmetric about the line x(2) = x(1), as in Fig. 3c; if we set w2 = [ 1+ 1 2d] and w3 = [ 1 1 2d], then using exp(w T 2 xmix) 1 + w T 2 xmix, we have p2 1 2(1 + w T 2 xmix) = λ = y2 and similarly p3 1 λ = y3. This implies that the model W trained to set p = ylin will look like the solution in Fig. 3c, especially when d is large. Thus, fitting the softmax regression model to mixup with linear labeling reduces the margin in this toy dataset. We now explain how the logistic labeling enjoys a large margin as in Fig. 3d. Again, consider mixup points xmix = λx2 + (1 λ)x3 = [(2λ 1)d, (2λ 1)d]. As in Fig. 3f, the logistic labeling with σ = 1 2 d assigns the label ylog = [y1, y2, y3] = [0, ρ, 1 ρ] for these mixup points, where ρ = 1 1+exp( 8d(λ 1/2)). Then, one can confirm that p = ylog holds for the support vector machine (SVM) solution having w1 = [ 1, 1], w2 = [1, 1], and w3 = [ 1, 1]. Note that the logistic label y in Fig. 3f resembles the score p of SVM Gen Label: Mixup Relabeling using Generative Models Algorithm 1 Gen Label Input Dataset S = {(xi, yi)}n i=1, learning rate η, loss ratio γ Output Discriminative model fθ( ) 1: θ Random initial model parameter 2: pc( ) Estimated density for class c [k] 3: for (xi, yi), (xj, yj) S do 4: (xmix, ymix) = (λxi + (1 λ)xj, λyi + (1 λ)yj) 5: ygen Pk c=1 pc(xmix) Pk c =1 pc (xmix)ec 6: θ θ η θ{γ ℓCE(ygen, fθ(xmix)) +(1 γ) ℓCE(ymix, fθ(xmix))} 7: end for solution in Fig. 3g, which corroborates the fact that logistic labeling guides us to achieve the maximum margin. The above analysis shows that the linear labeling method is harming the margin of a mixup-trained classifier, while the logistic labeling method is allowing mixup to enjoy the maximum margin. This clearly shows that the conventional labeling method of mixup is sub-optimal, and an appropriate re-labeling method improves the margin significantly. We also tested the classification accuracy of mixup with linear/logistic labeling, for 133 real datsets in Open ML (Vanschoren et al., 2013) having less than 20 features, when we use the logistic regression model. In order to decouple the effect of labeling and the effect of manifold intrusion, we removed the mixed points incurring the manifold intrusion by following the criterion in Definition 1. It turns out that the accuracy gain by using logistic labeling is positive for 47.4%, zero for 16.5%, and negative for 36.1% of the tested datasets. In other words, using logistic labeling instead of linear labeling improves the accuracy for around half of the tested real datasets in Open ML. This shows that the conventional method of labeling mixup points is not optimal in terms of accuracy in many low-dimensional real datasets. 4. Gen Label In the previous section, we observed two main issues of mixup. First issue is manifold intrusion, which occurs since mixup blindly interpolates randomly chosen two samples, without knowing the underlying data distribution. Second, the conventional method of labeling mixup samples is suboptimal, in terms of margin and accuracy. Motivated by these observations, we propose Gen Label, a method of re-labeling mixup samples based on the data distribution estimated by generative models. Gen Label contains three steps. First, we estimate the class-conditional data distribution bpc(x) for each class c. Second, we apply the mixup-based data augmentation, generating mixup sample xmix originally labeled as ymix. Finally, we relabel xmix based on bpc(x), i.e., we define the new label as ygen = softmax(log bp1(xmix), , log bpk(xmix)). This new label is called Gen Label since it makes use of generative models for labeling. Note that the suggested Gen Label is a re-labeling method, and we follow the mixing strategy of mixup by default. Throughout the paper, the scheme called Gen Label refers to mixup+Gen Label . Now we give a formal description of Gen Label. Given a dataset S, we first train class-conditional generative model, thereby getting estimates on the underlying data distribution bpc(x). Then, for randomly chosen data pair (xi, yi), (xj, yj) S, we apply mixup scheme, generating the mixed feature xmix = λxi + (1 λ)xj and the mixed label ymix = λyi + (1 λ)yj. Here, the mixing coefficient follows the beta distribution, i.e., λ Beta(α, α) for some α > 0. Finally, we re-label this augmented data xmix as b pc(xmix) Pk c =1 c pc (xmix)ec, (1) which is the posterior probability estimate b P(y = c|xmix) when we have a balanced dataset. Since our generative model is imperfect, ygen may be incorrect for some samples. Thus, we can use a combination of mixup labeling and the suggested labeling, e.g., define the label of mixed point as γygen + (1 γ)ymix for some γ [0, 1]. Note that our scheme reduces to the mixup when γ = 0. Using the relabeled augmented data, the algorithm trains the classifier fθ : Rn [0, 1]k that predicts the label y = [y1, , yk] of the input data. Here, the cross-entropy loss ℓCE( ) is used while training the classifier. The pseudocode of Gen Label is provided in Algorithm 1. In summary, the proposed scheme is a novel label correction method for mixup, which first learns the data distributions for each class using class-conditional generative models, and then re-label the mixup data based on the conditional likelihood of the mixup data sampled from each class. More precisely, Gen Label sets the label of a mixup data as the softmax of the class-conditional log-likelihood, which matches with the posterior probability for the balanced datasets. Remark 1. Gen Label described in Algorithm 1 assumes that the generative model learns the data distribution in the input feature space. However, we can easily extend our algorithm to cases when we train generative models in the latent feature space. Due to the space limitation, the detailed algorithm description is in Section A.1 at Appendix. 5. Analysis of Gen Label We analyze the effect of Gen Label in various perspectives. In Section 5.1, we visualize Gen Label for toy datasets, empirically showing that Gen Label fixes the manifold intrusion issue of mixup. Section 5.2 provides a concrete problem instance where Gen Label solves the margin reduction issue of linear labeling in mixup. Finally, Section 5.3 shows Gen Label: Mixup Relabeling using Generative Models that Gen Label improves the robustness of mixup on logistic regression models and fully-connected Re LU networks. 5.1. Gen Label solves the manifold intrusion issue Consider the datasets given in Fig. 4a: we have nine classes of 2-dimensional Gaussian dataset on the top row, and two classes having two circles at each class on the bottom row; we call the top one as 9-class Gaussian and the bottom one as Four-circle dataset. Note that both datasets contain numerous mixed points suffering from the manifold intrusion, e.g., the mixed point of blue and orange samples lie on the black class in 9-class Gaussian dataset. Given a soft-label y = [y1, , yk], define the top-1 label as ytop-1 = arg maxc [k] yc. Fig. 4b and Fig. 4c illustrate the top-1 label of a mixed point, for the conventional labeling in mixup and the suggested Gen Label scheme, respectively. For the 9-class Gaussian data, we set the mixing coefficient as λ = 0.6 for the purpose of illustration. As shown in Fig. 4b, the conventional labeling method causes the label conflict issue for a large number of mixup samples. This issue has been resolved by Gen Label as shown in Fig. 4c: the label of mixed points assigned by Gen Label matches with the label of maximum margin classifier for each dataset. This label correction method also increases the margin of a classifier. Figs. 4d, 4e, 4f and 4g show the decision boundaries of vanilla training, mixup (with original labeling), Gen Label, and kernel SVM using radial basis function kernel exp( 2 x x 2), respectively. While vanilla training and mixup have small margin for some samples, Gen Label achieves the ideal margin of kernel SVM. 5.2. Gen Label improves the margin Here we provide a concrete problem instance where switching from the linear label to Gen Label of mixup sample will increase the margin to its maximum value, i.e., margin(SVM) = margin(Gen Label) > margin(vanilla) > margin(mixup), where SVM represents the support vector machine (Cortes & Vapnik, 1995) achieving the maximum L2 margin. Example 3. Consider a dataset S = {(xi, yi)}n+2 i=1 , where the feature xi R2 and the label yi {+1, 1} of each point is specified in Fig. 5a. Let θ = (r cos φ, r sin φ) be the model parameter for a fixed r > 0. Fig. 5b shows φ = arg minφ ℓ(r, φ) for various r, where ℓis the logistic loss applied to the (augmented) dataset. It turns out that the optimal φ of Gen Label approaches to the SVM solution φsvm = 0.25π as r increases. The detailed derivation of φ for each scheme is given in Section C.2 in Appendix. Remark 2. For large r = θ , the original mixup does not converge to the max-margin solution, while the mixup rela- beled by Gen Label approaches to the max-margin solution. In summary, for Examples 2 and 3, (1) the original mixup method has a smaller margin than vanilla training, and (2) simply changing the label of the mixed points (using Gen Label) fixes this issue and achieves the maximum margin. 5.3. Gen Label improves the robustness of mixup We provide mathematical analysis on the effect of Gen Label on the adversarial robustness. Below theorem shows that Gen Label is improving the robustness of mixup in logistic regression model and FC Re LU networks, for Gaussian data. Due to the space limitation, we put the formal statement and proofs at Sections B.1 and C in Appendix. Theorem (informal). Consider binary classification problem where each class follows a Gaussian distribution. For logistic regression model and fully-connected Re LU networks parameterized by θ, we have ℓmixup(θ) ℓGen Label(θ) ℓadv(θ). This shows that the adversarial loss of a model is upper bounded by the Gen Label loss of the model, i.e., if we find a model with Gen Label loss smaller than or equal to a threshold lth, then the adversarial loss of this model is at most lth. Compared with mixup loss, Gen Label loss is a tighter upper bound on the adversarial loss. This implies that Gen Label improves the robustness of mixup. 6. Experimental Results Now we investigate the effect of Gen Label on real datasets. We consider two models: logistic regression and fullyconnected (FC) Re LU networks with 2 hidden layers. We focus on Open ML (Vanschoren et al., 2013) having various real tabular datasets. Especially, we consider 160 low-dimensional Open ML datasets having less than 20 features and less than 5000 data points. Among such 160 datasets, we filter out 27 redundant datasets with different versions, for datasets named as iris, seeds, and chscasecensus, thus having 133 datasets for testing. For logistic regression model, we reported the result for 82 datasets which fit well on either KDE or GM model; we used a dataset if either of the generative model has more than 95% of train accuracy2. Recall that we allow the combination of the mixup labeling ymix and the suggested labeling ygen, i.e., re-label the mixed point by γygen + (1 γ)ymix for γ [0, 1]. Here we choose the optimal mixing ratio γ using cross-validations. All algorithms are implemented 2It is clear that our algorithm will not work as expected if the chosen generative model has a poor fit to the data. By using a larger class of generative models, such as flow (Kingma & Dhariwal, 2018) and diffusion models (Kingma et al., 2021), one should be able to handle the omitted datasets, but we leave it as future work. Gen Label: Mixup Relabeling using Generative Models (a) dataset (d) vanilla training, decision boundary (e) mixup, decision boundary (f) Gen Label, decision boundary (b) mixup, top-1 label of mixed points (c) Gen Label, top-1 label of mixed points (g) kernel SVM, decision boundary Figure 4: Comparison of vanilla training, mixup and the suggested Gen Label for Four-circle dataset. (a): dataset. (b), (c): mixup points and their top-1 labels of ymix and ygen. One can see that ymix causes label conflicts while ygen does not. (d), (e), (f), (g): decision boundaries. The margin of Gen Label classifier is larger than those of vanilla training and mixup and matches that of the kernel SVM. ( 1, 0) (1, 0) n points overlapped θ = r cos φ r sin φ (a) Dataset 5 10 15 20 25 ||θ||: norm of weight φ : angle with minimum loss (b) Comparison of φ Figure 5: (a) Logistic regression on (x1, y1) = ([1, 0], +1), (x2, y2) = ([0, 1], +1), and (xi, yi) = ([ 1, 0], 1) for i = 3, 4, , n + 2. (b) Gen Label quickly converges to the SVM solution as θ increases much faster than the vanilla scheme. The original mixup converge to a suboptimal value. in Py Torch (Paszke et al., 2017), and the experimental details are summarized in Section E in Appendix. All results are averaged out over 5 trials. The github repo for reproducible code is given in https://github.com/ UW-Madison-Lee-Lab/Gen Label_official. Suggested schemes For Open ML datasets, we considered three types of generative models: Gaussian mixture (GM), kernel density estimator (KDE) with Gaussian kernel, and 1-nearest-neighbor (NN) density estimator (Bishop, 2006). We denote each scheme by Gen Label (GM), Gen Label (KDE), and Gen Label (NN), respectively. We also considered using cross-validation (CV) to choose either GM or KDE: this scheme is denoted by Gen Label (CV). For logistic regression, we used Gen Label on the input feature space, and for FC Re LU networks, we used Gen Label on the latent feature space at the penultimate layer. For image datasets (MNIST, CIFAR-10, CIFAR-100 and Tiny Image Net), we tested Gen Label on mixup and manifold-mixup, the results of which are given in the Appendix. Compared schemes For all datasets, we compared our scheme with vanilla training, mixup, and adamixup (Guo et al., 2019). For Open ML datasets, we added the comparison with (1) generative classifier using Gaussian mixture (GM) as the generative model and (2) mixup+excluding MI, which excludes the mixup points with manifold intrusion (MI). Here, we detected mixup points with MI using Definition 1. For image datasets, we tested mixup and manifold-mixup and compared them with the performances of mixup+Gen Label and manifold-mixup +Gen Label. 6.1. Results on clean accuracy We compare the accuracy of Gen Label with various baselines, on the logistic regression model. Figure 6: The accuracy increase (%) of Gen Label (CV) compared with mixup for 82 Open ML datasets, where λ Beta(α, α). For both α values, a large portion of tested datasets enjoy the accuracy improvement by combining Gen Label with mixup. Table 3: Accuracies of Gen Label and baselines training a logistic regression model, for Open ML datasets. We tested on (a) 82 datasets where generative model well fits the data, and (b) 10 balanced datasets among them. Each cell has two values: the result for case (a), and the result for case (b) in the parenthesis. For example, among 82 datasets, Gen Label has higher accuracy than mixup for 54.88% of tested datasets. The number of datasets where Gen Label outperforms each baseline is larger than or equal to the number of datasets where Gen Label is worse than the baseline. Mixup+Gen Label (CV) versus Higher On-par Lower Vanilla 47.56% (50%) 7.32% (30%) 45.12% (20%) Adamixup 45.12% (50%) 9.76% (20%) 45.12% (30%) Mixup 54.88% (40%) 7.32% (30%) 35.37% (30%) Mixup + Excluding MI 53.66% (50%) 6.10% (20%) 39.02% (30%) Generative Classifier (GM) 52.42% (70%) 3.66% (10%) 41.46% (20%) Statistics of Gen Label vs baselines Fig. 6 compares the accuracies of Gen Label (CV) and mixup on 82 Open ML datasets. The x-axis is (Acln Gen Label Acln mixup), the accuracy increase with the aid of Gen Label, and the y-axis is the number of datasets having the accuracy gain. Recall that λ Beta(α, α) for given α = 1 or α = 2. For both settings, the accuracy of Gen Label is greater than or equal to that of mixup for more than 69% of the tested datasets. Table 3 compares accuracies of Gen Label and baselines, for Open ML datasets. Each cell shows the percentage of datasets satisfying the condition for (1) 82 datasets and (2) 10 balanaced datasets, where result for (2) is in the parenthesis. For example, for 54.88% of 82 datasets, Gen Label has higher accuracy than mixup. The number of datasets where Gen Label (CV) outperforms each baseline is larger than or equal to the number of datasets where the baseline outperforms Gen Label (CV). The results for 10 balanced datasets in the parenthesis in Table 3 show similar behavior. Gen Label: Mixup Relabeling using Generative Models Table 4: Accuracy (%) comparison on selected Open ML datasets, for the logistic regression model. For dataset IDs 830 and 1413, Gen Label has over 8% accuracy gain than mixup. Note that the comparisons for all tested datasets are in Table 3. Methods \ Dataset ID 721 777 792 830 855 913 1413 Generative Classifier (GM) 78.33 53.33 74.67 78.67 65.33 71.33 95.56 Vanilla 79.67 58.67 73.20 77.60 63.33 70.80 95.56 Ada Mixup 80.33 64.00 73.87 78.40 66.67 70.53 92.44 Mixup 79.33 62.67 73.47 76.27 66.00 69.87 88.00 Mixup + Excluding MI 79.67 62.67 74.53 78.13 66.40 71.47 93.33 Mixup + Gen Label (GM) 81.00 58.67 75.47 86.13 66.40 71.47 96.00 Mixup + Gen Label (KDE) 79.67 58.67 75.87 77.33 67.60 72.67 96.00 Mixup + Gen Label (CV) 80.33 64.00 75.60 84.53 67.33 73.20 96.44 Table 5: Robustness against a black-box attack (Brendel et al., 2018) with radius ε = 0.1, on 13 balanced Open ML datasets having less than 500 data samples and less than 20 features, for FC Re LU networks with 2 hidden layers. (a): Gen Label (best among NN and GM) versus baselines. The portion of datasets where Gen Label outperforms is over 53%. (b): Robust accuracy (%) for 6 datasets. Gen Label achieves 2 10% improvement over mixup. See Table 7 in Appendix for the full result. (a) Statistics for 13 balanced datasets Mixup+Gen Label (Best) versus Higher On-par Lower Vanilla 53.85% 23.08% 23.08% Adamixup 61.54% 7.69% 30.77% Mixup 69.23% 0.00% 30.77% Mixup + Excluding MI 69.23% 0.00% 30.77% (b) 6 selected datasets Methods \ Open ML ID 446 468 683 755 763 1413 Vanilla 29.67 34.55 51.11 41.05 64.27 68.00 Ada Mixup 30.33 37.27 51.11 37.89 63.20 67.11 Mixup 30.67 37.27 50.00 36.84 65.07 67.56 Mixup + Excluding MI 31.67 31.82 52.22 38.95 63.20 70.67 Mixup + Gen Label (GM) 37.00 42.73 52.22 43.16 61.87 71.11 Mixup + Gen Label (NN) 38.00 32.73 46.67 43.16 66.93 77.33 Performance of Gen Label on selected datasets Table 4 shows some selected datasets when Gen Label performs better than mixup, e.g., 8% accuracy gain for dataset IDs 830 and 1413. The best generative model (GM or KDE) for Gen Label varies depending on the dataset. Moreover, Gen Label (CV) successfully chooses the appropriate generative model and outperforms other baselines in all datasets in Table 4. Interestingly, for dataset IDs 721, 830, 913 and 1413, mixup performs worse than vanilla, but mixup combined with Gen Label overcomes the limitation of mixup and outperforms vanilla. For other datasets (IDs 777, 792 and 855) where mixup outperforms vanilla, Gen Label with an appropriate choice of generative model further improves mixup. Table 4 also shows that Gen Label (CV) is performing better or equal to other baselines (ada Mixup, generative classifer, mixup with excluding MI), for the selected datasets. Note that the comparison for all datasets are given in Table 3. 6.2. Results on adversarial robustness Gen Label has some improvement on robustness, compared with baselines. We tested a black-box attack (Brendel et al., 2018) on 13 balanced Open ML datasets; black-box attack is used to avoid the gradient obfuscation issue (Athalye et al., 2018) of gradient-based attacks (e.g., FGSM/PGD). Table 5 shows the robust accuracy of Gen Label and baselines. Gen Label improves the robust accuracy of mixup for more than 69% of tested datasets. In Table 5(b), Gen Label improves the robustness by 2 10 %. Still, we remark that mixup approaches should not be considered as a standalone approach as they are less effective at obtaining robust models compared to adversarial training algorithms. 6.3. Extension to high-dimensional image datasets We also tested Gen Label on MNIST, CIFAR-10, CIFAR100 and Tiny Image Net-200. Gen Label has a marginal gain. See Section A.2 in Appendix for the results. 7. Discussions Regarding the suggested Gen Label, one can think of the following discussion topics: (1) using generative models with implicit/approximate density for Gen Label, (2) using generative models not only for labeling, but also mixing. 7.1. Extension to other generative models Note that our method can be applied to generative models do not having explicit density. For generative models with approximated density pc(x), as in VAEs (Kingma et al., 2019), we can replace pc(x) by pc(x) in Algorithm 1. For GANs (Xia et al., 2021) having implicit density, we use a proxy to the density pc(x) by inverting the generator (Creswell & Bharath, 2018). Let Mc = {G(w, c) : w Rd} be the data manifold generated by generator G for class c. Assuming the spherical Gaussian noise model used for manifold learning (Hastie & Stuetzle, 1989; Chang & Ghosh, 2001), we can estimate pc(x) = R p(x|G(w, c))p(G(w, c))dw 1 n Pn i=1 exp( d(x, G(wi, c)) by choosing n random samples of w. Then, we approximate this summation with the dominant term, expressed as maxi exp( d(x, G(wi, c)) = exp( d(x, Mc)). Thus, we replace pc(x) by exp( d(x, Mc)) in Algorithm 1. 7.2. Using generative models for mixing and labeling In Gen Label, mixed points xmix are obtained by existing mixing strategies, and generative models are used only for re-labeling the mixed points. Can we also use generative models for making better mixed points? We suggest the following. For a target class pair c1 and c2, we first choose a mixing coefficient λ [0, 1], e.g., using a Beta distribution. Then, we find xmix satisfying pc1 pc1+pc2 = λ and label it as ymix = pc1 pc1+pc2 ec1 + pc2 pc1+pc2 ec2, where pc1 = pc1(xmix) and pc2 = pc2(xmix). In Section F in Appendix, we provide this algorithm for finding mixed point xmix using generative models (Gaussian mixture models and GANs) and the experimental results of this algorithm on synthetic/real datasets. Gen Label: Mixup Relabeling using Generative Models 8. Conclusion In this paper, we closely examined the failure scenarios of mixup for low dimensional data, and specify two main issues of mixup: (1) the manifold intrusion reduces both margin and accuracy, and (2) even when there is no manifold intrusion, the linear labeling method harms the margin and accuracy. Motivated by these observations, we proposed Gen Label, a novel way of labeling mixup points by using generative models. We provided concrete examples where Gen Label solves the main issues of mixup and achieves maximum margin. We also provided empirical results showing that Gen Label helps improving the accuracy of mixup on a large number of low-dimensional Open ML datasets. Acknowledgements This work was supported by an American Family Insurance grant via American Family Insurance Data Science Institute at University of Wisconsin-Madison. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. Tensor Flow: Largescale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org. Antoniou, A., Storkey, A., and Edwards, H. Data augmentation generative adversarial networks. ar Xiv preprint ar Xiv:1711.04340, 2017. Athalye, A., Carlini, N., and Wagner, D. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International Conference on Machine Learning, pp. 274 283. PMLR, 2018. Bishop, C. M. Pattern recognition. Machine learning, 128 (9), 2006. Brendel, W., Rauber, J., and Bethge, M. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. In International Conference on Learning Representations, 2018. Carratino, L., Cissé, M., Jenatton, R., and Vert, J.-P. On mixup regularization. ar Xiv preprint ar Xiv:2006.06049, 2020. Chang, K.-Y. and Ghosh, J. A unified model for probabilistic principal surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(1):22 41, 2001. Choi, H., Jang, E., and Alemi, A. A. Waic, but why? generative ensembles for robust anomaly detection. ar Xiv preprint ar Xiv:1810.01392, 2018. Cortes, C. and Vapnik, V. Support-vector networks. Machine learning, 20(3):273 297, 1995. Creswell, A. and Bharath, A. A. Inverting the generator of a generative adversarial network. IEEE transactions on neural networks and learning systems, 2018. Croce, F. and Hein, M. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International Conference on Machine Learning, pp. 2206 2216. PMLR, 2020. Donahue, J., Krähenbühl, P., and Darrell, T. Adversarial feature learning. In International Conference on Learning Representations, 2017. Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml. Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M., and Courville, A. Adversarially learned inference. In International Conference on Learning Representations, 2017. Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Tran, B., and Madry, A. Adversarial robustness as a prior for learned representations. ar Xiv preprint ar Xiv:1906.00945, 2019. Feurer, M., van Rijn, J. N., Kadra, A., Gijsbers, P., Mallik, N., Ravi, S., Müller, A., Vanschoren, J., and Hutter, F. Openml-python: an extensible python api for openml. ar Xiv:1911.02490, 2019. Ghojogh, B. and Crowley, M. Linear and quadratic discriminant analysis: Tutorial. ar Xiv preprint ar Xiv:1906.02590, 2019. Ghosh, P., Losalka, A., and Black, M. J. Resisting adversarial attacks using gaussian mixture variational autoencoders. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 541 548, 2019. Greenewald, K., Gu, A., Yurochkin, M., Solomon, J., and Chien, E. k-mixup regularization for deep learning via optimal transport. ar Xiv preprint ar Xiv:2106.02933, 2021. Guo, H., Mao, Y., and Zhang, R. Mixup as locally linear out-of-manifold regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3714 3722, 2019. Gen Label: Mixup Relabeling using Generative Models Hastie, T. and Stuetzle, W. Principal curves. Journal of the American Statistical Association, 84(406):502 516, 1989. Hendrycks, D., Mu, N., Cubuk, E. D., Zoph, B., Gilmer, J., and Lakshminarayanan, B. Augmix: A simple data processing method to improve robustness and uncertainty. In International Conference on Learning Representations, 2020. Hwang, S.-H. and Whang, S. E. Mix RL: Data mixing augmentation for regression using reinforcement learning. ar Xiv preprint ar Xiv:2106.03374, 2021. Ilyas, A., Jalal, A., Asteri, E., Daskalakis, C., and Dimakis, A. G. The robust manifold defense: Adversarial training using generative models. ar Xiv preprint ar Xiv:1712.09196, 2017. Inoue, H. Data augmentation by pairing samples for images classification. ar Xiv preprint ar Xiv:1801.02929, 2018. Ju, A. and Wagner, D. E-abs: Extending the analysis-bysynthesis robust classification model to more complex image domains. In Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security, pp. 25 36, 2020. Kim, J.-H., Choo, W., and Song, H. O. Puzzle mix: Exploiting saliency and local statistics for optimal mixup. In International Conference on Machine Learning, pp. 5275 5285. PMLR, 2020. Kim, J.-H., Choo, W., Jeong, H., and Song, H. O. Co-mixup: Saliency guided joint mixup with supermodular diversity. In International Conference on Learning Representations, 2021. Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, 2018. Kingma, D. P., Welling, M., et al. An introduction to variational autoencoders. Foundations and Trends in Machine Learning, 12(4):307 392, 2019. Kingma, D. P., Salimans, T., Poole, B., and Ho, J. Variational diffusion models. In Advances in Neural Information Processing Systems, 2021. Lee, K., Lee, K., Lee, H., and Shin, J. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, 2018. Li, Y., Bradshaw, J., and Sharma, Y. Are generative classifiers more robust to adversarial attacks? In International Conference on Machine Learning, pp. 3804 3814. PMLR, 2019. Liu, Z., Li, S., Wu, D., Chen, Z., Wu, L., Guo, J., and Li, S. Z. Unveiling the power of mixup for stronger classifiers. ar Xiv preprint ar Xiv:2103.13027, 2021. Ng, A. Y. and Jordan, M. I. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Advances in Neural Information Processing Systems, pp. 841 848, 2002. Park, J., Yang, J. Y., Shin, J., Hwang, S. J., and Yang, E. Saliency grafting: Innocuous attribution-guided mixup with calibrated label mixing. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., De Vito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. 2017. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. Perez, L. and Wang, J. The effectiveness of data augmentation in image classification using deep learning. ar Xiv preprint ar Xiv:1712.04621, 2017. Samangouei, P., Kabkab, M., and Chellappa, R. Defensegan: Protecting classifiers against adversarial attacks using generative models. In International Conference on Learning Representations, 2018. Schott, L., Rauber, J., Bethge, M., and Brendel, W. Towards the first adversarially robust neural network model on mnist. ar Xiv preprint ar Xiv:1805.09190, 2018. Serrà, J., Álvarez, D., Gómez, V., Slizovskaia, O., Núñez, J. F., and Luque, J. Input complexity and out-ofdistribution detection with likelihood-based generative models. In International Conference on Learning Representations, 2020. Shimada, T., Yamaguchi, S., Hayashi, K., and Kobayashi, S. Data interpolating prediction: Alternative interpretation of mixup. ar Xiv preprint ar Xiv:1906.08412, 2019. Song, Y., Kim, T., Nowozin, S., Ermon, S., and Kushman, N. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. In International Conference on Learning Representations, 2018. Soudry, D., Hoffer, E., Nacson, M. S., and Srebro, N. The implicit bias of gradient descent on separable data. In International Conference on Learning Representations, 2018. Gen Label: Mixup Relabeling using Generative Models Tanaka, F. H. K. d. S. and Aranha, C. Data augmentation using gans. ar Xiv preprint ar Xiv:1904.09135, 2019. Tokozume, Y., Ushiku, Y., and Harada, T. Learning from between-class examples for deep sound recognition. In International Conference on Learning Representations, 2018. Uddin, A., Monira, M., Shin, W., Chung, T., Bae, S.-H., et al. Saliencymix: A saliency guided data augmentation strategy for better regularization. In International Conference on Learning Representations, 2021. Vanschoren, J., van Rijn, J. N., Bischl, B., and Torgo, L. Openml: Networked science in machine learning. SIGKDD Explorations, 15(2):49 60, 2013. doi:10.1145/2641190.2641198. URL http://doi. acm.org/10.1145/2641190.2641198. Verma, V., Lamb, A., Beckham, C., Najafi, A., Mitliagkas, I., Lopez-Paz, D., and Bengio, Y. Manifold mixup: Better representations by interpolating hidden states. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 6438 6447, 2019. Xia, W., Zhang, Y., Yang, Y., Xue, J.-H., Zhou, B., and Yang, M.-H. GAN inversion: A survey. ar Xiv preprint ar Xiv:2101.05278, 2021. Xiao, C., Li, B., Zhu, J. Y., He, W., Liu, M., and Song, D. Generating adversarial examples with adversarial networks. In 27th International Joint Conference on Artificial Intelligence, IJCAI 2018, pp. 3905 3911. International Joint Conferences on Artificial Intelligence, 2018. Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6023 6032, 2019. Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018. Zhang, L., Deng, Z., Kawaguchi, K., Ghorbani, A., and Zou, J. How does mixup help with robustness and generalization? In International Conference on Learning Representations, 2021. Gen Label: Mixup Relabeling using Generative Models Algorithm 2 Gen Label (using generative models for the latent feature) Input Dataset S = {(xi, yi)}n i=1, input feature set X = {xi}n i=1, learning rate η, loss ratio γ Output Trained discriminative model fθ,φ 1: θ, φ Vanilla-trained parameter for model fθ,φ = f cls θ f feature φ 2: pc(z) Density estimated by generative model for latent feature z f feature φ (X), conditioned on class c [k] 3: for (xi, yi), (xj, yj) S do 4: (xmix, ymix) (λxi + (1 λ)xj, λyi + (1 λ)yj) 5: zmix f feature φ (xmix) 6: ygen Pk c=1 pc(zmix) Pk c =1 pc (zmix)ec 7: θ θ η θ{γ ℓCE(ygen, f cls θ (zmix)) + (1 γ) ℓCE(ymix, f cls θ (zmix))} 8: end for A. Additional algorithms and experimental results A.1. Gen Label in the latent feature space Here we illustrate how Gen Label provided in Algorithm 1 can be extended to the case of learning generative models in the latent feature space. The procedure is in three steps. First, we use existing training algorithm (vanilla training or mixup training) to learn a classifier fθ,φ = f cls θ f feature φ where f feature φ is the feature extractor part and f cls θ is the classification part. Second, given the trained feature extractor f feature φ , we train a class-conditional generative model for the latent feature z = f feature φ (x), and denote the learned density for class c by pc(z). Third, we freeze f feature φ and fine-tune f cls θ using the following manner. We first follow the mixup process: for (xi, yi), (xj, yj) S, we generate the augmented data xmix = λxi + (1 λ)xj having label ymix = λyi + (1 λ)yj. Then, we re-label this augmented data by ygen = Pk c=1 pc(zmix) Pk c =1 pc (zmix)ec, where zmix = f feature φ (xmix). Finally, we train f cls θ using (zmix, ygen), the hidden representation of mixed sample and the suggested label for the mixed feature. The pseudocode of this Gen Label variant (for the latent feature) is given in Algorithm 2. A.2. Gen Label on high dimensional image datasets In the main manuscript, we focused on the result for low-dimensional datasets. Here, we provide our experimental results on high dimensional image datasets, including MNIST, CIFAR-10, CIFAR-100, and Tiny Image Net-200. We test Gen Label combined with existing data augmentation schemes of mixup (Zhang et al., 2018) and manifoldmixup (Verma et al., 2019). We compare our schemes with mixup and manifold-mixup. We also compare with Ada Mixup, which avoids the manifold intrusion of mixup, similar to our work. For measuring the adversarial robustness, we test under Auto Attack (Croce & Hein, 2020), which is developed to overcome gradient obfuscation (Athalye et al., 2018), containing four white/black-box attack schemes (including auto-PGD) that does not need any specification of free parameters. The attack radius ε for each dataset is specified in Section E. Gen Label variant used for image datasets For image datasets, we learn generative models in the latent space. To be specific, we use a variant of Gen Label, which learns the generative model (Gaussian mixture model) and the discriminative model at the same time. The pseudocode of this variant is given in Algorithm 3, and below we explain the details of this algorithm. Consider a neural network fθ = f cls θ f feature θ parameterized by θ, which is composed of the feature extractor part f feature θ and the classification part f cls θ . We train a Gaussian mixture (GM) model on f feature θ (x), the hidden representation of input x. In this algorithm, we consider updating the estimated GM model parameters (mean and covariance) at each batch training. At each iteration t, we randomly choose B batch samples {(xi, yi)}B i=1 from the dataset D. Then, we estimate the class-conditional mean and covariance of GM model in the hidden feature space. In other words, we compute the mean µ(t) c = 1 |Sc| P i Sc f feature θ (xi) and the covariance Σ(t) c = 1 |Sc| P i Sc(f feature θ (xi) µ(t) c )(f feature θ (xi) µ(t) c )T for each class c [k], where Sc = {i : yi = ec} is the set of samples with label c within the batch. For simplicity, we approximate the covariance matrix as a multiple of identity matrix, by setting Σ(t) c 1 dtrace(Σ(t) c )Id. Here, we consider making use of the parameters (mean and covariance) estimated in the previous batches as well, by introducing a memory ratio factor β [0, 1]. Formally, the rule for updating mean µ(t) c and covariance Σ(t) c are represented as µ(t) c (1 β)µ(t) c + βµ(t 1) c Gen Label: Mixup Relabeling using Generative Models Algorithm 3 Gen Label (learning generative/discriminative models at the same time) Input Data D, mix function mix( ), learning rate η, loss ratio γ, memory ratio β, batch size B, max iteration T Output Trained model fθ = f cls θ f feature θ θ Random initial model parameter, π Permutation of [B] (µ(0) c , Σ(0) c ) (0, Id) for c [k] for iteration t = 1, 2, , T do {(xi, yi)}B i=1 Randomly chosen batch samples in D for class c [k] do Sc {i : yi = ec} µ(t) c 1 |Sc| P i Sc f feature θ (xi), µ(t) c (1 β)µ(t) c + βµ(t 1) c Σ(t) c 1 |Sc| P i Sc(f feature θ (xi) µ(t) c )(f feature θ (xi) µ(t) c )T dtrace(Σ(t) c )Id, Σ(t) c (1 β)Σ(t) c + βΣ(t 1) c end for for sample index i [B] do (xmix i , ymix i ) mix((xi, yi), (xπ(i), yπ(i))) pc det(Σ(t) c ) 1/2 exp{ (f feature θ (xmix i ) µ(t) c )T (Σ(t) c ) 1(f feature θ (xmix i ) µ(t) c )} for c [k] c1 arg minc [k] pc, c2 arg minc [k]\{c1} pc ygen i pc1 pc1 +pc2 ec1 + pc2 pc1 +pc2 ec2 end for θ θ η P i [B] θ{γ ℓCE(ygen i , fθ(xmix i )) + (1 γ) ℓCE(ymix i , fθ(xmix i ))} end for and Σ(t) c (1 β)Σ(t) c + βΣ(t 1) c . When β = 0, it reduces to the memoryless estimation. To avoid cluttered notation, we discard t unless necessary. In the second stage, we apply conventional mixup-based data augmentation. We first permute the batch data {(xi, yi)}B i=1 and obtain {(xπ(i), yπ(i))}B i=1. Afterwards, for each i [B], we select the data pair, (xi, yi) and (xπ(i), yπ(i)), and apply a mixup-based data augmentation scheme denoted by mix( ), to generate mixed point xmix i labeled by ymix i . One can use any data augmentation as mix( ), e.g., mixup, manifold-mixup. For example, for vanilla mixup, we have xmix i = λxi + (1 λ)xπ(i), ymix i = λyi + (1 λ)yπ(i) where λ Beta(α, α) for some α > 0. In the third stage, we re-label this augmented data based on the estimated GM model parameters. To be specific, we compute the likelihood of the mixed data sampled from class c, denoted by pc = det(Σc) 1/2 exp{ (f feature θ (xmix i ) µc)T Σ 1 c (f feature θ (xmix i ) µc)}. Then, we sort k classes in a descending order of pc, and select the top-2 classes c1 and c2 satisfying pc1 pc2 pc for c [k]\{c1, c2}. Then, we label the mixed point xmix i as ygen i = pc1 pc1 + pc2 ec1 + pc2 pc1 + pc2 ec2. Since our generative model is an imperfect estimate on the data distribution, ygen i may be incorrect for some samples. Thus, we can use a combination of vanilla labeling and the suggested labeling, i.e., define the label of mixed point as γygen i + (1 γ)ymix i for some γ [0, 1]. Note that our scheme reduces to the vanilla labeling scheme when γ = 0. Using the augmented data with updated label, the algorithm trains the classification model fθ : Rn [0, 1]k that predicts the label y = [y1, , yk] of the input data, using the cross-entropy loss ℓCE( ). Results Table 6 shows the summary of results. Here, we tried two different validation schemes: one is to choose the best robust model against Auto Attack, and the other is to select the model with the highest clean accuracy. For each scheme X {mixup, manifold-mixup}, it is shown that X + Gen Label has a minor improvement on both clean and robust accuracies, for all image datasets. It is also shown that the suggested Gen Label achieves a higher clean accuracy than Ada Mixup, which requires 3x higher computational complexity than our method. Gen Label: Mixup Relabeling using Generative Models Table 6: Clean and robust accuracies on real image datasets. We consider two types of validation: selecting the best robust model based on Auto Attack, or the best model based on the clean accuracy. Here, Mixup+Gen Label indicates that we applied the suggested labeling method to the augmented points generated by mixup. Gen Label helps mixup-based data augmentations in terms of both robust accuracy and clean accuracy. Methods MNIST CIFAR-10 CIFAR-100 Tiny Image Net-200 Robust Clean Robust Clean Robust Clean Robust Clean Vanilla 48.17 13.1 99.34 0.03 16.89 0.98 94.57 0.25 17.19 0.20 74.48 0.28 13.19 0.19 58.13 0.09 Ada Mixup - 99.32 0.05 - 95.45 0.13 - - - - Mixup 55.44 1.80 99.27 0.03 11.65 1.96 95.68 0.06 18.44 0.45 77.65 0.30 14.91 0.48 59.46 0.30 Mixup+Gen Label 56.54 1.03 99.36 0.06 14.32 1.23 96.09 0.01 19.58 0.71 78.04 0.21 15.34 0.30 59.78 0.09 Manifold mixup 55.56 1.53 99.32 0.04 18.14 1.88 94.78 0.49 19.25 0.61 78.61 0.17 14.78 0.28 59.87 0.63 Manifold mixup+Gen Label 56.62 1.31 99.37 0.07 18.91 1.26 95.10 0.10 19.28 1.04 78.99 0.54 15.19 0.22 60.02 0.25 Table 7: Robust accuracy (%) under decision-based black-box attack (Brendel et al., 2018) and clean accuracy (%), on 6 selected datasets. The robust accuracy is same as what has been reported in Table 5, and we added clean accuracy in the parenthesis for comparison. One can confirm the huge gap between robust accuracy and clean accuracy, showing the effectiveness of the black-box attack used in the paper. Methods \ Open ML ID 446 468 683 755 763 1413 Vanilla 29.67 (99.33) 34.55 (84.55) 51.11 (81.11) 41.05 (67.37) 64.27 (85.33) 68.00 (92.00) Ada Mixup 30.33 (100.00) 37.27 (89.09) 51.11 (81.11) 37.89 (71.58) 63.20 (85.87) 67.11 (92.00) Mixup 30.67 (100.00) 37.27 (84.55) 50.00 (81.11) 36.84 (70.53) 65.07 (86.13) 67.56 (88.89) Mixup + Excluding MI 31.67 (100.00) 31.82 (89.09) 52.22 (83.33) 38.95 (71.58) 63.20 (85.33) 70.67 (92.89) Mixup + Gen Label (GM) 37.00 (100.00) 42.73 (82.73) 52.22 (80.00) 43.16 (73.68) 61.87 (85.33) 71.11 (93.78) Mixup + Gen Label (NN) 38.00 (100.00) 32.73 (81.82) 46.67 (82.22) 43.16 (69.47) 66.93 (85.33) 77.33 (96.44) A.3. Supplementary for robustness experiments In Table 5, we showed the robust accuracies of selected datasets. Table 7 shows the full version including the robust accuracy as well as clean accuracy. This shows the effectiveness of the attack scheme used in this paper. A.4. Visualization of Gen Label For the Four-circle dataset and 9-class Gaussian data, we visualized the label of mixup samples and decision boundaries for Gen Label, mixup, and vanilla training in Fig. 4. This is a full version of Fig. 7. A.5. Gen Label on Cut Mix (Yun et al., 2019) In the main manuscript, we have provided the effect of Gen Label on mixup only, but Gen Label can be also applied to other mixup variants. Now the question is, whether Gen Label is effective in other mixup variants as well? Note that the latest mixup variants (Puzzle Mix, Co-Mixup, Saliency Mix) are mostly tailored for image datasets, while we are focusing on tabular datasets. Thus, we consider Cut Mix, the only method (among the variants listed above) that is applicable to tabular datasets. Table 8 compares the performance of mixup and mixup+Gen Label, as well as that of Cut Mix and Cut Mix+Gen Label. Our results show that Cut Mix+Gen Label is having gain compared with Cut Mix in terms of accuracy. Table 8: Test accuracies (%) of baselines and Gen Label. For both mixup and Cut Mix, using Gen Label improves the accuracy. Dataset / Scheme Vanilla Mixup Mixup+Gen Label Cut Mix Cut Mix+Gen Label Iris 95.56 88.00 96.44 (8.44 ) 94.67 95.11 (0.44 ) Wine 92.96 94.81 96.67 (1.86 ) 96.30 97.40 (1.10 ) Kidney 62.61 65.22 71.30 (6.08 ) 63.48 69.57 (6.09 ) A.6. Effect of generative models on the performance of Gen Label Here we empirically show how the quality of generative models affect the performance of Gen Label. Table 9 shows the test accuracy of vanilla/mixup/Gen Label, as well as the train accuracy of Gen Label using Gaussian Mixture (GM) model. One can see that Gen Label fails when the train accuracy of GM classifier is low, i.e., when GM model cannot fit the data. Gen Label: Mixup Relabeling using Generative Models (a) dataset (d) vanilla training, decision boundary (e) mixup, decision boundary (f) Gen Label, decision boundary (b) mixup, top-1 label of mixed points (c) Gen Label, top-1 label of mixed points (g) kernel SVM, decision boundary Figure 7: Comparison of vanilla training, mixup and the suggested Gen Label. Top: Four-circle dataset, Bottom: 9-class Gaussian data. For the purpose of illustration, we set the mixing coefficient as λ = 0.6 for the 9-class Gaussian data. (a): training data points. (b): mixup points and their top-1 labels (c): mixup points and their top-1 label for the suggested Gen Label ygen. Figures in (b) and (c) show that the conventional labeling method ymix causes the label conflict issue for a large number of mixup points, while the suggested label ygen does not suffer from the conflict issue. (d), (e), (f), (g): decision boundaries of vanilla training, mixup, Gen Label, and kernel SVM. In the decision boundary plots, we can find that several classes/samples have small margins for vanilla training and mixup, while the suggested Gen Label does not have such issue and approaching to the ideal margin obtained by kernel SVM. Table 9: Test accuracies (%) of three schemes (vanilla, mixup, mixup+Gen Label) and train accuracy of Gaussian Mixture (GM) classifier, for four datasets. For Iris and Wine datasets which GM fits the data well (i.e., train accuracy of GM classifier is high), Gen Label has high test accuracy. On the other hand, for Abalone and Glass datasets having small train accuracy of GM classifier, Gen Label does not improve the performance of mixup. Dataset Vanilla Mixup Mixup+Gen Label (GM) Train acc of GM classifier Iris 95.56 88.00 96.44 (8.44 ) 98.00 Wine 92.96 94.81 96.67 (1.86 ) 100.00 Abalone 63.01 62.42 62.39 (0.03 ) 62.00 Glass 60.00 64.92 60.31 (4.61 ) 64.00 B. Additional mathematical results B.1. Formal statement on the adversarial robustness of Gen Label Here we analyze the adversarial robustness of a model trained by mixup+Gen Label, and show that Gen Label is beneficial for improving the robustness of mixup under the logistic regression models and the fully-connected (FC) Re LU networks. We first describe the basic setting considered in our analysis, and then provide the results. All proofs are given in Section C in Appendix. Basic setting and notations Consider d-dimensional Gaussian dataset defined as (x|y = 0) N( e1, Σ σ2 1 ) and (x|y = σ2 2 ), where Σij = 1 for i = j and Σij = τ for i = j. Here we assume that 1 < τ < 1, τ / { 1 σ2 = cσ1 with 2 3 < c < 2 + 3. We consider the loss function ℓ(θ, (x, y)) = h (fθ(x)) yfθ(x), where fθ(x) is the prediction of a model parameterized by θ for a given input x, and h(w) = log(1 + exp(w)). We assume the following labeling setting: when we mix (xi, yi) and (xj, yj) which generates the mixed point xmix ij = λxi + (1 λ)xj, we label it as ymix ij = y if yi = yj = y, and we label it as ygen ij in (1) otherwise. We assume the mixing coefficient follows the uniform distribution λ Unif[0, 1], i.e., λ Beta(α, α) with α = 1. For a given model parameter θ and the dataset S, we define the notations for several losses as below. The standard loss is denoted by Lstd n (θ, S) = 1 n Pn i=1 ℓ(θ, zi). The mixup loss and Gen Label loss are denoted by Lmix n (θ, S) = 1 n2 Pn i,j=1 Eλ[ℓ(θ, zmix ij )] and Lgen n (θ, S) = 1 n2 Pn i,j=1 Eλ[ℓ(θ, zgen ij )], respectively, where zmix ij = (xmix ij , ymix ij ) and zgen ij = (xmix ij , ygen ij ). The adversarial loss with L2 attack of radius ε d is defined as Ladv n (θ, S) = 1 n Pn i=1 max δi 2 ε d ℓ(θ, (xi + δi, yi)). Mathematical results Before stating our result, we denote the Taylor approximation of mixup loss by Lmix n (θ, S), the expression of which is given in Lemma 8 in Appendix. Similarly, the Taylor approximation of each term in the adversarial loss is denoted by ℓadv(ε d, (xi, yi)), which is expressed in Lemma 9 in Appendix. Finally, the approximation of Gen Label loss, denoted by Lgen n (θ, S), is expressed in Lemma 1 in Appendix. Gen Label: Mixup Relabeling using Generative Models In the theorem below, we state the relationship between Taylor approximations of mixup loss, Gen Label loss, and adversarial loss, for the logistic regression models. In this theorem, we consider the set of model parameters Θ := {θ Rd : (2yi 1)fθ(xi) 0 for all i = 1, 2, , n} which contains the set of all θ with zero training errors. Theorem 1. Consider the logistic regression setting having fθ(x) = θT x. Suppose there exists a constant cx > 0 such that xi 2 cx for all i {1, 2, , n}. Then, in the asymptotic regime of large σ1, for any θ Θ, we have Lmix n (θ, S) > Lgen n (θ, S) 1 n Pn i=1 ℓadv(δgen, (xi, yi)). Here, δgen = R cx Ai σ1,c,τ,d with R = mini {1, ,n} | cos(θ, xi)|, where Ai σ1,c,τ,d is defined in (24) in the Appendix. 0.2 0.1 0.0 0.1 0.2 φ: angle of the model [π radian] Figure 8: Comparison between the mixup loss, Gen Label loss and adversarial loss, for the logistic regression model θ = (10 cos φ, 10 sin φ) on a two-dimensional Gaussian dataset. This plot coincides with the result in Theorem 1. Fig. 8 compares the second-order Taylor approximation of mixup loss Lmix n (θ, S), Gen Label loss Lgen n (θ, S), and adversarial loss 1 n Pn i=1 ℓadv(δgen, (xi, yi)), for the logistic regression model θ = (10 cos φ, 10 sin φ) parameterized by the angle φ. Here, we use the dataset S = {(x+ i , +1), (x i , 1)}20 i=1, where each sample at class +1 and 1 follows the distribution of x+ i N([+1, 0], 1 100I2) and x i N([ 1, 0], 1 100I2), respectively. One can confirm that the model θ = (10, 0), which corresponds to φ = 0, has the smallest mixup/Gen Label/adversarial loss. In every angle φ [ π 4 ], the Gen Label loss is strictly smaller than mixup loss, which coincides with the result of Theorem 1. We can also extend the result of Theorem 1 to fully-connected Re LU networks as below. Theorem 2. Consider fully-connected Re LU network fθ(x) = βT σ(WN 1 (W2σ(W1x))) where σ is the activation function and the parameters contain matrices Wi and a vector β. Suppose there exists a constant cx > 0 such that xi 2 cx for all i {1, 2, , n}. Then, in the asymptotic regime of large σ1, for any θ Θ, we have Lmix n (θ, S) > Lgen n (θ, S) 1 n Pn i=1 ℓadv(δgen, (xi, yi)). Here, ℓadv(δ, (x, y)) = ℓ(θ, (x, y)) + δ |g (fθ(x)) y| fθ(x) 2 + δ2d 2 |h (fθ(x))| fθ(x) 2 2 is the Taylor approximation of adversarial loss for Re LU network, and we have δgen = Rcx Ai σ1,c,τ,d and R = mini {1, ,n} | cos( fθ(xi), xi)|, where Ai σ1,c,τ,d is in (24) and g(x) = ex/(1 + ex). B.2. Approximation of Gen Label loss Below we state the approximation of Gen Label loss Lgen n (θ, S). The proof of this lemma is in Section C.6. Lemma 1. The second order Taylor approximation of the Gen Label loss is given by Lgen n (θ, S) = Lstd n (θ, S) + Rgen 1 (θ, S) + Rgen 2 (θ, S) + Rgen 3 (θ, S), Rgen 1 (θ, S) = 1 i=1 Aσ1,c,τ,d(h (fθ(xi)) yi) fθ(xi)T Erx DX[rx xi], Rgen 2 (θ, S) = 1 i=1 Bσ1,c,τ,dh (fθ(xi)) fθ(xi)T Erx DX[(rx xi)(rx xi)T ] fθ(xi), Rgen 3 (θ, S) = 1 i=1 Bσ1,c,τ,d(h (fθ(xi)) yi)Erx DX[(rx xi)T 2fθ(xi)(rx xi)], where Aσ1,c,τ,d and Bσ1,c,τ,d are two constants defined in (24). When σ1 , we have limσ1 Aσ1,c,τ,d = c2+1 2(c+1)2 < 1 3, limσ1 Bσ1,c,τ,d = c2 c+1 3(c+1)2 < 1 Gen Label: Mixup Relabeling using Generative Models θ = (r cos ϕ, r sin ϕ) (1,0) ( 1,0) (0,1) n points overlapped ϕ x1 = (1,0) x3 = ( 1,0) Figure 9: Left: the dataset S used in Example 3. The feature-label pairs are defined as (x1, y1) = ([1, 0], +1), (x2, y2) = ([0, 1], +1), and (xi, yi) = ([ 1, 0], 1) for i = 3, 4, , n + 2. Right: the line segments connecting training data points. C. Proof of mathematical results C.1. Proof for Example 1 We start with showing θ mixup = 7 16. First, we the prediction of the classifier can be represented as ( 1 2θ|x|, if 0 |x| 2θ, 1, if 2θ |x| 1. Since we have three data points, we have 3 2 = 3 different way of mixing the data points: (1) mixing x1 and x2, (2) mixing x2 and x3, (3) mixing x3 and x1. We denote the loss value of i-th mix pair as Li. We first compute L2, the loss of mixing x2 and x3. The mixed point is xmix = λx3 + (1 λ)x2 = λ, which has label ymix = λe1 + (1 λ)e2 for λ [0, 1]. Then, ymix fθ(xmix) 1 fθ(xmix) 2 dλ = 2 Z 1 0 (λ fθ(xmix))2dλ = 2 Z 2θ 2θ)2dλ + 2 Z 1 2θ (λ 1)2dλ Since fθ(x) is symmetric, we have L1 = L2. Now, we compute L3. The mixed point is represented as xmix = λx3 + (1 λ)x1 = 2λ 1, which is labeled as ymix = λe1 + (1 λ)e1 = e1, for λ [0, 1]. Then, ymix fθ(xmix) 1 fθ(xmix) 0 (1 fθ(xmix))2dλ 2θ|2λ 1|)2dλ = 2 Z 1 2 +θ 2θ(2λ 1))2dλ = 2 Thus, d dθ( 3 2(L1 + L2 + L3)) = d dθ(8θ2 7θ + 2) = 0 when θ = 7 16. This completes the proof of θ mixup = 7 16. Finally, θ mixup-without-MI = 1 2 is trivial from the fact that d dθ( 3 4(L1 + L2)) = d dθ(2θ 1)2 = 0 when θ = 1 C.2. Proof for Example 3 Consider the problem of classifying n + 2 data points S = {(xi, yi)}n+2 i=1 , where the feature xi R2 and the label yi {+1, 1} of each point is specified in Fig. 9. We use the one-hot label yi = [1, 0] for class +1 and yi = [0, 1] for class 1. Consider applying logistic regression to this problem, where the solution is represented as θ = [r cos φ, r sin φ]. Here we compare three different schemes: (1) vanilla training, (2) mixup, and (3) mixup with Gen Label (dubbed as new-mixup). The first scheme is nothing but training only using the given training data S. Both mixup and new-mixup generate mixed points using linear combination of data points, i.e., xij = λxi + (1 λ)xj for some λ Beta(α, α), while the labeling method is different. The original mixup uses yij = λyi + (1 λ)yj, whereas the new-mixup uses yij = ρyi + (1 ρ)yj where ρ = 1 1+exp{ (λ 1/2)/σ2} for some small σ , assuming the class +1 is modeled as Gaussian mixture. We analyze the solutions of these schemes, denoted by θvanilla, θmixup and θnew-mixup, and compare it with the L2 max-margin classifier obtained from support vector machine (SVM), represented as θsvm = (cos π 4 ). Here, we denote the angle of SVM solution by φsvm = π/4. Below we first analyze the loss of vanilla training, and then provide analysis on the loss of the mixup scheme (using either original linear labeling or the suggested Gen Label). Gen Label: Mixup Relabeling using Generative Models Vanilla training Consider the vanilla training which learns θ (or the corresponding φ) by only using the given data. In this case, the sum of logistic loss over all samples can be represented as i=1 log(1 + e yiθT xi), where the exponential term for each data is y1θT x1 = (r cos φ, r sin φ)T (1, 0) = r cos φ, y2θT x2 = (r cos φ, r sin φ)T (0, 1) = r sin φ, yiθT xi = (r cos φ, r sin φ)T ( 1, 0) = r cos φ, i = 3, 4, , n + 2 Then, the loss of vanilla training is ℓvanilla = (n + 1) log(1 + e r cos φ) + log(1 + e r sin φ). (2) The derivative of the loss with respect to φ is given as dφℓvanilla = (n + 1)r sin φ exp{ r cos φ} 1 + exp{ r cos φ} + r cos φ exp{ r sin φ} 1 + exp{ r sin φ} . By plugging in φsvm = π/4 in this expression, we have R(φ = φsvm) = nr 2} 1 + exp{ r/ meaning that vanilla training cannot achieve the max-margin classifier θsvm for a fixed r > 0. Note that R(φ = φsvm) 0 holds when θ = r , i.e., the vanilla gradient descent training achieves the SVM solution. This coincides with the result of (Soudry et al., 2018) which showed that for linearly separable data, the model parameter w(t) updated by gradient descent satisfies both limt θ(t) = and limt θ(t)/ θ(t) = θsvm. Mixup Now we analyze the case of mixup + Gen Label (or new-mixup). Here we briefly recap how the suggested data augmentation works. Basically, following the vanilla mixup scheme, we randomly sample data points xi and xj, and generate augmented data xij = λxi + (1 λ)xj where λ Beta(α, α) for some α > 0. Then, we label this augmented data as yij = ρyi + (1 ρ)yj where ρ = λ for vanilla mixup with linear labeling, and ρ = 1 1+exp{ (λ 1/2)/σ2} for new labeling, where σ is a small positive number. Since there are total n + 2 points in the training set, we have (n + 2)2 pairs of xi, xj X. The sum of loss values of all pairs can be represented as ℓmixup = 2n Z L1 L2 ℓ(yij, ˆyij) + n2ℓ(y3, ˆy3) + 2 Z L3 ℓ(yij, ˆyij) + ℓ(y1, ˆy1) + ℓ(y2, ˆy2) (3) where the line segments L1, L2, L3 are illustrated in Fig. 9. Note that each line segment can be represented as the set of following (xij, yij) pairs for λ [0, 1]: L1 : xij = ( λ, 1 λ), yij = [1 ρ, ρ] L2 : xij = (2λ 1, 0), yij = [ρ, 1 ρ] L3 : xij = (1 λ, λ), yij = [1, 0] Recall that for a given random data x, the label estimated by logistic regression model θ is represented as ˆy = [ˆy(0), ˆy(1)] = [ 1 1+exp( θT x), 1 1+exp(+θT x)]. If this sample has true one-hot encoded label y = [y(0), y(1)], then the logistic loss of this model (regarding the specific sample (x, y)) is given as ℓ(y, ˆy) = y(0) log ˆy(0) y(1) log ˆy(1) Gen Label: Mixup Relabeling using Generative Models Thus, each loss term in (3) can be represented as L1 ℓ(yij, ˆyij) = Z 1 0 {(1 ρ) log(1 + eλr cos φ (1 λ)r sin φ) + ρ log(1 + e λr cos φ+(1 λ)r sin φ)}p(λ)dλ, L2 ℓ(yij, ˆyij) = Z 1 0 {(1 ρ) log(1 + e(2λ 1)r cos φ) + ρ log(1 + e (2λ 1)r cos φ)}p(λ)dλ, L3 ℓ(yij, ˆyij) = Z 1 0 log(1 + e (1 λ)r cos φ λr sin φ) p(λ)dλ, ℓ(y1, ˆy1) = ℓ(y3, ˆy3) = log(1 + e r cos φ), ℓ(y2, ˆy2) = log(1 + e r sin φ), where p(λ) is the probability density function for sampling λ. Based on the expression of the loss ℓ(r, φ) for each scheme given in (2) and (3), we numerically plotted φ = arg minφ ℓ(r, φ) for various r in Fig. 5. It turns out that the optimal φ new-mixup of new-mixup approaches to the SVM solution φsvm = π/4 as r increases. Using the standard definition of margin denoted by margin(θ) = min (xi,yi) D θ = min{cos φ, sin φ}, we margin(θsvm) = margin(θnew-mixup) > margin(θvanilla) > margin(θmixup) according to Fig. 5. C.3. Proof of Theorem 1 Following the proof of Theorem 3.1 of (Zhang et al., 2021), when θ Θ, we have (h (fθ(xi)) yi) fθ(xi)T Erx DX[rx xi] 0, h (fθ(xi)) fθ(xi)T Erx DX[(rx xi)(rx xi)T ] fθ(xi) 0. The first inequality in Theorem 1 is directly obtained by combining Lemma 8 and the fact that Ai σ1,c,τ,d < 1/3 and Bi σ1,c,τ,d < 1 6 holds, which is proven in Lemma 1. The second inequality in Theorem 1 is obtained by applying Theorem 3.1 of (Zhang et al., 2021) into the Taylor approximation of Gen Label loss Lgen n (θ, S) in Lemma 1. C.4. Proof of Theorem 2 Similar to the proof of Theorem 1, the first inequality is directly from Lemma 1. The second inequality is obtained by applying Theorem 3.3 of (Zhang et al., 2021) into the Taylor approximation of Gen Label loss Lgen n (θ, S) in Lemma 1. C.5. Lemmas used for proving Lemma 1 We here provide lemmas that are used in the proof of Lemma 1, which is given in Section C.6. Before stating our first lemma, recall that the covariance matrix of each class-conditional data distribution is a scalar factor of Σ, which is defined as 1 τ τ τ τ 1 τ τ ... τ ... τ ... τ τ 1 τ τ τ τ 1 Below we provide the inverse matrix of Σ. Gen Label: Mixup Relabeling using Generative Models Lemma 2. When τ / { 1 d 2} and 1 < τ < 1, the matrix in (4) is invertible. The inverse is given by 1 τd τd τd τd 1 τd τd ... τd ... τd ... τd τd 1 τd τd τd τd 1 cd = 1 1 τ (d 2)τ + 1 (d 1)τ + 1 (6) and τd = τ (d 2)τ + 1. (7) Proof. We prove the lemma by verifying Σ (5) = Id. Clearly the diagonal element in Σ (5) reads cd[1 (d 1)ττd] = 1 1 τ (d 2)τ + 1 (d 1)τ + 1[1 (d 1) τ 2 (d 2)τ + 1] = 1 1 τ (d 2)τ + 1 (d 1)τ + 1 (d 2)τ + 1 τ 2(d 1) = 1 1 τ (d 2)τ + 1 τ 2(d 1) (d 1)τ + 1 = 1 1 τ (1 τ)(τ(d 1) + 1) (d 1)τ + 1 = 1. The off-diagonal element in Σ (5) reads cd(τ τd (d 2)ττd) = cd[(d 2)τ 2 + τ τ (d 2)τ + 1 (d 2)τ 2 (d 2)τ + 1] = 0. Then we conclude the proof. Lemma 3. For Z N(0, Σ), we have the following formula for ZT Σ 1Z: ZT Σ 1Z = cd h A1[Z1 B1( i=2 Zi)]2 + A2[Z2 B2( + + Ad 1[Zd 1 Bd 1( i=d Zi)]2 + Ad Z2 d i , where An, Bn are constants that satisfy the following recurrence relation for n d An = An 1 B2 n 1An 1, Bn = An 1Bn 1 + B2 n 1An 1 An , A1 = 1, B1 = τd, (8) and Z = [Z1, , Zd]. Proof. Using (5) we have ZT Σ 1Z = cd[ i=1 Z2 i 2τd X Gen Label: Mixup Relabeling using Generative Models We focus on (*). We claim the following induction formula: Claim: for any n < d, and An, Bn satisfying (8), we can decompose ( ) into ( ) = A1[Z1 B1( i=2 Zi)]2 + + An[Zn Bn( i=n+1 Zi)]2 i=n+1 Z2 i 2An+1Bn+1 i,j=n+1,i =j Zi Zj. (9) The lemma immediately follows by setting n = d 1 in the claim. Now we use induction to prove the claim. Base case: when n = 1, we complete the square for Z1 and obtain ( ) = [Z1 τd(Z2 + + Zd)]2 τ 2 d(Z2 + + Zd)2 + i=2 Z2 i 2τd X i,j=2,i =j Zi Zj = [Z1 τd(Z2 + + Zd)]2 + (1 τ 2 d) i=2 Z2 i 2(τd + τ 2 d) X i,j=2,i =j Zi Zj. We conclude the base case with A1, B1, A2, B2 satisfying (8) as: A1 = 1, B1 = τd, A2 = 1 τ 2 d = A1 B2 1A1, A2B2 = τd + τ 2 d = A2 A1B1 + B2 1A1 A2 = A1B1 + B2 1A1. Induction hypothesis: we assume the claim holds true for n. We want to show the claim also holds true for n + 1. We focus on the second line of the claim: (9). We further complete the square and have (9) := An+1 i=n+1 Z2 i 2An+1Bn+1 i,j=n+1,i =j Zi Zj = An+1[Zn+1 Bn+1 i=n+2 Zi]2 An+1B2 n+1( i=n+2 Z2 i 2An+1Bn+1 i,j=n+2,i =j Zi Zj = An+1[Zn+1 Bn+1 + (An+1 An+1B2 n+1) i=n+2 Z2 i 2(An+1Bn+1 + An+1B2 n+1) X i,j=n+2,i =j Zi Zj = An+1[Zn+1 Bn+1 i=n+2 Zi]2 + An+2 i=n+2 Z2 i 2An+2Bn+2 X i,j=n+2,i =j Zi Zj. Thus the claim holds true for n + 1 with An+2, Bn+2, An+1, Bn+1 satisfying (8) as An+2 = An+1 An+1B2 n+1, Bn+2 = An+1Bn+1 + An+1B2 n+1 An+2 . Then we conclude the claim and the lemma. Gen Label: Mixup Relabeling using Generative Models Lemma 4. Denote Y = (Y1, , Yj), and Yj = Zj Bj(Pd i=j+1 Zi) with Bj, Zj defined in Lemma 3 for 1 j d, then Yj follows a 1-D Gaussian distribution: Yj N(0, 1 cd Aj ). Proof. Applying Lemma 3, we compute the cumulative density function of Yj as P(Yj < x) = P(Zj Bj( i=j+1 Zi) < x) = Z (Zj+1,Zj+2, ,Zd) Rd j Z x+Bj(Zj+1+ +Zd) (Z1,Z2, ,Zj 1) Rj 1(2π) d 2ZT Σ 1Z}d Z (Zj+1,Zj+2, ,Zd) Rd j(2π) d j 2 [Aj+1Y 2 j+1 + + Ad Y 2 d ]} Z x+Bj(Zj+1+ +Zd) 2 Aj Y 2 j } (Z1,Z2, ,Zj 1) Rj 1(2π) j 1 2 [A1Y 2 1 + + Aj 1Y 2 j 1]}d Z, (10) where we used Yd = Zd. Note that Yj = Zj Bj(Pm i=j+1 Zi), we apply change of variable i=2 Zi), , Zd) (Y1, , Yd). (11) The corresponding Jacobian matrix | (Y1, ,Yd) (Z1, ,Zd)| is an upper triangular matrix with diagonal element 1. Thus the Jacobian is 1, and we conclude (Yj+1,Yj+2, ,Yd) Rd j(2π) d j 2 [Aj+1Y 2 j+1 + + Ad Y 2 d ]} 2 Aj Y 2 j } (Y1,Y2, ,Yj 1) Rj 1(2π) j 1 2 [A1Y 2 1 + + Aj 1Y 2 j 1]}d Y = C 1 (cd Aj)1/2 1 2[1 + erf(x p where C a constant that corresponds to the integration in the first and third line. Here C does not depend on x. Note that 1 2[1 + erf( x 2 )] is the cdf of N(0, 1 cd Aj ), let x , we conclude C 1 (cd Aj)1/2 = 1, thus Yj N(0, 1 cd Aj ). Lemma 5. Yj and Yk are independent for j = k, where Yj is defined in Lemma 4. Proof. We prove the lemma by showing the joint cdf of Yj, Yk can be written as the product of cdf of Yj and cdf of Yk. Without loss of generality, we assume j < k. We focus on computing the joint cdf P(Yj < x, Yk < y). Following the same Gen Label: Mixup Relabeling using Generative Models procedure of (10), we apply the change of variable (11), then the integration becomes: P(Yj < x, Yk < y) = Z (Yk+1,Yk+2, ,Yd) Rd k(2π) d k 2 [Ak+1Y 2 k+1 + + Ad Y 2 d ]} 2 Ak Y 2 k } (Yj+1,Yj+2, ,Yk 1) Rk j 1(2π) k j 1 2 [Aj+1Y 2 j+1 + + Ak 1Y 2 k 1]} 2 Aj Y 2 j } (Y1,Y2, ,Yj 1) Rj 1(2π) j 1 2 [A1Y 2 1 + + Aj 1Y 2 j 1]}d Y = C 1 (cd Aj)1/2 1 2[1 + erf(x p 2 )] 1 (cd Ak)1/2 1 2[1 + erf(y cd Ak Similar to (12), C is a constant that corresponds to the first, third and fifth line, and C does not depend on x, y. Note that 1 2[1 + erf( x cd Aj 2 )] and 1 2[1 + erf( y cd Ak 2 )] are the cdf of N(0, 1 cd Aj ) and N(0, 1 cd Ak ). Let x, y , we conclude that the constant terms combine to be C 1 cd Aj Ak = 1. Thus the joint cdf is P(Yj < x, Yk < y) = 1 2[1 + erf(x p cd Aj 2 )] 1 2[1 + erf(y cd Ak This equals to P(Yj < x) P(Yk < y) by directly applying Lemma 4. Then we conclude the lemma. Lemma 6. Suppose Z N(0, Σ) with Σ given by (4), then e T 1 Σ 1Z N(0, cd), (13) where cd is defined in (7). Here (13) corresponds to a 1-D Gaussian distribution. Proof. By Lemma 2 we have e T 1 Σ 1 = cd(1, τd, τd, , τd)T . Thus e T 1 Σ 1Z = cd[Z1 τd Z2 τd Z3 τd Zd]. Then the lemma follows by directly applying Lemma 4 with A1 = 1, B1 = τd in (8). Lemma 7. Suppose Z N(0, Σ) with Σ given by (4), then ZT Σ 1Z has the same distribution as χ2(d), where χ2(d) is the Chi-square distribution with freedom d. Proof. Applying Lemma 3, Lemma 4 and Lemma 5 we have i=1 cd Ai Y 2 i , where Ai is defined in (8), Yi is defined in Lemma 4. Here Yi, Yj are independent for i = j and cd Ai Y 2 i = ( cd Ai Yi)2. Then we apply Lemma 4 to get ( cd Ai Yi) N(0, 1) is a standard normal distribution. Then by the definition of the Chi-square distribution we conclude the lemma. Gen Label: Mixup Relabeling using Generative Models C.6. Proof of Lemma 1 Proof. Denote the mixed point by xij(λ) = λxi + (1 λ)xj. In order to estimate the second order Taylor expansion of Lgen n (θ, S), we first compute the Gen Label ygen ij . Next we use expression of ygen ij to estimate Lgen n (θ, S). Then we derive the second order Taylor expansion and the correspond coefficients Ai σ1,c,τ,d, Bi σ1,c,τ,d. Last we consider the asymptotic limit σ1 . Step 1: compute ygen ij . Recall that when yi = yj, we set the label of mixed point as ymix ij = yi. For such case, we have ymix ij = λ1yi +(1 λ1)yi = λ1yi + (1 λ1)yj for any λ1 R. When yi = yj, we use the suggested Gen Label ygen ij in (1). Without loss of generality, we assume xi N( e1, Σ σ2 1 ) and xj N(e1, Σ σ2 2 ). Thus the correspond labels are yi = 0, yj = 1. We compute the mixed point xij as follows: xij(λ) = λxi + (1 λ)xj = λ( e1 + Zi) + (1 λ)(e1 + Zj) = (1 2λ)e1 + Zij Zij = λZi + (1 λ)Zj N(0, λ2Σ σ2 1 + (1 λ)2Σ σ2 2 ). (14) where Zi = xi + e1 and Zj = xj e1. Now we compute the Gen Label ygen ij and express it as a convex combination of yi and yj. To compute the ygen ij , we denote the density function of N( e1, Σ p(x) = (2π) d 2 σd 1e σ2 1 2 (x+e1)T Σ 1(x+e1), and we denote the density function of N(e1, Σ q(x) = (2π) d 2 σd 2e σ2 2 2 (x e1)T Σ 1(x e1). Then the Gen Label ygen ij in (1) is given by the ratio: ygen ij = q( xij(λ)) p( xij(λ)) + q( xij(λ)) = 1 1 + p( xij(λ)) 1 + σd 1 σd 2 exp{ σ2 1 2 [( xij(λ) + e1)T Σ 1( xij(λ) + e1) σ2 2 σ2 1 ( xij(λ) e1)T Σ 1( xij(λ) e1)]} . We use xij(λ) = (1 2λ)e1 + Zij in (14) to express the exponential term in the denominator as exp{ σ2 1 2 [(2 2λ)2e T 1 Σ 1e1 + 4(1 λ)e T 1 Σ 1Zij + ZT ijΣ 1Zij]} exp{σ2 1 2 [4λ2e T 1 Σ 1e1 4λe T 1 Σ 1Zij + ZT ijΣ 1Zij]} exp{σ2 2 σ2 1 2 [4λ2e T 1 Σ 1e1 4λe T 1 Σ 1Zij + ZT ijΣ 1Zij]} = exp{ σ2 1[(2 4λ)e T 1 Σ 1e1 + 2e T 1 Σ 1Zij]} exp{(σ2 2 σ2 1)[2λ2e T 1 Σ 1e1 2λe T 1 Σ 1Zij + ZT ijΣ 1Zij Now we apply previous lemmas to estimate all terms in the exponent. For e T 1 Σ 1e1, we apply Lemma 2 to Σ 1 and conclude (2 4λ)e T 1 Σ 1e1 = (2 4λ)cd, where cd is defined in (7). For the other two terms, we define Z ij := e T 1 Σ 1Zij and zij := ZT ijΣ 1Zij. From (14), Zij = q λ2 σ2 1 + (1 λ)2 σ2 1 + (1 λ)2 σ2 2 ]Σ), with Z N(0, Σ). Then we apply Lemma 6 to Z ij and have Z ij = e T 1 Σ 1Zij N(0, [λ2 σ2 1 + (1 λ)2 σ2 2 ]cd). (15) Gen Label: Mixup Relabeling using Generative Models For zij, we apply Lemma 7 and have zij = ZT ijΣ 1Zij = [λ2 σ2 1 + (1 λ)2 σ2 2 ]ZT Σ 1Z [λ2 σ2 1 + (1 λ)2 σ2 2 ]χ2(d) (16) where χ2(d) is the Chi-square distribution with freedom d. Thus we conclude that the Gen Label reads ygen ij = 1 1 + σd 1 σd 2 exp{ σ2 1[(2 4λ)cd + 2Z ij]} exp{(σ2 2 σ2 1)[2λ2cd 2λZ ij + zij 2 ]} . (17) In other words, ygen ij can be written as a convex combination of yi and yj as follows: ygen ij = λ1yi + (1 λ1)yj = 1 λ1 = (17), λ1 = 1 (17) = 1 1 + σd 2 σd 1 exp{σ2 1[(2 4λ)cd + 2Z ij]} exp{(σ2 1 σ2 2)[2λ2cd 2λZ ij + zij 2 ]} . (18) Step 2: estimate Lgen n (θ, S). Now we plug the expression of ygen ij (18) into the Gen Label loss, we have Lgen n (θ, S) = 1 n2 Eλ Unif([0,1]) i,j=1 [h(fθ( xij(λ))) (λ1yi + (1 λ1)yj)]fθ( xij(λ)) n2 Eλ Unif([0,1]) EB Bern(λ1) B[h(fθ( xij)) yifθ( xij)] + (1 B)[h(fθ( xij)) yjfθ( xij)] ) For λ Unif([0, 1]), B|λ Bern(λ1), we can exchange them in order and have B Bern(aij), λ|B F1 ij, B = 1; F2 ij, B = 0. 0 λ1dλ = Z 1 1 + σd 2 σd 1 exp{σ2 1[(2 4λ)cd + 2Z ij]} exp{(σ2 1 σ2 2)[2λ2cd 2λZ ij + zij F1 ij has density function 1 + σd 2 σd 1 exp{σ2 1[(2 4λ)cd + 2Z ij]} exp{(σ2 1 σ2 2)[2λ2cd 2λZ ij + zij F2 ij has density function 1 aij = 1 1 aij 1 + σd 1 σd 2 exp{ σ2 1[(2 4λ)cd + 2Z ij]} exp{(σ2 2 σ2 1)[2λ2cd 2λZ ij + zij After changing the order of λ and B in (19), we get i,j=1 aij Eλ F1 ij[h(fθ( xij(λ))) yifθ( xij(λ))] (20) i,j=1 (1 aij)Eλ F2 ij[h(fθ( xij(λ))) yjfθ( xij(λ))] . (21) Gen Label: Mixup Relabeling using Generative Models Since xij(λ) = xji(1 λ), we can rewrite (21) as i,j=1 (1 aij)Eλ F3 ij[h(fθ( xij(λ))) yifθ( xij(λ))]. (22) Here F3 ij has density function F2 ij(1 λ): F3 ij 1 1 aij 1 + σd 1 σd 2 exp{ σ2 1[(4λ 2)cd + 2Z ij]} exp{(σ2 2 σ2 1)[2(1 λ)2cd 2(1 λ)Z ij + zij From (20) and (22), we denote Fij as a mixture distribution: Fij = aij F1 ij + (1 aij)F3 ij. Then (19) reads i,j=1 Eλ Fij[h(fθ( xij(λ))) yifθ( xij(λ))] i=1 Eλ Fi Erx DXℓˇxi,yi(θ). (23) Here we defined Fi as a mixture distribution: DX is the empirical distribution induced by training samples and ˇxi = λxi + (1 λ)rx. Step 3: derive the second order Taylor expansion. Given the expression of Lgen n (θ, S) in (23), we follow the proof of Lemma 8 and conclude that the second order Taylor expansion is given by Lemma 1, with the coefficients Ai σ1,c,τ,d, Bi σ1,c,τ,d given by Ai σ1,c,τ,d = Eλ Fi[1 λ], Bi σ1,c,τ,d = Eλ Fi[(1 λ)2]. (24) Here Fi has density function 1 + σd 2 σd 1 exp{σ2 1[(2 4λ)cd + 2Z ij]} exp{(σ2 1 σ2 2)[2λ2cd 2λZ ij + zij 1 + σd 1 σd 2 exp{ σ2 1[(4λ 2)cd + 2Z ij]} exp{(σ2 2 σ2 1)[2(1 λ)2cd 2(1 λ)Z ij + zij where Z ij, zij and cd are defined in (15), (16) and (6) respectively. It remains to prove that when σ1 , these coefficients satisfy the properties mentioned in Lemma 1. Step 4: asymptotic analysis for σ1 Now we prove limσ1 Ai σ1,c,τ,d = c2+1 2(c+1)2 and limσ1 Bi σ1,c,τ,d = c2 c+1 3(1+c)2 . Recall that Gen Label ygen ij is given in (18). When σ1 , we have Z ij, zij 0, then λ1 in (18) becomes λ1 = 1 1 + cd exp{σ2 1[(2 4λ)]cd + 2(σ2 1 c2σ2 1)λ2cd} = 1 1 + cd exp{2σ2 1cd[(1 c2)λ2 2λ + 1]} ( 1 1+cd exp{2σ2 1cd(1 c2)(λ 1 1 c )(λ 1 1+c )}, c = 1; 1 1+cd exp{2σ2 1cd(1 2λ)}, c = 1. Gen Label: Mixup Relabeling using Generative Models We have three cases regarding c. If c > 1, then 1 1 c < 0, 1 c2 < 0, which implies (1 c2)(λ 1 1 c)(λ 1 1 + c) > 0, 1 1 c < 0 λ < 1 1+c; < 0, 1 1+c < λ 1. If 0 < c < 1, then 1 1 c > 1, 1 c2 > 0, which implies (1 c2)(λ 1 1 c)(λ 1 1 + c) > 0, 0 λ < 1 1+c; < 0, 1 1+c < λ 1 < 1 1 c. If c = 1, we have 1 2λ > 0 for 0 λ < 1 2 and 1 2λ < 0 for 1 When σ1 , we combine all three cases above and conclude λ1 = 0, 0 λ < 1 1+c; 1, 1 1+c < λ 1. ygen ij = yj, 0 λ < 1 1+c; yi, 1 1+c < λ 1. With the Gen Label given by the above equation, we compute the Gen Label loss as Lgen n (θ, S) = 1 n2 Eλ Unif([0,1]) i,j=1 [h(fθ( xij(λ))) ygen ij fθ( xij(λ))] = 1 (c + 1)n2 Eλ Unif([0,1/(1+c)]) h(fθ( xij(λ))) yjfθ( xij(λ)) (25) + c (c + 1)n2 Eλ Unif([1/(1+c),1]) h(fθ( xij(λ))) yifθ( xij(λ)) . Since 1 Unif([0, 1/(1 + c)]) and Unif([1 1/(1 + c), 1]) are of the same distribution and xij(1 λ) = xji(λ), we have (25) = 1 (c + 1)n2 Eλ Unif([c/(c+1),1]) h(fθ( xij(λ))) yifθ( xij(λ)) . Using the above equation, the Gen Label loss reads Lgen n (θ, S) = 1 n2 Eλ c c+1 Unif([1/(c+1),1])+ 1 c+1 Unif([c/(c+1),1]) h(fθ( xij(λ))) yifθ( xij(λ)) i=1 Eλ c c+1 Unif([1/(c+1),1])+ 1 c+1 Unif([c/(c+1),1])Erx DXℓˇxi,yi(θ). Following the proof of Lemma 8, we conclude that when σ1 , the coefficients Ai σ1,c,τ,d, Bi σ1,c,τ,d are given by lim σ1 Ai σ1,c,τ,d =Eλ c c+1 Unif([1/(c+1),1])+ 1 c+1 Unif([c/(c+1),1])[1 λ] c (1 λ)dλ + 1 c + 1 1 c+1 (1 λ)dλ + Z 1 c c+1 (1 λ)dλ = c2 2(c + 1)2 + 1 2(c + 1)2 = c2 + 1 2(c + 1)2 . lim σ1 Bi σ1,c,τ,d =Eλ c c+1 Unif([1/(c+1),1])+ 1 c+1 Unif([c/(c+1),1])[(1 λ)2] 1 c+1 (1 λ)2dλ + Z 1 c c+1 (1 λ)2dλ = c3 3(c + 1)3 + 1 3(c + 1)3 = c2 c + 1 3(c + 1)2 . Gen Label: Mixup Relabeling using Generative Models From direct computation, we conclude that when 2 3 < c < 2 + c2 + 1 2(c + 1)2 < 1 3, c2 c + 1 3(c + 1)2 < 1 6 c2 4c + 1 < 0. We conclude the lemma. D. Mathematical results in (Zhang et al., 2021) Lemma 8 (Lemma 3 of (Zhang et al., 2021)). The second order Taylor approximation of the mixup loss Lmix n (θ, S) is given by Lmix n (θ, S) = Lstd n (θ, S) + Rmix 1 (θ, S) + Rmix 2 (θ, S) + Rmix 3 (θ, S), Rmix 1 (θ, S) = 1 1 3(h (fθ(xi)) yi) fθ(xi)T Erx DX[rx xi], Rmix 2 (θ, S) = 1 1 6h (fθ(xi)) fθ(xi)T Erx DX[(rx xi)(rx xi)T ] fθ(xi), Rmix 3 (θ, S) = 1 1 6(h (fθ(xi)) yi)Erx DX[(rx xi)T 2fθ(xi)(rx xi)]. Lemma 9 (Lemma 3.2 of (Zhang et al., 2021)). Consider the logistic regression model having fθ(x) = θT x. The second order Taylor approximation of Ladv n (θ, S) is 1 n Pn i=1 ℓadv(ε d, (xi, yi)), where for any η > 0, x Rd and y {0, 1}, ℓadv(η, (x, y)) = ℓ(θ, (x, y)) + η g θT x y θ 2 + η2 2 g θT x 1 g θT x θ 2 2 and g(x) = ex/(1 + ex) is the logistic function. E. Detailed experiments setup Here we provide a detailed description on our experimental settings. E.1. Synthetic datasets Datasets The 2D cube dataset with 2 classes (class 0 and 1) is defined as follows. Consider two adjacent squares centered at µ0 = ( 1, 0) and µ1 = (1, 0), respectively, where the length of each side of each square is 2. We define the support of class i as the area of each square. In other words, the support of class 0 is X0 = {x R2 : µ0 x 1} where is the L norm operator. Similarly, the support of class 1 is X1 = {x R2 : µ1 x 1}. The data point x for class i {0, 1} is uniform-randomly sampled from the square Xi. The 3D cube dataset with 8 classes is defined as below. Consider 8 adjacent cubes, each of which is located at each octant, where the center of each cube is µ = (µ(1), µ(2), µ(3)) for µ(1), µ(2), µ(3) { 1, 1} and the length of each side of each cube is 2. We define the support of class i as the volume of each cube. For example, the class 0 corresponds to the cube centered at µ0 = ( 1, 1, 1), and the support of class 0 is X0 = {x R3 : µ0 x 1}. Similarly, we define the support of class i {0, 1, , 7}. The data point x for class i is uniform-randomly sampled from the cube Xi. The 9-class Gaussian dataset used in Fig. 4 is defined as follows. We generate 9 Gaussian clusters having the covariance matrix of Σ = 1 10I2 and centered at µ = (µ(1), µ(2)) for µ(1), µ(2) { 10, 0, 10}. For example, cluster 0 (or class 0) is centered at µ0 = ( 10, 10) and cluster 8 (or class 8) is centered at µ0 = (10, 10). The Circle and Moon datasets used in Table 1 are from scikit-learn (Pedregosa et al., 2011) combined with Laplacian noise, where the exponential decay λ of Laplacian noise is set to 0.1 for Moon and 0.02 for Circle. The Four-circle dataset used in Table 1 is generated as follows. We first generate a Circle dataset from scikit-learn (Pedregosa et al., 2011) combined with Laplacian noise, where the exponential decay λ of Laplacian noise is set to 0.01. Then, we generate another (second) Circle dataset under the same setting (but having different realization), shift it to the right, and flip the label of the second Circle dataset. In this way, we get two adjacent Circle datasets with flipped label. Gen Label: Mixup Relabeling using Generative Models Training setting For synthetic datasets, the hyperparameters used in our experiments are summarized in Table 10. For both 2D and 3D cube datasets, we randomly generate 20 data samples from uniform distribution for each class as training data, and evaluate the decision boundary by another 10000 randomly generated data samples for each class. For 9-class Gaussian dataset, each cluster has 5000 randomly generated samples as the training data. For Moon and Circle datasets, we randomly generate 1000 data samples for both training and testing. For Four-circle dataset, we randomly generate 1000 data samples for each Circle dataset for both training and testing. For 2D and 3D cube datasets, we use a 3-layer fully connected network, which has 64 neurons in the first hidden layer and 128 neurons in the second hidden layer. For Moon, Circle and Four-circle datasets, we use a 4-layer fully connected network, which has 64 neurons in the first hidden layer and 128 neurons in the remaining hidden layers. For all the datasets, we use the SGD optimizer and the multi-step learning rate decay. We measure the clean validation accuracy at each epoch and choose the best model having the highest clean accuracy. Algorithms For mixup (Zhang et al., 2018), we followed the code from the official github repository: https:// github.com/facebookresearch/mixup-cifar10. For our Gen Label scheme on 9-class Gaussian datasets, we use the ground-truth mean and identity covariance to estimate the Gaussian mixture (GM) models at the input layer. E.2. Real datasets Datasets We use Open ML datasets from (Vanschoren et al., 2013) , MNIST, CIFAR-10 and CIFAR-100 datasets from Py Torch (Paszke et al., 2017), and Tiny-Imagenet-200 dataset from http://cs231n.stanford.edu/ tiny-imagenet-200.zip. For experiments on Open ML datasets, we first accessed all datasets from Python Open ML API (Feurer et al., 2019). Afterwards, we filtered out the datasets having more than 20 features, datasets with more than 5000 data samples. We tested our Gen Label on the remaining datasets. Training setting The hyperparameters used in our experiments are summarized in Table 11, 12, 13 and 14. When we train mixup+Gen Label on Open ML datasets, we used a 6-fold cross-validation for choosing the best loss ratio γ {0.0, 0.2, 0.4, 0.6, 0.8, 1.0}. For the clean validation runs, we measured the clean validation accuracy at each epoch and choose the best model having the highest clean accuracy. For the robust validation runs, we measured the robust validation accuracy at every 5 epochs and choose the best model having the highest robust accuracy. For Open ML datasets, we tested training methods on both the logistic regression model and the neural network with 2 hidden layers. For the latter, we followed the same architecture used in mixup (Zhang et al., 2018) which has 128 neurons in each hidden layer. For MNIST and CIFAR-10 datasets, we used Le Net-5 and Res Net-18, respectively. For both CIFAR-100 and Tiny-Imagenet-200 datasets, we used Pre Act Res Net-18. We tested on NVIDIA Tesla V100 GPUs in Amazon Web Service (AWS) and local NVIDIA RTX2080 GPU machines. Algorithms For mixup (Zhang et al., 2018) and manifold-mixup (Verma et al., 2019), we followed the code from the official github repository: https://github.com/facebookresearch/mixup-cifar10 and https://github. com/vikasverma1077/manifold_mixup. Note that the mixup github repository contains license: see https: //github.com/facebookresearch/mixup-cifar10/blob/master/LICENSE. For adamix Up (Guo et al., 2019), we used two methods: (1) for Open ML, we implemented by ourselves in Py Torch, (2) for MNIST and CIFAR-10, we cloned the source code in https://github.com/SITE5039/Ada Mix Up implemented in Tensor Flow (Abadi et al., 2015), and made slight modifications to make their experimental settings and models consistent with ours. For our Gen Label schemes, we estimated and updated the Gaussian mixture (GM) models at the penultimate layer. F. Generative model-based mixup algorithm (Gen Mix) In Section 7.2 of the main manuscript, we suggested a new way of mixing data points using generative models. Here, we formally define the algorithm for such generative model-based mixup , which is dubbed as Gen Mix. Our algorithm first trains a class-conditional generative model. One can use any generative models off-the-shelf, e.g., Gaussian mixture models, GANs. Based on the learned class-conditional distribution pc(x) s, our algorithm augments the training dataset with data points xmix that satisfy pc1(xmix) : pc2(xmix) = (1 λ) : λ for arbitrary pre-defined λ [0, 1]. It then trains a model via a standard (non-adversarial) training algorithm with the augmented dataset. The key idea behind Gen Mix is that such augmented data points can act as an implicit regularizer, promoting larger margins for the classification boundary of the trained model, which in turn guarantees robustness with good generalization. Gen Label: Mixup Relabeling using Generative Models Table 10: Hyperparameters and models used for clean validation in synthetic dataset experiments. General settings Optimizer Momentum Weight decay Batch size SGD 0.9 0.0001 128 Datasets Methods Model Training epochs Learning rate Loss ratio γ Vanilla 3-layer FC net 40 0.1 - Mixup 3-layer FC net 40 0.1 - Mixup+Gen Label 3-layer FC net 40 0.1 1 Vanilla 3-layer FC net 40 0.1 - Mixup 3-layer FC net 40 0.1 - Mixup+Gen Label 3-layer FC net 40 0.1 0.8 Vanilla 4-layer FC net 100 0.1 - Mixup 4-layer FC net 100 0.1 - Mixup+Gen Label 4-layer FC net 100 0.1 1 Vanilla 4-layer FC net 100 0.1 - Mixup 4-layer FC net 100 0.1 - Mixup+Gen Label 4-layer FC net 100 0.1 0.8 Four-circle Vanilla 4-layer FC net 100 0.1 - Mixup 4-layer FC net 100 0.1 - Mixup+Gen Label 4-layer FC net 100 0.1 1 Table 11: Hyperparameters and models used for clean validation in Open ML datasets experiments. General settings Training epochs Optimizer Weight decay Batch size 100 Adam 0.0001 128 Datasets Methods Model Learning rate Loss ratio γ Baselines Logistic Regression Chosen by cross-validation (among 0.1, 0.01, 0.001, and 0.0001) - Mixup+Gen Label Logistic Regression Chosen by cross-validation (among 0.1, 0.01, 0.001, and 0.0001) Chosen by cross-validation The rest of this section is organized as follows. We first provide a formal description of the Gen Mix framework. Then, we propose two specific instances of our framework, namely, Gen Mix+GM and Gen Mix+GAN, which use Gaussian mixture (GM) and GANs for generative modeling, respectively. F.1. General framework Let Dc = {(x(m) c , ec)}nc m=1 be the training data for class y [k], where x(m) c is the feature vector for m-th data point, ec is the one-hot encoded label vector for any data points in class c, and nc is the number of data points with class y. The training data is denoted by D = c [k]Dc. In the first stage, it trains class-conditional generative model using the given training data Xc = {x(m) c }nc m=1, thereby learning the underlying data distribution pc(x). In the second stage, we randomly sample mixing coefficient λ [0, 1]. For each class pair i, j [k], we generate augmented points Xmixup = {x1, , xnaug}, each of which satisfies pi(xmix) : pj(xmix) = (1 λ) : λ. In other words, the goal is to find virtual data x s which satisfy pj(x) pi(x) + pj(x) λ ε (26) for a pre-defined small margin ε > 0. Depending on the generative model used in the algorithm, we use different methods to find these mixup points Xmixup. The detailed description of these methods are given in the following subsections. In both schemes, we check whether the generated mixup points Xmixup incur manifold intrusion (Guo et al., 2019), and discard the mixup points having such issues. To be specific, for the case of mixing class i and j, we decide that the manifold intrusion does not occur for a mixup point x Xmixup if classes i and j are the two most probable classes of x, i.e., min{pi(x), pj(x)} pℓ(x) holds for all other classes ℓ [k]\{i, j}. For augmented data x without such manifold intrusion issue, we soft-label it as y = pi pi+pj ei + pj pi+pj ej where pi = pi(x) is the probability that x is sampled from class i. We denote the set of data-label pair as Dmixup = {(x, y)} for x Xmixup. Given naug data points obtained in the second stage, the algorithm finally trains the classification model f : Rn [0, 1]k that predicts the label y = [y1, , yk] of the input data. Here, the cross-entropy loss is used while optimizing the model. In our Gen Mix scheme, the model is trained by using not only the given training data D = c [k]{Dc}, but also the augmented dataset Dmixup. The pseudocode of the Gen Mix algorithm is given in Algorithm 4. Gen Label: Mixup Relabeling using Generative Models Table 12: Hyperparameters and models used for robust validation in Open ML dataset experiments. General settings Training epochs Optimizer Momentum Weight decay Batch size Black-box attack radius 100 SGD 0.9 0.0001 128 0.1 * (max value - min value) Datasets Methods Model Learning rate Loss ratio γ Baselines FC Re LU networks with 2 hidden layers Chosen by cross-validation (among 0.1, 0.01, 0.001, and 0.0001) - Mixup+Gen Label FC Re LU networks with 2 hidden layers Chosen by cross-validation (among 0.1, 0.01, 0.001, and 0.0001) Chosen by cross-validation Table 13: Hyperparameters and models used for clean validation in image dataset experiments. General settings Training epochs Learning rate scheduler Optimizer Momentum Weight decay Batch size 200 multi-step decay SGD 0.9 0.0001 128 Datasets Methods Model Learning rate Attack radius Loss ratio γ Memory ratio β Vanilla Le Net-5 0.1 0.05 - - Ada Mixup Le Net-5 0.1 0.05 - - Mixup Le Net-5 0.1 0.05 - - Mixup+Gen Label Le Net-5 0.1 0.05 0.15 0.95 Manifold mixup Le Net-5 0.1 0.05 - - Manifold mixup+Gen Label Le Net-5 0.1 0.05 0.15 0.99 Vanilla Res Net-18 0.1 2/255 - - Ada Mixup Res Net-18 0.1 2/255 - - Mixup Res Net-18 0.1 2/255 - - Mixup+Gen Label Res Net-18 0.1 2/255 0.1 0.95 Manifold mixup Res Net-18 0.1 2/255 - - Manifold mixup+Gen Label Res Net-18 0.1 2/255 0.1 0.95 Vanilla Pre Act Res Net-18 0.1 1/255 - - Mixup Pre Act Res Net-18 0.1 1/255 - - Mixup+Gen Label Pre Act Res Net-18 0.1 1/255 0.1 0.97 Manifold mixup Pre Act Res Net-18 0.1 1/255 - - Manifold mixup+Gen Label Pre Act Res Net-18 0.1 1/255 0.1 0.97 Tiny Image Net Vanilla Pre Act Res Net-18 0.1 1/255 - - Mixup Pre Act Res Net-18 0.1 1/255 - - Mixup+Gen Label Pre Act Res Net-18 0.1 1/255 0.05 0.995 Manifold mixup Pre Act Res Net-18 0.1 1/255 - - Manifold mixup+Gen Label Pre Act Res Net-18 0.1 1/255 0.05 0.995 Algorithm 4 Gen Mix Input Training data D, Number of augmented data naug, likelihood-ratio margin ε > 0, mixing coefficient λ [0, 1] Output Trained model f( ), Augmented data Dmixup pc Data distribution of class c learned by generative model Dmixup {} for classes i [k] and j [k]\{i} do n 0 while n < naug do Find point x satisfying pj(x) pi(x)+pj(x) λ ε pℓ pℓ(x) for ℓ [k] if min{pi, pj} pℓ ℓ [k]\{i, j} then Dmixup Dmixup {(x, pi pi+pj ei + pj pi+pj ej)} n n + 1 end if end while end for f model training with D Dmixup In summary, the proposed scheme is a novel data augmentation technique that first learns the data distributions for each class using class-conditional generative models, and then augments the train data with soft-labeled data points Xmixup, each of which has the likelihood ratio of λ [0, 1] with respect to a target class pair. Gen Label: Mixup Relabeling using Generative Models Table 14: Hyperparameters and models used for Auto Attack validation in image dataset experiments. General settings Training epochs Learning rate scheduler Optimizer Momentum Weight decay Batch size 50 multi-step decay SGD 0.9 0.0001 128 Datasets Methods Model Learning rate Attack radius Loss ratio γ Memory ratio β Vanilla Le Net-5 0.001 0.1 - - Ada Mixup Le Net-5 0.001 0.1 - - Mixup Le Net-5 0.001 0.1 - - Mixup+Gen Label Le Net-5 0.001 0.1 0.15 0.97 Manifold mixup Le Net-5 0.001 0.1 - - Manifold mixup+Gen Label Le Net-5 0.001 0.1 0.15 0.97 Vanilla Res Net-18 0.001 2/255 - - Ada Mixup Res Net-18 0.001 2/255 - - Mixup Res Net-18 0.001 2/255 - - Mixup+Gen Label Res Net-18 0.001 2/255 0.15 0.9 Manifold mixup Res Net-18 0.001 2/255 - - Manifold mixup+Gen Label Res Net-18 0.001 2/255 0.15 0.9 Vanilla Pre Act Res Net-18 0.001 1/255 - - Mixup Pre Act Res Net-18 0.001 1/255 - - Mixup+Gen Label Pre Act Res Net-18 0.001 1/255 0.15 0.97 Manifold mixup Pre Act Res Net-18 0.001 1/255 - - Manifold mixup+Gen Label Pre Act Res Net-18 0.001 1/255 0.15 0.97 Tiny Image Net Vanilla Pre Act Res Net-18 0.002 1/255 - - Mixup Pre Act Res Net-18 0.002 1/255 - - Mixup+Gen Label Pre Act Res Net-18 0.002 1/255 0.15 0.995 Manifold mixup Pre Act Res Net-18 0.002 1/255 - - Manifold mixup+Gen Label Pre Act Res Net-18 0.002 1/255 0.15 0.995 Generator G Class i Mi = {G(z, i)}z ℝd Class j Mj = {G(z, j)}z ℝd Figure 10: Finding the augmented point x Xmixup satisfying d(x, Mj) d(x, Mi) log( 1 λ 1) in Gen Mix+GAN, for arbitrary classes i = j and a pre-defined mixing coefficient λ [0, 1]. Given the manifold Mc = {G(z, c)}z Rd for class c [k] estimated by class-conditional GAN, the distance d(x, Mc) is measured by inverting the generator of GAN (Creswell & Bharath, 2018). F.2. Gen Mix+GM We first suggest Gen Mix+GM, a data augmentation scheme which uses the Gaussian mixture (GM) model for generative modeling. Here, we provide a formal description on how Gen Mix+GM finds the augmented points x satisfying the likelihood ratio condition (26). Given training samples, Gen Mix+GM algorithm first estimates the parameters of Gaussian distribution for each class. To be specific, it computes the sample mean and the sample covariance of class c, represented as c µc = 1 nc Pnc m=1 x(m) c and c Σc = 1 nc Pnc m=1(x(m) c c µc)(x(m) c c µc)T , respectively. Then, the (estimated) probability of point x sampled from class c is pc(x) = 1 (2π)kdet( c Σc)e (x c µc)T c Σc 1(x c µc)/2. Now, the question is how to find the virtual data points x satisfying (26). This can be solved by applying quadratic discriminant analysis (QDA) (Ghojogh & Crowley, 2019), which gives us the closed-form solution for x satisfying |log pj(x) pi(x)+pj(x)| λ, for given target classes i, j. F.3. Gen Mix+GAN The Gaussian mixture (GM) model is a simple generative model that works well when the data distribution is similar to Gaussian, but it cannot learn other distributions. In such cases, GANs are useful for learning the underlying distribution. Thus, here we suggest Gen Mix+GAN which uses GANs for generative modeling. As discussed in Section 7.1, we can replace pc(x) by exp( d(x, Mc)) in Algorithm 4 and apply Gen Mix scheme. Note that the condition in (26) reduces to d(x, Mj) d(x, Mi) log( 1 λ 1). Thus, the goal is to solve minx(d(x, Mj) d(x, Mi) log( 1 We use an iterative method to find points x that satisfy this condition. One key observation that helps us to design an efficient optimization algorithm is that if m = arg minm M d(x, m), then d(x + δ, M) d(x + δ, m ) if δ is small. Gen Label: Mixup Relabeling using Generative Models x data robust feature extractor frobust classifier gcls y D = {xl, yl}n Drobust = D {xmid, ymid} Zi : manifold for class i pi(z) : distribution of class i feature domain Zi Zj zmid : mid features for class i, j robust(zmid) ymid = αei + (1 α)ej α = pi(zmid)/(pi(zmid) + pj(zmid)) Figure 11: How to generate augmented dataset Drobust by applying Gen Mix in the hidden feature space. For target classes i, j, we first apply Gen Mix+GM in the feature space to find the mid hidden feature zmid, and then invert it back to get mid input feature xmid = f 1 robust(zmid). Finally, we add mid input features in the original dataset D to obtain Drobust. Here, we make use of the invertibility of frobust suggested in (Engstrom et al., 2019). That is, once we have a projection of x onto a manifold M, say m , the distance between x + δ and the same manifold can be safely approximated by the distance between x + δ and m , without recomputing the projection. To formally prove this, from triangle inequality, d(x + δ, M) = min m M d(x + δ, m) min m M[d(x + δ, x)+d(x, m)] = min m Md(x, m)+d(x, x + δ) = d(x, m ) + d(x, x + δ) d(x + δ, m ) + 2d(x, x + δ) holds. Similarly, we have d(x+δ, M) d(x+δ, m ) 2d(x, x+δ). This implies that when d(x+δ, m ) d(x, x+δ), we have d(x + δ, M) d(x + δ, m ). Using this approximation, we propose the following sequential optimization algorithm, as illustrated in Fig. 10. Starting from a random initial point x Rn, we first compute its projection on k class-conditional manifolds, finding m c = arg minm Mc d(x, m) for each c [k]. Each of these projections can be approximately computed by solving a respective optimization problem minz Rd d(x, G(z, c)). Now, we select two target classes i, j which are closest to the initial point, i.e., d(x, m i ) d(x, m j) d(x, m l ) for all l [k] \ {i, j}, and consider the following optimization problem: min δ |d(x + δ, Mj) d(x + δ, Mi) log( 1 such that d(x + δ, m c) d(x, x + δ), c {i, j} That is, we find the best direction δ that minimizes the objective function, within a small set around x. By the aforementioned approximation, the target function can be rewritten as |d(x + δ, m j) d(x + δ, m i ) log( 1 λ 1)|2. Since m i and m j are given, we can compute the gradient of this objective function with respect to δ and run a gradient descent algorithm. The solution to this sub-optimization problem is now defined as x, and we repeat the whole procedure until | 1 1+exp(d(x,m j ) d(x,m i )) λ| ε, and obtain the augmented data point x. We label this augmented data as y = exp( di) exp( di)+exp( dj)ei + exp( dj) exp( di)+exp( dj)ej where di = d(x, m i ). F.4. Gen Mix in the hidden feature space As illustrated in Fig. 11, the suggested Gen Mix can be also defined in the hidden feature space. Below we describe the details of using Gen Mix in the hidden space. Let frobust be the robust feature extractor suggested in (Engstrom et al., 2019). Note that this feature extractor is approximately invertible, i.e., the input data x can be well estimated by the representation z = frobust(x) in the feature space. We first apply Gen Mix in the feature space to find the middle features zmid satisfying pi(zmid) pj(zmid) for target classes i, j. Then, using the invertibility of frobust, we compute xmid = f 1 robust(zmid) = arg min x zmid frobust(x) . Afterwards, we define the augmented dataset as Drobust = D {(xmid, ymid)}, where ymid = αei + (1 α)ej for α = pi(zmid) pi(zmid)+pj(zmid). Gen Label: Mixup Relabeling using Generative Models (a) Training data (b) Augmented data (c) Decision Boundary Figure 12: Result of Gen Mix+GAN on V, Ket and Y datasets. (a): Training data (black: X1, blue: X2, magenta: X3), (b) Augmented data Xmixup including middle points (red, yellow, cyan), (c): Decision boundaries (DBs) of Gen Mix+GAN. The region with same color represents the set of points classified as the same class. F.5. Experimental results on Gen Mix We evaluate the generalization and robustness performances of Gen Mix+GAN, Gen Mix+GM and existing algorithms. We tested on synthetic datasets (Circle, Moon in scikit-learn (Pedregosa et al., 2011) and V, Ket, Y datasets designed by us) and a real dataset (MNIST with digits 7 and 9). The V, Ket, Y-datasets are illustrated in Fig. 12a. We compare our schemes with mixup (Zhang et al., 2018) and manifold-mixup (Verma et al., 2019). F.5.1. GENMIX ENJOYS LARGE MARGINS Fig. 12 shows the result of Gen Mix+GAN for three synthetic datasets. Here, we set the mixing coefficient as λ = 0.5, so that Gen Mix generates mixup data that are equiprobable to target classes. One can confirm that the equiprobable points help the trained model to enjoy large margins in all datasets. In Fig. 13, we visualize the suggested mixup points and the model trained by the suggested data augmentation on various synthetic datasets, and compare them with those found by vanilla mixup. Here, we set the mixing coefficient λ = 0.5, meaning that the suggested mixup points are equally probable to be sampled by two target classes. First, we show the result for 2D Gaussian dataset with 4 classes, where each data in class c is sampled from a Gaussian distribution N(µc, Σc). Trivially, Gaussian mixture (GM) model fits well with this data, so we use GM to estimate pc(x) in this dataset. The middle points xmix generated by the suggested mixup are illustrated in (a). Note that the mid points lie on the equiprobable regime for each class pair. Here, the suggested mixup learns to not mix class-1 data (red) and class-2 data (blue), since mixing these classes incur manifold intrusion. In (b) and (c), we show the decision boundary found by suggested mixup and vanilla mixup. One can see that the suggested mixup, which makes use of the underlying distribution to generate proper middle points, achieves large margins for all classes. On the other hand, the standard mixup interpolates samples without considering the overall data distribution, resulting in smaller margins around the class-0 data. Second, we show the result for circle and moon datasets defined in (Pedregosa et al., 2011). Since the Gaussian mixture model is not suitable for these datasets, we use GANs to estimate the underlying distribution pc(x). As described in the discussion section for applying Gen Label to implicit density , we inverted GAN and used the projected distance as a proxy to the negative log likelihood. In (a) of circle and moon datasets, the mixed points satisfying p0(xmix) = p1(xmix) are colored as red, which are indeed at the middle of two manifolds of black and blue. Using these mixed points, the decision boundary has a larger margin compared with vanilla mixup, as shown in (b) and (c). Note that in Fig. 13 we used L2 norm for generating middle points in Moon dataset, but we can also generate middle points for L1 or L norms. Fig. 14 illustrates the mixup points generated for Moon dataset, when L1, L2, and L distance metrics are used. Here, we set the mixing coefficient as λ = 0.5, i.e., the goal is to find equidistant points to target manifolds. From the figures, we can conclude that Gen Mix+GAN successfully finds the points that are equidistant to both manifolds, for various Lp distance settings. Gen Label: Mixup Relabeling using Generative Models 2D Gaussian (a) (b) (c) (a) (b) (c) (a) (b) (c) Figure 13: Comparison between generative-model based mixup (suggested mixup) and vanilla mixup for 2D datasets. For 2D Gaussian data, class 0,1,2,3 are colored as violet, red, blue, and yellow, respectively. For circle and moon datasets, class 0 and 1 are colored as black and blue, respectively, and the middle point obtained in the suggested mixup is colored as red. For each dataset, we show three results: (a) the mid points generated by the suggested mixup, and the decision boundaries of (b) suggested mixup and (c) vanilla mixup. In (a), one can confirm that the mid points of the suggested mixup lie on the equiprobable regime for the target class pair. As shown in (b) and (c), the suggested mixup enjoys larger margins for all classes than vanilla mixup. (a) L1-norm (b) L2-norm (c) L -norm Figure 14: Illustration of data points generated by Gen Mix+GAN in various Lp norm setup. Black and blue points correspond to each class. The red points represent the mixup points generated by Algorithm 4. F.5.2. GENMIX HELPS GENERALIZATION Here we compare Gen Mix with mixup and manifold-mixup in terms of generalization performance. Table 15 compares the performance for circle and MNIST datasets. For MNIST, we used binary classification of digits 7 and 9 using only ntrain = 500 samples at each class, to show the scenarios with large gap between Gen Mix and existing schemes. One can confirm that Gen Mix+GAN strictly outperforms the other data augmentation schemes in terms of generalization performances. This shows that depending on how we generate middle points (i.e., how we mix data), generalization performance varies significantly. One can confirm that Gen Mix outperforms conventional ways of mixing data, by making use of the underlying data distribution learned by generative models. F.5.3. GENMIX IN THE HIDDEN FEATURE SPACE Recall that in Section F.4, we have suggested Gen Mix in the hidden feature space. Fig. 15 shows the result of Gen Mix+GM applied for the hidden feature space, tested on CIFAR-10 dataset. Note that each generated image contains the features of both classes c1, c2 written in the caption, showing that the mid features obtained by the suggested mixup indeed lies in between the target class manifolds. F.6. Reducing the computational complexity of Gen Mix+GAN Here we discuss methods for reducing the complexity of Gen Mix+GAN, which used inverting the generator of GAN. We can reduce the complexity of inverting the generator of GAN, by using alternative GAN architectures that simultaneously learn the inverse mapping during training, e.g., bidirectional GAN (Donahue et al., 2017) and ALIGAN (Dumoulin et al., 2017). One can also consider using flow-based generative models, e.g., (Kingma & Dhariwal, 2018). Gen Label: Mixup Relabeling using Generative Models Table 15: Classification errors (%). Gen Mix+GAN has a better generalization performance than other schemes. Schemes / Datasets Circle (2D) Circle (3D) MNIST 7/9 (ntrain=500) Vanilla Training 8.60 4.84 1.40 0.54 2.72 0.20 Mixup 7.98 2.94 5.22 1.99 2.32 0.40 Manifold-mixup 7.34 1.43 0.94 0.75 3.88 0.53 Gen Mix+GAN 4.90 0.12 0.22 0.06 2.13 0.12 horse, ship horse, cat frog, car deer, car cat, frog bird, car Figure 15: Generative model-based mixup in the hidden feature space for CIFAR-10. Each image xmix contains features of both classes c1, c2 in the caption, supporting that it lies in between the manifold of target classes.