# on_mixup_regularization__778e460a.pdf Journal of Machine Learning Research 23 (2022) 1-31 Submitted 12/20; Revised 5/22; Published 10/22 On Mixup Regularization Luigi Carratino luigi.carratino@dibris.unige.it Ma LGa - University of Genova, Italy Moustapha Cissé moustaphacisse@google.com Google Research - Brain team, Accra Rodolphe Jenatton rjenatton@google.com Google Research - Brain team, Berlin Jean-Philippe Vert jean-philippe.vert@m4x.org Google Research - Brain team, Paris Editor: Simon Lacoste-Julien Mixup is a data augmentation technique that creates new examples as convex combinations of training points and labels. This simple technique has empirically shown to improve the accuracy of many state-of-the-art models in different settings and applications, but the reasons behind this empirical success remain poorly understood. In this paper we take a substantial step in explaining the theoretical foundations of Mixup, by clarifying its regularization effects. We show that Mixup can be interpreted as standard empirical risk minimization estimator subject to a combination of data transformation and random perturbation of the transformed data. We gain two core insights from this new interpretation. First, the data transformation suggests that, at test time, a model trained with Mixup should also be applied to transformed data, a one-line change in code that we show empirically to improve both accuracy and calibration of the prediction. Second, we show how the random perturbation of the new interpretation of Mixup induces multiple known regularization schemes, including label smoothing and reduction of the Lipschitz constant of the estimator. These schemes interact synergistically with each other, resulting in a self calibrated and effective regularization effect that prevents overfitting and overconfident predictions. We corroborate our theoretical analysis with experiments that support our conclusions. Keywords: regularization, generalization, approximation, deep learning, mixup 1. Introduction Regularization is an essential component of machine learning models and plays an even more important role in deep learning (Goodfellow et al., 2016). Regularization mechanisms can take various forms. They can be explicitly enforced by: (i) applying various penalties to the parameters of the models (Hinton, 1987; Krogh and Hertz, 1991; Bartlett et al., 2017; Neyshabur et al., 2015; Sedghi et al., 2019; Arjovsky et al., 2017), (ii) injecting noise to the internal representations of the network (Srivastava et al., 2014; Gal and Ghahramani, 2016) and/or to its outputs (Szegedy et al., 2016; Müller et al., 2019), or (iii) normalizing the activations (He et al., 2016; Salimans and Kingma, 2016). Or they can be implicit . This work was done in part at Google Brain, Paris. . Now at Owkin, Paris, France c 2022 Luigi Carratino, Moustapha Cissé, Rodolphe Jenatton and Jean-Philippe Vert. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/20-1385.html. Carratino, Cissé, Jenatton and Vert thanks to: (j) parameter sharing in architectures such as convolutional networks (Le Cun et al., 1998), (jj) the choice of the optimization algorithm (Neyshabur, 2017), e.g., stochastic gradient descent converging to small norm solutions (Arora et al., 2019), or (jjj) through data augmentation and transformation (Goodfellow et al., 2016). There is a large body of work explaining the effects of the numerous explicit and implicit regularization procedures existing in the literature. For instance, explicit regularization schemes usually proceed from analysis aiming to control specific characteristics of a model such as robustness (Hein and Andriushchenko, 2017; Cissé et al., 2017) or calibration (Guo et al., 2017; Müller et al., 2019), while the forms of implicit regularization are often understood through the angle of generalization (Neyshabur, 2017; Arora et al., 2019). However, the regularization effects of modern data augmentation procedures are less theoretically understood. Data augmentation is a core ingredient for successful deep learning pipelines. It helps to alleviate sample size issues and prevent overfitting. In simple cases, there are known equivalences between data augmentation and other existing explicit regularization procedures, e.g., training with additional noisy points in least-squares regression is equivalent to Tikhonov regularization (Bishop, 1995). Similar analysis have recently been performed to explain the regularization effect of dropout (Srivastava et al., 2014; Wager et al., 2013; Wei et al., 2020). In this work, we focus on Mixup (Zhang et al., 2018; Tokozume et al., 2018), a recently introduced data-augmentation technique that consists in generating examples as random convex combinations of data points and labels from the training set (as illustrated in Figure 1). Despite its simplicity, Mixup has been shown to substantially improve generalization on a broad range of tasks ranging from computer vision (Zhang et al., 2018; Tokozume et al., 2018) to natural language processing (Guo, 2020) and semi-supervised learning (Berthelot et al., 2019). The success of Mixup has triggered several variations such as adaptive Mixup (Guo et al., 2019), manifold Mixup (Verma et al., 2019) and Cutmix (Yun et al., 2019), but the reasons why Mixup and its variants work so well in practice remain poorly understood. Mixup s primary motivation was to alleviate overfitting in training deep neural networks (Zhang et al., 2018). However, previous studies have also empirically noticed other desirable regularization effects it induces. These include improved calibration (Thulasidasan et al., 2019), robustness to input adversarial noise (Zhang et al., 2018), and robustness to label corruption (Zhang et al., 2018). Zhang et al. (2018) also showed it helps stabilize notoriously difficult learning problems such as generative adversarial networks. Traditionally, separate regularization methods are applied to induce the above effects. For example, label smoothing (Szegedy et al., 2016; Müller et al., 2019) leads to better calibration, while dropout improves generalization (Srivastava et al., 2014; Wager et al., 2013) and robustness to label corruption (Arpit et al., 2017). Lipschitz regularization helps stabilize the training of generative adversarial networks (Arjovsky et al., 2017; Gulrajani et al., 2017). It also leads to increased robustness to adversarial perturbations (Hein and Andriushchenko, 2017; Cissé et al., 2017). Table 1 shows a comparison of various regularization procedure proposed in the literature, and the effect they are known to induce on the model. Although all these desirable regularization effects have been observed empirically, no theoretical explanation has been given yet. On Mixup Regularization Original Modified Mixup Zoom Fig. 1: Illustration of how training a model with Mixup (second plot) differs from training a model on original data (first plot), the fourth plot highlighting the discrepancy between the Bayes classifiers in both situations (black vs red). To explain this difference, we show in this paper that the model trained with Mixup can be interpreted as a regularized version of a model trained on modified data (third plot, blue curve on the zoom plot), and characterize both the data modification (from black to blue) and the regularization effect (from blue to red). Both effects interact synergistically to confer Mixup strong regularization properties, which may explain its good empirical behavior in a variety of tasks. In this work, we propose the first theoretical analysis of Mixup1 to better understand the reasons for its empirical success. We show that Mixup can be analyzed through the lenses of empirical risk minimization with random perturbations, and exploit ingredients from previous analysis of dropout (Wager et al., 2013; Khalfaoui et al., 2019; Wei et al., 2020) to derive a regularized objective function that sharply captures the regularization effects of Mixup. In particular, our analysis sheds some light on the multiple effects that Mixup borrows from the popular regularization mechanisms listed above such as label smoothing (Pereyra et al., 2017) (output noise) or dropout (Srivastava et al., 2014) (input noise), and how it uniquely combines them to improve calibration and smooth the Jacobian of the model. We further show that this analysis points out a missing step in learning with Mixup, and we present how applying a simple transformation when evaluating at test time the function learned with Mixup can improves accuracy and calibration. More precisely, we make the following contributions (illustrated in Figure 1): We show that Mixup can be reinterpreted as a standard empirical risk minimization procedure, applied to a transformation of the original data perturbed by random perturbations, and give explicit formulas for the data transformation and the perturbations. In particular, we show that the Mixup transformations shrinks both the inputs and the outputs towards their mean, the later creating a form of regularization by label smoothing. We notably give a formal description of the effect of label smoothing in the case of the cross-entropy loss where it translates into an increase in the entropy of the predictions. We show that Mixup learns functions from a modified version of the input space of the training points to a modified version of the output space of the training points. Thus, we present how to properly evaluate the learned functions to further improve accuracy and calibration. 1. After we published a first version of this work (Carratino et al., 2020), Zhang et al. (2021) independently derived a similar and complementary analysis of Mixup; we summarize in Section 5 the main differences between both works. Carratino, Cissé, Jenatton and Vert We characterize the random perturbations induced by Mixup on both the inputs and the outputs, as well as their dependency and their correlation structure. We deduce an approximation of the regularization induced by Mixup, and highlight in particular how it regularizes both the model and its derivatives. We discuss in details the specific cases of classification with cross-entropy loss, and least squares regression. We provide empirical support for our interpretation of Mixup regularization. Method Calibration Jacobian Reg. Robustness Label Noise Input Normaliz. Label smooth. (Szegedy et al., 2016) Spectral Reg. (Cissé et al., 2017) Dropout (Wager et al., 2013) Temperat. scaling (Guo et al., 2017) Mixup (Zhang et al., 2018) Table 1: Summary of the effects induced by various regularizers. Absence of checkmark means the corresponding effect is not known for this regularizer. The rest of the paper is organized as follows. In Section 2, we introduce notations used throughout the paper and describe the setting of empirical risk minimization and learning with Mixup. In Section 3, we show how Mixup can be interpreted as an empiricial risk minimization on modified data with random perturbations. In Section 4 we analyze the regularization effect of Mixup through a quadratic Taylor approximation of the formulation derived in Section 3. In Section 5, we discuss in detail several aspects of Mixup that the theoretical analysis in Section 4 and Section 3 suggest, and confront them to experimental validations. The proofs of all results are detailed in the Annex, together with additional experimental results. 2. Notations and setting Notations. For any n N, [n] = {1, . . . , n} is the set of nonzero integers up to n, 1n Rn is the n-dimensional vector of ones, and 0n and In Rn n are the n-dimensional null and identity matrices, respectively. For any two matrices Z, Z of equal size we note Z, Z = Trace(Z Z ) their Frobenius inner product, and with Z F = p Z, Z the Frobenius norm. For any vector x Rn, matrix M Rn m and positive semi-definite matrix Z Rn n, we denote by x 2 Z = x Zx the squared semi-norm of x with metric Z, and with M 2 Z = M, ZM = Trace(M ZM) the squared Frobenius norm with metric Z. For any function f : Ra Rb and vector x Ra, we denote respectively by f(x) Rb a and 2f(x) Rb a a the Jacobian and Hessian of f at x, i.e., if f(x) = (f1(x1, . . . , xa), . . . , fb(x1, . . . , xa)), then [ f(x)]i,j = fi/ xj(x) and [ 2f(x)]i,j,k = 2fi/ xj xk(x), for (i, j, k) [b] [a] [a]. Note in particular that if f : Ra R, then the gradient of f is a row vector f(x) R1 a. When f has several arguments and we wish to take partial derivatives with respect to some of the arguments, we explicitly name the different arguments as f(u, v) and then indicate as a subscript to the sign the argument(s) according to which we take derivatives, e.g., if u Rau and v Rav, then uf(u, v) Rb au is the Jacobian of f with respect to u, On Mixup Regularization and 2 uvf Rb au av is the tensor of second derivatives of f of the form [ 2 uvf(u, v)]i,j,k = 2fi/ uj vk(u, v) for (i, j, k) [b] [au] [av]. We recall that if f : Rau+av R is twice continuously differentiable, then uvf = vuf , by Schwarz s theorem. For any random variable X and measurable function f, we denote by EXf(X) the expectation of f(X), or simply Ef(X) when no confusion is possible. For any shape parameters α, β > 0, and any interval [a, b] [0, 1], Beta[a,b](α, β) denotes the truncated Beta distribution on [a, b], i.e., the distribution of a random variable with values in [a, b] and density proportional to xα 1(1 x)β 1 on [a, b]. We simply write Beta(α, β) = Beta[0,1](α, β) for the usual Beta distribution. For any p [0, 1], Ber(p) denotes the Bernoulli distribution with parameter p. For any c N, we denote by c = {u Rc; u 1c = 1 and for j [c], uj 0} the simplex in Rc, and for any p c, we denote by Z(p) = Pc j=1 pj log(pj) the entropy of a categorical distribution with parameter p. Learning problem. We consider a training set Sn = {(x1, y1), . . . , (xn, yn)} made of n input/output pairs, where for each pair i [n], xi X Rd and yi Y Rc. This covers in particular the regression or binary classification settings, where c = 1, or the multivariate regression and multiclass classification setting, where yi is an embedding of the class of xi in Rc, e.g., the one-hot encoding by taking c equal to the total number of classes and letting yi {0, 1}c be the binary vector with all entries equal to zero except for the one corresponding to the class of xi. We further denote the mean input and output as i=1 xi , y = 1 and the empirical variance and covariance matrices or inputs and outputs as i=1 (xi x)(xi x) , Σxy = 1 i=1 (xi x)(yi y) , Σyy = 1 i=1 (yi y)(yi y) . Our goal is to learn from Sn a function f : X Rc to predict the output corresponding to any new input x X via ρ(f(x)), where ρ : Rc Y maps an Rc-valued prediction to an element of Y; standard mappings include the identity ρ(y) = y for regression problems, and the softmax operator ρ(y)i = eyi/ Pc j=1 eyj for multiclass classification problems. For that purpose, we formulate the inference problem as an optimization problem: min f H E(f) , (1) where H is a class of candidate functions, such as linear functions or deep neural networks, and E(f) is a risk functional that depends on Sn. The most standard risk used in machine learning is the empirical risk, defined for any loss function ℓ: Y Rc R by: EEmpirical(f) = 1 i=1 ℓ(yi, f(xi)) . (2) Carratino, Cissé, Jenatton and Vert Solving (1) with the empirical risk (2) is often called empirical risk minimization (ERM), and is typically performed in practice by first-order numerical optimization such as stochastic gradient descent (Bottou and Bousquet, 2008). Standard losses ℓinclude the squared error (in regression) and the cross-entropy loss applied to the softmax mapping (in classification, assuming that y Y, y 1c = 1, which is true for one-hot encoded classes and their convex combinations): (y, u) Y Rc , ℓSE(y, u) = 1 2 y u 2 , ℓCE(y, u) = log Mixup. Instead of minimizing the empirical risk (2), Mixup (Zhang et al., 2018) creates new random input/output samples by taking convex combinations of pairs of training samples, and minimizes the corresponding empirical risk. With our notations, Mixup therefore minimizes the following Mixup risk over f H: EMixup(f) = 1 j=1 Eλℓ(λyi + (1 λ)yj, f(λxi + (1 λ)xj)) , (4) where λ Beta(α, α), and α is a parameter of Mixup. The minimization of (4) is typically performed by stochastic gradient descent, where λ is sampled at each iteration to obtain a stochastic gradient. In practice, Zhang et al. (2018) suggest to sample minibatches of training pairs, and generate Mixup random pairs within the minibatch, which also produces a stochastic gradient of (4). 3. Mixup as a perturbed ERM The Mixup risk (4) is defined as a sum over pairs of samples, making a comparison with standard ERM approaches (2) not direct. The following result shows that the Mixup risk can be equivalently rewritten as a standard empirical risk, over modified input/output pairs (as in the third plot of Figure 1), subject to random perturbations. Theorem 1 Let θ Beta[ 1 2 ,1](α, α) and j Unif([n]) be two random variables with α > 0, n > 0 and let θ = Eθθ. For any training set Sn, let (exi, eyi) for any i [n] be the modified input/output pair given by ( exi = x + θ(xi x) , eyi = y + θ(yi y) , (5) and (δi, εi) be the random perturbations given by: ( δi = (θ θ)xi + (1 θ)xj (1 θ)x , εi = (θ θ)yi + (1 θ)yj (1 θ)y . (6) Then for any i [n], Eθ,jδi = Eθ,jεi = 0, and for any function f H, EMixup(f) = 1 i=1 Eθ,jℓ(eyi + εi, f(exi + δi)) . (7) On Mixup Regularization Both δi and εi are random vectors because they are functions of θ and j in (6), which are themselves random variables. We hence use the notation Eθ,j in (7). Note also θ [1/2, 1] meaning that the transformation from (xi, yi) to (exi, eyi) in (5) shrinks the inputs and the outputs towards their mean. Theorem 1 and the expression (7) of the Mixup risk allow us to re-interpret Mixup as a combination of two standard techniques: (i) transforming each input/output pair (xi, yi) into (exi, eyi), and (ii) adding zero-mean random perturbations (δi, εi) to each transformed pair, before minimizing the empirical risk. This helps us to understand the effects of training a model with Mixup by studying each technique and their interaction. In particular, perturbing input data is a classical approach to regularize ERM estimators (Bishop, 1995; Srivastava et al., 2014; Wager et al., 2013; Wei et al., 2020), and we study in detail in the next section the particular regularization induced by the Mixup perturbations on both inputs and outputs, before interpreting the resulting regularization aspects of Mixup due to both data transformation and perturbation in Section 5. 4. The regularization effects of Mixup We now study the effect of the random perturbations (δi, εi) for i [n] in the Mixup risk (7). While perturbing inputs with additive or multiplicative noise (e.g., dropout), and independently perturbing outputs (resulting, e.g., in label smoothing) have been widely studied, the Mixup perturbation (7) is unique in the sense that it is applied to both inputs and outputs simultaneously, and that the input and output perturbations are not independent from each other by (6). In order to study the regularization effect of these perturbations, we first characterize the covariance structure among the input and output perturbations. Lemma 2 Let θ and σ2 be respectively the mean and variance of a Beta[ 1 2 ,1](α, α) distributed random variable, and γ2 = σ2 + (1 θ)2. For any i [n], let Σ(i) exex = σ2(exi x)(exi x) + γ2Σexex Σ(i) eyey = σ2(eyi y)(eyi y) + γ2Σeyey Σ(i) exey = σ2(exi x)(eyi y) + γ2Σexey Then, for any i [n], the random perturbations defined in (6) satisfy Eθ,jδiδ i = Σ(i) exex , Eθ,jεiε i = Σ(i) eyey , and Eθ,jδiε i = Σ(i) exey . (9) Following recent lines of work that interpret various random perturbations such as dropout as regularization (Wager et al., 2013; Wei et al., 2020), we can now introduce and study an approximate Mixup risk: EMixup Q (f) = 1 i=1 Eθ,jℓ(i) Q (eyi + εi, f(exi + δi)) , (10) Carratino, Cissé, Jenatton and Vert obtained by replacing the loss function ℓ(ey, f(ex)) by a second-order quadratic Taylor approximation near each modified input/output training pairs (exi, eyi), namely, for any i [n] and (δ, ε) X Y: ℓ(i) Q (eyi + ε, f(exi + δ)) = ℓ(eyi, f(exi)) + yℓ(eyi, f(exi)) ε + uℓ(eyi, f(exi)) xf(exi)δ D δδ , f(exi) 2 uuℓ(eyi, f(exi)) f(exi) + uℓ(eyi, f(exi)) 2f(exi) E D εε , 2 yyℓ(eyi, f(exi)) E + D εδ , 2 yuℓ(eyi, f(exi)) f(exi) E , assuming both ℓand f are twice continuously differentiable. Due to its quadratic form as a function of input and output perturbations, the approximate Mixup risk (10) can be re-expressed as a regularized ERM risk, as shown in the next result. We note that the expression we derive is in fact valid for any joint perturbation of the inputs and outputs with covariance structure given in (9). Theorem 3 For any twice continuously differentiable loss ℓ(y, u), the approximate Mixup risk at any twice differentiable f H satisfies EMixup Q (f) = 1 i=1 ℓ(eyi, f(exi)) + R1(f) + R2(f) + R3(f) + R4(f) , (12) f(exi) J(i) 2 uuℓ(eyi, f(exi)) 1 Σ(i) exex , D Σ(i) exex, uℓ(eyi, f(exi)) 2f(exi) E , Σ(i) exey 2 yuℓ(eyi, f(exi)) 2 uuℓ(eyi, f(exi)) 1 2 2 Σ(i) exex 1 , D Σ(i) eyey, 2 yyℓ(eyi, f(exi)) E , and i [n], J(i) = 2 uuℓ(eyi, f(exi)) 1 2 uyℓ(eyi, f(exi))Σ(i) eyex Σ(i) exex 1 . (13) Theorem 3 captures the effect of the random perturbations in Mixup as a sum of four penalty terms Ri(f) for i [4]. They regularize the simple ERM risk applied on the modified inputs exi and smoothed outputs eyi. Before discussing the accuracy and practical consequences of this reformulation of Mixup as regularized empirical risk minimization on modified data in the next Section, we now derive the details of this approximation for the cross-entropy, logistic and squared error losses. We begin by presenting the results for the cross-entropy loss: On Mixup Regularization Corollary 4 Let S : Rc Rc be the softmax operator, i.e., for any i [c] and u Rc, S(u)i = eui/ Pc j=1 euj, and let H(u) = diag(S(u)) S(u)S(u) Rc c. The approximate Mixup risk for the cross-entropy loss satisfies EMixup Q (f) = 1 i=1 ℓCE(eyi, f(exi)) + RCE 1 (f) + RCE 2 (f) + RCE 3 (f) , RCE 1 (f) = 1 f(exi) J(i) H(f(exi)) 1 2 Σ(i) exex , RCE 2 (f) = 1 D Σ(i) exex, (S(f(exi)) eyi) 2f(exi) E , RCE 3 (f) = 1 Σ(i) exey H(f(exi)) 1 2 2 Σ(i) exex 1 , with i [n] , J(i) = H(f(exi)) 1Σ(i) eyex Σ(i) exex 1 . (14) In the binary classification setting, minimizing the empirical cross-entropy risk over f : X R2 after one-hot encoding of the two possible classes in R2 as (0, 1) and (1, 0) is equivalent to minimizing the following well-known logistic loss over f : X R after encoding the two classes in R as 0 and 1: ℓLR(y, u) = log(1 + eu) yu . (15) The regularization effect of Mixup in that case is detailed in the following result: Corollary 5 Let s : R R be the sigmoid operator, i.e., for any u R, s(u) = (1 + e u) 1, and let v(u) = s(u)(1 s(u)) R. The approximate Mixup risk for the logistic regression loss satisfies EMixup Q (f) = 1 i=1 ℓLR(eyi, f(exi)) + RLR 1 (f) + RLR 2 (f) + RLR 3 (f) , RLR 1 (f) = 1 i=1 v(f(exi)) f(exi) J(i) 2 Σ(i) exex , RLR 2 (f) = 1 i=1 (s(f(exi)) eyi) D Σ(i) exex, 2f(exi) E , RLR 3 (f) = 1 i=1 v(f(exi)) 1Σ(i) eyex Σ(i) exex 1 Σ(i) exey , Carratino, Cissé, Jenatton and Vert i [n], J(i) = Σ(i) eyex Σ(i) exex 1 v(f(exi)) . (16) The next result summarizes the form of the approximate Mixup risk in the case of the squared error loss, and shows in particular that Mixup has no effect for linear least-squares regression models. Corollary 6 The approximate Mixup risk for the squared error loss satisfies EMixup Q (f) = 1 i=1 ℓSE(eyi, f(exi)) + RSE 1 (f) + RSE 2 (f) + C , (17) where C is a constant independent of f and RSE 1 (f) = 1 i=1 f(exi) J(i) 2 Σ(i) exex and RSE 2 (f) = 1 D Σ(i) exex, (f(exi) eyi) 2f(exi) E , with i [n], J(i) = Σ(i) eyex Σ(i) exex 1 . (18) In particular, when we consider linear models with intercept of the form f W,b(x) = Wx + b for (W, b) Rc d Rc, then the exact Mixup risk satisfies EMixup(f W,b) = 2σ2 + 2θ 2 + (1 θ)2 i=1 ℓSE(yi, f W,b(xi)) + b b 2 + C , (19) where C is a constant that does not depend on (W, b) and b = y Wx. Consequently, the linear model that minimizes EMixup is the standard multivariate ordinary least squares (MOLS) predictor that minimizes EERM on the original data, i.e., Mixup has no effect on linear least-squares regression. 5. Discussion and experiments Let us now discuss how our analysis relates to a recent similar study by Zhang et al. (2021), and empirically assess the validity of our analysis and the regularization properties of Mixup it suggests. To support our discussion, we provide empirical results on CIFAR-10/100 and Image Net for different networks (Le Net, Res Net-34/50). For each experimental result we report mean and 95% confidence interval using 10 repetitions (unless stated otherwise). All details about experiments are provided in Appendix B, together with other experiments on the simpler setting of learning on the two-moon dataset with random features. Comparison with related work. Zhang et al. (2021) independently published a similar analysis of the regularization effect of Mixup. Both works provide complementary and coherent views of the effect of Mixup on generalization and robustness, and differ in a few technical aspects that we clarify here. First, we provide an analysis valid for any loss ℓ(y, u), where y and u are respectively the predicted and true outputs, while Zhang et al. (2021) On Mixup Regularization restrict their analysis to the losses of the form ℓ(y, u) = h(u) yu for some function h. Second, while both works use a second-order Taylor expansion to approximate the loss at a Mixup example, the expansions are different since we do a Taylor expansion near the expected value of the Mixup example, while Zhang et al. (2021, Lemma 3.1) do a Taylor expansion near a non-Mixup example. One consequence is that our Taylor approximation converges to the exact Mixup risk when the amount of Mixup reduces (i.e., when α + in the Beta(α, α) distribution of the mixing parameter), while the one used by Zhang et al. (2021) does not2. Third, the generalization analysis of Zhang et al. (2021) for generalized linear models (GLM) is performed for a variant of Mixup where the mixed input is xi + (1/λ 1)xj, while we focus on the standard Mixup λxi + (1 λ)xj. A consequence of this difference is that we identify the importance of rescaling data at test time for standard Mixup (see below), while no such rescaling is needed in the variant considered by Zhang et al. (2021). Finally, Zhang et al. (2021) explore the link between Mixup-induced regularization and adversarial robustness and generalization through Rademacher complexity analysis, which we do not explore in this work but could be done in a similar manner. 0 50 100 150 200 Epoch CIFAR10 Le Net Mixup Approximate Mixup 0 50 100 150 200 Epoch CIFAR10 Le Net Mixup Approximate Mixup 0 50 100 150 200 Epoch Test accuracy CIFAR10 Le Net Mixup Approximate Mixup Fig. 2: From left to right: train loss, test loss and accuracy during optimization of Le Net on CIFAR-10 with Mixup and approximate Mixup. Validity of the Taylor approximation. To analyze the regularization effect of Mixup, we used a quadratic approximation of the loss function (11). We note that compared to similar approximations that have been proposed to study the regularization effect of input perturbation only, such as dropout (Wager et al., 2013; Wei et al., 2020), we must include in the Taylor expansion all second-order terms involving the input perturbation only (term with δδ ), the output perturbation only (term with εε ), and their interaction (term with εδ ). In the absence of output perturbation (e.g., in the case of dropout), only the term in δδ matters, and in the absence of correlation between input and output perturbation (e.g., dropout combined with independent label smoothing), then the term in εδ does not matter either. Mixup is unique in the correlation it creates between input and output perturbations, which is captured by the interaction term with εδ in (11). Regarding the validity of the Taylor approximation, we note that, as for similar work on input perturbation, the approximate Mixup risk (10) is only a good approximation to the Mixup risk for small perturbations; as noted by Wei et al. (2020, Annex A.2), though, this often remains valid even for large 2. More precisely, the Taylor approximation remainder in Zhang et al. (2021, Lemma 3.1) does not vanish since it is a Taylor expansion near 0 of a random variable that follows a Beta(α + 1, α + 1) distribution, i.e., that takes values close to 1 with probability 1/2 when α goes to + . Carratino, Cissé, Jenatton and Vert input perturbation followed by a linear transformation layer. To support empirically the validity of the approximation, Figure 2 shows the training and test performance of training a Le Net on CIFAR-10 using Mixup (minimizing (4)), and using the approximate Mixup formulation (minizing (12)), where we dropped the term R2(f) in the regularization since it empirically induces numerical instability due to it non-convexity (see also Wei et al., 2020, for a discussion about discarding the Hessian regularization). We can see how when training 0 50 100 150 200 Epoch CIFAR10 Le Net 0 50 100 150 200 Epoch i = 1 (yi, f(xi)) CIFAR10 Le Net Fig. 3: Evaluations of the regularization terms of the mixup approximation (left) and of the loss on modified data (right) for functions learned with Mixup, ERM and ERM on modified data. using the approximate Mixup formulation, we learn functions which mimic, both in training and test, the performance of functions learned when training with the standard Mixup procedure. To further mark the validity of the approximation and decouple the contributions of the data transformation (5) and the input perturbation (6), we evaluate for functions learned with Mixup (4), ERM (2) and ERM on modified data: min f Z 1 n i=1 ℓ(eyi, f(exi)) , (20) the regularization terms of the approximation (12) and the loss on modified data (20). From Figure 3 we see that the functions learned with Mixup are the ones with the smallest values of the regularization terms but not the smallest loss on modified data, which confirms that the model trained with Mixup finds a trade-offbetween empirical risk and the regularization we study. Data modification at test time. By Theorem 1, we see that Mixup implicitly shrinks inputs towards their mean since the Mixup risk involves the empirical risk over modified inputs exi and outputs eyi. In particular, this means that the functions that the standard Mixup procedure learns are not functions from the space of input points X to the output points Y, but it learns functions from e X to e Y , which are spaces defined by the training points and the hyperparameter α of Mixup as ( e X = {x + θ(x x) | x X} , e Y = {y + θ(y y) | y Y} , (21) where x and y are respectively the average training inputs and outputs points, and θ = Eθθ as defined in Theorem 1. An important consequence is that the function f : e X e Y estimated On Mixup Regularization by Mixup, should ideally be applied at test time to transformed data, to map the test point xtest X to extest e X, and the evaluation of the function f(extest) e Y should be mapped back to Y. In details, the prediction for point xtest should be predf(xtest) = y 1 1/θ + 1/θ f θxtest + (1 θ)x . (22) Algorithm 1 python code to evaluate according to (22) functions learned with mixup Input: X_test Tensor of test points to evaluate, trained_mode Model trained using mixup, alpha float hyperparameter used by mixup during training, X_train, Y_train Tensor of points used for training with one-hot encoded labels. import scipy.special as sc import torch X_bar = torch.mean(X_train, dim=0, keepdim=True) Y_bar = torch.mean(Y_train, dim=0, keepdim=True) # expectation of a truncated beta distribution in [0.5, 1] theta_bar = 1. - sc.betainc(alpha + 1., alpha, .5) def predict_mixup(X_test, trained_model): f_X = trained_model.forward((1. - theta_bar) * X_bar + theta_bar * X_test) return Y_bar * (1. - 1. / theta_bar) + f_X / theta_bar For centered training data (x = y = 0) and homogeneous functions (f(ux) = uf(x) for any (u, x) R+ X, e.g., linear models or neural networks with Re LU activation and linear transformations), this has no impact as predf(x) = f(x) in that case. For more general models, however, (22) may be a better predictor than f. For example, we clearly see in Figure 1 that the asymptotically Bayes optimal classifier under the Mixup distribution matches the one under the empirical distribution of the modified data (up to regularization effects), and not of the original data. Interestingly, when the classes are balanced, i.e., y = 1 c1c, the transformation in (22) adds the same constant to each of the c entries of f. In particular, in the multi-class setting, since the softmax is invariant to a constant in the logits, (22) becomes equivalent to a scaling of the logits, commonly referred to as temperature scaling (Guo et al., 2017). While temperature scaling is traditionally tuned with a validation set (Guo et al., 2017), mixup automatically sets this value, according to the distribution of θ. To point out the advantages of using (22), we compare in Figure 4 the performance of ERM, standard Mixup (for different α values), the same Mixup but with the proper data transformation (22) at test time and the same data transformation applied to the ERM estimator. The trained networks are Res Net-34 for CIFAR-10 and CIFAR-100, and Res Net-50 for Image Net. For CIFAR-10 and 100, we observe overall benefits of using the data transformation for evaluating the functions learned with Mixup: higher test accuracy, lower test loss, and lower expected calibration error (ECE). For Image Net we have the same benefits, with the only exception of the test loss and the ECE for very low values of α. Notice indeed that for small values of α the data transformation (22) has a smaller impact than for bigger values of α, as limα 0 θ = 1 while limα + θ = 1 2. Thus, as observed empirically, Carratino, Cissé, Jenatton and Vert Fig. 4: Test accuracy, test cross-entropy loss and test expected calibration error (first, second and third column respectively) on the CIFAR-10, CIFAR-100, Image Net datasets (first, second and third row respectively). We report the mean and the standard error over 10 repetitions for CIFAR-10 and CIFAR-100 and 5 repetitions for Image Net. The training was done with Res Net-34 for CIFAR-10, CIFAR-100, and with Res Net-50 for Image Net. the correction (22) is more important, as it brings bigger improvements, when training Mixup with bigger α. Finally, we can observe that when the data transformation (22) is applied to ERM, performance always deteriorate. This supports that (22) is a mixup specific improvement. Algorithm 1 shows the few lines of codes that implement the new prediction procedure (22). Data modification for out-of-distribution data. Even though our theoretical analysis holds only for in-distribution data, we now empirically investigate whether the benefits of rescaling data at test time that we observe for models trained with Mixup also hold when the test time data come from a different distribution, a setting called out-of-distribution (OOD) data (Hendrycks and Gimpel, 2017). Indeed, standard Mixup is known to provide some benefits in out-of-distribution settings compared to models simply trained by ERM, so it is interesting to assess whether the data modification scheme we propose can further boost the accuracy and calibration of models trained with Mixup in that setting. In particular we consider CIFAR-10-C, CIFAR-100-C, Image Net-C (Hendrycks and Dietterich, 2019), Image Net-A (Hendrycks et al., 2021), Image Net V2 (Recht et al., 2019), Image Net-Vid-Robust, YTBB-Robust (Shankar et al., 2021), Object Net (Barbu et al., 2019). These benchmark datasets are designed by systematically perturbing in a controlled way the in-distribution data On Mixup Regularization Fig. 5: Test accuracy, test cross-entropy loss and test expected calibration error (first, second and third column respectively) on the CIFAR-10-C, CIFAR-100-C (first, and second row respectively). We report the mean and the standard error over 10 repetitions. The training was done with Res Net-34 on the standard CIFAR-10, CIFAR-100. (CIFAR-10, CIFAR-100 and Image Net), e.g., by adding noise or applying transformations to the images. Table 2 details the performance of models trained with Mixup for α = 0.25 Fig. 6: Difference in test performance between Mixup and rescaled Mixup on CIFAR-10-C, CIFAR100-C for the different corruption intensities. The training was done with Res Net-34 on the standard CIFAR-10, CIFAR-100 dataset and the performance metrics are accuracy, cross-entropy loss, expected calibration error. and α = 0.5, with or without data modification at test time, on out-of-distribution datasets derived from Image Net, while Figure 5 shows similar results for benchmarks derived from Carratino, Cissé, Jenatton and Vert CIFAR-10 and CIFAR-100. We see that for most datasets the rescaling improves accuracy and some of the other metrics, but for others and in particular CIFAR-10-C, CIFAR-100-C and Image Net-C, the rescaling worsen almost all metrics with respect of the standard Mixup. We now investigate how the performance of Mixup with or without rescaling differs with respect to the intensity of the noise. Image Net Metric mixup(rescaled) mixup mixup(rescaled) mixup α = 0.25 α = 0.5 standard Acc 0.7764 0.0022 0.7754 0.0021 0.7780 0.0006 0.7766 0.0009 CE-loss 0.8820 0.0092 0.8964 0.0113 0.8852 0.0046 0.9482 0.0188 ECE 0.0291 0.0031 0.0236 0.0058 0.0280 0.0040 0.0707 0.0127 A Acc 0.0165 0.0011 0.0165 0.0011 0.0229 0.0004 0.0225 0.0005 CE-loss 7.3395 0.0884 6.8075 0.0687 7.1010 0.0879 6.3738 0.0506 ECE 0.4322 0.0072 0.3629 0.0075 0.4318 0.0110 0.3161 0.0117 Acc 0.4638 0.0021 0.4662 0.0019 0.4719 0.0015 0.4761 0.0011 CE-loss 2.8050 0.0140 2.7374 0.0094 2.7842 0.0164 2.7206 0.0136 MCE 0.6824 0.0025 0.6795 0.0022 0.6735 0.0018 0.6685 0.0014 ECE 0.1040 0.0046 0.0652 0.0019 0.1096 0.0054 0.0793 0.0061 V2 Acc 0.6560 0.0028 0.6543 0.0029 0.6593 0.0007 0.6575 0.0011 CE-loss 1.5208 0.0118 1.5048 0.0122 1.5130 0.0057 1.5295 0.0170 ECE 0.0746 0.0053 0.0344 0.0030 0.0715 0.0079 0.0417 0.0096 Vid-Robust Acc (pm-k) 0.5254 0.0043 0.5244 0.0036 0.5224 0.0023 0.5216 0.0035 YTBB-Robust Acc (pm-k) 0.4738 0.0037 0.4717 0.0035 0.4705 0.0021 0.4692 0.0017 Object Net Acc 0.2756 0.0035 0.2751 0.0034 0.2837 0.0010 0.2825 0.0011 Table 2: Test performance of Mixup and rescaled Mixup on different OOD versions of Image Net. The training was done with Res Net-50 on the standard Image Net dataset and the performance metrics are accuracy, cross-entropy loss, expected calibration error and mean corruption error (Hendrycks and Dietterich, 2019). For each metric means and standard errors over 10 repetitions are reported. In details, the CIFAR-10-C and CIFAR-100-C data are 95 different corrupted versions each of the original CIFAR-10, CIFAR-100 test sets: 19 different corruption types, for 5 different growing corruption intensities. The results that we reported in Figure 5 are the averages of the performance of these 95 test sets. In Figure 6 instead, for Mixup and rescaled Mixup, we compute the average of the different metrics (test accuracy, test cross-entropy loss, ECE) over the 19 corruption types for each corruption intensity, and we report the difference in performance between Mixup and rescaled Mixup. We can see that as the noise intensity grows the gap between the two methods always grows in favor of standard Mixup, and that only for a few metrics and settings with low noise the rescaled version is better than the standard one. This observations encourage future investigations of the effect of Mixup on OOD data3. 3. Codes for reproducing results for Image Net and various OOD variants are available at https://github. com/google/uncertainty-baselines/tree/master/baselines/imagenet On Mixup Regularization Label smoothing. The transformation that modifies the original labels yi onto eyi acts as some form of label smoothing, a technique known to often improve accuracy and calibration (Szegedy et al., 2016; Müller et al., 2019). The transformed labels eyi are indeed pulled towards the average label y. Recall from Szegedy et al. (2016) that label smoothing consists in training a model on the perturbed version of the training labels defined as y LS i = (1 ε)yi + εu(i), where ε is a fixed scalar in [0, 1] and u(i) is a fixed distribution over the labels. It is easy to see that for ε = (1 θ) and u(i) = y the two formulations coincide. This implies that Mixup implicitly performs label smoothing, and can benefit from this technique in terms of accuracy or calibration. In the following Proposition 7, we formally prove that, in the case of the cross entropy and linear models, label smoothing translates into an increase in the average entropy of the predictions, or, in other words, that predictions become less certain, as observed in practice. Like in Theorem 4, we use the softmax operator S : Rc 7 c defined for j [c] by S(u)j = euj/ (Pc k=1 euk): Proposition 7 Let us consider the following two classification problems with a cross-entropy loss and linear model f(x) = Wx parameterized by W Rc d, min W Rc d 1 n i=1 ℓCE(yi, Wxi) (23) min W Rc d 1 n i=1 ℓCE(eyi, Wxi) (24) defined without and with label smoothing respectively, i.e., with eyi = y + θ(yi y) c for i [n]. Let us denote by W and W ls a solution of (23) and (24) respectively, together with pi = S(W xi) and epi = S(W lsxi). (25) It holds that the average entropy of the predictions of W ls is lower bounded as follows i=1 Z(pi) + (1 θ)Z(y) 1 i=1 Z(epi). If predicting with W also reduces the entropy of the average predictor, i.e., 1 n Pn i=1 Z(pi) Z(y), then label smoothing increases the average entropy of the predictions: i=1 Z(pi) 1 i=1 Z(epi). To illustrate how both Mixup and label smoothing increase the entropy of predictions compared to ERM, we show on Figure 7 the histograms of the confidence of the estimators predictions on test points, for a Le Net neural network trained on CIFAR-10. From the first plot, we notice how standard ERM produces very confident predictions, how label smoothing helps decreasing ERM confidence at test time, and how Mixup naturally produces even less confident predictions. From the second plot, we see that that approximate Mixup, like Mixup, produces less confident prediction, which confirms that the Mixup approximation we study captures well this behavior of Mixup. Carratino, Cissé, Jenatton and Vert 0.2 0.4 0.6 0.8 1.0 maxi( (f(x))i) CIFAR-10 Le Net Mixup ERM ERM + label smooth 0.2 0.4 0.6 0.8 1.0 maxi( (f(x))i) CIFAR-10 Le Net Mixup Approximate Mixup ERM + label smooth Fig. 7: Histograms of confidence of predictions for models trained with different techniques. Jacobian regularization. The first implicit regularization term R1(f) in Theorem 3 penalizes the discrepancy between f(exi) and J(i) given by (13). We recognize in J(i) the Jacobian of the standard MOLS model trained in the input space on the modified training set, with an increased weight for sample (exi, eyi) in J(i) . Compared to, e.g., dropout regularization with penalizes the norm of f at the training points, we therefore see that Mixup also regularizes the Jacobian of f but with a different and more informative implicit bias, namely, to mimic a good linear model in the input space. Furthermore, we note from the proof of Theorem 3 that this implicit bias results from the correlation between input and output noise, which may explain why independent Mixup in the input and output performs more poorly than standard Mixup (Zhang et al., 2018). While this regularization is similar across all points in the squared loss setting (Corollary 6), it is weighted by the Hessian H(p(f(exi))) in the cross-entropy loss (Corollary 4). Similar to dropout, this implies that this regularization vanishes when the prediction p(f(exi)) is confidently near 0 or 1. In the Mixup case, though, the label smoothing effect discussed in the previous paragraph tends to prevent over-confident predictions on the training point (see Proposition 7 for a formal description of that property), therefore ensuring that the Jacobian regularization in R1(f) remains active even for easy points. This interaction between label smoothing (due to output Mixup) and Jacobian regularization (due to input Mixup) may explain why Mixup on inputs only performs poorly compared to Mixup on both inputs and outputs (Thulasidasan et al., 2019). 6. Conclusions In this paper we have proposed the first theoretical analysis that explains the multiple regularization effects of Mixup. We have proved that training with Mixup is equivalent to learning on modified data with the injection of structured noise. Through a Taylor approximation, we have further shown that Mixup amounts to empirical risk minimization on modified points plus multiple regularization terms. Fascinatingly, our derivation shows that Mixup induces varied and complex effects, e.g., calibration, Jacobian regularization, label noise and normalization, while being a simple and cheap data augmentation technique. Further, we have shown how this analysis points out a missing rescaling procedure required to evaluate functions learned with Mixup, and we brought empirical evidence that implementing it improves accuracy and calibration. More broadly, we have studied how a specific combination On Mixup Regularization of data modification and noise injection leads to certain regularizers. An interesting research question is whether we can reverse-engineer this process, namely starting from possibly expensive regularizers and design the corresponding data augmentation technique emulating their effects at a lower computational cost. Acknowledgments The authors thank Jeremiah Zhe Liu for his feedback on an early version of this work. Appendix A. Proofs A.1 Proof of Theorem 1 Proof To simplify notations, let us denote, for any i, j [n] and u [0, 1], mij(u) = ℓ(uyi + (1 u)yj, f(uxi + (1 u)xj)) . The Mixup risk (4) can then be written as EMixup(f) = 1 j=1 Eλmij(λ) , λ Beta(α, α) . (26) We now separate the values of λ depending on whether or not they are above 1/2 by expressing it as λ = πλ0 +(1 π)λ1 , λ0 Beta[0, 1 2 ](α, α) , λ1 Beta[ 1 2 ,1](α, α) , π Ber 1 By symmetry of the Beta(α, α) distribution around 1/2, it is clear that λ defined in (27) follows a Beta(α, α) distribution, and furthermore that λ 1 = 1 λ0 follows, like λ1, a Beta[ 1 2 ,1](α, α) distribution. For any i, j [n], we therefore get Eλmij(λ) = Eλ0,λ1,πmij(πλ0 + (1 π)λ1) 2 [Eλ0mij(λ0) + Eλ1mij(λ1)] h Eλ 1mji(λ 1) + Eλ1mij(λ1) i , Carratino, Cissé, Jenatton and Vert where we used the fact that mij(1 u) = mji(u) to get the third equality. Plugging this equality back into (26), we finally get EMixup(f) = 1 2n2 h Eλ 1mji(λ 1) + Eλ1mij(λ1) i j=1 Eλ1mij(λ1) j=1 Eλ1mij(λ1) ℓi = Eθ,jℓ(θyi + (1 θ)yj, f(θxi + (1 θ)xj)) , θ Beta[ 1 2 ,1](α, α) , j Unif([n]) . We now easily see that exi and eyi defined in (5) satisfy ( exi = Eθ,j [θxi + (1 θ)xj] , eyi = Eθ,j [θyi + (1 θ)yj] , and furthermore that δi and εi defined in (6) satisfy (eδi = θxi + (1 θ)xj Eθ,j [θxi + (1 θ)xj] , eεi = θyi + (1 θ)yj Eθ,j [θyi + (1 θ)yj] , from which we deduce that Eθ,jδi = Eθ,jεi = 0 and ℓi = Eθ,jℓ eyi + eεi, f(exi + eδi) . Plugging this equality back into (28) concludes the proof. A.2 Proof of Lemma 2 Proof From the definition of δi in (6), we easily get: Eθ,jδiδ i = Eθ(θ θ)2xix i + Eθ[(1 θ)2]Ej[xjx j ] + (1 θ)2xx + Eθ[(θ θ)(1 θ)]Ej[xix j + xjx i ] (1 θ)Eθ[1 θ]Ej[xjx + xx j ] = σ2xix i + γ2Ej[xjx j ] + (1 θ)2xx σ2(xix + xx i ) 2(1 θ)2xx = σ2(xix i xix xx i ) + γ2(Σxx + xx ) (1 θ)2xx = σ2(xi x)(xi x) + γ2Σxx , On Mixup Regularization where we used the independence between θ and j in the first equality; for the second, the facts that Eθ(θ θ)2 = σ2, Eθ[(1 θ)2] = σ2+(1 θ)2 = γ2, Eθ[(θ θ)(1 θ)] = θ 2 Eθ[θ2] = σ2, and Eθ[1 θ] = 1 θ; for the third, we reorganized the terms and used the equality Ejxjx j = Σxx + xx by definition of the empirical covariance matrix Σxx; the last equality is obtained by reorganizing the terms and using the definition of γ2. In order to write this covariance matrix in terms of modified inputs, we notice that by definition (5) we have xi x = (exi x)/θ and Ejexj = Ejxj = x, which implies that the empirical covariance matrix of the modified inputs is Σexex = θ 2Σxx. Combining these equalities with (29) gives the first equality in (9). The two other equalities can be proved exactly the same way. A.3 Proof of Theorem 3 Proof Given a modified input/output pair (ex, ey) X Y and a function f H, the second-order Taylor approximation of the loss G(ex, ey) = ℓ(ey, f(ex)) is, for any (δ, ε) X Y: GQ(ex + δ, ey + ε) = G(ex, ey) + ex G(ex, ey)δ + ey G(ex, ey)ε 2δ 2 exex G(ex, ey)δ + 1 2ε 2 eyey G(ex, ey)ε + ε 2 eyex G(ex, ey)δ . (30) Using this quadratic approximation at each training point i [n] in (7), and using the fact that Eθ,jδi = Eθ,jεi = 0, we get Eθ,j GQ(exi + δi, eyi + εi) =G(exi, eyi) + 1 D Eθ,jδiδ i , 2 exex G(exi, eyi) E D Eθ,jεiε i , 2 eyey G(exi, eyi) E + D Eθ,jεiδ i , 2 eyex G(exi, eyi) E , (31) which we can rewrite as follows by expressing the derivatives of G(x, y) = ℓ(y, f(x)) in terms of derivatives of ℓ(y, u) and f(x): Eθ,jℓQ(eyi + εi, f(exi + δi)) = ℓ(eyi, f(exi)) D Eθ,jδiδ i , f(exi) 2 uuℓ(eyi, f(exi)) f(exi) + uℓ(eyi, f(exi)) 2f(exi) E D Eθ,jεiε i , 2 yyℓ(eyi, f(exi)) E + D Eθ,jεiδ i , 2 yuℓ(eyi, f(exi)) f(exi) E . Replacing the expectations in this equation by their values given by Lemma 2 gives: Eθ,jℓQ(eyi + εi, f(exi + δi)) = ℓ(eyi, f(exi)) D Σ(i) exex, f(exi) 2 uuℓ(eyi, f(exi)) f(exi) E D Σ(i) exex, uℓ(eyi, f(exi)) 2f(exi) E D Σ(i) eyey, 2 yyℓ(eyi, f(exi)) E + D Σ(i) eyex, 2 yuℓ(eyi, f(exi)) f(exi) E . Carratino, Cissé, Jenatton and Vert We now use the following fact, true for any square symmetric and invertible matrices A and C and rectangular matrices B and Y (such that the matrix multiplications below make sense): D A, Y CY E 2 B, Y = D A, (Y Z) C(Y Z) E D A 1, B C 1B E , (34) where Z = C 1BA 1, to combine the second and fifth terms together. Indeed, the fifth term (33) can be rewritten as D Σ(i) eyex, 2 yuℓ(eyi, f(exi)) f(exi) E = D 2 uyℓ(eyi, f(exi))Σ(i) eyex, f(exi) , E , so plugging into (34) the following matrices: A = Σ(i) exex , B = 2 uyℓ(eyi, f(exi))Σ(i) eyex , C = 2 uuℓ(eyi, f(exi)) , Y = f(exi) , D Σ(i) exex, f(exi) 2 uuℓ(eyi, f(exi)) f(exi) E + D Σ(i) eyex, 2 yuℓ(eyi, f(exi)) f(exi) E Σ(i) exex, f(exi) J(i) 2 uuℓ(eyi, f(exi)) f(exi) J(i) Σ(i) eyex Σ(i) exex 1 Σ(i) exey, 2 yuℓ(eyi, f(exi)) 2 uuℓ(eyi, f(exi)) 1 2 uyℓ(eyi, f(exi)) (35) where J(i) is defined in (13). Theorem 3 then follows by merging the second and fifth terms in (33) using (35), and summing over i. A.4 Proof of Corollary 4 Proof For the definition (3) of the cross-entropy loss ℓCE(y, u), we easily get: yℓCE(y, u) = u , uℓCE(y, u) = (S(u) y) , 2 yyℓCE(y, u) = 0c , 2 yuℓCE(y, u) = Ic , 2 uuℓCE(y, u) = H(u) . Plugging these results back in the four regularization terms in Theorem 3 we conclude the proof. On Mixup Regularization A.5 Proof of Corollary 5 Proof For the definition (15) of the logistic regression loss ℓLR(y, u), we easily get: yℓLR(y, u) = u , uℓLR(y, u) = s(u) y , 2 yyℓLR(y, u) = 0 , 2 yuℓLR(y, u) = 1 , 2 uuℓLR(y, u) = v(u) . Plugging these results back in the four regularization terms in Theorem 3 we conclude the proof. A.6 Proof of Corollary 6 Proof For the definition (3) of the squared error loss ℓSE(y, u), we easily get: yℓSE(y, u) = (y u) , uℓSE(y, u) = (u y) , 2 yyℓSE(y, u) = 2 uuℓSE(y, u) = Ic , 2 yuℓSE(y, u) = Ic . Plugging these results back in the 4 regularization terms in Theorem 3 proves (17). When f is a linear function with intercept of the form f W,b(x) = Wx + b, then we first note that ℓSE(y, f W,b(x)) is a quadratic function of (x, y), so the second-order Taylor approximation (11) is exact in that case: ℓSE(i) Q (y, f W,b(x)) = ℓSE(y, f W,b(x)) for any i [n] and (x, y) X Y, and consequently: (W, b) Rc d Rc , EMixup Q (f W,b) = EMixup(f W,b) . Carratino, Cissé, Jenatton and Vert Applying (17) to the case of a linear function f W,b gives us immediately RSE 2 (f W,b) = 0, because 2f W,b = 0. For the first regularization term, we compute RSE 1 (f W,b) = 1 i=1 f(exi) J(i) 2 Σ(i) exex i=1 W J(i) 2 Σ(i) exex D W W, Σ(i) exex E 2 D W, Σ(i) eyex E + C j=1 W(exj x) (eyj y) 2 + σ2 θ 2 W(exi x) (eyi y) 2 i=1 W(exi x) (eyi y) 2 + C = 2σ2 + (1 θ)2 i=1 W(xi x) (yi y) 2 + C . As for the empirical risk term, we can also rewrite it as i=1 ℓSE(eyi, f W,b(exi)) = 1 i=1 W exi + b eyi 2 i=1 W(exi x) (eyi y) + (b b) 2 i=1 W(exi x) (eyi y) 2 + b b 2 i=1 W(xi x) (yi y) 2 + b b 2 Plugging (36) and (37) into (17) finally gives (19). To see that the minimizer of (19) is the standard MOLS solution, we notice that the obvious solution for b is b = b, which is the intercept of MOLS, while the solution for W should minimize the sum of squared errors over centered points, which is exactly what MOLS does. A.7 Proof of Proposition 7 Proof We first start by deriving a dual formulation for (24). The derivation for (23) follows along the same lines, up to the replacement of eyi by yi. Introducing primal variables ui Rc for i [n] with equality constraints Wxi = ui and the dual variables ξi Rc, we obtain the following Lagrangian (see Boyd and Vandenberghe On Mixup Regularization L({ui, ξi}n i=1, W) = 1 n ℓCE(eyi, ui) + ξ i (Wxi ui) j=1 e(ui)j (ξi + eyi) ui o + We recall that the entropy is concave and is the Fenchel conjugate, up to a sign flip, of the log-sum-exp function Z(p) = max t Rc n p t log c X j=1 etj o , (38) as for instance detailed in Example 5.5 of (Boyd and Vandenberghe, 2004). We therefore derive the dual function using (38) min {ui}n i=1,W L({ui, ξi}n i=1, W) = ( 1 n Pn i=1 Z(ξi + eyi) if Pn i=1 ξix i = 0 and ξi + eyi c, otherwise and the dual problem max {νi}n i=1 i=1 Z(νi) subject to νi c, i [n], and i=1 (νi eyi)x i = 0 (39) where we have made the change of variables νi = ξi + eyi. The dual problem for (23) is identical to (39) up to the replacement of eyi by yi. Recalling the definitons in (25) and exploiting the first-order optimality conditions of (23) and (24), we have n X i=1 (pi yi)x i = 0 and i=1 (epi eyi)x i = 0, so that {epi}n i=1 is feasible for the dual problem (39). Since strong duality applies (Boyd and Vandenberghe, 2004), it also holds that {epi}n i=1 maximize (39). Let us consider {qi}n i=1 defined for i [n] by qi = θpi + (1 θ)y. We can easily observe that qi c as convex combination of pi, y c and that i=1 (qi eyi)x i = θ i=1 (pi yi)x i + (1 θ) i=1 (y y)x i = 0. This implies that {qi}n i=1 is feasible for (39) and we have i=1 Z(qi) 1 i=1 Z(epi). We get the advertised result by using the concavity of Z so that θZ(pi)+(1 θ)Z(y) Z(qi). Carratino, Cissé, Jenatton and Vert Appendix B. Experiments B.1 CIFAR-10 and CIFAR-100 To learn on CIFAR-10, CIFAR-100 we use Res Net-34 and Le Net. For both architectures the optimizer used for training is SGD with momentum 0.9 for 200 epochs with mini-batch size 128, weight-decay 5 10 4. For Res Net-34 the learning rate is 0.1 reduced by a factor 10 at epoch 60, 120, 160. For Le Net the learning rate is 0.01 reduced by a factor 10 at epoch 100. B.2 Image Net To learn on the Image Net we use Res Net-50. The optimizer used for training is SGD with Nesterov and momentum 0.9 for 200 epochs with mini-batch size 4096 (32 numbers of cores 128 per core batch-size), weight-decay 5 10 5. The learning rate is 1.6 reduced by a factor 10 at epoch 66, 133, 177. B.3 Two Moons with Random Features We report in Figure 8 and Figure 9 some more results on a synthetic binary classification problem (noisy two-moon problem), where we train a logistic regression model with random Fourier features (Rahimi and Recht, 2008). This allowed us to get rid of convergence issues due to the convexity of the problem, but still work with nonlinear models of the input points. For each experimental result we report mean and 95% confidence interval using 30 repetitions over 30 different instances of the data. What we notice from these results is similar behaviors as reported for CIFAR-10, CIFAR100, Image Net in the main paper. 0 100 200 300 400 Epoch Logistic loss Mixup Mixup + input mod Approximate Mixup ERM 0 100 200 300 400 Epoch Test accuracy Mixup Mixup + input mod Approximate Mixup ERM 0 100 200 300 400 Epoch Logistic loss Mixup Mixup + input mod Approximate Mixup ERM Fig. 8: From left to right: train loss, train and test accuracy during optimization of a logistic regression model trained on the noisy two-moon problem with Mixup, Mixup (rescaled), approximate Mixup and ERM risks. Data generation. To generate the data we use the sklearn.datasets.make_moons function from the scikit-learn library. We create n = 300 points with noise= 0.01, and split them in 50% for train and 50% for test. We then randomly flip 20% of the training labels to make the learning task more difficult. We repeat this pipeline 30 times for 30 different random seeds. Function space. Let M = 1000, w RM and φ : Rd RM be the feature map defined as φ(x) = 1 M cos(Sx + B), where cos : RM RM is the element-wise cosine function, On Mixup Regularization 0.0 0.2 0.4 0.6 0.8 1.0 1/(1 + e f(x)) Mixup ERM ERM + label smooth 0.0 0.2 0.4 0.6 0.8 1.0 1/(1 + e f(x)) Mixup + input mod Mixup Approximate Mixup Fig. 9: Histograms of confidence of predictions on test points, for models trained with different techniques. S RM d is the random matrix s.t. Si,j N(0, σ2), i [M], j [d] with σ = 10, and B RM s.t. Bi Unif(0, 2π), i [M]. The space of candidate solutions H we consider is the class of functions of the form f(x) = w φ(x). Optimization. To minimize any functional we use stochastic gradient descent with minibatching, with mini-batch size b = 50 and step-size γ = 5. Mixup hyperparameter. We consider the Beta distribution in Mixup and its approximation to be Beta(α, α) with α = 1. Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. ar Xiv preprint ar Xiv:1701.07875, 2017. Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 7411 7422. Curran Associates, Inc., 2019. Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien. A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 233 242. JMLR.org, 2017. Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Object Net: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 9453 9463. Curran Associates, Inc., 2019. Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, page 6241 6250. Curran Associates Inc., 2017. Carratino, Cissé, Jenatton and Vert David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. Mix Match: A holistic approach to semi-supervised learning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., 2019. Chris M Bishop. Training with noise is equivalent to Tikhonov regularization. Neural computation, 7(1):108 116, 1995. Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems, volume 20, pages 161 168. Curran Associates, Inc., 2008. Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004. Luigi Carratino, Moustapha Cissé, Rodolphe Jenatton, and Jean-Philippe Vert. On mixup regularization. Technical Report 2006.06049, ar Xiv, 2020. Moustapha Cissé, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 854 863. JMLR. org, 2017. Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1050 1059. PMLR, 2016. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, pages 5767 5777. Curran Associates, Inc., 2017. Chuan Guo, GeoffPleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321 1330. JMLR. org, 2017. Hongyu Guo. Nonlinear Mixup: Out-of-manifold data augmentation for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4044 4051, 2020. Hongyu Guo, Yongyi Mao, and Richong Zhang. Mixup as locally linear out-of-manifold regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3714 3722, 2019. On Mixup Regularization Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, page 2263 2273. Curran Associates Inc., 2017. Dan Hendrycks and Thomas G. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-ofdistribution examples in neural networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017. Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 15262 15271. Computer Vision Foundation / IEEE, 2021. Geoffrey E Hinton. Learning translation invariant recognition in a massively parallel networks. In International Conference on Parallel Architectures and Languages Europe, pages 1 13. Springer, 1987. Beyrem Khalfaoui, Joseph Boyd, and Jean-Philippe Vert. Asni: Adaptive structured noise injection for shallow and deep neural networks. ar Xiv preprint ar Xiv:1909.09819, 2019. Anders Krogh and John A Hertz. A simple weight decay can improve generalization. In J. Moody, S. Hanson, and R.P. Lippmann, editors, Advances in neural information processing systems, volume 4, pages 950 957. Morgan-Kaufmann, 1991. Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 4694 4703. Curran Associates, Inc., 2019. Behnam Neyshabur. Implicit regularization in deep learning. ar Xiv preprint ar Xiv:1709.01953, 2017. Behnam Neyshabur, Russ R Salakhutdinov, and Nati Srebro. Path-SGD: Path-normalized optimization in deep neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, pages 2422 2430, 2015. Carratino, Cissé, Jenatton and Vert Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings. Open Review.net, 2017. Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in neural information processing systems, volume 20, pages 1177 1184. Curran Associates, Inc., 2008. Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do Image Net classifiers generalize to Image Net? In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 5389 5400. PMLR, 2019. Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in neural information processing systems, volume 29, pages 901 909. Curran Associates, Inc., 2016. Hanie Sedghi, Vineet Gupta, and Philip M Long. The singular values of convolutional layers. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. Vaishaal Shankar, Achal Dave, Rebecca Roelofs, Deva Ramanan, Benjamin Recht, and Ludwig Schmidt. Do image classifiers generalize across time? In ICCV, pages 9641 9649. IEEE, 2021. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929 1958, 2014. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818 2826, 2016. Sunil Thulasidasan, Gopinath Chennupati, JeffA Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Yuji Tokozume, Yoshitaka Ushiku, and Tatsuya Harada. Between-class learning for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5486 5494, 2018. Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of On Mixup Regularization Machine Learning Research, pages 6438 6447, Long Beach, California, USA, 09 15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/verma19a.html. Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regularization. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in neural information processing systems, volume 26, pages 351 359. Curran Associates, Inc., 2013. Colin Wei, Sham Kakade, and Tengyu Ma. The implicit and explicit regularization effects of dropout. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 10181 10192. PMLR, 13 18 Jul 2020. Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision, pages 6023 6032, 2019. Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. Linjun Zhang, Zhun Deng, Kenji Kawaguchi, Amirata Ghorbani, and James Zou. How does mixup help with robustness and generalization? In International Conference on Learning Representations. Open Review.net, 2021.