# noisy_feature_mixup__efbd013f.pdf Published as a conference paper at ICLR 2022 NOISY FEATURE MIXUP Soon Hoe Lim Nordita, KTH and Stockholm University soon.hoe.lim@su.edu N. Benjamin Erichson* University of Pittsburgh erichson@pitt.edu Francisco Utrera University of Pittsburgh and ICSI utrerf@berkeley.edu Winnie Xu University of Toronto winniexu@cs.toronto.edu Michael W. Mahoney ICSI and UC Berkeley mmahoney@stat.berkeley.edu We introduce Noisy Feature Mixup (NFM), an inexpensive yet effective method for data augmentation that combines the best of interpolation based training and noise injection schemes. Rather than training with convex combinations of pairs of examples and their labels, we use noise-perturbed convex combinations of pairs of data points in both input and feature space. This method includes mixup and manifold mixup as special cases, but it has additional advantages, including better smoothing of decision boundaries and enabling improved model robustness. We provide theory to understand this as well as the implicit regularization effects of NFM. Our theory is supported by empirical results, demonstrating the advantage of NFM, as compared to mixup and manifold mixup. We show that residual networks and vision transformers trained with NFM have favorable trade-offs between predictive accuracy on clean data and robustness with respect to various types of data perturbation across a range of computer vision benchmark datasets. 1 INTRODUCTION Mitigating over-fitting and improving generalization on test data are central goals in machine learning. One approach to accomplish this is regularization, which can be either data-agnostic or data-dependent (e.g., explicitly requiring the use of domain knowledge or data). Noise injection is a typical example of data-agnostic regularization (Bishop, 1995), where noise can be injected into the input data (An, 1996), or the activation functions (Gulcehre et al., 2016), or the hidden layers of deep neural networks (Camuto et al., 2020; Lim et al., 2021). Data augmentation constitutes a different class of regularization methods (Baird, 1992; Chapelle et al., 2001; De Coste & Sch olkopf, 2002), which can also be either data-agnostic or data-dependent. Data augmentation involves training a model with not just the original data, but also with additional data that is properly transformed, and it has led to state-of-the-art results in image recognition (Cires an et al., 2010; Krizhevsky et al., 2012). The recently-proposed data-agnostic method, mixup (Zhang et al., 2017), trains a model on linear interpolations of a random pair of examples and their corresponding labels, thereby encouraging the model to behave linearly in-between training examples. Both noise injection and mixup have been shown to impose smoothness and increase model robustness to data perturbations (Zhang et al., 2020; Carratino et al., 2020; Lim et al., 2021), which is critical for many safety and sensitive applications (Goodfellow et al., 2018; Madry et al., 2017). In this paper, we propose and study a simple yet effective data augmentation method, which we call Noisy Feature Mixup (NFM). This method combines mixup and noise injection, thereby inheriting the benefits of both methods, and it can be seen as a generalization of input mixup (Zhang et al., 2017) and manifold mixup (Verma et al., 2019). When compared to noise injection and mixup, NFM imposes regularization on the largest natural region surrounding the dataset (see Fig. 1), which may help improve robustness and generalization when predicting on out of distribution data. Conveniently, NFM can be implemented on top of manifold mixup, introducing minimal computation overhead. Equal contribution Published as a conference paper at ICLR 2022 λx1 + (1 λ)x2 x 2 λx 1 + (1 λ)x 2 Figure 1: An illustration of how two data points, x1 and x2, are transformed in mixup (top) and noisy feature mixup (NFM) with S := {0} (bottom). Contributions. Our main contributions are as follows. We study NFM via the lens of implicit regularization, showing that NFM amplifies the regularizing effects of manifold mixup and noise injection, implicitly reducing the feature-output Jacobians and Hessians according to the mixing level and noise levels (see Theorem 1). We provide mathematical analysis to show that NFM can improve model robustness when compared to manifold mixup and noise injection. In particular, we show that, under appropriate assumptions, NFM training approximately minimizes an upper bound on the sum of an adversarial loss and feature-dependent regularizers (see Theorem 2). We provide empirical results in support of our theoretical findings, showing that NFM improves robustness with respect to various forms of data perturbation across a wide range of state-of-theart architectures on computer vision benchmark tasks. In the Supplementary Materials (SM), we provide proofs for our theorems along with additional theoretical and empirical results to gain more insights into NFM. In particular, we show that NFM can implicitly increase classification margin (see Proposition 1 in SM C) and the noise injection procedure in NFM can robustify manifold mixup in a probabilistic sense (see Theorem 5 in SM D). We also provide and discuss generalization bounds for NFM (see Theorem 6 and 7 in SM E). Notation. I denotes identity matrix, [K] := {1, . . . , K}, the superscript T denotes transposition, denotes composition, denotes Hadamard product, 1 denotes the vector with all components equal one. For a vector v, vk denotes its kth component and v p denotes its lp norm for p > 0. conv(X) denote the convex hull of X. Mλ(a, b) := λa + (1 λ)b, for random variables a, b, λ. δz denotes the Dirac delta function, defined as δz(x) = 1 if x = z and δz(x) = 0 otherwise. 1A denotes indicator function of the set A. For α, β > 0, Dλ := α α+β Beta(α + 1, β) + β α+β Beta(β + 1, α) denotes a uniform mixture of two Beta distributions. For two vectors a, b, cos(a, b) := a, b / a 2 b 2 denotes their cosine similarity. N(a, b) is a Gaussian distribution with mean a and covariance b. 2 RELATED WORK Regularization. Regularization refers to any technique that reduces overfitting in machine learning; see (Mahoney & Orecchia, 2011; Mahoney, 2012) and references therein, in particular for a discussion of implicit regularization, a topic that has received attention recently in the context of stochastic gradient optimization applied to neural network models. Traditional regularization techniques such as ridge regression, weight decay and dropout do not make use of the training data to reduce the model capacity. A powerful class of techniques is data augmentation, which constructs additional examples from the training set, e.g., by applying geometric transformations to the original data (Shorten & Khoshgoftaar, 2019). A recently proposed technique is mixup (Zhang et al., 2017), where the examples are created by taking convex combinations of pairs of inputs and their labels. Verma et al. (2019) extends mixup to hidden representations in deep neural networks. Subsequent works by Greenewald et al. (2021); Yin et al. (2021); Engstrom et al. (2019); Kim et al. (2020a); Yun et al. (2019); Hendrycks et al. (2019) introduce different variants and extensions of mixup. Regularization is also intimately connected to robustness (Hoffman et al., 2019; Sokoli c et al., 2017; Novak et al., 2018; Elsayed et al., 2018; Moosavi-Dezfooli et al., 2019). Adding to the list is NFM, a powerful regularization method that we propose to improve model robustness. Robustness. Model robustness is an increasingly important issue in modern machine learning. Robustness with respect to adversarial examples (Kurakin et al., 2016) can be achieved by adversarial training (Goodfellow et al., 2014; Madry et al., 2017; Utrera et al., 2020). Several works present theoretical justifications to observed robustness and how data augmentation can improve it (Hein & Andriushchenko, 2017; Yang et al., 2020b; Couellan, 2021; Pinot et al., 2019a; 2021; Zhang et al., 2020; 2021; Carratino et al., 2020; Kimura, 2020; Dao et al., 2019; Wu et al., 2020; Gong et al., 2020; Chen et al., 2020). Relatedly, Fawzi et al. (2016); Franceschi et al. (2018); Lim et al. (2021) Published as a conference paper at ICLR 2022 investigate how noise injection can be used to improve robustness. Parallel to this line of work, we provide theory to understand how NFM can improve robustness. Also related is the study of the trade-offs between robustness and accuracy (Min et al., 2020; Zhang et al., 2019; Tsipras et al., 2018; Schmidt et al., 2018; Su et al., 2018; Raghunathan et al., 2020; Yang et al., 2020a). 3 NOISY FEATURE MIXUP Noisy Feature Mixup is a generalization of input mixup (Zhang et al., 2017) and manifold mixup (Verma et al., 2019). The main novelty of NFM against manifold mixup lies in the injection of noise when taking convex combinations of pairs of input and hidden layer features. Fig. 1 illustrates, at a high level, how this modification alters the region in which the resulting augmented data resides. Fig. 2 shows that NFM is most effective at smoothing the decision boundary of the trained classifiers; compared to noise injection and mixup alone, it imposes the strongest smoothness on this dataset. Formally, we consider multi-class classification with K labels. Denote the input space by X Rd and the output space by Y = RK. The classifier, g, is constructed from a learnable map f : X RK, mapping an input x to its label, g(x) = arg maxk f k(x) [K]. We are given a training set, Zn := {(xi, yi)}n i=1, consisting of n pairs of input and one-hot label, with each training pair zi := (xi, yi) X Y drawn i.i.d. from a ground-truth distribution D. We consider training a deep neural network f := fk gk, where gk : X gk(X) maps an input to a hidden representation at layer k, and fk : gk(X) g L(X) := Y maps the hidden representation to a one-hot label at layer L. Here, gk(X) Rdk for k [L], d L := K, g0(x) = x and f0(x) = f(x). Training f using NFM consists of the following steps: 1. Select a random layer k from a set, S {0} [L], of eligible layers in the neural network. 2. Process two random data minibatches (x, y) and (x , y ) as usual, until reaching layer k. This gives us two immediate minibatches (gk(x), y) and (gk(x ), y ). 3. Perform mixup on these intermediate minibatches, producing the mixed minibatch: ( gk, y) := (Mλ(gk(x), gk(x )), Mλ(y, y )), (1) where the mixing level λ Beta(α, β), with the hyper-parameters α, β > 0. 4. Produce noisy mixed minibatch by injecting additive and multiplicative noise: ( gk, y) := ((1 + σmultξmult k ) Mλ(gk(x), gk(x )) + σaddξadd k , Mλ(y, y )), (2) where the ξadd k and ξmult k are Rdk-valued independent random variables modeling the additive and multiplicative noise respectively, and σadd, σmult 0 are pre-specified noise levels. 5. Continue the forward pass from layer k until the output using the noisy mixed minibatch ( gk, y). 6. Compute the loss and gradients that update all the parameters of the network. Baseline (85.5%). Dropout (87.0%). Weight decay (88.0%). Noise injections (87.0%). Mixup (84.5%). Manifold mixup (88.5%). Noisy mixup (89.0%). NFM (90.0%). Figure 2: The decision boundaries and test accuracy (in parenthesis) for different training schemes on a toy dataset in binary classification (see Subsection F.2 for details). Published as a conference paper at ICLR 2022 At the level of implementation, following (Verma et al., 2019), we backpropagate gradients through the entire computational graph, including those layers before the mixup layer k. In the case where σadd = σmult = 0, NFM reduces to manifold mixup (Verma et al., 2019). If in addition S = {0}, it reduces to the original mixup method (Zhang et al., 2017). The main difference between NFM and manifold mixup lies in the noise injection of the fourth step above. Note that NFM is equivalent to injecting noise into gk(x), gk(x ) first, then performing mixup on the resulting pair, i.e., the order that the third and fourth steps occur does not change the resulting noisy mixed minibatch. For simplicity, we have used the same mixing level, noise distribution, and noise levels for all layers in S in our formulation. Within the above setting, we consider the expected NFM loss: LNF M(f) = E(x,y),(x ,y ) DEk SEλ Beta(α,β)Eξk Ql(fk(Mλ,ξk(gk(x), gk(x ))), Mλ(y, y )), where l : RK RK [0, ) is a loss function (note that here we have suppressed the dependence of both l and f on the learnable parameter θ in the notation), ξk := (ξadd k , ξmult k ) are drawn from some probability distribution Q with finite first two moments, and Mλ,ξk(gk(x), gk(x )) := (1 + σmultξmult k ) Mλ(gk(x), gk(x )) + σaddξadd k . NFM seeks to minimize a stochastic approximation of LNF M(f) by sampling a finite number of k, λ, ξk values and using minibatch gradient descent to minimize this loss approximation. In this section, we provide mathematical analysis to understand NFM. We begin with formulating NFM in the framework of vicinal risk minimization and interpreting NFM as a stochastic learning strategy in Subsection 4.1. Next, we study NFM via the lens of implicit regularization in Subsection 4.2. Our key contribution is Theorem 1, which shows that minimizing the NFM loss function is approximately equivalent to minimizing a sum of the original loss and feature-dependent regularizers, amplifying the regularizing effects of manifold mixup and noise injection according to the mixing and noise levels. In Subsection 4.3, we focus on demonstrating how NFM can enhance model robustness via the lens of distributionally robust optimization. The key result of Theorem 2 shows that NFM loss is approximately the upper bound on a regularized version of an adversarial loss, and thus training with NFM not only improves robustness but can also mitigate robust over-fitting, a dominant phenomenon where the robust test accuracy starts to decrease during training (Rice et al., 2020). 4.1 NFM: BEYOND EMPIRICAL RISK MINIMIZATION The standard approach in statistical learning theory (Bousquet et al., 2003) is to select a hypothesis function f : X Y from a pre-defined hypothesis class F to minimize the expected risk with respect to D and to solve the risk minimization problem: inff F R(f) := E(x,y) D[l(f(x), y)], for a suitable choice of loss function l. In practice, we do not have access to the ground-truth distribution. Instead, we find an approximate solution by solving the empirical risk minimization (ERM) problem, in which case D is approximated by the empirical distribution Pn = 1 n Pn i=1 δzi. In other words, in ERM we solve the problem: inff F Rn(f) := 1 n Pn i=1 l(f(xi), yi). However, when the training set is small or the model capacity is large (as is the case for deep neural networks), ERM may suffer from overfitting. Vicinal risk minimization (VRM) is a data augmentation principle introduced in (Vapnik, 2013) that goes beyond ERM, aiming to better estimate expected risk and reduce overfitting. In VRM, a model is trained not simply on the training set, but on samples drawn from a vicinal distribution, that smears the training data to their vicinity. With appropriate choices for this distribution, the VRM approach has resulted in several effective regularization schemes (Chapelle et al., 2001). Input mixup (Zhang et al., 2017) can be viewed as an example of VRM, and it turns out that NFM can be constructed within a VRM framework at the feature level (see Section A in SM). On a high level, NFM can be interpreted as a random procedure that introduces feature-dependent noise into the layers of the deep neural network. Since the noise injections are applied only during training and not inference, NFM is an instance of a stochastic learning strategy. Note that the injection strategy of NFM differs from those of An (1996); Camuto et al. (2020); Lim Published as a conference paper at ICLR 2022 et al. (2021). Here, the structure of the injected noise differs from iteration to iteration (based on the layer chosen) and depends on the training data in a different way. We expect NFM to amplify the benefits of training using either noise injection or mixup alone, as will be shown next. 4.2 IMPLICIT REGULARIZATION OF NFM We consider loss functions of the form l(f(x), y) := h(f(x)) yf(x), which includes standard choices such as the logistic loss and the cross-entropy loss, and recall that f := fk gk. Denote Lstd n := 1 n Pn i=1 l(f(xi), yi) and let Dx be the empirical distribution of training samples {xi}i [n]. We shall show that NFM exhibits a natural form of implicit regularization, i.e., regularization imposed implicitly by the stochastic learning strategy, without explicitly modifying the loss. Let ϵ > 0 be a small parameter. In the sequel, we rescale 1 λ 7 ϵ(1 λ), σadd 7 ϵσadd, σmult 7 ϵσmult, and denote kf and 2 kf as the first and second directional derivative of fk with respect to gk respectively, for k S. By working in the small parameter regime, we can relate the NFM empirical loss LNF M n to the original loss Lstd n and identify the regularizing effects of NFM. Theorem 1. Let ϵ > 0 be a small parameter, and assume that h and f are twice differentiable. Then, LNF M n = Ek SLNF M(k) n , where LNF M(k) n = Lstd n + ϵR(k) 1 + ϵ2 R(k) 2 + ϵ2 R(k) 3 + ϵ2ϕ(ϵ), (3) with R(k) 2 = R(k) 2 +σ2 add Radd(k) 2 +σ2 mult Rmult(k) 2 and R(k) 3 = R(k) 3 +σ2 add Radd(k) 3 +σ2 mult Rmult(k) 3 , where Radd(k) 2 = 1 i=1 h (f(xi)) kf(gk(xi))T Eξk[ξadd k (ξadd k )T ] kf(gk(xi)), (4) Rmult(k) 2 = 1 i=1 h (f(xi)) kf(gk(xi))T (Eξk[ξmult k (ξmult k )T ] gk(xi)gk(xi)T ) kf(gk(xi)), Radd(k) 3 = 1 i=1 (h (f(xi)) yi)Eξk[(ξadd k )T 2 kf(gk(xi))ξadd k ], (6) Rmult(k) 3 = 1 i=1 (h (f(xi)) yi)Eξk[(ξmult k gk(xi))T 2 kf(gk(xi))(ξmult k gk(xi))]. (7) Here, Rk 1, Rk 2 and Rk 3 are the regularizers associated with the loss of manifold mixup (see Theorem 3 in SM for their explicit expression), and ϕ is some function such that limϵ 0 ϕ(ϵ) = 0. Theorem 1 implies that, when compared to manifold mixup, NFM introduces additional smoothness, regularizing the directional derivatives, kf(gk(xi)) and 2 kf(gk(xi)), with respect to gk(xi), according to the noise levels σadd and σmult, and amplifying the regularizing effects of manifold mixup and noise injection. In particular, making 2f(xi) small can lead to smooth decision boundaries (at the input level), while reducing the confidence of model predictions. On the other hand, making the kf(gk(xi)) small can lead to improvement in model robustness, which we discuss next. 4.3 ROBUSTNESS OF NFM We show that NFM improves model robustness. We do this by considering the following three lenses: (1) implicit regularization and classification margin; (2) distributionally robust optimization; and (3) a probabilistic notion of robustness. We focus on (2) in the main paper. See Section C-D in SM and the last paragraph in this subsection for details on (1) and (3). We now demonstrate how NFM helps adversarial robustness. By extending the analysis of Zhang et al. (2017); Lamb et al. (2019), we can relate the NFM loss function to the one used for adversarial training, which can be viewed as an instance of distributionally robust optimization (DRO) (Kwon et al., 2020; Kuhn et al., 2019; Rahimian & Mehrotra, 2019) (see also Proposition 3.1 in (Staib & Jegelka, 2017)). DRO provides a framework for local worst-case risk minimization, minimizing supremum of the risk in an ambiguity set, such as in the vicinity of the empirical data distribution. Published as a conference paper at ICLR 2022 Following (Lamb et al., 2019), we consider the binary cross-entropy loss, setting h(z) = log(1 + ez), with the labels y taking value in {0, 1} and the classifier model f : Rd R. In the following, we assume that the model parameter θ Θ := {θ : yif(xi) + (yi 1)f(xi) 0 for all i [n]}. Note that this set contains the set of all parameters with correct classifications of training samples (before applying NFM), since {θ : 1{f(xi) 0} = yi for all i [n]} Θ. Therefore, the condition of θ Θ is satisfied when the model classifies all labels correctly for the training data before applying NFM. Since, in practice, the training error often becomes zero in finite time, we study the effect of NFM on model robustness in the regime of θ Θ. Working in the data-dependent parameter space Θ, we have the following result. Theorem 2. Let θ Θ := {θ : yif(xi) + (yi 1)f(xi) 0 for all i [n]} such that kf(gk(xi)) and 2 kf(gk(xi)) exist for all i [n], k S. Assume that fk(gk(xi)) = kf(gk(xi))T gk(xi), 2 kf(gk(xi)) = 0 for all i [n], k S. In addition, suppose that f(xi) 2 > 0 for all i [n], Er Dx[gk(r)] = 0 and gk(xi) 2 c(k) x dk for all i [n], k S. Then, i=1 max δi 2 ϵmix i l(f(xi + δi), yi) + Lreg n + ϵ2φ(ϵ), (8) where ϵmix i := ϵEλ Dλ[1 λ] Ek S h r(k) i c(k) x kf(gk(xi)) 2 f(xi) 2 dk i and Lreg n := 1 2n Pn i=1 |h (f(xi))|(ϵreg i )2, with r(k) i := | cos( kf(gk(xi)), gk(xi))| and (ϵreg i )2 := ϵ2 kf(gk(xi)) 2 2 Eλ[(1 λ)]2Exr[ gk(xr) 2 2 cos( kf(gk(xi)), gk(xr))2] + σ2 add Eξk[ ξadd k 2 2 cos( kf(gk(xi)), ξadd k )2] + σ2 mult Eξk[ ξmult k gk(xi) 2 2 cos( kf(gk(xi)), ξmult k gk(xi))2] , (9) and φ is some function such that limϵ 0 φ(ϵ) = 0. The second assumption stated in Theorem 2 is similar to the one made in Lamb et al. (2019); Zhang et al. (2020), and is satisfied by linear models and deep neural networks with Re LU activation function and max-pooling. Theorem 2 shows that the NFM loss is approximately an upper bound of the adversarial loss with l2 attack of size ϵmix = mini [n] ϵmix i , plus a feature-dependent regularization term Lreg n (see SM for further discussions). Therefore, we see that minimizing the NFM loss not only results in a small adversarial loss, while retaining the robustness benefits of manifold mixup, but it also imposes additional smoothness, due to noise injection, on the adversarial loss. The latter can help mitigate robust overfitting and improve test performance (Rice et al., 2020; Rebuffiet al., 2021). NFM can also implicitly increase the classification margin (see Section C of SM). Moreover, since the main novelty of NFM lies in the introduction of noise injection, it would be insightful to isolate the robustness boosting benefits of injecting noise on top of manifold mixup. We demonstrate these advantages via the lens of probabilistic robustness in Section D of SM. 5 EMPIRICAL RESULTS In this section, we study the test performance of models trained with NFM, and examine to what extent NFM can improve robustness to input perturbations. We demonstrate the tradeoff between predictive accuracy on clean and perturbed test sets. We consider input perturbations that are common in the literature: (a) white noise; (b) salt and pepper; and (c) adversarial perturbations (see Section F). We evaluate the average performance of NFM with different model architectures on CIFAR10 (Krizhevsky, 2009), CIFAR-100 (Krizhevsky, 2009), Image Net (Deng et al., 2009), and CIFAR10c (Hendrycks & Dietterich, 2019). We use a pre-activated residual network (Res Net) with depth 18 (He et al., 2016) on small scale tasks. For more challenging tasks, we consider the performance of wide Res Net-18 (Zagoruyko & Komodakis, 2016) and Res Net-50 architectures, respectively. Baselines. We evaluate against related data augmentation schemes that have shown performance improvements in recent years: mixup (Zhang et al., 2017); manifold mixup (Verma et al., 2019); Published as a conference paper at ICLR 2022 cutmix (Yun et al., 2019); puzzle mixup (Kim et al., 2020b); and noisy mixup (Yang et al., 2020b). Further, we compare to vanilla models trained without data augmentation (baseline), models trained with label smoothing, and those trained on white noise perturbed inputs. Experimental details. All hyperparameters are consistent with those of the baseline model across the ablation experiments. In the models trained on the different data augmentation schemes, we keep α fixed, i.e., the parameter defining Beta(α, α), from which the λ parameter controlling the convex combination between data point pairs is sampled. Across all models trained with NFM, we control the level of noise injections by fixing the additive noise level to σadd = 0.4 and multiplicative noise to σmult = 0.2. To demonstrate the significant improvements on robustness upon the introduction of these small input perturbations, we show a second model ( * ) that was injected with higher noise levels (i.e., σadd = 1.0, σmult = 0.5). See SM (Section F.5) for further details and comparisons against NFM models trained on various other levels of noise injections. 5.1 CIFAR10 Pre-activated Res Net-18. Table 1 summarizes the performance improvements and indicates a consistent robustness across different α values. The model trained with NFM outperforms the baseline model on the clean test set, while being more robust to input perturbations (Fig. 3; left). This advantage is also displayed in the models trained with mixup and manifold mixup, though in a less pronounced way. Notably, the NFM model is also robust to salt and pepper perturbations and could be significantly more so by further increasing the noise levels (Fig. 3; right). 5.2 CIFAR-100 Wide Res Net-18. Previous work indicates that data augmentation has a positive effect on performance for this dataset (Zhang et al., 2017). Fig. 4 (left) confirms that mixup and manifold mixup improve the generalization performance on clean data and highlights the advantage of data augmentation. The NFM training scheme is also capable of further improving the generalization performance. In 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 75 Baseline Mixup Cut Mix Puzzle Mix Noisy Mixup Manifold Mixup NFM (*) NFM Test Accuracy White Noise (σ) 0.00 0.02 0.04 0.06 0.08 65 Baseline Mixup Cut Mix Puzzle Mix Noisy Mixup Manifold Mixup NFM (*) NFM Salt and Pepper Noise (γ) Figure 3: Pre-actived Res Net-18 evaluated on CIFAR-10 with different training schemes. Shaded regions indicate one standard deviation about the mean. Averaged across 5 random seeds. Table 1: Robustness of Res Net-18 w.r.t. white noise (σ) and salt and pepper (γ) perturbations evaluated on CIFAR-10. The results are averaged over 5 models trained with different seed values. Scheme Clean (%) σ (%) γ (%) 0.1 0.2 0.3 0.02 0.04 0.1 Baseline 94.6 90.4 76.7 56.3 86.3 76.1 55.2 Baseline + Noise 94.4 94.0 87.5 71.2 89.3 82.5 64.9 Baseline + Label Smoothing 95.0 91.3 77.5 56.9 87.7 79.2 60.0 Mixup (α = 1.0) Zhang et al. (2017) 95.6 93.2 85.4 71.8 87.1 76.1 55.2 Cut Mix Yun et al. (2019) 96.3 86.7 60.8 32.4 90.9 81.7 54.7 Puzzle Mix Kim et al. (2020b) 96.3 91.7 78.1 59.9 91.4 81.8 54.4 Manifold Mixup (α = 1.0) Verma et al. (2019) 95.7 92.7 82.7 67.6 88.9 80.2 57.6 Noisy Mixup (α = 1.0) Yang et al. (2020b) 78.9 78.6 66.6 46.7 66.6 53.4 25.9 Noisy Feature Mixup (α = 1.0) 95.4 95.0 91.6 83.0 91.9 87.4 73.3 Published as a conference paper at ICLR 2022 0.00 0.05 0.10 0.15 0.20 0.25 0.30 40 Baseline Mixup Cut Mix Puzzle Mix Noisy Mixup Manifold Mixup NFM (*) NFM Test Accuracy White Noise (σ) 0.00 0.02 0.04 0.06 0.08 40 Salt and Pepper Noise (γ) Figure 4: Wide Res Nets evaluated on CIFAR-100. Averaged across 5 random seeds. Table 2: Robustness of Wide-Res Net-18 w.r.t. white noise (σ) and salt and pepper (γ) perturbations evaluated on CIFAR-100. The results are averaged over 5 models trained with different seed values. Scheme Clean (%) σ (%) γ (%) 0.1 0.2 0.3 0.02 0.04 0.1 Baseline 76.9 64.6 42.0 23.5 58.1 39.8 15.1 Baseline + Noise 76.1 75.2 60.5 37.6 64.9 51.3 23.0 Mixup (α = 1.0) Zhang et al. (2017) 80.3 72.5 54.0 33.4 62.5 43.8 16.2 Cut Mix Yun et al. (2019) 77.8 58.3 28.1 13.8 70.3 58. 24.8 Puzzle Mix (200 epochs) Kim et al. (2020b) 78.6 66.2 41.1 22.6 69.4 56.3 23.3 Puzzle Mix (1200 epochs) Kim et al. (2020b) 80.3 53.0 19.1 6.2 69.3 51.9 15.7 Manifold Mixup (α = 1.0) Verma et al. (2019) 79.7 70.5 45.0 23.8 62.1 42.8 14.8 Noisy Mixup (α = 1.0) Yang et al. (2020b) 78.9 78.6 66.6 46.7 66.6 53.4 25.9 Noisy Feature Mixup (α = 1.0) 80.9 80.1 72.1 55.3 72.8 62.1 34.4 Table 3: Robustness of Res Net-50 w.r.t. white noise (σ) and salt and pepper (γ) perturbations evaluated on Image Net. Here, the NFM training scheme improves both the predictive accuracy on clean data and robustness with respect to data perturbations. Scheme Clean (%) σ (%) γ (%) 0.1 0.25 0.5 0.06 0.1 0.15 Baseline 76.0 73.5 67.0 50.1 53.2 50.4 45.0 Manifold Mixup (α = 0.2) Verma et al. (2019) 76.7 74.9 70.3 57.5 58.1 54.6 49.5 Noisy Feature Mixup (α = 0.2) 77.0 76.5 72.0 60.1 58.3 56.0 52.3 Noisy Feature Mixup (α = 1.0) 76.8 76.2 71.7 60.0 60.9 58.8 54.4 addition, we see that the model trained with NFM is less sensitive to both white noise and salt and pepper perturbations. These results are surprising, as robustness is often thought to be at odds with accuracy (Tsipras et al., 2018). However, we demonstrate NFM has the ability to improve both accuracy and robustness. Table 2 indicates that for the same α, NFM can achieve an average test accuracy of 80.9% compared to only 80.3% in the mixup setting. 5.3 IMAGENET Res Net-50. Table 3 similarly shows that NFM improves both the generalization and robustness capacities with respect to data perturbations. Although less pronounced in comparison to previous datasets, NFM shows a favorable trade-off without requiring additional computational resources. Note that due to computational costs, we do not average across multiple seeds and only compare NFM to the baseline and manifold mixup models. 5.4 CIFAR-10C In Figure 6 we use the CIFAR-10C dataset (Hendrycks & Dietterich, 2019) to demonstrate that models trained with NFM are more robust to a range of perturbations on natural images. Figure 6 (left) shows Published as a conference paper at ICLR 2022 0.00 0.02 0.04 0.06 0.08 0.10 0.12 50 Baseline Mixup Cut Mix Puzzle Mix Noisy Mixup Manifold Mixup NFM (*) NFM Test Accuracy Adverserial Noise (ϵ) 0.000 0.025 0.050 0.075 0.100 0.125 0.150 20 Adverserial Noise (ϵ) Figure 5: Pre-actived Res Net-18 evaluated on CIFAR-10 (left) and Wide Res Net-18 evaluated on CIFAR-100 (right) with respect to adversarially perturbed inputs. 100 Baseline Cut Mix M. Mixup Mixup Puzz Mix NFM NFM (*) Test Accuracy gaussian jpeg impulse shot snow speckle 20 Figure 6: Pre-actived Res Net-18 evaluated on CIFAR-10c. the average test accuracy across six selected perturbations and demonstrates the advantage of NFM being particularly pronounced with the progression of severity levels. The right figure shows the performance on the same set of six perturbations for the median severity level 3. NFM excels on Gaussian, impulse, speckle and shot noise, and is competitive with the rest on the snow perturbation. 5.5 ROBUSTNESS TO ADVERSARIAL EXAMPLES So far we have only considered white noise and salt and pepper perturbations. We further consider adversarial perturbations. Here, we use projected gradient decent (Madry et al., 2017) with 7 iterations and various ϵ levels to construct the adversarial perturbations. Fig. 5 highlights the improved resilience of Res Nets trained with NFM to adversarial input perturbations and shows this consistently on both CIFAR-10 (left) and CIFAR-100 (right). Models trained with both mixup and manifold mixup do not show a substantially increased resilience to adversarial perturbations. In Section F.6, we compare NFM to models that are adversarially trained. There, we see that adversarially trained models are indeed more robust to adversarial attacks, while at the same time being less accurate on clean data. However, models trained with NFM show an advantage compared to adversarially trained models when faced with salt and pepper perturbations. 6 CONCLUSION We introduce Noisy Feature Mixup, an effective data augmentation method that combines mixup and noise injection. We identify the implicit regularization effects of NFM, showing that the effects are amplifications of those of manifold mixup and noise injection. Moreover, we demonstrate the benefits of NFM in terms of superior model robustness, both theoretically and experimentally. Our work inspires a range of interesting future directions, including theoretical investigations of the trade-offs between accuracy and robustness for NFM and applications of NFM beyond computer vision tasks. Further, it will be interesting to study whether NFM may also lead to better model calibration by extending the analysis of Thulasidasan et al. (2019); Zhang et al. (2021). Published as a conference paper at ICLR 2022 CODE OF ETHICS We acknowledge that we have read and commit to adhering to the ICLR Code of Ethics. REPRODUCIBILITY The codes that can be used to reproduce the empirical results, as well as description of the data processing steps, presented in this paper are available as a zip file in Supplementary Material at Open Review.net. The codes are also available at https://github.com/erichson/NFM. For the theoretical results, all assumptions, proofs and the related discussions are provided in SM. ACKNOWLEDGMENTS S. H. Lim would like to acknowledge the WINQ Fellowship and the Knut and Alice Wallenberg Foundation for providing support of this work. N. B. Erichson and M. W. Mahoney would like to acknowledge IARPA (contract W911NF20C0035), NSF, and ONR for providing partial support of this work. Our conclusions do not necessarily reflect the position or the policy of our sponsors, and no official endorsement should be inferred. We are also grateful for the generous support from Amazon AWS. Guozhong An. The effects of adding noise during backpropagation training on a generalization performance. Neural Computation, 8(3):643 674, 1996. Henry S Baird. Document image defect models. In Structured Document Image Analysis, pp. 546 556. Springer, 1992. Chris M Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1):108 116, 1995. Olivier Bousquet, St ephane Boucheron, and G abor Lugosi. Introduction to statistical learning theory. In Summer school on machine learning, pp. 169 207. Springer, 2003. Alexander Camuto, Matthew Willetts, Umut S ims ekli, Stephen Roberts, and Chris Holmes. Explicit regularisation in Gaussian noise injections. ar Xiv preprint ar Xiv:2007.07368, 2020. Luigi Carratino, Moustapha Ciss e, Rodolphe Jenatton, and Jean-Philippe Vert. On mixup regularization. ar Xiv preprint ar Xiv:2006.06049, 2020. Olivier Chapelle, Jason Weston, L eon Bottou, and Vladimir Vapnik. Vicinal risk minimization. Advances in Neural Information Processing Systems, pp. 416 422, 2001. Shuxiao Chen, Edgar Dobriban, and Jane H Lee. A group-theoretic framework for data augmentation. Journal of Machine Learning Research, 21(245):1 71, 2020. Dan Claudiu Cires an, Ueli Meier, Luca Maria Gambardella, and J urgen Schmidhuber. Deep, big, simple neural nets for handwritten digit recognition. Neural Computation, 22(12):3207 3220, 2010. Nicolas Couellan. Probabilistic robustness estimates for feed-forward neural networks. Neural Networks, 142:138 147, 2021. Tri Dao, Albert Gu, Alexander Ratner, Virginia Smith, Chris De Sa, and Christopher R e. A kernel theory of modern data augmentation. In International Conference on Machine Learning, pp. 1528 1537. PMLR, 2019. Dennis De Coste and Bernhard Sch olkopf. Training invariant support vector machines. Machine Learning, 46(1):161 190, 2002. Published as a conference paper at ICLR 2022 Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248 255. Ieee, 2009. Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between highdimensional Gaussians. ar Xiv preprint ar Xiv:1810.08693, 2018. Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211 407, 2014. Gamaleldin F Elsayed, Dilip Krishnan, Hossein Mobahi, Kevin Regan, and Samy Bengio. Large margin deep networks for classification. ar Xiv preprint ar Xiv:1803.05598, 2018. Logan Engstrom, Justin Gilmer, Gabriel Goh, Dan Hendrycks, Andrew Ilyas, Aleksander Madry, Reiichiro Nakano, Preetum Nakkiran, Shibani Santurkar, Brandon Tran, Dimitris Tsipras, and Eric Wallace. A discussion of adversarial examples are not bugs, they are features . Distill, 2019. Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Brandon Tran, and Aleksander Madry. Adversarial robustness as a prior for learned representations. Ar Xiv preprint ar Xiv:1906.00945, 2020. Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Robustness of classifiers: from adversarial to random noise. ar Xiv preprint ar Xiv:1608.08967, 2016. Jean-Yves Franceschi, Alhussein Fawzi, and Omar Fawzi. Robustness of classifiers to uniform lp and Gaussian noise. In International Conference on Artificial Intelligence and Statistics, pp. 1280 1288. PMLR, 2018. Alison L Gibbs and Francis Edward Su. On choosing and bounding probability metrics. International Statistical Review, 70(3):419 435, 2002. Chengyue Gong, Tongzheng Ren, Mao Ye, and Qiang Liu. Maxup: A simple way to improve generalization of neural network training. ar Xiv preprint ar Xiv:2002.09024, 2020. Ian Goodfellow, Patrick Mc Daniel, and Nicolas Papernot. Making machine learning robust against adversarial inputs. Communications of the ACM, 61(7):56 66, 2018. Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014. Kristjan Greenewald, Anming Gu, Mikhail Yurochkin, Justin Solomon, and Edward Chien. k-mixup regularization for deep learning via optimal transport. ar Xiv preprint ar Xiv:2106.02933, 2021. Caglar Gulcehre, Marcin Moczulski, Misha Denil, and Yoshua Bengio. Noisy activation functions. In International Conference on Machine Learning, pp. 3059 3068. PMLR, 2016. Ali Hassani, Steven Walton, Nikhil Shah, Abulikemu Abuduweili, Jiachen Li, and Humphrey Shi. Escaping the big data paradigm with compact transformers. ar Xiv preprint ar Xiv:2104.05704, 2021. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pp. 630 645. Springer, 2016. Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation. ar Xiv preprint ar Xiv:1705.08475, 2017. Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations, 2019. Dan Hendrycks, Norman Mu, Ekin D Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. ar Xiv preprint ar Xiv:1912.02781, 2019. Published as a conference paper at ICLR 2022 Judy Hoffman, Daniel A Roberts, and Sho Yaida. Robust learning with Jacobian regularization. ar Xiv preprint ar Xiv:1908.02729, 2019. Jang-Hyun Kim, Wonho Choo, and Hyun Oh Song. Puzzle mix: Exploiting saliency and local statistics for optimal mixup. In International Conference on Machine Learning, pp. 5275 5285. PMLR, 2020a. Jang-Hyun Kim, Wonho Choo, and Hyun Oh Song. Puzzle mix: Exploiting saliency and local statistics for optimal mixup. In International Conference on Machine Learning, 2020b. Masanari Kimura. Mixup training as the complexity reduction. ar Xiv preprint ar Xiv:2006.06231, 2020. Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical Report, 2009. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25:1097 1105, 2012. Daniel Kuhn, Peyman Mohajerin Esfahani, Viet Anh Nguyen, and Soroosh Shafieezadeh-Abadeh. Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations Research & Management Science in the Age of Analytics, pp. 130 166. INFORMS, 2019. Alexey Kurakin, Ian Goodfellow, Samy Bengio, et al. Adversarial examples in the physical world, 2016. Yongchan Kwon, Wonyoung Kim, Joong-Ho Won, and Myunghee Cho Paik. Principled learning method for Wasserstein distributionally robust optimization with local perturbations. In International Conference on Machine Learning, pp. 5567 5576. PMLR, 2020. Alex Lamb, Vikas Verma, Juho Kannala, and Yoshua Bengio. Interpolated adversarial training: Achieving robust neural networks without sacrificing too much accuracy. In Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security, pp. 95 103, 2019. Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 656 672. IEEE, 2019. Soon Hoe Lim, N Benjamin Erichson, Liam Hodgkinson, and Michael W Mahoney. Noisy recurrent neural networks. ar Xiv preprint ar Xiv:2102.04877, 2021. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017. M. W. Mahoney. Approximate computation and implicit regularization for very large-scale data analysis. In Proceedings of the 31st ACM Symposium on Principles of Database Systems, pp. 143 154, 2012. M. W. Mahoney and L. Orecchia. Implementing regularization implicitly via approximate eigenvector computation. In International Conference on Machine Learning, pp. 121 128, 2011. Yifei Min, Lin Chen, and Amin Karbasi. The curious case of adversarially robust models: More data can help, double descend, or hurt generalization. ar Xiv preprint ar Xiv:2002.11080, 2020. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Jonathan Uesato, and Pascal Frossard. Robustness via curvature regularization, and vice versa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9078 9086, 2019. Roman Novak, Yasaman Bahri, Daniel A Abolafia, Jeffrey Pennington, and Jascha Sohl Dickstein. Sensitivity and generalization in neural networks: an empirical study. ar Xiv preprint ar Xiv:1802.08760, 2018. Published as a conference paper at ICLR 2022 Sayak Paul and Pin-Yu Chen. Vision transformers are robust learners. ar Xiv preprint ar Xiv:2105.07581, 2021. Rafael Pinot, Laurent Meunier, Alexandre Araujo, Hisashi Kashima, Florian Yger, C edric Gouy Pailler, and Jamal Atif. Theoretical evidence for adversarial robustness through randomization. ar Xiv preprint ar Xiv:1902.01148, 2019a. Rafael Pinot, Florian Yger, C edric Gouy-Pailler, and Jamal Atif. A unified view on differential privacy and robustness to adversarial examples. ar Xiv preprint ar Xiv:1906.07982, 2019b. Rafael Pinot, Laurent Meunier, Florian Yger, C edric Gouy-Pailler, Yann Chevaleyre, and Jamal Atif. On the robustness of randomized classifiers to adversarial examples. ar Xiv preprint ar Xiv:2102.10875, 2021. Aditi Raghunathan, Sang Michael Xie, Fanny Yang, John Duchi, and Percy Liang. Understanding and mitigating the tradeoff between robustness and accuracy. ar Xiv preprint ar Xiv:2002.10716, 2020. Hamed Rahimian and Sanjay Mehrotra. Distributionally robust optimization: A review. ar Xiv preprint ar Xiv:1908.05659, 2019. Sylvestre-Alvise Rebuffi, Sven Gowal, Dan A Calian, Florian Stimberg, Olivia Wiles, and Timothy Mann. Fixing data augmentation to improve adversarial robustness. ar Xiv preprint ar Xiv:2103.01946, 2021. Leslie Rice, Eric Wong, and Zico Kolter. Overfitting in adversarially robust deep learning. In International Conference on Machine Learning, pp. 8093 8104. PMLR, 2020. Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. ar Xiv preprint ar Xiv:1804.11285, 2018. Rulin Shao, Zhouxing Shi, Jinfeng Yi, Pin-Yu Chen, and Cho-Jui Hsieh. On the adversarial robustness of visual transformers. ar Xiv preprint ar Xiv:2103.15670, 2021. Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):1 48, 2019. Jure Sokoli c, Raja Giryes, Guillermo Sapiro, and Miguel RD Rodrigues. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 65(16):4265 4280, 2017. Matthew Staib and Stefanie Jegelka. Distributionally robust deep learning as a generalization of adversarial training. In NIPS workshop on Machine Learning and Computer Security, volume 3, pp. 4, 2017. Dong Su, Huan Zhang, Hongge Chen, Jinfeng Yi, Pin-Yu Chen, and Yupeng Gao. Is robustness the cost of accuracy? a comprehensive study on the robustness of 18 deep image classification models. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 631 648, 2018. Sunil Thulasidasan, Gopinath Chennupati, Jeff Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. ar Xiv preprint ar Xiv:1905.11001, 2019. Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. ar Xiv preprint ar Xiv:1805.12152, 2018. Francisco Utrera, Evan Kravitz, N Benjamin Erichson, Rajiv Khanna, and Michael W Mahoney. Adversarially-trained deep nets transfer better. ar Xiv preprint ar Xiv:2007.05869, 2020. Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer Science & Business Media, 2013. Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states. In International Conference on Machine Learning, pp. 6438 6447. PMLR, 2019. Published as a conference paper at ICLR 2022 Colin Wei and Tengyu Ma. Data-dependent sample complexity of deep neural networks via Lipschitz augmentation. ar Xiv preprint ar Xiv:1905.03684, 2019a. Colin Wei and Tengyu Ma. Improved sample complexities for deep networks and robust classification via an all-layer margin. ar Xiv preprint ar Xiv:1910.04284, 2019b. Sen Wu, Hongyang Zhang, Gregory Valiant, and Christopher R e. On the generalization effects of linear transformations in data augmentation. In International Conference on Machine Learning, pp. 10410 10420. PMLR, 2020. Yao-Yuan Yang, Cyrus Rashtchian, Hongyang Zhang, Ruslan Salakhutdinov, and Kamalika Chaudhuri. A closer look at accuracy vs. robustness. ar Xiv preprint ar Xiv:2003.02460, 2020a. Yaoqing Yang, Rajiv Khanna, Yaodong Yu, Amir Gholami, Kurt Keutzer, Joseph E Gonzalez, Kannan Ramchandran, and Michael W Mahoney. Boundary thickness and robustness in learning models. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 6223 6234, 2020b. Wenpeng Yin, Huan Wang, Jin Qu, and Caiming Xiong. Batch Mixup: Improving training by interpolating hidden states of the entire mini-batch. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4908 4912, 2021. Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In International Conference on Computer Vision, pp. 6023 6032, 2019. Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016. Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning, pp. 7472 7482. PMLR, 2019. Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. Mixup: Beyond empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412, 2017. Linjun Zhang, Zhun Deng, Kenji Kawaguchi, Amirata Ghorbani, and James Zou. How does mixup help with robustness and generalization? ar Xiv preprint ar Xiv:2010.04819, 2020. Linjun Zhang, Zhun Deng, Kenji Kawaguchi, and James Zou. When and how mixup improves calibration. ar Xiv preprint ar Xiv:2102.06289, 2021. Published as a conference paper at ICLR 2022 Supplementary Material (SM) for Noisy Feature Mixup Organizational Details. This SM is organized as follows. In Section A, we study the regularizing effects of NFM within the vicinal risk minimization framework, relating the effects to those of mixup and noise injection. In Section B, we restate the results presented in the main paper and provide their proof. In Section C, we study robutsness of NFM through the lens of implicit regularization, showing that NFM can implicitly increase the classification margin. In Section D, we study robustness of NFM via the lens of probabilistic robustness, showing that noise injection can improve robustness on top of manifold mixup while keeping track of maximal loss in accuracy incurred under attack by tuning the noise levels. In Section E, we provide results on generalization bounds for NFM and their proofs, identifying the mechanisms by which NFM can lead to improved generalization bound. In Section F, we provide additional experimental results and their details. We recall the notation that we use in the main paper as well as this SM. Notation. I denotes identity matrix, [K] := {1, . . . , K}, the superscript T denotes transposition, denotes composition, denotes Hadamard product, 1 denotes the vector with all components equal one. For a vector v, vk denotes its kth component and v p denotes its lp norm for p > 0. conv(X) denote the convex hull of X. Mλ(a, b) := λa + (1 λ)b, for random variables a, b, λ. δz denotes the Dirac delta function, defined as δz(x) = 1 if x = z and δz(x) = 0 otherwise. 1A denotes indicator function of the set A. For α, β > 0, Dλ := α α+β Beta(α + 1, β) + β α+β Beta(β + 1, α), a uniform mixture of two Beta distributions. For two vectors a, b, cos(a, b) := a, b / a 2 b 2 denotes their cosine similarity. N(a, b) denotes the Gaussian distribution with mean a and covariance b. A NFM THROUGH THE LENS OF VICINAL RISK MINIMIZATION In this section, we shall show that NFM can be constructed within a vicinal risk minimization (VRM) framework at the level of both input and hidden layer representations. To begin with, we define a class of vicinal distributions and then relate NFM to such distributions. Definition 1 (Randomly perturbed feature distribution). Let Zn = {z1, . . . , zn} be a feature set. We say that P n is an ei-randomly perturbed feature distribution if there exists a set {z 1, . . . , z n} such that P n = 1 n Pn i=1 δz i, with z i = zi + ei, for some random variable ei (possibly dependent on Zn) drawn from a probability distribution. Note that the support of an ei-randomly perturbed feature distribution may be larger than that of Z. If Zn is an input dataset and the ei are bounded variables such that ei β for some β 0, then P n is a β-locally perturbed data distribution according to Definition 2 in (Kwon et al., 2020). Examples of β-locally perturbed data distribution include that associated with denoising autoencoder, input mixup, and adversarial training (see Example 1-3 in (Kwon et al., 2020)). Definition 1 can be viewed as an extension of the definition in (Kwon et al., 2020), relaxing the boundedness condition on the ei to cover a wide families of perturbed feature distribution. One simple example is the Gaussian distribution, i.e., when ei N(0, σ2 i ), which models Gaussian noise injection into the features. Another example is the distribution associated with NFM, which we now discuss. To keep the randomly perturbed distribution close to the original distribution, the amplitude of the perturbation should be small. In the sequel, we let ϵ > 0 be a small parameter and rescale 1 λ 7 ϵ(1 λ), σadd 7 ϵσadd and σmult 7 ϵσmult. Let Fk be the family of mappings from gk(X) to Y and consider the VRM: inf fk Fk Rn(fk) := E(g k(x),y ) P(k) n [l(fk(g k(x))), y )], (10) where P(k) n = 1 n Pn i=1 δ(g k(xi),y i), with g k(xi) = gk(xi) + ϵe NF M(k) i and y i = yi + ϵey i , for some random variables e NF M(k) i and ey i . Published as a conference paper at ICLR 2022 In NFM, we approximate the ground-truth distribution D using the family of distributions {P(k) n }k S, with a particular choice of (e NF M(k) i , ey i ). In the sequel, we denote NFM at the level of kth layer as NFM(k) (i.e., the particular case when S := {k}). The following lemma identifies the (e NF M(k) i , ey i ) associated with NFM(k) and relates the effects of NFM(k) to those of mixup and noise injection, for any perturbation level ϵ > 0. Lemma 1. Let ϵ > 0 and denote zi(k) := gk(xi). Learning the neural network map f using NFM(k) is a VRM with the (ϵe NF M(k) i , ϵey i )-randomly perturbed feature distribution, P(k) n = 1 n Pn i=1 δ(z i(k),y i), with z i(k) := zi(k)+ϵe NF M(k) i , y i := yi +ϵey i , as the vicinal distribution. Here, ey i = (1 λ)( yi yi), e NF M(k) i = (1 + ϵσmultξmult) emixup(k) i + enoise(k) i , (11) where emixup(k) i = (1 λ)( zi(k) zi(k)), and enoise(k) i = σmultξmult zi(k) + σaddξadd, with zi(k), zi(k) gk(X), λ Beta(α, β) and yi, yi Y. Here, ( zi(k), yi) are drawn randomly from the training set. Therefore, the random perturbation associated to NFM is data-dependent, and it consists of a randomly weighted sum of that from injecting noise into the feature and that from mixing pairs of feature samples. As a simple example, one can take ξadd, ξmult to be independent standard Gaussian random variables, in which case we have enoise(k) i N(0, σ2 add I + σ2 multdiag(zi(k))2), and ei N(0, σ2 add + σ2 mult Mλ(zi(k), zi(k))2) in Lemma 1. We now prove Lemma 1. Proof of Lemma 1. Let k be given and set ϵ = 1 without loss of generality. For every i [n], NFM(k) injects noise on top of a mixed sample z i(k) and outputs: z i (k) = (1 + σmultξmult) z i(k) + σaddξadd (12) = (1 + σmultξmult) (λzi(k) + (1 λ) zi(k)) + σaddξadd (13) = zi(k) + e NF M(k) i , (14) where e NF M(k) i = (1 λ)( zi(k) zi(k)) + σmultξmult (λzi(k) + (1 λ) zi(k)) + σaddξadd. Now, note that applying mixup to the pair (zi(k), zi(k)) results in z i(k) = zi(k) + emixup(k) i , with emixup(k) i = (1 λ)( zi(k) zi(k)), where zi(k), zi(k) gk(X) and λ Beta(α, β), whereas applying noise injection to zi(k) results in (1+σmultξmult) zi(k)+σaddξadd = zi(k)+enoise(k) i , with enoise(k) i = σmultξmult zi(k) + σaddξadd. Rewriting e NF M(k) i in terms of emixup(k) i and enoise(k) i gives e NF M(k) i = (1 + σmultξmult) emixup(k) i + enoise(k) i . (15) Similarly, we can derive the expression for ey i using the same argument. The results in the lemma follow upon applying the rescaling 1 λ 7 ϵ(1 λ), σadd 7 ϵσadd and σmult 7 ϵσmult, for ϵ > 0. B STATEMENTS AND PROOF OF THE RESULTS IN THE MAIN PAPER B.1 COMPLETE STATEMENT OF THEOREM 1 IN THE MAIN PAPER AND THE PROOF We first state the complete statement of Theorem 1 in the main paper. Theorem 3 (Theorem 1 in the main paper). Let ϵ > 0 be a small parameter, and assume that h and f are twice differentiable. Then, LNF M n = Ek SLNF M(k) n , where LNF M(k) n = Lstd n + ϵR(k) 1 + ϵ2 R(k) 2 + ϵ2 R(k) 3 + ϵ2ϕ(ϵ), (16) Published as a conference paper at ICLR 2022 R(k) 2 = R(k) 2 + σ2 add Radd(k) 2 + σ2 mult Rmult(k) 2 , (17) R(k) 3 = R(k) 3 + σ2 add Radd(k) 3 + σ2 mult Rmult(k) 3 , (18) R(k) 1 = Eλ Dλ[1 λ] i=1 (h (f(xi) yi) kf(gk(xi))T Exr Dx[gk(xr) gk(xi)], (19) R(k) 2 = Eλ Dλ[(1 λ)2] i=1 h (f(xi)) kf(gk(xi))T Exr Dx[(gk(xr) gk(xi))(gk(xr) gk(xi))T ] kf(gk(xi)), (20) R(k) 3 = Eλ Dλ[(1 λ)2] i=1 (h (f(xi)) yi) Exr Dx[(gk(xr) gk(xi))T 2 kf(gk(xi))(gk(xr) gk(xi))], (21) Radd(k) 2 = 1 i=1 h (f(xi)) kf(gk(xi))T Eξk[ξadd k (ξadd k )T ] kf(gk(xi)), (22) Rmult(k) 2 = 1 i=1 h (f(xi)) kf(gk(xi))T (Eξk[ξmult k (ξmult k )T ] gk(xi)gk(xi)T ) kf(gk(xi)), Radd(k) 3 = 1 i=1 (h (f(xi)) yi)Eξk[(ξadd k )T 2 kf(gk(xi))ξadd k ], (24) Rmult(k) 3 = 1 i=1 (h (f(xi)) yi)Eξk[(ξmult k gk(xi))T 2 kf(gk(xi))(ξmult k gk(xi))], (25) and ϕ(ϵ) = Eλ DλExr Dx Eξk Q[ϕ(ϵ)], with ϕ some function such that limϵ 0 ϕ(ϵ) = 0. Following the setup of Zhang et al. (2020), we provide empirical results to show that the second order Taylor approximation for the NFM loss function is generally accurate (see Figure 7). Recall from the main paper that the NFM loss function to be minimized is LNF M n = Ek SLNF M(k) n , where LNF M(k) n = 1 j=1 Eλ Beta(α,β)Eξk Ql(fk(Mλ,ξk(gk(xi), gk(xj))), Mλ(yi, yj)), (26) Figure 7: Comparison of the original NFM loss with the approximate loss function during training and testing for a two layer Re LU neural network trained on the toy dataset of Subsection F.2. Published as a conference paper at ICLR 2022 where l : RK RK [0, ) is a loss function of the form l(f(x), y) = h(f(x)) yf(x), ξk := (ξadd k , ξmult k ) are drawn from some probability distribution Q with finite first two moments (with zero mean), and Mλ,ξk(gk(x), gk(x )) := (1 + σmultξmult k ) Mλ(gk(x), gk(x )) + σaddξadd k . (27) Before proving Theorem 3, we note that, following the argument of the proof of Lemma 3.1 in Zhang et al. (2020), the loss function minimized by NFM can be written as follows. For completeness, we provide all details of the proof. Lemma 2. The NFM loss (26) can be equivalently written as LNF M n = Ek SLNF M(k) n , where LNF M(k) n = 1 i=1 Eλ DλExr Dx Eξk Q[h(fk(gk(xi)+ϵe NF M(k) i )) yifk(gk(xi)+ϵe NF M(k) i )], (28) with e NF M(k) i = (1 + ϵσmultξmult k ) emixup(k) i + enoise(k) i . (29) Here emixup(k) i = (1 λ)(gk(xr) gk(xi)) and enoise(k) i = σmultξmult k gk(xi) + σaddξadd k , with gk(xi), gk(xr) gk(X) and λ Beta(α, β). Proof of Lemma 2. From (26), we have: LNF M(k) n = 1 j=1 Eλ Beta(α,β)Eξk Ql(fk(Mλ,ξk(gk(xi), gk(xj))), Mλ(yi, yj)). (30) We can rewrite: Eλ Beta(α,β)l(fk(Mλ,ξk(gk(xi), gk(xj))), Mλ(yi, yj)) = Eλ Beta(α,β)[h(fk(Mλ,ξk(gk(xi), gk(xj)))) Mλ(yi, yj)fk(Mλ,ξk(gk(xi), gk(xj)))] (31) = Eλ Beta(α,β)[λ(h(fk(Mλ,ξk(gk(xi), gk(xj)))) yifk(Mλ,ξk(gk(xi), gk(xj)))) + (1 λ)(h(fk(Mλ,ξk(gk(xi), gk(xj)))) yjfk(Mλ,ξk(gk(xi), gk(xj))))] (32) = Eλ Beta(α,β)EB Bern(λ)[B(h(fk(Mλ,ξk(gk(xi), gk(xj)))) yifk(Mλ,ξk(gk(xi), gk(xj)))) + (1 B)(h(fk(Mλ,ξk(gk(xi), gk(xj)))) yjfk(Mλ,ξk(gk(xi), gk(xj))))], (33) where Bern(λ) denotes the Bernoulli distribution with parameter λ (i.e., P[B = 1] = λ and P[B = 0] = 1 λ). Note that λ Beta(α, β) and B|λ Bern(λ). By conjugacy, we can switch their order: B Bern α α + β , λ|B Beta(α + B, β + 1 B), (34) and arrive at: Eλ Beta(α,β)l(fk(Mλ,ξk(gk(xi), gk(xj))), Mλ(yi, yj)) = EB Bern( α α+β)Eλ Beta(α+B,β+1 B)[B(h(fk(Mλ,ξk(gk(xi), gk(xj)))) yifk(Mλ,ξk(gk(xi), gk(xj)))) + (1 B)(h(fk(Mλ,ξk(gk(xi), gk(xj)))) yjfk(Mλ,ξk(gk(xi), gk(xj))))] (35) = α α + β Eλ Beta(α+1,β)[h(fk(Mλ,ξk(gk(xi), gk(xj)))) yifk(Mλ,ξk(gk(xi), gk(xj)))] + β α + β Eλ Beta(α,β+1)[h(fk(Mλ,ξk(gk(xi), gk(xj)))) yjfk(Mλ,ξk(gk(xi), gk(xj)))]. Using the facts that Beta(β + 1, α) and 1 Beta(α, β + 1) are of the same distribution and M1 λ(xi, xj) = Mλ(xj, xi), we have: X i,j Eλ Beta(α,β+1)[h(fk(Mλ,ξk(gk(xi), gk(xj)))) yjfk(Mλ,ξk(gk(xi), gk(xj)))] i,j Eλ Beta(β+1,α)[h(fk(Mλ,ξk(gk(xi), gk(xj)))) yifk(Mλ,ξk(gk(xi), gk(xj)))]. (37) Published as a conference paper at ICLR 2022 Therefore, denoting Dλ := α α+β Beta(α + 1, β) + β α+β Beta(β + 1, α) and Dx := 1 n Pn j=1 δxj the empirical distribution induced by the training samples {xj}j [n], we have: LNF M(k) n = 1 i=1 Eλ DλExr Dx Eξk Q[h(fk(Mλ,ξk(gk(xi), gk(xr)))) yifk(Mλ,ξk(gk(xi), gk(xr)))]. (38) The statement of the lemma follows upon substituting the fact that Mλ,ξk(gk(xi), gk(xr)) = gk(xi)+ ϵe NF M(k) i into the above equation. With this lemma in hand, we now prove Theorem 3. Proof of Theorem 3. Denote ψi(ϵ) := h(fk(gk(xi) + ϵe NF M(k) i )) yifk(gk(xi) + ϵe NF M(k) i ), where e NF M(k) i is given in (29). Since h and fk are twice differentiable by assumption, ψi is twice differentiable in ϵ, and ψi(ϵ) = ψi(0) + ϵψ i(0) + ϵ2 2 ψ i (0) + ϵ2ϕi(ϵ), (39) where ϕi is some function such that limϵ 0 ϕi(ϵ) = 0. Therefore, by Lemma 2, LNF M n = Ek SLNF M(k) n , where LNF M(k) n = 1 i=1 Eλ DλExr Dx Eξk Q[ψi(ϵ)] (40) i=1 Eλ DλExr Dx Eξk Q ψi(0) + ϵψ i(0) + ϵ2 2 ψ i (0) + ϵ2ϕi(ϵ) (41) i=1 Eλ DλExr Dx Eξk Q ψi(0) + ϵψ i(0) + ϵ2 2 ψ i (0) + ϵ2ϕ(ϵ) (42) =: Lstd n + ϵR(k) 1 + ϵ2( R(k) 2 + R(k) 3 ) + ϵ2ϕ(ϵ), (43) where ϕ(ϵ) = 1 n Pn i=1 Eλ DλExr Dx Eξk Q[ϕi(ϵ)]. It remains to compute ψ i(0) and ψ i (0) in order to arrive at the expression for the R(k) 1 , R(k) 2 and R(k) 3 presented in Theorem 3. Denoting gk(xi) := gk(xi) + ϵe NF M(k) i , we compute, applying chain rule: ψ i(ϵ) = h (fk( gk(xi))) kfk( gk(xi))T gk(xi) ϵ yi kfk( gk(xi))T gk(xi) = (h (fk( gk(xi))) yi) kfk( gk(xi))T gk(xi) = (h (fk( gk(xi))) yi) kfk( gk(xi))T e NF M(k) i (46) = (h (fk( gk(xi))) yi) kfk( gk(xi))T [(1 λ)(gk(xr) gk(xi)) + σaddξadd k + σmultξmult k gk(xi) + ϵ(1 λ)σmultξmult k (gk(xr) gk(xi))], (47) where we have used gk(xi) ϵ = e NF M(k) i in the second last line and substituted the expression for e NF M(k) i from (29) in the last line above. ψ i(0) = (h (fk(gk(xi))) yi) kfk(gk(xi))T [(1 λ)(gk(xr) gk(xi)) + σaddξadd k + σmultξmult k gk(xi)], (48) Published as a conference paper at ICLR 2022 Eξk Qψ i(0) = (h (fk(gk(xi))) yi) kfk(gk(xi))T [(1 λ)(gk(xr) gk(xi))], (49) where we have used the assumptions that Eξk Qξadd k = 0 and Eξk Qξmult k = 0. The expression for the R(k) 1 in the theorem then follows from substituting (49) into (42). Next, using chain rule, we have: (h (fk( gk(xi))) yi) kfk( gk(xi))T gk(xi) ϵ(h (fk( gk(xi))) yi) kfk( gk(xi))T gk(xi) + (h (fk( gk(xi))) yi) kfk( gk(xi))T gk(xi) Note that, applying chain rule, kfk( gk(xi))T gk(xi) kfk( gk(xi))T e NF M(k) i (52) (e NF M(k) i )T kfk( gk(xi)) (53) = (e NF M(k) i )T 2 kfk( gk(xi)) gk(xi) = (e NF M(k) i )T 2 kfk( gk(xi))e NF M(k) i . (55) Also, using chain rule again, ϵ(h (fk( gk(xi))) yi) = h (fk( gk(xi))) kfk( gk(xi))T gk(xi) = h (fk( gk(xi))) kfk( gk(xi))T e NF M(k) i . (57) Therefore, we have: ψ i (ϵ) = h (fk( gk(xi))) kfk( gk(xi))T e NF M(k) i (e NF M(k) i )T kfk( gk(xi)) + (h (fk( gk(xi))) yi)(e NF M(k) i )T 2 kfk( gk(xi))e NF M(k) i (58) = h (fk( gk(xi))) kfk( gk(xi))T [(1 λ)(gk(xr) gk(xi)) + σaddξadd k + σmultξmult k gk(xi) + ϵ(1 λ)σmultξmult k (gk(xr) gk(xi))] [(1 λ)(gk(xr) gk(xi)) + σaddξadd k + σmultξmult k gk(xi) + ϵ(1 λ)σmultξmult k (gk(xr) gk(xi))]T kfk( gk(xi)) + (h (fk( gk(xi))) yi)[(1 λ)(gk(xr) gk(xi)) + σaddξadd k + σmultξmult k gk(xi) + ϵ(1 λ)σmultξmult k (gk(xr) gk(xi))]T 2 kfk( gk(xi))[(1 λ)(gk(xr) gk(xi)) + σaddξadd k + σmultξmult k gk(xi) + ϵ(1 λ)σmultξmult k (gk(xr) gk(xi))] (59) =: h (fk( gk(xi))) kfk( gk(xi))T P1(ϵ) kfk( gk(xi)) + (h (fk( gk(xi))) yi)P2(ϵ), (60) where we have substituted the expression for the e NF M(k) i into the first line to arrive at the last line above. Published as a conference paper at ICLR 2022 = Eξk Q[(1 λ)(gk(xr) gk(xi)) + σaddξadd k + σmultξmult k gk(xi) + ϵ(1 λ)σmultξmult k (gk(xr) gk(xi))] [(1 λ)(gk(xr) gk(xi)) + σaddξadd k + σmultξmult k gk(xi) + ϵ(1 λ)σmultξmult k (gk(xr) gk(xi))]T (61) = (1 λ)2(gk(xr) gk(xi))(gk(xr) gk(xi))T + σ2 add Eξk Q[ξadd k (ξadd k )T ] + σ2 mult Eξk Q[(ξmult k gk(xi))(ξmult k gk(xi))T ] + o(ϵ) (62) = (1 λ)2(gk(xr) gk(xi))(gk(xr) gk(xi))T + σ2 add Eξk Q[ξadd k (ξadd k )T ] + σ2 mult Eξk Q[(ξmult k (ξmult k )T ) gk(xi))gk(xi)T ] + o(ϵ), (63) as ϵ 0, where we have used the assumption that Eξk Qξadd k = 0 and Eξk Qξmult k = 0 in the second last line above. = Eξk Q[(1 λ)(gk(xr) gk(xi)) + σaddξadd k + σmultξmult k gk(xi) + ϵ(1 λ)σmultξmult k (gk(xr) gk(xi))]T 2 kfk( gk(xi))[(1 λ)(gk(xr) gk(xi)) + σaddξadd k + σmultξmult k gk(xi) + ϵ(1 λ)σmultξmult k (gk(xr) gk(xi))] (64) = (1 λ)2(gk(xr) gk(xi))T 2 kfk( gk(xi))(gk(xr) gk(xi)) + σ2 add Eξk Q[(ξadd k )T 2 kfk( gk(xi))ξadd k ] + σ2 mult Eξk Q[(ξmult k gk(xi))T 2 kfk( gk(xi))(ξmult k gk(xi))] + o(ϵ), (65) Now, recall from Eq. (42) that we have LNF M(k) n = 1 i=1 Eλ DλExr Dx Eξk Q ψi(0) + ϵψ i(0) + ϵ2 2 ψ i (0) + ϵ2ϕ(ϵ) (66) =: Lstd n + ϵR(k) 1 + ϵ2( R(k) 2 + R(k) 3 ) + ϵ2ϕ(ϵ), (67) where ψi(0) = h(fk(gk(xi))) yifk(gk(xi)). Also, we have: Eξk Q[ψ i (ϵ)] = h (fk( gk(xi))) kfk( gk(xi))T Eξk Q[P1(ϵ)] kfk( gk(xi)) + (h (fk( gk(xi))) yi)Eξk Q[P2(ϵ)] (68) = h (fk( gk(xi))) kfk( gk(xi))T [(1 λ)2(gk(xr) gk(xi))(gk(xr) gk(xi))T + σ2 add Eξk Q[ξadd k (ξadd k )T ] + σ2 mult Eξk Q[(ξmult k (ξmult k )T ) gk(xi))gk(xi)T ] + o(ϵ)] kfk( gk(xi)) + (h (fk( gk(xi))) yi)[(1 λ)2(gk(xr) gk(xi))T 2 kfk( gk(xi))(gk(xr) gk(xi)) + σ2 add Eξk Q[(ξadd k )T 2 kfk( gk(xi))ξadd k ] + σ2 mult Eξk Q[(ξmult k gk(xi))T 2 kfk( gk(xi))(ξmult k gk(xi))] + o(ϵ)]. (69) Published as a conference paper at ICLR 2022 Therefore, setting ϵ = 0, Eξk Q[ψ i (0)] = h (fk(gk(xi))) kfk(gk(xi))T [(1 λ)2(gk(xr) gk(xi))(gk(xr) gk(xi))T + σ2 add Eξk Q[ξadd k (ξadd k )T ] + σ2 mult Eξk Q[(ξmult k (ξmult k )T ) gk(xi))gk(xi)T ]] kfk(gk(xi)) + (h (fk(gk(xi))) yi)[(1 λ)2(gk(xr) gk(xi))T 2 kfk(gk(xi))(gk(xr) gk(xi)) + σ2 add Eξk Q[(ξadd k )T 2 kfk(gk(xi))ξadd k ] + σ2 mult Eξk Q[(ξmult k gk(xi))T 2 kfk(gk(xi))(ξmult k gk(xi))]]. (70) The expression for the R(k) 2 and R(k) 3 in the theorem follows upon substituting (70) into (66). B.2 THEOREM 2 IN THE MAIN PAPER AND THE PROOF We first restate Theorem 2 in the main paper and then provide the proof. Recall that we consider the binary cross-entropy loss, setting h(z) = log(1 + ez), with the labels y taking value in {0, 1} and the classifier model f : Rd R. Theorem 4 (Theorem 2 in the main paper). Let θ Θ := {θ : yif(xi)+(yi 1)f(xi) 0 for all i [n]} be a point such that kf(gk(xi)) and 2 kf(gk(xi)) exist for all i [n], k S. Assume that fk(gk(xi)) = kf(gk(xi))T gk(xi), 2 kf(gk(xi)) = 0 for all i [n], k S. In addition, suppose that f(xi) 2 > 0 for all i [n], Er Dx[gk(r)] = 0 and gk(xi) 2 c(k) x dk for all i [n], k S. Then, i=1 max δi 2 ϵmix i l(f(xi + δi), yi) + Lreg n + ϵ2φ(ϵ), (71) ϵmix i = ϵEλ Dλ[1 λ] Ek S r(k) i c(k) x kf(gk(xi)) 2 r(k) i = | cos( kf(gk(xi)), gk(xi))|, (73) i=1 |h (f(xi))|(ϵreg i )2, (74) (ϵreg i )2 = ϵ2 kf(gk(xi)) 2 2 Eλ[(1 λ)]2Exr[ gk(xr) 2 2 cos( kf(gk(xi)), gk(xr))2] + σ2 add Eξ[ ξadd 2 2 cos( kf(gk(xi)), ξadd)2] + σ2 mult Eξ[ ξmult gk(xi) 2 2 cos( kf(gk(xi)), ξmult gk(xi))2] , (75) and φ is some function such that limϵ 0 φ(ϵ) = 0. Theorem 4 says that LNF M n is approximately an upper bound of sum of an adversarial loss with l2attack of size ϵmix = mini ϵmix i and a feature-dependent regularizer with the strength of mini(ϵreg i )2. Therefore, minimizing the NFM loss would result in a small regularized adversarial loss. We note that both ϵmix i and ϵreg i depend on the cosine similarities between the directional derivatives and the features at which the derivatives are evaluated at, whereas the ϵreg i additionally depend on the cosine similarities between the directional derivatives and the injected noise. Before proving Theorem 4, we remark that the assumption that fk(gk(xi)) = kf(gk(xi))T gk(xi), 2 kf(gk(xi)) = 0 for all i [n], k S is satisfied by fully connected neural networks with Re LU activation function or max-pooling. For a proof of this, we refer to Section B.2 in Zhang et al. (2020). The assumption that Er Dx[gk(r)] = 0 could be relaxed at the cost of obtaining a more complicated formula (see Remark 1 for the formula) for the ϵreg i in the bound, which could be derived in a straightforward manner. Published as a conference paper at ICLR 2022 Proof of Theorem 4. For h(z) = log(1 + ez), we have h (z) = ez 1+ez =: S(z) 0 and h (z) = ez (1+ez)2 = S(z)(1 S(z)) 0. Substituting these expressions into the equation of Theorem 3 and using the assumptions that fk(gk(xi)) = kf(gk(xi))T gk(xi) and Er Dx[gk(r)] = 0, we have, for k S, R(k) 1 = Eλ Dλ[1 λ] i=1 (yi S(f(xi)))fk(gk(xi)), (76) and we compute: R(k) 2 = Eλ Dλ[(1 λ)2] i=1 S(f(xi))(1 S(f(xi))) kf(gk(xi))T Exr Dx[(gk(xr) gk(xi))(gk(xr) gk(xi))T ] kf(gk(xi)) (77) Eλ Dλ[(1 λ)]2 i=1 |S(f(xi))(1 S(f(xi)))| kf(gk(xi))T Exr Dx[(gk(xr) gk(xi))(gk(xr) gk(xi))T ] kf(gk(xi)) (78) = Eλ Dλ[(1 λ)]2 i=1 |S(f(xi))(1 S(f(xi)))| kf(gk(xi))T (Exr Dx[(gk(xr)gk(xr)T ] + gk(xi)gk(xi)T ]) kf(gk(xi)) (79) = Eλ Dλ[(1 λ)]2 i=1 |S(f(xi))(1 S(f(xi)))|( kf(gk(xi))T gk(xi))2 + Eλ Dλ[(1 λ)]2 i=1 |S(f(xi))(1 S(f(xi)))|Exr Dx[( kf(gk(xi))T gk(xr))2] = Eλ Dλ[(1 λ)]2 i=1 |S(f(xi))(1 S(f(xi)))| kf(gk(xi)) 2 2 gk(xi) 2 2 (cos( kf(gk(xi)), gk(xi)))2 + 1 i=1 |S(f(xi))(1 S(f(xi)))| kf(gk(xi)) 2 2 Eλ[(1 λ)]2Exr[ gk(xr) 2 2 cos( kf(gk(xi)), gk(xr))2] (81) i=1 |S(f(xi))(1 S(f(xi)))| kf(gk(xi)) 2 2Eλ Dλ[(1 λ)]2dk(r(k) i c(k) x )2 i=1 |S(f(xi))(1 S(f(xi)))| kf(gk(xi)) 2 2 Eλ[(1 λ)]2Exr[ gk(xr) 2 2 cos( kf(gk(xi)), gk(xr))2] (82) i=1 |S(f(xi))(1 S(f(xi)))| f(xi) 2 2 Eλ Dλ[(1 λ)]2 kf(gk(xi)) 2 2 f(xi) 2 2 dk(r(k) i c(k) x )2 i=1 |S(f(xi))(1 S(f(xi)))| kf(gk(xi)) 2 2 Eλ[(1 λ)]2Exr[ gk(xr) 2 2 cos( kf(gk(xi)), gk(xr))2]. (83) In the above, we have used the facts that E[Z2] = E[Z]2 + V ar(Z) E[Z]2 and S, S(1 S) 0 to obtain (78), the assumption that Er Dx[gk(r)] = 0 to arrive at (79), the assumption that gk(xi) 2 c(k) x dk for all i [n], k S to arrive at (82), and the assumption that f(xi) 2 > 0 for all i [n] to justify the last equation above. Published as a conference paper at ICLR 2022 Next, we bound R(k) 1 , using the assumption that θ Θ. Note that from our assumption on θ, we have yif(xi) + (yi 1)f(xi) 0, which implies that f(xi) 0 if yi = 1 and f(xi) 0 if yi = 0. Thus, if yi = 1, then (yi S(f(xi)))fk(gk(xi)) = (1 S(f(xi)))fk(gk(xi)) 0, since f(xi) 0 and (1 S(f(xi))) 0 due to the fact that S(f(xi)) (0, 1). A similar argument leads to (yi S(f(xi)))fk(gk(xi)) 0 if yi = 0. So, we have (yi S(f(xi)))fk(gk(xi)) 0 for all i [n]. Therefore, noting that Eλ Dλ[1 λ] 0, we compute: R(k) 1 = Eλ Dλ[1 λ] i=1 |yi S(f(xi))||fk(gk(xi))| (84) = Eλ Dλ[1 λ] i=1 |S(f(xi)) yi| kf(gk(xi)) 2 gk(xi) 2| cos( kf(gk(xi)), gk(xi))| i=1 |S(f(xi)) yi| kf(gk(xi)) 2(Eλ Dλ[1 λ]r(k) i c(k) x p i=1 |S(f(xi)) yi| f(xi) 2 Eλ Dλ[1 λ] kf(gk(xi)) 2 f(xi) 2 r(k) i c(k) x p Note that R(k) 3 = 0 as a consequence of our assumption that 2 kf(gk(xi)) = 0 for all i [n], k S, and similar argument leads to: Radd(k) 2 = 1 i=1 |S(f(xi))(1 S(f(xi)))| kf(gk(xi))T Eξk[ξadd k (ξadd k )T ] kf(gk(xi)) (88) i=1 |S(f(xi))(1 S(f(xi)))| kf(gk(xi)) 2 2 Eξk[ ξadd k 2 2 cos( kf(gk(xi)), ξadd k )2] (89) Rmult(k) 2 = 1 i=1 |S(f(xi))(1 S(f(xi)))| kf(gk(xi))T (Eξk[ξadd k (ξadd k )T ] gk(xi)gk(xi)T ) i=1 |S(f(xi))(1 S(f(xi)))| kf(gk(xi)) 2 2 Eξk[ ξmult k gk(xi) 2 2 cos( kf(gk(xi)), ξmult gk(xi))2]. (90) Published as a conference paper at ICLR 2022 Using Theorem 3 and the above results, we obtain: i=1 l(f(xi), yi) Ek[ϵR(k) 1 + ϵ2R(k) 2 + ϵ2Radd(k) 2 + ϵ2Rmult(k) 2 + ϵ2ϕ(ϵ)] (91) i=1 |S(f(xi)) yi| f(xi) 2ϵmix i (92) i=1 |S(f(xi))(1 S(f(xi)))| f(xi) 2 2(ϵmix i )2 i=1 |S(f(xi))(1 S(f(xi)))| kf(gk(xi)) 2 2 Eλ[(1 λ)]2Exr[ gk(xr) 2 2 cos( kf(gk(xi)), gk(xr))2] (93) i=1 |S(f(xi))(1 S(f(xi)))|(ϵnoise i )2 + ϵ2ϕ(ϵ), (94) where ϵmix i := ϵEλ Dλ[1 λ]Ek h kf(gk(xi)) 2 f(xi) 2 r(k) i c(k) x dk i and (ϵnoise i )2 = ϵ2 kf(gk(xi)) 2 2 σ2 add Eξk[ ξadd k 2 2 cos( kf(gk(xi)), ξadd k )2] + σ2 mult Eξk[ ξmult k gk(xi) 2 2 cos( kf(gk(xi)), ξmult k gk(xi))2] . (95) On the other hand, for any small parameters ϵi > 0 and any inputs z1, . . . , zn, we can, using a second-order Taylor expansion and then applying our assumptions, compute: i=1 max δi 2 ϵi l(f(zi + δi), yi) 1 i=1 l(f(zi), yi) i=1 |S(f(zi)) yi| f(zi) 2ϵi + 1 i=1 |S(f(zi))(1 S(f(zi)))| f(zi) 2 2ϵ2 i i=1 max δi 2 ϵi δi 2 2ϕ i(δi) (96) i=1 |S(f(zi)) yi| f(zi) 2ϵi + 1 i=1 |S(f(zi))(1 S(f(zi)))| f(zi) 2 2ϵ2 i i=1 ϵ2 i ϕ i (ϵi), (97) where the ϕ i are functions such that limz 0 ϕ i(z) = 0, ϕ i (ϵi) := max δi 2 ϵi ϕ i(δi) and limz 0 ϕ i (z) = 0. Combining (94) and (97), we see that i=1 max δmix i 2 ϵmix i l(f(xi + δmix i ), yi) + Lreg n + ϵ2ϕ(ϵ) 1 i=1 (ϵmix i )2ϕ i (ϵmix i ) i=1 max δmix i 2 ϵmix i l(f(xi + δmix i ), yi) + Lreg n + ϵ2φ(ϵ), (99) where Lreg n is defined in the theorem. Noting that limϵ 0 φ(ϵ) = 0, the proof is done. Published as a conference paper at ICLR 2022 Remark 1. Had we assumed that Er Dx[gk(r)] = 0, then the statements of Theorem 4 remain unchanged, but with (ϵreg i )2 replaced by (ϵreg i )2 = ϵ2 kf(gk(xi)) 2 2 Eλ[(1 λ)]2Exr[ gk(xr) 2 2 cos( kf(gk(xi)), gk(xr))2] + σ2 add Eξ[ ξadd 2 2 cos( kf(gk(xi)), ξadd)2] + σ2 mult Eξ[ ξmult gk(xi) 2 2 cos( kf(gk(xi)), ξmult gk(xi))2] ϵ2Eλ[(1 λ)]2 kf(gk(xi))T [Ergk(r)gk(xi)T + gk(xi)Ergk(r)T ] kf(gk(xi)). (100) C NFM THROUGH THE LENS OF IMPLICIT REGULARIZATION AND CLASSIFICATION MARGIN First, we define classification margin at the input level. We shall show that minimizing the NFM loss can lead to an increase in the classification margin, and therefore improve model robustness in this sense. Definition 2 (Classification Margin). The classification margin of a training input-label sample si := (xi, ci) measured by the Euclidean metric d is defined as the radius of the largest d-metric ball in X centered at xi that is contained in the decision region associated with the class label ci, i.e., it is: γd(si) = sup{a : d(xi, x) a g(x) = ci x}. Intuitively, a larger classification margin allows a classifier to associate a larger region centered on a point xi in the input space to the same class. This makes the classifier less sensitive to input perturbations, and a perturbation of xi is still likely to fall within this region, keeping the classifier prediction. In this sense, the classifier becomes more robust. In the typical case, the networks are trained by a loss (cross-entropy) that promotes separation of different classes in the network output. This, in turn, maximizes a certain notion of score of each training sample (Sokoli c et al., 2017). Definition 3 (Score). For an input-label training sample si = (xi, ci), we define its score as o(si) = minj =ci 2(eci ej)T f(xi) 0, where ei RK is the Kronecker delta vector (one-hot vector) with ei i = 1 and ej i = 0 for i = j. A positive score implies that at the network output, classes are separated by a margin that corresponds to the score. A large score may not imply a large classification margin, but score can be related to classification margin via the following bound. Proposition 1. Assume that the score o(si) > 0 and let k S. Then, the classification margin for the training sample si can be lower bounded as: γd(si) C(si) supx conv(X) kf(gk(x)) 2 , (101) where C(si) = o(si)/ supx conv(X) gk(x) 2. Since NFM implicitly reduces the feature-output Jacobians kf (including the input-output Jacobian) according to the mixup level and noise levels (see Proposition 3), this, together with Theorem 1, suggests that applying NFM implicitly increases the classification margin, thereby making the model more robust to input perturbations. We note that a similar, albeit more involved, bound can also be obtained for the all-layer margin, a more refined version of classification margin introduced in (Wei & Ma, 2019b), and the conclusion that applying NFM implicitly increases the margin also holds. We now prove the proposition. Proof of Proposition 1. Note that, for any k S, f(x) = kf(gk(x)) gk(x) by the chain rule, and so f(x) 2 kf(gk(x)) 2 gk(x) 2 (102) sup x conv(X) kf(gk(x)) 2 sup x conv(X) gk(x) 2 Published as a conference paper at ICLR 2022 The statement in the proposition follows from a straightforward application of Theorem 4 in (Sokoli c et al., 2017) together with the above bound. D NFM THROUGH THE LENS OF PROBABILISTIC ROBUSTNESS Since the main novelty of NFM lies in the introduction of noise injection, it would be insightful to isolate the robustness boosting benefits of injecting noise on top of manifold mixup. We shall demonstrate the isolated benefit in this section. The key idea is based on the observation that manifold mixup produces minibatch outputs that lie in the convex hull of the feature space at each iteration. Therefore, for k S, NFM(k) can be viewed as injecting noise to the layer k features sampled from some distribution over conv(gk(X)), and so the NFM(k) neural network Fk can be viewed as a probabilistic mapping from conv(gk(X)) to P(Y), the space of probability distributions on Y. To isolate the benefit of noise injection, we adapt the approach of (Pinot et al., 2019a; 2021) to our setting to show that the Gaussian noise injection procedure in NFM robustifies manifold mixup in a probabilistic sense. At its core, this probabilistic notion of robustness amounts to making the model locally Lipschitz with respect to some distance on the input and output space, ensuring that a small perturbation in the input will not lead to large changes (as measured by some probability metric) in the output. Interestingly, it is related to a notion of differential privacy (Lecuyer et al., 2019; Dwork et al., 2014), as formalized in (Pinot et al., 2019b). We now formalize this probabilistic notion of robustness. Let p > 0. We say that a standard model f : X Y is αp-robust if for any (x, y) D such that f(x) = y, one has, for any data perturbation τ X, τ p αp = f(x) = f(x + τ). (104) Analogous definition can be formulated when output of the model is distribution-valued. Definition 4 (Probabilistic robustness). A probabilistic model F : X P(Y) is called (αp, ϵ)-robust with respect to D if, for any x, τ X, one has τ p αp = D(F(x), F(x + τ)) ϵ, (105) where D is a metric or divergence between two probability distributions. We refer to the probabilistic model (built on top of a manifold mixup classifier) that injects Gaussian noise to the layer k features as probabilistic FM model, and we denote it by F noisy(k) : conv(gk(X)) P(Y). We denote G as the classifier constructed from F noisy(k), i.e., G : x 7 arg maxj [K][F noisy(k)]j(x). In the sequel, we take D to be the total variation distance DT V , defined as: DT V (P, Q) := sup S X |P(S) Q(S)|, (106) for any two distributions P and Q over X. Recall that if P and Q have densities ρp and ρq respectively, then the total variation distance is half of the L1 distance, i.e., DT V (P, Q) = 1 X |ρp(x) ρq(x)|dx. The choice of the distance depends on the problem on hand and will give rise to different notions of robustness. One could also consider other statistical distances such as the Wasserstein distance and Renyi divergence, which can be related to total variation (see (Pinot et al., 2021; Gibbs & Su, 2002) for details). Before presenting our main result in this section, we need the following notation. Let Σ(x) := σ2 add I + σ2 multxx T . For x, τ X, let Πx be a dk by dk 1 matrix whose columns form a basis for the subspace orthogonal to gk(x + τ) gk(x), and {ρi(gk(x), τ)}i [dk 1] be the eigenvalues of (ΠT x Σ(gk(x))Πx) 1ΠT x Σ(gk(x + τ))Πx I. Also, let [F]topk(x) denote the kth highest value of the entries in the vector F(x). Viewing an NFM(k) classifier as a probabilistic FM classifier, we have the following result. Published as a conference paper at ICLR 2022 Theorem 5 (Gaussian noise injection robustifies FM classifiers). Let k S, dk > 1, and assume that gk(x)gk(x)T β2 k I > 0 for all x conv(X) for some constant βk. Then, F noisy(k) is (αp, ϵk(p, d, αp, σadd, σmult))-robust with respect to DT V against lp adversaries, with ϵk(p, d, αp, σadd, σmult) = 9 2 min{1, max{A, B}}, (107) A = Ap(αp) σ2 mult σ2 add + σ2 multβ2 k 0 gk(x + tτ)dt 2 + 2 gk(x) 2 0 gk(x + tτ)dt 2 B = Bk(τ)αp(1p (0,2] + d1/2 1/p1p (2, ) + σ2 add + σ2 multβ2 k , (109) αp1αp<1 + α2 p1αp 1, if p (0, 2], d1/2 1/p(αp1αp<1 + α2 p1αp 1), if p (2, ), d(αp1αp<1 + α2 p1αp 1), if p = , (110) Bk(τ) = sup x conv(X) 0 gk(x + tτ)dt 2 i=1 ρ2 i (gk(x), τ) . (111) Moreover, if x X is such that [F noisy(k)]top1(x) [F noisy(k)]top2(x) + 2ϵ(p, d, αp, σadd, σmult), then for any τ X, we have τ p α = G(x) = G(x + τ), (112) for any p > 0. Theorem 5 implies that we can inject Gaussian noise into the feature mixup representation to improve robustness of FM classifiers in the sense of Definition 4, while keeping track of maximal loss in accuracy incurred under attack, by tuning the noise levels σadd and σmult. To illustrate this, suppose that σmult = 0 and consider the case of p = 2, in which case A = 0, B α2/σadd and so injecting additive Gaussian noise can help controlling the change in the model output, keeping the classifier s prediction, when the data perturbation is of size α2. We now prove Theorem 5. Before this, we need the following lemma. Lemma 3. Let x1 := z Rdk and x2 := z + τ Rdk, with τ > 0 and dk > 1, and Σ(x) := σ2 add I +σ2 multxx T (σ2 add +σ2 multβ2)I > 0, for some constant β, for all x. Let Π be a dk by dk 1 matrix whose columns form a basis for the subspace orthogonal to τ, and let ρ1(z, τ), . . . , ρdk 1(z, τ) denote the eigenvalues of (ΠT Σ(x1)Π) 1ΠT Σ(x2)Π I. Define the function C(x1, x2, Σ) := max{A, B}, where A = σ2 mult σ2 add + σ2 multβ2 ( τ 2 2 + 2τ T z), (113) σ2 add + σ2 multβ2 i=1 ρ2 i (z, τ). (114) Then, the total variation distance between N(x1, Σ(x1)) and N(x2, Σ(x2)) admits the following bounds: 1 200 DT V (N(x1, Σ(x1)), N(x2, Σ(x2))) min{1, C(x1, x2, Σ)} 9 Proof of Lemma 3. The result follows from a straightforward application of Theorem 1.2 in (Devroye et al., 2018), which provides bounds on the total variation distance between Gaussians with different means and covariances. Published as a conference paper at ICLR 2022 With this lemma in hand, we now prove Theorem 5. Proof of Theorem 5. We denote the noise injection procedure by the map I : x N(x, Σ(x)), where Σ(x) = σ2 add I + σ2 multxx T . Let x X be a test datapoint and τ X be a data perturbation such that τ p αp for p > 0. DT V (Fk(I(gk(x))), Fk(I(gk(x + τ)))) DT V (I(gk(x)), I(gk(x + τ))) (116) DT V (I(gk(x)), I(gk(x) + gk(x + τ) gk(x))) (117) = DT V (I(gk(x)), I (gk(x) + τk)) (118) 2 min{1, Φ(gk(x), τk, σadd, σmult, βk)}, (119) where τk := gk(x + τ) gk(x) = R 1 0 gk(x + tτ)dt τ by the generalized fundamental theorem of calculus, and Φ(gk(x), τk, σadd, σmult, βk) σ2 mult σ2 add + σ2 multβ2 k ( τk 2 2 + 2 τk, gk(x) ), τk 2 p σ2 add + σ2 multβ2 k i=1 ρ2 i (gk(x), τ) where the ρi(gk(x), τ) are the eigenvalues given in the theorem. In the first line above, we have used the data preprocessing inequality (Theorem 6 in (Pinot et al., 2021)), and the last line follows from applying Lemma 3 together with the assumption that gk(x)gk(x)T β2 k > 0 for all x. Using the bounds 0 gk(x + tτ)dt 2 τ 2 (121) | τk, gk(x) | gk(x) 2 0 gk(x + tτ)dt 2 τ 2, (122) Φ(gk(x), τk, σadd, σmult, βk) max {A, B} , (123) A = σ2 mult σ2 add + σ2 multβ2 k 0 gk(x + tτ)dt 2 τ 2 2 + 2 gk(x) 2 0 gk(x + tτ)dt 2 τ 2 R 1 0 gk(x + tτ)dt 2 τ 2 p σ2 add + σ2 multβ2 k i=1 ρ2 i (gk(x), τ) (125) sup x conv(X) 0 gk(x + tτ)dt 2 i=1 ρ2 i (gk(x), τ) τ 2 p σ2 add + σ2 multβ2 k (126) =: Bk(τ) τ 2 p σ2 add + σ2 multβ2 k . (127) Published as a conference paper at ICLR 2022 The first statement of the theorem then follows from the facts that τ 2 τ p αp for p (0, 2], τ 2 d1/2 1/q τ q d1/2 1/qαq for q > 2, and τ 2 dα for any τ Rd. In particular, these imply that A CAp, where αp1αp<1 + α2 p1αp 1, if p (0, 2], d1/2 1/p(αp1αp<1 + α2 p1αp 1), if p (2, ), d(αp1αp<1 + α2 p1αp 1), if p = , (128) C := σ2 mult σ2 add + σ2 multβ2 k 0 gk(x + tτ)dt 2 + 2 gk(x) 2 0 gk(x + tτ)dt 2 The last statement in the theorem essentially follows from Proposition 3 in (Pinot et al., 2021). E ON GENERALIZATION BOUNDS FOR NFM Let F be the family of mappings x 7 f(x) and Zn := ((xi, yi))i [n]. Given a loss function l, the Rademacher complexity of the set l F := {(x, y) 7 l(f(x), y) : f F} is defined as: Rn(l F) := EZn,σ i=1 σil(f(xi), yi) where σ := (σ1, . . . , σn), with the σi independent uniform random variables taking values in { 1, 1}. Following (Lamb et al., 2019), we can derive the following generalization bound for the NFM loss function, i.e., the upper bound on the difference between the expected error on unseen data and the NFM loss. This bound shows that NFM can reduce overfitting and give rise to improved generalization. Theorem 6 (Generalization bound for the NFM loss). Assume that the loss function l satisfies |l(x, y) l(x , y)| M for all x, x and y. Then, for every δ > 0, with probability at least 1 δ over a draw of n i.i.d. samples {(xi, yi)}n i=1, we have the following generalization bound: for all maps f F, Ex,y[l(f(x), y)] LNF M n 2Rn(l F) + 2M 2n Qϵ(f), (131) where Qϵ(f) = E[ϵR(k) 1 + ϵ2 R(k) 2 + ϵ2 R(k) 3 ] + ϵ2ϕ(ϵ), (132) for some function ϕ such that limx ϕ(x) = 0. To compare the generalization behavior of NFM with that without using NFM, we also need the following generalization bound for the standard loss function. Theorem 7 (Generalization bound for the standard loss). Assume that the loss function l satisfies |l(x, y) l(x , y)| M for all x, x and y. Then, for every δ > 0, with probability at least 1 δ over a draw of n i.i.d. samples {(xi, yi)}n i=1, we have the following generalization bound: for all maps f F, Ex,y[l(f(x), y)] Lstd n 2Rn(l F) + 2M By comparing the above two theorems and following the argument of (Lamb et al., 2019), we see that the generalization benefit of NFM comes from two mechanisms. The first mechanism is based on the term Qϵ(f). Assuming that the Rademacher complexity term is the same for both methods, then NFM has a better generalization bound than that of standard method if Qϵ(f) > 0. The second mechanism is based on the Rademacher complexity term Rn(l F). For certain families of neural networks, this term can be bounded by the norms of the hidden layers of the network and the norms of the Jacobians of each layer with respect to all previous layers (Wei & Ma, 2019a;b). Therefore, Published as a conference paper at ICLR 2022 this term differs for the case of training using NFM and the case of standard training. Since NFM implicitly reduces the feature-output Jacobians (see Theorem 3), we can argue that NFM leads to a smaller Rademacher complexity term and hence a better generalization bound. We now prove Theorem 6. The proof of Theorem 7 follows the same argument as that of Theorem 6. Proof of Theorem 6. Let Zn := {(xi, yi)}i [n] and Z n := {(x i, y i)}i [n] be two test datasets, where Z n differs from Zn by exactly one point of an arbitrary index i0. Denote GE(Zn) := supf F Ex,y[l(f(x), y)] LNF M n , where LNF M n is computed using the dataset Zn, and likewise for GE(Z n). Then, GE(Z n) GE(Zn) M(2n 1) where we have used the fact that LNF M n has n2 terms and there are 2n 1 different terms for Zn and Z n. Similarly, we have GE(Zn) GE(Z n) 2M Therefore, by Mc Diarmid s inequality, for any δ > 0, with probability at least 1 δ, GE(Zn) EZn[GE(Zn)] + 2M Applying Theorem 3, we have sup f F EZ n i=1 l(f(x i), y i) sup f F EZ n i=1 l(f(x i), y i) i=1 l(f(xi), yi) i=1 (l(f(x i), y i) l(f(xi), yi)) i=1 σi(l(f(x i), y i) l(f(xi), yi)) i=1 σil(f(xi), yi) = 2Rn(l F) Qϵ(f) + 2M where (136) uses the definition of GE(Zn), (137) uses 1 n Pn i=1 l(f(xi), yi) inside the expectation and the linearity of expectation, (138) follows from the Jensen s inequality and the convexity of the supremum, (139) follows from the fact that σi(l(f(x i), y i) l(f(xi), yi)) and l(f(x i), y i) l(f(xi), yi) have the same distribution for each σi { 1, 1} (since Zn, Z n are drawn i.i.d. with the same distribution), and (140) follows from the subadditivity of supremum. The bound in the theorem then follows from the above bound. Published as a conference paper at ICLR 2022 F ADDITIONAL EXPERIMENTS AND DETAILS F.1 INPUT PERTURBATIONS We consider the following three types of data perturbations during inference time: White noise perturbations are constructed as x = x + x, where the additive noise is sampled from a Gaussian distribution x N(0, σ). This perturbation strategy emulates measurement errors that can result from data acquisition with poor sensors (where σ corresponds to the severity of these errors). Salt and pepper perturbations emulate defective pixels that result from converting analog signals to digital signals. The noise model takes the form P( X = X) = 1 γ, and P( X = max) = P( X = min) = γ/2, where X(i, j) denotes the corrupted image and min, max denote the minimum and maximum pixel values, respectively. γ parameterizes the proportion of defective pixels. Adversarial perturbations are worst-case non-random perturbations that maximize the loss ℓ(gδ(X + X), y) subject to the constraint X r on the norm of the perturbation. We consider the projected gradient decent for constructing these perturbations (Madry et al., 2017). F.2 ILLUSTRATION OF THE EFFECTS OF NFM ON TOY DATASETS We consider a binary classification task for the noise corrupted 2D dataset whose data points form two concentric circles. Points on the same circle corresponds to the same label class. We generate 500 samples, setting the scale factor between inner and outer circle to be 0.05 and adding Gaussian noise with zero mean and standard deviation of 0.3 to the samples. Fig. 8 shows the training and test data points. We train a fully connected feedforward neural network that has four layers with the Re LU activation functions on these data, using 300 points for training and 200 for testing. All models are trained with Adam and learning rate 0.1, and the seed is fixed across all experiments. Note that the learning rate can be considered as a temperature parameter which introduces some amount of regularization itself. Hence, we choose a learning rate that is large for this problem to better illustrate the regularization effects imposed by the different schemes that we consider. Fig. 2 illustrates how different regularization strategies affect the decision boundaries of the neural network classifier. The decision boundaries and the test accuracy indicate that white noise injections and dropout (we explore dropout rates in the range [0.0, 0.9] and we finds that 0.2 yields the best performance) introduce a favorable amount of regularization. Most notably is the effect of weight decay (we use 9e 3), i.e., the decision boundary is nicely smoothed and the test accuracy is improved. In contrast, the simple mixup data augmentation scheme shows no benefits here, whereas manifold mixup is improving the predictive accuracy considerably. Combining mixup (manifold mixup) with noise injections yields the best performance in terms of both smoothness of the decision boundary and predictive accuracy. Indeed, NFM is outperforming all other methods here. The performance could be further improved by combining NFM with weight decay or dropout. This shows that there are interaction effects between different regularization schemes. In practice, when (a) Data points for training. (b) Data points for testing. Figure 8: The toy dataset in R2 that we use for binary classification. Published as a conference paper at ICLR 2022 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Baseline Mixup ( = 0.1) Noisy Mixup ( = 0.1) Manifold Mixup ( = 0.1) Noisy Feature Mixup (*) Noisy Feature Mixup ( = 0.1) Test Accuracy White Noise (σ) 0.00 0.05 0.10 0.15 0.20 Baseline Mixup ( = 0.1) Noisy Mixup ( = 0.1) Manifold Mixup ( = 0.1) Noisy Feature Mixup (*) Noisy Feature Mixup ( = 0.1) Salt and Pepper Noise (γ) Figure 9: Vision transformers evaluated on CIFAR-10 with different training schemes. Table 4: Robustness of Wide-Res Net-18 w.r.t. white noise (σ) and salt and pepper (γ) perturbations evaluated on CIFAR-100. The results are averaged over 5 models trained with different seed values. Scheme Clean (%) σ (%) γ (%) 0.1 0.2 0.3 0.08 0.12 0.2 Baseline 91.3 89.4 77.0 56.7 83.2 74.6 48.6 Mixup (α = 0.1) Zhang et al. (2017) 91.2 89.5 77.6 57.7 82.9 74.6 48.6 Mixup (α = 0.2) Zhang et al. (2017) 91.2 89.2 77.8 58.9 82.6 74.5 47.9 Noisy Mixup (α = 0.1) Yang et al. (2020b) 90.9 90.4 87.5 80.2 84.0 79.4 63.8 Noisy Mixup (α = 0.2) Yang et al. (2020b) 90.9 90.4 87.4 79.8 83.8 79.3 63.4 Manifold Mixup (α = 0.1) Verma et al. (2019) 91.2 89.2 77.2 56.9 83.0 74.3 47.1 Manifold Mixup (α = 1.0) Verma et al. (2019) 90.2 88.4 76.0 55.1 81.3 71.4 42.7 Manifold Mixup (α = 2.0) Verma et al. (2019) 89.0 87.0 74.3 53.7 79.8 70.3 41.9 Noisy Feature Mixup (α = 0.1) 91.4 90.2 88.2 84.8 84.4 81.2 74.4 Noisy Feature Mixup (α = 1.0) 89.8 89.1 86.6 82.7 82.5 79.0 71.4 Noisy Feature Mixup (α = 2.0) 88.4 87.6 84.6 80.1 80.4 76.5 68.6 one trains deep neural networks, different regularization strategies are considered as knobs that are fine-tuned. From this perspective, NFM provides additional knobs to further improve a model. F.3 ADDITIONAL RESULTS FOR VISION TRANSFORMERS Here we consider compact vision transformer (Vi T-lite) with 7 attention layers and 4 heads (Hassani et al., 2021). Fig. 9 (left) compares vision transformers trained with different data augmentation strategies. Again, NFM improves the robustness of the models while achieving state-of-the-art accuracy when evaluated on clean data. However, mixup and manifold mixup do not boost the robustness. Further, Fig. 9 (right) shows that that the vision transformer is less sensitive to salt and pepper perturbations as compared to the Res Net model. These results are consistent with the high robustness properties of transformers recently reported in Shao et al. (2021); Paul & Chen (2021). Table 4 provides additional results for different α values. Table 4 shows results for vision transformers trained with different data augmentation schemes and different values of α. It can be seen that NFM with α = 0.1 helps to improve the predictive accuracy on clean data while also improving the robustness of the models. For example, the model trained with NFM shows about a 25% improvement compared to the baseline model when faced with salt and paper perturbations (γ = 0.2). Further, our results indicate that larger values of α have a negative effect on the generalization performance of vision transformer. F.4 ABLATION STUDY In Table 5 we provide a detailed ablation study where we vary several knobs. First, we can see that just injecting noise helps to improve robustness, but the test accuracy is only marginally improving. On the other hand, just mixing inputs and hidden features improves the testing performance of the model, but it does not significantly improve the robustness of a model. In contrast, the NFM scheme combines best of both worlds and shows that both accuracy and robustness can be increased. Varying the noise levels indicate that there is a trade-off between test accuracy on clean data and robustness to Published as a conference paper at ICLR 2022 perturbations. We also vary the mixup parameter α to show that the good performance is consistent across a range of different values. Table 5: Ablation study using Wide-Res Net-18 trained and evaluated on CIFAR-100. Mixup Manifold Noise Injections α Noise Levels Clean (%) σ (%) γ (%) σadd σmult 0.1 0.25 0.5 0.06 0.1 0.15 - 0 0 76.9 64.6 42.0 23.5 58.1 39.8 15.1 - 0.4 0.2 78.1 76.2 65.7 46.6 70.0 58.8 28.4 1 0 0 80.3 72.5 54.0 33.4 62.5 43.8 16.2 1 0.4 0.2 78.9 78.6 66.6 46.7 66.6 53.4 25.9 0.2 0 0 79.7 70.6 46.6 25.3 62.1 43.0 15.2 1 0 0 79.7 70.5 45.0 23.8 62.1 42.8 14.8 2 0 0 79.2 69.3 43.8 23.0 62.8 44.2 16.0 1 0.1 0.1 81.0 76.2 56.6 36.4 66.8 49.7 21.4 0.2 0.4 0.2 80.6 79.2 70.2 51.7 71.5 60.4 30.3 1 0.4 0.2 80.9 80.1 72.1 55.3 72.8 62.1 34.4 2 0.4 0.2 80.7 80.0 71.5 53.9 72.7 62.7 36.6 1 0.8 0.4 80.3 80.1 75.5 66.4 74.3 66.5 44.6 F.5 ADDITIONAL RESULTS FOR RESNETS WITH HIGHER LEVELS OF NOISE INJECTIONS In the experiments in Section 5, we considered models trained with NFM that use noise injection levels σadd = 0.4 and σmult = 0.2, whereas the ablation model uses σadd = 1.0 and σmult = 0.5. Here, we want to better illustrate the trade-off between accuracy and robustness. We saw that there exists a potential sweet-spot where we are able to improve both the predictive accuracy and the robustness of the model. However, if the primary aim is to push the robustness of the model, then we need to sacrifice some amount of accuracy. Fig. 10 is illustrating this trade-off for pre-actived Res Net-18s trained on CIFAR-10. We can see that increased levels of noise injections considerably improve the robustness, while the accuracy on clean data points drops. In practice, the amount of noise injection that the user chooses depend on the situation. If robustness is critical, than higher noise levels can be used. If adversarial examples are the main concern, than other training strategies such as adversarial training might be favorable. However, the advantage of NFM over adversarial training is that (a) we have a more favorable trade-off between robustness and accuracy in the small noise regime, and (b) NFM is computationally inexpensive, when compared to most adversarial training schemes. This is further illustrated in the next section. F.6 COMPARISON WITH ADVERSARIAL TRAINED MODELS Here, we compare NFM to adversarial training in the small noise regime, i.e., the situation where models do not show a significant drop on the clean test set. Specifically, we consider the projected gradient decent (PGD) method (Madry et al., 2017) using 7 attack iterations and varying l2 per- 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 70 add = 0.4, mult = 0.2 add = 0.6, mult = 0.3 add = 0.8, mult = 0.4 add = 1.0, mult = 0.5 add = 1.2, mult = 0.6 Test Accuracy White Noise (σ) 0.000 0.025 0.050 0.075 0.100 0.125 0.150 70 Salt and Pepper Noise (γ) Figure 10: Pre-actived Res Net-18 evaluated on CIFAR-10 trained with NFM and varying levels of additive (σadd) and multiplicative (σmult) noise injections. Shaded regions indicate one standard deviation about the mean. Averaged across 5 random seeds. Published as a conference paper at ICLR 2022 0.000 0.025 0.050 0.075 0.100 0.125 0.150 60 = 0.001 = 0.002 = 0.005 = 0.01 NFM ( add = 0.4, mult = 0.2) NFM ( add = 1.2, mult = 0.6) Test Accuracy Adverserial Noise (ϵ) 0.00 0.02 0.04 0.06 0.08 0.10 0.12 60 Adverserial Noise (ϵ) Figure 11: Pre-actived Res Net-18 evaluated on CIFAR-10 (left) and Wide Res Net-18 evaluated on CIFAR-100 (right) with respect to adversarial perturbed inputs. Shaded regions indicate one standard deviation about the mean. Averaged across 5 random seeds. turbation levels ϵ to train adversarial robust models. First, we compare how resilient the different models are with respect to adversarial input perturbations during inference time (Fig. 11; left). Again the adversarial examples are constructed using the PGD method with 7 attack iterations. Not very surprisingly, the adversarial trained model with ϵ = 0.01 features the best resilience while sacrificing about 0.5% accuracy as compared to the baseline model (here not shown). In contrast, the models trained with NFM are less robust, while being about 1 1.5% more accurate on clean data. Next, we compare in (Fig. 11; right) the robustness with respect to salt and pepper perturbations, i.e., perturbations that both models have not seen before. Interestingly, here we see an advantage of the NFM scheme with high noise injection levels as compared to the adversarial trained models. F.7 FEATURE VISUALIZATION COMPARISON In this subsection, we concern ourselves with comparing the features learned by three Res Net50 models trained on Restricted Imagenet (Tsipras et al., 2018): without mixup, manifold mixup (Verma et al., 2019), and NFM. We can compare features by maximizing randomly chosen pre-logit activations of each model with respect to the input, as described by Engstrom et al. (2020). We do so for all models with Projected Gradient Ascent over 200 iterations, a step size of 16, and an ℓ2 norm constraint of 2,000. Both the models trained with manifold mixup and NFM use an α = 0.2, and the NFM model uses in addition σadd = 2.4 and σmult = 1.2. The result, as shown in Fig. 12, is that the features learned by the model trained with NFM are slightly stronger (i.e., different from random noise) than the clean model. F.8 TRAIN AND TEST ERROR FOR CIFAR-100 Figure 13 shows models trained with different training schemes on CIFAR-100. Compared to the baseline model, the models trained with manifold mixup and NFM have a similar convergence behavior. However, they are able to achieve a smaller test error. This shows that both manifold mixup and NFM have a favorable implicit regularization effect, where the effect is more pronounced for the NFM scheme. Published as a conference paper at ICLR 2022 Manifold Mixup Figure 12: The features learned by the NFM classifier are slightly stronger (i.e., different from random noise) than the clean model. See Subsection F.7 for more details. 0 50 100 150 200 Number of epochs Train error Baseline Manifold Mixup ( = 1.0) NFM ( = 1.0) (a) Train error. 0 50 100 150 200 Number of epochs Baseline Manifold Mixup ( = 1.0) NFM ( = 1.0) (b) Test error. Figure 13: Train (a) and test (b) error for a pre-actived Wide-Res Net-18 trained on CIFAR-100.