# regularization_properties_of_adversariallytrained_linear_regression__88b5b171.pdf

Regularization properties of adversarially-trained linear regression

Antônio H. Ribeiro Uppsala University antonio.horta.ribeiro@it.uu.se

Dave Zachariah Uppsala University dave.zachariah@it.uu.se

Francis Bach PSL Research University, INRIA francis.bach@inria.fr

Thomas B. Schön Uppsala University thomas.schon@it.uu.se

State-of-the-art machine learning models can be vulnerable to very small input perturbations that are adversarially constructed. Adversarial training is an effective approach to defend against it. Formulated as a min-max problem, it searches for the best solution when the training data were corrupted by the worst-case attacks. Linear models are among the simple models where vulnerabilities can be observed and are the focus of our study. In this case, adversarial training leads to a convex optimization problem which can be formulated as the minimization of a finite sum. We provide a comparative analysis between the solution of adversarial training in linear regression and other regularization methods. Our main findings are that: (A) Adversarial training yields the minimum-norm interpolating solution in the overparameterized regime (more parameters than data), as long as the maximum disturbance radius is smaller than a threshold. And, conversely, the minimumnorm interpolator is the solution to adversarial training with a given radius. (B) Adversarial training can be equivalent to parameter shrinking methods (ridge regression and Lasso). This happens in the underparametrized region, for an appropriate choice of adversarial radius and zero-mean symmetrically distributed covariates. (C) For ℓ -adversarial training as in square-root Lasso the choice of adversarial radius for optimal bounds does not depend on the additive noise variance. We confirm our theoretical findings with numerical examples.

1 Introduction

Adversarial attacks generated striking examples of the brittleness of modern machine learning. The framework considers inputs contaminated with disturbances deliberately chosen to maximize the model error and, even for small disturbances, can cause a substantial performance drop in otherwise state-of-the-art models [1] [6]. Adversarial training [7] is one of the most effective approaches for deep learning models to defend against adversarial attacks [7] [9]. It considers training models on samples that have been modified by an adversary, with the goal of obtaining a model that will be more robust when faced with new adversarially perturbed samples. The training procedure is formulated as a min-max problem, searching for the best solution to the worst-case attacks.

Despite its success in producing state-of-the-art results in adversarial defense benchmarks [10], there are still major challenges in using adversarial training. The min-max problem is a hard optimization problem to solve and it is still an open question how to solve it efficiently. Moreover, these methods still produce large errors in new adversarially disturbed test points, often much larger than the adversarial error obtained during training.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

0 5 10 15 ||b훽||1

coefﬁcients

(a) Lasso regression

0 5 10 15 ||b훽||1

(b) ℓ -adversarial training

Figure 1: Regularization paths estimated in the Diabetes dataset [18]. Regularization paths are plots showing the coefficient estimates for varying regularization parameter λ or, in adversarial training, perturbation radius δ. This type of plot is commonly used in analysing Lasso and variants, i.e. [18], [19]. On the horizontal axis, we give the ℓ1-norm of the estimated parameter. On the vertical axis, the coefficients obtained for each method. The dataset has p = 10 baseline variables (age, sex, body mass index, average blood pressure, and six blood serum measurements), which were obtained for n = 442 diabetes patients. The model output is a quantitative measure of the disease progression. See Appendix B.5 for the relationship between bβ 1 and δ and λ.

To get insight into adversarial training and try to tackle these challenges, a growing body of work [11] [15] studies fundamental properties of adversarial attacks and adversarial training in linear models. Linear models allow for analytical analysis while still reproducing phenomena observed in state-ofthe-art models. Consider a training dataset {(xi, yi)}n i=1 consisting of n data points of dimension Rp R, adversarial training in linear regression corresponds to minimizing (in β Rp)

Radv(β; δ, ) = 1

i=1 max xi δ |yi (xi + xi) β|2. (1)

Restricting the analysis to linear models allows for a simplified analysis: the problem is convex and the next proposition allows us to express Radv in terms of the dual norm, β = sup x 1 |β x|.

Proposition 1 (Dual formulation). Let be the dual norm of , then

Radv(β; δ, ) = 1

|yi x i β| + δ β 2 . (2)

This simple reformulation removes one of the major challenges: it removes the inner optimization problem and yields a model that is more amenable to analysis. Similar reformulations can be obtained for classification. This simplification made the analysis of adversarial attacks in linear models fruitful to get insights into the properties of adversarial training and examples. From giving a counterexample to the idea that deep neural networks vulnerabilities to adversarial attacks were caused by their nonlinearity [2], [5], [16], to help explaining how implicitly regularized overparametrized models can be robust, see [17].

In this paper, we further explore this reformulation and provide a thorough characterization of adversarial training in linear regression problems. We do this in a comparative fashion, establishing when the solution of adversarially-trained linear models coincides with the solution of traditional regularization methods. For instance, a simple inspection of (2) hints at the similarities with other parameter shrinking methods. This similarity is confirmed in Figure 1,1 which illustrates how the solutions of ℓ -adversarial training can be almost indistinguishable from Lasso. We explain the similarities and differences with other methods. Our contributions are:

A. In the overparametrized region, we prove that the minimum-norm interpolator is the solution to adversarial training for δ smaller than a certain threshold (Section 4). B. We establish conditions under which the solution coincides with Lasso and ridge regression, respectively (Section 5). C. We show that adversarial training can be framed in the robust regression framework, and use it to establish connections with the square-root Lasso. We show that adversarial training (like the square-root Lasso) can obtain bounds on the prediction error that do not require the noise variance to be known (Section 6). D. We prove a more general version of Proposition 1, valid for general lower-semicontinuous and convex loss functions (Section 8).

1Code for reproducing the figures is available in: https://github.com/antonior92/advtrain-linreg

2 Related work

Generalization of minimum-norm interpolators. The study of minimum-norm interpolators played a key role in explaining why overparametrized models generalize an important open question for which traditional theory failed to explain empirical results [20] and for which significant progress has been made over the past few years [21]. These estimates provide a simple scenario where we can interpolate noisy data and still generalize well. Minimum-norm interpolators have indeed been quite a fruitful setting to study the phenomenon of benign overfitting, i.e., when the model interpolates the training data but still generalizes well to new samples. In [22] the authors use these estimates to prove consistency, a development that later had several extensions [23]. From another angle, in a series of insightful papers, Belkin et al. [24], [25] explore the phenomena of double-descent: where a double-descent curve subsumes the textbook U-shaped bias variance trade-off curve, with a second decrease in the error occurring beyond the point where the model has reached the capacity of interpolating the training data. Interestingly, minimum-norm interpolators are also basic scenarios for observing this phenomena [24] [26]. Our research connects the research on robustness to adversarial attacks to the study of generalization of minimum-norm interpolators.

Adversarial training in linear models. The generalization of adversarial attacks in linear models is well-studied. Tsipras et al. [16] and Ilyas et al. [5] use linear models to explain the conflict between robustness and high-performance models observed in neural networks; Ribeiro et al. [17] use these models to show how overparameterization affects robustness to perturbations; Taheri et al. [11] derive asymptotics for adversarial training in binary classification. Dan et al. [27] and Dobriban et al. [28] study adversarial robustness in Gaussian classification problems. Javanmard et al. [12] provide asymptotics for adversarial training in linear regression, Javanmard et al. [29] in classification settings and Hassani et al. [13] for random feature regressions. Min et al. [14] investigate how the dataset size affects adversarial performance. Yin et al. [15] provide an analysis of ℓ -attack on linear classifiers based on Rademacher complexity. These works, however, focus on the generalization properties. We provide a fresh perspective on the problem by clarifying the connection of adversarial training to other regularization methods.

Robust regression and square-root Lasso. The properties of Lasso [30] for recovering sparse parameters in noisy settings have been extensively studied, and bounds on the parameter estimation error are well-known [31]. However, these results rely on choices of regularization parameters that depend on the variance of additive noise. Square-root Lasso [32] was proposed to circumvent this difficulty. We show that for ℓ -adversarial, similarly to the square-root Lasso, bounds can be obtained with the adversarial radius set without knowledge about the variance of the additive noise. We also show that both methods fit into the robust regression framework [33]. The work on robust classification [34] is also connected to our work, there they show that certain robust support vector machines are equivalent to SVM. Our results for classification in Section 8 can be viewed as a generalization of their Theorem 3.

Dual formulation. Variations of Proposition 1 are presented in [12], [17], [35]. Equivalent results in the context of classification are presented in [2], [15]. We use the reformulation extensively in our developments and also present a generalized statement for it.

3 Background

Different estimators will be relevant to our developments and will be compared in this paper.

Adversarially-trained linear regression. Our main object of study is the minimization of

Radv(β; δ, ) = 1

i=1 max xi δ |yi (xi + xi) β|2 (a) = 1

|yi x i β| + δ β 2 ,

where equality (a) follows from Proposition 1. We use β = sup x 1 |β x| to denote the dual norm of . We highlight that the ℓ2-norm, β 2 = P

i |βi|2 is dual to itself. The ℓ1-norm, β 1 = P

i |βi|, is the dual norm of the ℓ -norm, β = maxi |βi|. When the adversarial disturbances are constrained to the ℓp ball: { x : x p δ} we will call it ℓp-adversarial training. Our focus will be on ℓ and ℓ2-adversarial training.

Parameter-shrinking methods. Parameter-shrinking methods explicitly penalize large values of β. Regression methods that involve shrinkage include Lasso [30] and ridge regression, which minimize,

respectively

Rlasso(β; λ) = 1

i=1 |yi x i β|2 + λ β 1 and Rridge(β; λ) = 1

i=1 |yi x T i β|2 + λ β 2 2.

Square-root Lasso: Setting (with guarantees) the Lasso regularization parameters requires knowledge about the variance of the noise. Square-root Lasso [32] circumvent this difficulty by minimizing:

lasso(β, λ) = q

1 n Pn i=1|yi x i β|2 + λ β 1. (3)

Minimum-norm interpolator: Let X Rn p denote the matrix of stacked vectors xi and y Rn is the vector of stacked outputs. When matrix X has full row rank, the linear system Xβ = y has multiple solutions. In this case the minimum -norm interpolator is the solution of min β β subject to Xβ = y. (4)

4 Adversarial training in the overparametrized regime

Our first contribution is to provide conditions for when the minimum-norm interpolator is equivalent to adversarial training in the overparametrized case (we will assume that rank(X) = n and n < p). On the one hand, our result gives further insight into adversarial training. On the other hand, it allows us to see minimum-norm interpolators as a solution of the adversarial training problem.

Theorem 1. Assume the matrix X Rn p to have full row rank, let bα denote the solution of the dual problem max α X 1 α y, and δ denote the threshold

δ = (n bα ) 1 . (5)

The minimum -norm interpolator minimizes the adversarial risk Radv(β, δ, ) if and only if δ (0, δ].

Remark 1 (Bounds on δ). The theorem allows us to directly compute δ, but it does require solving the dual problem to the minimum-norm solution. In Appendix A.3, we provide bounds on δ that avoid solving the optimization problem. For instance, for ℓ -adversarial attacks, we have the following bounds depending on the singular values of X: 1 pσn(X) n δ pσ1(X). Where σ1 and σn denote the largest and smallest positive singular values of X, respectively. Proof of Theorem 1. Let bβ and bα be the minimum -norm solution and the solution of the associated dual problem. Throughout the proof we denote Radv(β) = Radv(β; δ, ), dropping the last two arguments. The subgradient of Radv(β) evaluated at bβ is given by2

Radv(bβ) = 2δ

i=1 bβ (ρixi + δ bβ ), (6)

for any ρ = (ρ1, . . . , ρn) Rn that satisfies ρ 1. We have that any element of bβ can be written as X bα (we prove that in Lemma 1 in the Appendix). Hence, we can rewrite

Radv(bβ) = 2δ

i=1 bβ (ρixi + δXT bα) = 2δ bβ X ρ

n + δX bα .

If δ δ = 1/(n bα ), we can take ρ = nδbα and the subderivative contains zero. On the other hand, if δ > 1/(n bα ) then X ρ

n + δX bα is not zero for ρ 1.

The minimum ℓ1-norm interpolator is well studied in the context of basis pursuit and allows the recovery of low-dimensional representations in sparse signals [39]. The interest in the minimum ℓ2-norm is more recent. It played an important role in studying the interplay between interpolation and generalization, being used in many recent papers where the double-descent [24], [26] and the benign overfitting phenomena [22] were observed.

2The subgradient of a function ω : Rp R is the set: ω(β0) = {v Rp : ω(β) ω(β0) v(β β0), β Rp}.

See [36] [38] for properties. We use bβ to denote the subgradient of ω(β) = β evaluated at bβ.

101 102 푝/푛

min. ℓ2-norm min. ℓ1-norm

Figure 2: Threshold δ vs. number of features. We fix n = 60 and show the value of δ (defined in (5)) as a function of the number of features p. The dotted lines give the reference δtest = 0.01E [ x ] for comparison.

Theorem 1 gives a new perspective on the robustness of the minimum-norm interpolator, allowing us to see it as a solution to adversarial training with adversarial radius δ. Figure 2 shows that the radius δ of the adversarial training problem corresponding to the minimum-norm interpolator increases with the ratio p/n. It is natural to expect that training with a larger radius would yield more adversarially robust models on new test points. The next result formalizes this intuition by establishing an upper bound on the robustness gap: the difference between the squareroot of the expected adversarial squared error Radv (β; δtest, ) = Ey0,x0 max xi δtest(y0 (x0 + x0) β)2 and the expected squared error R (β) = Ey0,x0 (y0 x 0 β)2 . The upper bound depends on the in-training adversarial error and on the ratio δtest δ which are quantities that you can directly compute.

Proposition 2. Assume X to have full row rank and let bβ be the minimum -norm interpolator, than q

Radv (bβ; δtest, ) q

R (bβ) δtest

Radv(bβ; δ, ). (7)

Remark 2. The example in Figure 2 is one case where adding more features makes the minimumnorm interpolator more robust. Examples showing the opposite and illustrating that adding more features to linear models can make them less adversarially robust are abound in the literature [16], [17]. Indeed, examples that consider the minimum ℓ2-interpolator subject to ℓ -adversarial attacks during test-time result in this scenario, as discussed in [17]. Proposition 7 in the Appendix is the equivalent of Proposition 2 for this case (mismatched norms in train and test). There, the upper bound grows with p, which explains the vulnerability in this scenario. Remark 3. A pitfall of analyzing the minimum-norm solution properties in linear models is that it requires the analysis of nested problems: different choices of p require different covariates x. The random projection model proposed by Bach [40] avoids this pitfall by considering the input x is fixed, but only a projected version of it Sx is observed, with the number of parameters being estimated changing with the number of projections observed. Adversarially training this model consists in finding a parameter bβ that minimizes: 1

n Pn i=1 max xi δ |yi (xi + xi) S β|2. We generalize our results so that we can also study this case (See Appendix A.4).

5 Adversarial training and parameter-shrinking methods

Theorem 1 stated an equivalence with the minimum-norm solution when δ is small. The next result applies to the other extreme of the spectrum, characterizing the case when δ is large.

Proposition 3 (Zero solution of adversarial training). The zero solution bβ = 0 minimizes the adversarial training if and only if δ X y

In the remaining of this section, we will characterize the solution between these two extremes. We compare adversarial training to Lasso and ridge regression.

5.1 Relation to Lasso and ridge regression

Proposition 1 makes it clear by inspection that adversarial training is similar to other well-known parameter-shrinking regularization methods. Indeed, for ℓ -adversarial attacks, the cost function Radv(β; δ, ) in its dual form is remarkably similar to Lasso [30] and Radv(β; δ, 2), to ridge regression (the cost functions were presented in Section 3). In Figures 1 and 3, we show the regularization paths of these methods (the dataset is described in [18]). We observe that ℓ - adversarial training produces sparse solutions and that Lasso and ℓ -adversarial training have extremely similar regularization paths. There are also striking similarities between ridge regression and ℓ2-adversarial training, with the notable difference that for large δ the solution of ℓ2-adversarial

10 4 10 3 10 2 10 1 100 101 102 1/휆

coefﬁcients

(a) Ridge regression

100 101 102 103 104 105 1/훿

(b) ℓ2-adversarial training

Figure 3: Regularization paths in the Diabetes dataset [18] (see dataset description in Figure 1). On the horizontal axis, we give the inverse of the regularization parameter (in log scale). On the vertical axis, the coefficients.

training is zero (which is explained by Proposition 3). The next proposition justifies why we can observe such similarities in part of the regularization path. It applies to data that has been normalized and for positive responses y (as it is the case in our example).

Proposition 4. Assume the output is positive and the data is normalized: y 0 and X 1 = 0. The solution of adversarial training bβ, for bβ mini |yi| xi is also the solution of

i=1 |yi x i β|2 + δ β + 1

The proposition is proved in Appendix B. It establishes the equivalence between ℓ -adversarial training and Lasso (even though there is not a closed-formula expression for the map between δ and λ ). Indeed, under the assumptions of the propositions, bβ is the solution of ℓ -adversarial problem only if it is the solution of

i=1 (yi x i β) + λ ||β||1, (8)

for some λ . We can prove this statement using the proposition: under the appropriate assumptions, the ℓ -adversarial problem solution is also a solution to the problem

i=1 (yi x i β)2 + δ||β||1 + 1

n||y||1 2 ,

which in turn, is the Lagrangian formulation of the following constrained optimization problem

i=1 (yi x i β)2 subject to δ||β||1 + 1

n||y||1 2 ,

n||y||1. The constraint can be rewritten as δ||β||1

n||y||1. And, the result follows because (8) is the Lagrangian formulation of this modified problem. A similar reasoning could be used to connect the result for ℓ2-adversarial training with ridge regression.

More generally, even when the condition on bβ is not satisfied and if y is negative, we show in Appendix B that for zero-mean and symmetrically distributed covariates, i.e., E [x] = 0 and (x x), adversarial training has approximately the same solution as

i=1 |yi x i β|2 + δ β + 1

for s = sign(y X bβ). This explains the similarities in the regularization paths.

5.2 Transition into the interpolation regime

The discussion above hints at the similarities between adversarial training and parameter-shrinking methods. Interestingly, Lasso and ridge regression are also connected to minimum-norm interpolators. The ridge regression solution converges to the minimum-norm solution as the parameter vanishes

10 1 100 101 102 103 104 105 1/훿

10 21 10 16 10 11 10 6 10 1

ℓ2-adv. train ridge

10 1 100 101 102 103 104 105 1/훿

10 24 10 19 10 14 10 9 10 4 101

ℓ -adv. train lasso

Figure 4: Training mean squared error vs inverse adversarial radius. Left: for ridge and ℓ2adversarial training. Right: for Lasso and ℓ -adversarial training. The error bars give the median and the 0.25 and 0.75 quantiles from 5 realizations. The vertical black lines show δ in (5). Figure S.9 (appendix) shows the test MSE. For ridge regression or Lasso (see Section 3), the x-axis should be read as 1/λ, rather than 1/δ. We generate the data synthetically using an isotropic Gaussian feature model (see Section 7) with n = 60 training data points and p = 200 features.

i.e., bβ ridge(λ) bβ min ℓ2 as λ 0+. Similarly, there is a relation between the minimum ℓ1-norm solution and Lasso. The relation requires additional constraints because, for the overparameterized case, Lasso does not necessarily have a unique solution. Nonetheless, it is proved in [41, Lemma 7]

that the Lasso solution by the LARS algorithm satisfies bβ lasso(λ) bβ min ℓ1 as λ 0+.

For a sufficiently small δ, the solution of ℓ2-adversarial training equals the minimum ℓ2-norm solution; and, the solution of ℓ -adversarial training equals the minimum ℓ1-norm solution. There is a notable difference though: while this happens only in the limit for ridge regression and Lasso, for adversarial training this happens for all values δ smaller than the threshold δ. We illustrate this phenomenon next. Unlike ridge regression and Lasso which converge towards the interpolation solution, adversarial training goes through abrupt transitions and suddenly starts to interpolate the data. Figure 4 illustrates this phenomenon in synthetically generated data (with isotropic Gaussian feature, see Section 7).

-2 -1 0 1 2 훼

0 2 4 6 8 10

훿= 0.5 훿= 1 훿= 2

Figure 5: Function L(β) for β = αx.

Intuitive explanation for abrupt transitions. To get insight into why the abrupt transition occurs we present the case where n = 1, i.e., the dataset has a single data point (x, y), where y = 1 and x 2 = 1. We can write Radv(β; δ, 2) = (L(β))2, where

L(β) = |y x Tβ| + δ β 2.

The function L is necessarily minimized along the subspace spanned by the vector x. Now, along this line, the function L is piecewise linear with three segments, see Figure 5. One of the three segments changes the slope sign as δ decreases and the minimum of the function changes abruptly, thus explaining why abrupt transitions occur.

6 Relation to robust regression and square-root Lasso

In this section we establish that both adversarial training and square-root Lasso can be integrated into the robust regression framework. We further investigate the similarities between the methods by analyzing the prediction error upper bounds.

Robust regression framework. Robust linear regression considers the minimization of the following cost function Rrobust(β; S) = max S y (X + )β 2, (9)

where the disturbance matrix is constrained to belong to the disturbance set S. We connect robust regression, square-root Lasso and adversarial training. The next proposition gives the equivalence between robust regression for a row-bounded disturbance sets Rp,δ and ℓp-adversarial training, see the appendix for the proof.

Proposition 5. For a disturbance set with the rows bounded by δ:

we have that arg minβ Rrobust(β, Rp,δ) = arg minβ Radv(β; δ, ).

On the other hand, there is an equivalence between square-root Lasso and robust regression for column-bounded disturbance sets C2,δ. This was established by Xu et al. [33, Theorem 1] and we repeat it below.

Proposition 6. [33] For a disturbance set with columns bounded by δ:

: ζi 2 δ, i

we have that arg minβ Rrobust(β; C2,δ) = arg minβ R

lasso(β, δ).

The above discussion hints at the similarities between adversarial training and square-root Lasso, as instances of robust regression under different constraints. Square-root Lasso minimizes R

lasso(β, λ) = n 1/2 y Xβ 2 + λ β 1 (See Section 3). The main motivation for the squareroot Lasso is that it is a pivotal method for sparse recovery: that is, it attains near-oracle performance without knowledge of the variance levels to set the regularization parameter [32]. As we will show in the next section, a similar property applies to ℓ -adversarial training.

Fixed-design analysis and similarities with square-root Lasso. In this section, we assume that the data was generated as: yi = x i β + εi where β is the parameter vector used to generate the data. Under these assumptions, we can derive an upper bound for the (in-sample) prediction error:

Theorem 2. Let δ > δ = 3 X ε

ε 1 , the prediction error of ℓ -adversarial training satisfies the bound: 1 n X(bβ β ) 2 2 8δ β 1 1

n ε 1 + 10δ β 1 . (10)

For comparison, we also provide the result for Lasso: (Adapted from [31, Thm. 7.13, p. 210])

Theorem 3. [31] Let λ > λ = 3 X ε

n , the prediction error of Lasso satisfies the bound:

1 n X(bβ β ) 2 2 8λ β 1. (11)

Without additional constraints (see Remark 5) it can be shown that for Lasso it is not possible to improve the above bound. To satisfy this bound, however, Lasso requires knowledge of the magnitude of the noise ε. In Theorem 3, λ = 3 n X ε . Hence, if we rescale ε (i.e., ε ηε) then a correspondent change in magnitude follows in λ (i.e., λ ηλ ). Square-root Lasso [32] avoids this problem and achieves a similar rate even without knowing the variance: i.e., it is a pivotal method. Our method has similar properties and allows us to set δ without estimating the variance. This can be seen in Theorem 2, where re-scaling ε does not alter the value of δ = 3 X ε / ε 1 because it affects the numerator and denominator simultaneously.

For instance, if we assume ε has i.i.d. N(0, σ2) entries and that the matrix X is fixed with maxj=1,...,m xj M. For λ Mσ p

(log p)/n, we (with high-probability) satisfy the condition in Theorem 3, obtaining: 1

n X(bβ β ) 2 2 Mσ p

(log p)/n. For adversarial training, we can set: δ M p

(log p)/n, and (with high-probability) satisfy the theorem condition, obtaining the same bound. Notice that the choice of δ is not dependent on σ. We provide a full analysis in Appendix C.3.

Remark 4 (On the relation between δ and δ ). For sufficiently large n, we have δ > δ in the scenario above the noise ε has i.i.d. normal entries N(0, σ2) and the matrix X is fixed. We can prove it by contradiction: if δ δ then we could apply the bound from Theorem 2 to the minimum ℓ1-norm

interpolator (due to Theorem 1). Hence, σ2 1

n ε 2 2 = 1

n X(bβ β ) 2 2 Mσ p

(log p)/n and we can always choose a sufficiently large value of n for which the inequality is false. Remark 5. In the original square-root Lasso paper [32], a bound is obtained for the ℓ2 parameter distance: bβ β 2 2 for the case X satisfy the restricted eigenvalue condition and β is sparse. Under these more strict assumptions (restricted eigenvalue condition) we can also obtain a faster convergence rate for the prediction error. Here we focus on the fixed-design prediction error without additional assumptions on X. For other predictors, these other proofs follow similar steps and yield similar requirements on λ , see [31, Chapter 7]. However, we leave these and other analysis (such as variable selection consistency) of adversarial training for future work.

7 Numerical Experiments

We study five different examples. The main goal is to experimentally confirm our theoretical findings. For each scenario, we compute and plot: train and test MSE for different choices of p, n and δ. For comparison purposes, we also compute and plot train and test MSE for Lasso and ridge regression and minimum-norm interpolators. Finally, in line with our discussion in Section 4, we compute and plot δ as a function of p/n. In all the numerical examples the adversarial training solution is implemented by minimizing (2) using CVXPY [42]. The scenarios under consideration are described below (see Appendix D for additional details)

1. Isotropic Gaussian feature model. The output is a linear combination of the features plus additive noise: yi = x i β + ϵi, for Gaussian noise and covariates: ϵi N(0, σ2) and xi N(0, Ip). 2. Latent-space feature model [26, Section 5.4]. The features x are noisy observations of a lowerdimensional subspace of dimension d. A vector in this latent space is represented by z Rd. x = W z + u. The output is a linear combination of the latent space plus noise. 3. Random Fourier features model [43]. We apply a random Fourier feature map to inputs of the Diabetes dataset [18]. Random fourier features are obtained by the transformation xi = p

2/m cos(W zi + b) where the entries of W and b are independently sampled. It can be seen as a one-layer untrained neural network and approximates the Gaussian kernel feature map [43]. 4. Random projection model [40]. The data is generated as in the isotropic Gaussian feature model, but we only observe Sxi, i.e., a projection of the inputs. 5. Phenotype prediction from genotype. We illustrate our method on the Diverse MAGIC wheat dataset [44] from the National Institute for Applied Botany. We use a subset of the genotype to predict one of the continuous phenotypes.

Results. We experimentally corroborate Theorem 1 in all scenarios: In Figures 4 and S.8 (Appendix) we show the train MSE as we change the adversarial radius δ, confirming the abrupt transitions into the interpolation regime and also that δ is the transition point. In Figures S.9 and S.10 we show the corresponding test MSE without and in the presence of an adversary, respectively. Interestingly, the adversarial radius that yields the best results is not always equal to the radius δtest the model will be evaluated on.

min ℓ2-norm

adv. train ℓ2

min ℓ1-norm

adv. train ℓ

# features, 푝

1000 2000 4000 8000 16000

Figure 6: Test normalized MSE (NMSE) in the MAGIC dataset.

Figures 2 and S.5 (Appendix) display δ as a function of the ratio p/n. We observe that δ/E [ x ] is growing in all examples and we would still expect improved robustness in light of Proposition 2. The Random Fourier features model is the only case where δ seems to decrease (in absolute value). However, E [ x 1] is also decreasing (and at a faster rate). Figures S.6 and S.7 give test MSE without and in the presence of an adversary for the minimum-norm interpolator as a function of the ratio p/n, confirming improved adversarial robustness as p grows.

Figure 6 provides a comparison of the test error of the different methods under study. For Lasso, ridge and adversarial training, we use the best δ or λ available for each method (obtained via grid search). We note that while for p = 1000 optimally tuned Lasso and ℓ -adversarial training significantly outperform the corresponding minimum ℓ1-norm interpolator. As p increases, the performance of the three different methods becomes quite similar. It is also an example where the minimum ℓ1-norm outperforms the minimum ℓ2-norm interpolator (for large p). In the same setup, Figure S.12

study a choice of adversarial radius inspired by Theorem 2. We use δ Xξ / ξ 1 for ξ a vector with zero-mean normal entries. We use a random ξ, since we do not know the true additive noise. Even with this approximation, ℓ -adversarial training performs comparably with Lasso with the regularization parameter set using 5-fold cross-validation doing a full search in the hyperparameter space. The figure also provides a comparison with square-root Lasso under a similar setting.

8 Results for general loss functions

The following theorem can be used to generalize Proposition 1.

Theorem 4. Let L : R R be a convex and lower-semicontinuous function, for every δ 0,

max x δ L (x + x) β = max s { 1,1} L x β + δs β . (12)

We can find a closed-formula expression for s = arg maxs { 1,1} L(β x + δs β ) in many cases of interest. For regression problems, given any non-decreasing function ℓ: R+ R+, and L(x β) = ℓ(|x β y|). If ℓis lower semicontinuous and convex then so will L and the result of the theorem holds (the squared loss ℓ(z) = z2 is a special case). Then s = sign(y x β) and

max x δ ℓ(|(x + x) β y|) = ℓ(|y x β| + δ β ).

For classification, let y { 1, 1} and L(x β) = ℓ(y(x β)) where ℓis a non-increasing function. If ℓis lower semicontinuous and convex then so will L and the result of the theorem holds. Then s = y and we obtain: max x δ ℓ(y((x + x) β)) = ℓ(y(x β) δ β ).

The above result can also be applied to unsupervised learning. In the Appendix, we illustrate how Theorem 4 can be used for dimensionality reduction. We consider the problem of finding P that minimizes the reconstruction error x + P P x 2 2 (that yields PCA algorithm) and derive an adversarial version of it. Remark 6. Over this paper we consider x, β Rp, but Theorem 4 holds generaly for x X a vector in a Banach space endowed with norm and β X a continuous linear map β : X R. Just define x β := β(x) and take to be the norm of the dual space X .

9 Conclusion

We study adversarial training in linear regression. Adversarial training allows us to depart from the traditional regularization setting where we can decompose the loss and the regularization terms, minimizing 1

n Pn i=0 ℓ(x i β, yi) + Ω(β) for some penalization function Ω. We show how it provides new insights into the minimum-norm interpolator, by proving a new equivalence between the two methods (Section 4). While adversarially-trained linear regression arrives at similar solutions to parameter shrinking methods in some scenarios (Section 5), it also has some advantages, for instance, the adversarial radius might be set without knowing the noise variance (Section 6). Unlike, squaredroot Lasso it achieves this while still minimizing a sum of squared terms. We believe a natural next step is to provide a tailored solver that can allow for efficient solutions for models with many features, rendering adversarially-trained linear regression useful, for instance, in modeling genetics data (see one minimal example of phenotype prediction from the genotype in Section 7).

Another interesting direction is generalizing our results to classification and nonlinear models. Indeed, adversarial training is very often used in the context of neural networks and it is natural to be interested in the behavior of this class of models. Our work allows for the analysis of simplified theoretical models commonly used to study neural networks. Random Fourier feature models can be analyzed using our theory and are they studied in the numerical examples. These models can be seen as a simplified neural network model (one-layer neural network where only the last layer weights are adjusted). In Appendix A.4, we provide adversarial training after linear projections, and could, for instance, be used to analyze deep linear networks, as the ones studied in [45], [46]. Finally, Section 8 provides a result for infinite spaces, and one could attempt to use it to analyze kernel methods and infinitely-wide neural networks [47], [48]. Overall we believe that our results provide an interesting set of tools also for the analysis of nonlinear models, and could provide insight into the empirically observed interplay (see [49]) between robustness and regularization in deep neural networks.

Acknowledgments

The authors would like to thank Carl Nettelblad for very fruitful discussions throughout the work with this research. We also thank Daniel Gedon and Dominik Baumann for their feedback on the first version of the manuscript.

TBS and DZ are financially supported by the Swedish Research Council, with the projects Deep probabilistic regression new models and learning algorithms (contract number: 2021-04301) and Counterfactual Prediction Methods for Heterogeneous Populations (contract number:201805040) and by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by Knut and Alice Wallenberg Foundation. TBS, AHR and DZ by the Kjell och Märta Beijer Foundation. FB by the Agence Nationale de la Recher-che as part of the Investissements d avenir program, reference ANR-19-P3IA0001 (PRAIRIE 3IA Institute); and by the European Research Council (grant SEQUOIA 724063).

The research was partially conducted during AHR s research visit to INRIA. The visit was financially supported by the Institute Français de Suède through the SFVE-A mobility program; and, by the European Network of AI Excellence Centres through ELISE mobility program.

[1] J. Bruna, C. Szegedy, I. Sutskever, I. Goodfellow, W. Zaremba, R. Fergus, and D. Erhan, Intriguing properties of neural networks, in International Conference on Learning Representations (ICLR), 2014.

[2] I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and Harnessing Adversarial Examples, in International Conference on Learning Representations (ICLR), 2015.

[3] A. Kurakin, I. Goodfellow, S. Bengio, et al., Adversarial attacks and defences competition, in The NIPS 17 competition: Building Intelligent Systems, 2018, pp. 195 231.

[4] A. Fawzi, O. Fawzi, and P. Frossard, Analysis of classifiers robustness to adversarial perturbations, Machine Learning, vol. 107, no. 3, pp. 481 508, 2018.

[5] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry, Adversarial Examples Are Not Bugs, They Are Features, Advances in Neural Information Processing Systems, vol. 32, 2019.

[6] X. Yuan, P. He, Q. Zhu, and X. Li, Adversarial examples: Attacks and defenses for deep learning, IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 9, pp. 2805 2824, 2019.

[7] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, Towards Deep Learning Models Resistant to Adversarial Attacks, International Conference for Learning Representations (ICLR), 2018.

[8] R. Huang, B. Xu, D. Schuurmans, and C. Szepesvari, Learning with a Strong Adversary, ar Xiv:1511.03034, 2016.

[9] T. Bai, J. Luo, J. Zhao, B. Wen, and Q. Wang, Recent Advances in Adversarial Training for Adversarial Robustness, ar Xiv:2102.01356, 2021.

[10] F. Croce, M. Andriushchenko, V. Sehwag, E. Debenedetti, N. Flammarion, M. Chiang, P. Mittal, and M. Hein, Robust Bench: A standardized adversarial robustness benchmark, in Neur IPS Datasets and Benchmarks track, 2021.

[11] H. Taheri, R. Pedarsani, and C. Thrampoulidis, Asymptotic Behavior of Adversarial Training in Binary Classification, IEEE International Symposium on Information Theory (ISIT), vol. 127-132, 2022.

[12] A. Javanmard, M. Soltanolkotabi, and H. Hassani, Precise tradeoffs in adversarial training for linear regression, in Proceedings of the Conference on Learning Theory, vol. 125, 2020, pp. 2034 2078.

[13] H. Hassani and A. Javanmard, The curse of overparametrization in adversarial training: Precise analysis of robust generalization for random features regression, ar Xiv:2201.05149, 2022.

[14] Y. Min, L. Chen, and A. Karbasi, The Curious Case of Adversarially Robust Models: More Data Can Help, Double Descend, or Hurt Generalization, Proceedings of the Conference on Uncertainty in Artificial Intelligence, vol. 161, pp. 129 139, 2021.

[15] D. Yin, R. Kannan, and P. Bartlett, Rademacher Complexity for Adversarially Robust Generalization, in Proceeding of the International Conference on Machine Learning, 2019, pp. 7085 7094.

[16] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Ma, Robustness May Be At Odds with Accuracy, International Conference for Learning Representations, 2019.

[17] A. H. Ribeiro and T. B. Schön, Overparameterized Linear Regression under Adversarial Attacks, IEEE Transactions on Signal Processing, 2023.

[18] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, Least angle regression, The Annals of Statistics, vol. 32, no. 2, pp. 407 499, 2004.

[19] T. Hastie, R. Tibshirani, and J. Friedman, Elements of Statistical Learning, Second. Springer Science & Business Media, 2009.

[20] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, Understanding deep learning requires rethinking generalization, in International Conference on Learning Representations, 2017.

[21] P. L. Bartlett, A. Montanari, and A. Rakhlin, Deep learning: A statistical viewpoint, ar Xiv:2103.09177, 2021.

[22] P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler, Benign overfitting in linear regression, Proceedings of the National Academy of Sciences, vol. 117, no. 48, pp. 30 063 30 070, 2020.

[23] F. Koehler, L. Zhou, D. J. Sutherland, and N. Srebro, Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds and Benign Overfitting, in Neur IPS, 2021.

[24] M. Belkin, D. Hsu, S. Ma, and S. Mandal, Reconciling modern machine-learning practice and the classical bias variance trade-off, Proceedings of the National Academy of Sciences, vol. 116, no. 32, pp. 15 849 15 854, 2019.

[25] M. Belkin, D. Hsu, and J. Xu, Two Models of Double Descent for Weak Features, SIAM Journal on Mathematics of Data Science, vol. 2, no. 4, pp. 1167 1180, 2020.

[26] T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani, Surprises in high-dimensional ridgeless least squares interpolation, The Annals of Statistics, vol. 50, no. 2, pp. 949 986, 2022.

[27] C. Dan, Y. Wei, and P. Ravikumar, Sharp statistical guaratees for adversarially robust Gaussian classification, in Proceedings of the 37th international conference on machine learning, ser. Proceedings of machine learning research, vol. 119, PMLR, 2020, pp. 2345 2355.

[28] E. Dobriban, H. Hassani, D. Hong, and A. Robey, Provable tradeoffs in adversarially robust classification, 2022.

[29] A. Javanmard and M. Soltanolkotabi, Precise statistical analysis of classification accuracies for adversarial training, The Annals of Statistics, vol. 50, no. 4, pp. 2127 2156, 2022.

[30] R. Tibshirani, Regression shrinkage and selection via the LASSO, Journal of the Royal Statistical Society. Series B (Methodological), pp. 267 288, 1996.

[31] M. J. Wainwright, High-Simensional Statistics: a non-Asymptotic Viewpoint, ser. Cambridge series on statistical and probabilistic mathematics 48. Cambridge University Press, 2019.

[32] A. Belloni, V. Chernozhukov, and L. Wang, Square-Root Lasso: Pivotal Recovery of Sparse Signals via Conic Programming, Biometrika, vol. 98, no. 4, pp. 791 806, 2011.

[33] H. Xu, C. Caramanis, and S. Mannor, Robust regression and lasso, Advances in Neural Information Processing Systems, vol. 21, 2008.

[34] , Robustness and regularization of support vector machines, Journal of Machine Learning Research, vol. 10, no. 51, pp. 1485 1510, 2009.

[35] Y. Xing, Q. Song, and G. Cheng, On the Generalization Properties of Adversarial Training, in Proceedings of the International Conference on Artificial Intelligence and Statistics, 2021, pp. 505 513.

[36] D. P. Bertsekas, A. Nedi, and A. E. Ozdaglar, Convex Analysis and Optimization. 2003.

[37] F. H. Clarke, Optimization and Nonsmooth Analysis. Society for Industrial and Applied Mathematics, 1990.

[38] S. Boyd, J. Duchi, M. Pilanci, and L. Vandenberghe, Subgradients (Lecture Notes), Tech. Rep. Notes for EE364b, Stanford University, Spring 2021-22, 2022.

[39] S. S. Chen, D. L. Donoho, and M. A. Saunders, Atomic decomposition by basis pursuit, SIAM Journal on Scientific Computing, vol. 20, no. 1, pp. 33 61, 1998.

[40] F. Bach, High-dimensional analysis of double descent for linear regression with random projections, ar Xiv:2303.01372, 2023.

[41] R. J. Tibshirani, The Lasso Problem and Uniqueness, Electronic Journal of Statistics, vol. 7, pp. 1456 1490, 2013.

[42] S. Diamond and S. Boyd, CVXPY: A Python-embedded modeling language for convex optimization, Journal of Machine Learning Research, vol. 17, no. 83, pp. 1 5, 2016.

[43] A. Rahimi and B. Recht, Random Features for Large-Scale Kernel Machines, in Advances in Neural Information Processing Systems 20, 2008, pp. 1177 1184.

[44] M. F. Scott, N. Fradgley, A. R. Bentley, et al., Limited haplotype diversity underlies polygenic trait architecture across 70 years of wheat breeding, Genome Biology, vol. 22, no. 1, p. 137, 2021.

[45] A. M. Saxe, J. L. Mc Clelland, and S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, International Conference on Learning Representations (ICLR), 2014.

[46] T. Laurent and J. Brecht, Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global, in Proceedings of the International Conference on Machine Learning, 2018, pp. 2902 2907.

[47] J. Lee, L. Xiao, S. S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington, Wide neural networks of any depth evolve as linear models under gradient descent, Journal of Statistical Mechanics: Theory and Experiment, vol. 2020, no. 12, p. 124 002, 2020.

[48] A. Jacot, F. Gabriel, and C. Hongler, Neural Tangent Kernel: Convergence and Generalization in Neural Networks, Advances in Neural Information Processing Systems 31, 2018.

[49] D. Jakubovitz and R. Giryes, Improving DNN robustness to adversarial attacks using jacobian regularization, in European conference on computer vision, 2018.