# highdimensional_group_adversarial_training_in_linear_regression__2b3ea6db.pdf

High-dimensional (Group) Adversarial Training in Linear Regression

Yiling Xie School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, Georgia, USA yxie350@gatech.edu

Xiaoming Huo School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, Georgia, USA huo@gatech.edu

Adversarial training can achieve robustness against adversarial perturbations and has been widely used in machine-learning models. This paper delivers a nonasymptotic consistency analysis of the adversarial training procedure under ℓ - perturbation in high-dimensional linear regression. It will be shown that, under the restricted eigenvalue condition, the associated convergence rate of prediction error can achieve the minimax rate up to a logarithmic factor in the high-dimensional linear regression on the class of sparse parameters. Additionally, the group adversarial training procedure is analyzed. Compared with classic adversarial training, it will be proved that the group adversarial training procedure enjoys a better prediction error upper bound under certain group-sparsity patterns.

1 Introduction

Adversarial training is proposed to hedge against adversarial perturbations and has attracted much research interest in recent years. Adversarial training has been widely used in Large Language Models [14, 24], computer vision [13], cybersecurity [33], etc. While the empirical risk minimization procedure optimizes the empirical loss, the adversarial training procedure seeks conservative solutions that optimize the worst-case loss under a given magnitude of perturbation. People have actively investigated model modifications and algorithmic frameworks to improve performance and training efficiency for adversarial training under different problem settings [1, 10, 12, 22, 23, 27, 29].

We are interested in understanding the fundamental properties of adversarial training from a statistical viewpoint. A standard approach for statisticians to evaluate statistical or machine-learning models is to investigate whether the estimator obtained from the model can achieve the minimax rate [25]. In this paper, we will give the non-asymptotic convergence rate of the prediction error in highdimensional adversarial training. The associated convergence rate achieves the minimax rate under certain settings, which will be clarified in Section 2.2.

In machine-learning models, adversarial training has the following mathematical formulation:

i=1 sup δ L (Xi + , Yi, β) ,

where (X1, Y1), ..., (Xn, Yn) are given samples, is the perturbation, is the perturbation norm, δ is the perturbation magnitude, β is the model parameter, and L(x, y, β) is the loss function with x being the input variable and y being the response variable.

Regarding the choice of the perturbation norm, we focus on ℓ -perturbation, i.e., = . Some literature has pointed out that ℓ -perturbation could help recover the model sparsity [21, 28].

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

For example, [28] has proved that the asymptotic distribution of adversarial training estimator under ℓ -perturbation has a positive mass at 0 when the underlying parameter is 0. Since the sparsity assumption could improve the model interpretation and reduce problem complexity [11], especially in high-dimensional regimes, ℓ -perturbation will be studied, and certain sparsity patterns will be assumed in this paper. In terms of the loss function, we focus on the loss in the linear regression, i.e., L(x, y, β) = (x β y)2. In particular, many of the existing theoretical explorations on adversarial training are based on linear models [15, 16, 21, 28, 30, 31], which admits advanced analytical analysis and sheds light on the characteristics of adversarial training in more general settings and applications. In this regard, the linear regression is considered in this paper. In short, we will focus on the adversarial-trained linear regression as shown in the following:

bβ arg min β

(Xi + ) β Yi 2 , (1)

where Xi, β Rp and Yi R. For the convenience of analysis, we write the given n samples (X1, Y1), ..., (Xn, Yn) in the matrix form: X = (X1, ..., Xn) Rn p and Y = (Y1, ..., Yn) Rn 1, where we call X the design matrix.

This paper delivers the convergence analysis under the high-dimension setting, where we suppose the dimension of the model parameter β is larger than the sample size, i.e., p > n. Further, the parameter sparsity is assumed. Specifically, we suppose that only a subset of the elements of the p-dimensional ground-truth parameter β is nonzero. If the size of the nonzero subset is s, it will be shown that the resulting prediction error of problem (1), i.e., X(bβ β ) 2 2/n, is of the order s log p/n under the restricted eigenvalue condition. The restricted eigenvalue condition is a standard assumption in the literature of sparse high-dimensional linear regression [2, 3, 4, 17]. Notably, the rate s log p/n is optimal in the minimax sense, up to a logarithmic factor, for all estimators over a class of s-sparse p-dimensional vectors if there are n training samples [20].

Our aforementioned results have the following implications. Firstly, in addition to robustness towards perturbation, our results show that adversarial training is beneficial regarding statistical optimality. This means the resulting estimator can achieve the minimax convergence rate for prediction error. To the best of our knowledge, we are the first to prove the minimax optimality of the adversarial training estimator. Secondly, our analysis illustrates that the ℓ -perturbation is recommended if the sparsity condition, i.e., the ground truth β is supported on a subset of {1, ..., p}, is known and a fast theoretical error convergence rate is required.

The convergence rate of the group adversarial training is also investigated. In the literature, the group effect has been studied in (finite-type) Wasserstein distributionally robust optimization problems [5, 7]. Since adversarial training is equivalent to -type Wasserstein distributionally robust optimization problem [9], the formulation of group Wasserstein distributionally robust optimization problem discussed in [5, 7] can be generalized to the adversarial training problem. We give a formal formulation of the group adversarial training problem based on frameworks in [5, 7]. Further, we derive the non-asymptotic convergence rate of the prediction error for the group adversarial training problem. It will be shown that group adversarial training can achieve a faster convergence upper bound if certain group sparsity structures are satisfied. The details can be found in Section 3.2.

1.1 Related Work

We review and compare some related work in this subsection.

The asymptotic behavior of ℓ -adversarial training estimator in the generalized linear model has been discussed in [28]. Notably, the paper [28] studies the behavior of adversarial training estimator from an asymptotic point of view, while our paper delivers a non-asymptotic analysis. More specifically, analysis in [28] is based on the asymptotic distribution of the adversarial training estimator, while our work is to give a non-asymptotic upper bound of the prediction error of the adversarial training estimator. More discussions can be found in Remark 2.7.

The prediction error of ℓ -adversarial training estimator has been briefly analyzed in [21], where the proven convergence is of the order 1/ n in terms of n. The results in our paper are different in the following two perspectives. Firstly, a faster convergence rate of the order 1/n in terms of n is given, and the associated rate is minimax optimal up to a logarithmic factor. Secondly, we have incorporated the sparsity setting in the model analysis, while no sparsity pattern is considered in

theoretical analysis for ℓ -adversarial training in [21]. More discussions can be found in Remark 2.8.

The paper [30] also investigates the convergence of adversarial training estimator in linear regression. The derivations in [30] are based on the assumption that the input variable X follows p-dimensional Gaussian distribution while our analysis imposes the restricted eigenvalue condition. In addition, notice that [30] argues the superiority of incorporating the sparsity information by deriving lower bounds for the estimator error while we directly prove the rate optimality of the adversarial training estimator under the sparsity assumption. Also, [30] applies ℓ2-perturbation while our work focuses on ℓ -perturbation.

In the literature, it has been proven that multiple estimators, including LASSO, Dantzig selector, and square-root LASSO, can achieve the minimax rate (up to a logarithmic factor) in high-dimensional sparse linear regression [3, 4, 11, 20]. However, to the best of our knowledge, no literature has investigated this property for the widely used adversarial training model. We are the first to study whether the adversarial training estimator can be minimax optimal, and our theoretical analysis implies that the answer is yes, i.e., the adversarial training estimator under ℓ -perturbation enjoys rate optimality. In addition, the group lasso has been intensely studied to explore the parameter group structure [17, 32] while group adversarial training imposes group structure on the perturbation. It will be shown that the group adversarial training estimator shares a similar convergence rate with the group LASSO estimator. Our proof technique is developed upon and extends the technical methods in the aforementioned papers [3, 4, 11, 17].

1.2 Notations and Preliminaries

We introduce some notations, which will be used in the rest of the paper. For vector z Rp, we use z q to denote the ℓq norm of the vector z, i.e., z q q = Pp j=1 |zj|q, 1 q < , z = max1 j p |zj|. We use ej Rp, 1 j p, to denote the basis vectors where the jth component is 1 and 0 otherwise. In Rn n denotes the identity matrix. For some set S, we use Sc to denote the complement set of S and |S| to denote the cardinality of S. If the set S is the subset of {1, ..., p}, we use z S R|S| to denote the subvector indexed by elements of S.

We clarify some preliminary settings that will be used in this paper. Throughout this paper, we suppose the high-dimension setting holds, the samples are generated from the Gaussian linear model, and the design matrix is normalized. We conclude these conditions with the following assumptions:

Assumption 1.1 (High-dimension). The parameter dimension p is larger than the sample size n, i.e., we have n < p.

Assumption 1.2 (Gaussian linear model). The design matrix X is fixed and the response vector Y is generated by the following: Y = Xβ + ϵ, where ϵ has i.i.d. entries N(0, σ2).

Assumption 1.3 (Normalization). The design matrix X is normalized such that Xej 2 n, for 1 j p.

1.3 Organization of this Paper

The remainder of this paper is organized as follows. In Section 2, we derive the convergence rate of the adversarial training estimator in high-dimensional linear regression. In Section 3, we derive the convergence rate of group adversarial training and compare it with the existing adversarial training. Numerical experiments are conducted in Section 4. Possible future work is discussed in Section 5. The proofs are relegated to the Appendix whenever possible.

2 ℓ -Adversarial-Trained Linear Regression

In this section, we will first introduce the problem formulation of the adversarial training in linear regression under ℓ -perturbation and then deliver the convergence analysis of the prediction error under ℓ -perturbation in the high-dimensional setting.

2.1 Problem Formulation

In this subsection, we give the problem formulation of ℓ -adversarial-trained linear regression and discuss its dual.

Recall that the ℓ -adversarial training problem in linear regression has the formulation shown in (1). The solution bβ to the optimization problem (1) is used to estimate the ground-truth data generating parameter β , seeing Assumption 1.2. In the inner optimization problem, we compute the worst-case square loss between the response variable and the linear prediction among the perturbations. The perturbations are added to the input variable and with the largest ℓ -norm δ. In the outer optimization problem, we optimize the empirical expectation of the worst-case loss of given samples.

The optimization problem (1) can be further simplified by considering its dual formulation, which is shown as follows [21, Proposition 1]. Proposition 2.1 (Dual Formulation of problem (1)). If we denote the optimal value of problem (1) by R(δ), then we have that

R(δ) = min β 1 n

|X i β Yi| + δ β 1 2 . (2)

We discuss the advantages and theoretical insights we could get by considering the dual problem (2). Note that the dual formulation (2) removes the inner maximization of problem (1), and the associated objective function is a convex function of β. Thus, it will be more convenient to solve the dual problem (2) than the primal problem (1). Also, the expansion of the objective function in (2) yields the following:

i=1 (X i β Yi)2 + δ β 1

i=1 |X i β Yi|

+ δ2 β 2 1, (3)

where the residual term δ2 β 2 1 will be of high order if we let δ, for example, be proportional to the inverse of a positive power of n. Regardless of the high order residual term, the objective function in problem (2) can be viewed as the sum of the loss function in linear regression and a regularization term depending on β 1. This implies that ℓ -adversarial-trained linear regression has a regularization effect. We refer to [21, 28] and references therein for more discussions about the regularization effect of adversarial training. Since the well-known LASSO is formulated by imposing the ℓ1-norm regularization term and enjoys the minimax convergence rate of the prediction error [20], the dual formulation (2) of ℓ -adversarial training in linear regression and its expansion (3) may indicate a fast convergence of its prediction error for the adversarial training estimator.

2.2 Convergence Analysis

In this subsection, we will first introduce the restricted eigenvalue condition and then derive the convergence rate of the prediction error for the adversarial training estimator in high-dimensional linear regression under the restricted eigenvalue condition and ℓ -perturbation. We will also discuss the high-probability arguments upon which we prove the optimality of the associated adversarial training estimator.

Before we deliver the convergence analysis, we make the following assumption. Assumption 2.2 (Restricted Eigenvalue Condition). The matrix X Rn p satisfies the restricted eigenvalue condition if there exists a positive number γ = γ(s, c1) > 0 such that

min Xv 2 n v 2 : |S| s, v Rp\{0}, v Sc 1 c1 v S 1

where S is some subset of {1, ..., p}.

In the sequel, we use the notation RE(s, c1) to denote the restricted eigenvalue condition w.r.t. the cardinality s of the index set S and the constant c1 in the constrained cone, i.e., v Sc 1 c1 v S 1. The restricted eigenvalue condition can be considered as a relaxation of the positive semidefiniteness of the gram matrix X X and is a useful technique in theoretical analysis in the sparse highdimensional analysis [11].

Equipped with Assumption 2.2, we have the following convergence result of prediction error.

Theorem 2.3 (Prediction Error Analysis for Adversarial Training). Suppose the adversarial training problem (1) satisfies 2 X ϵ

If β is supported on a subset S of {1, ..., p} where |S| s, and the design matrix X satisfies RE(s, 3) with parameter γ(s, 3), then we have that

1 2n X(bβ β ) 2 2 3δ2s max

( 9 γ2(s, 3)

2 , 164 β 2 2

Theorem 2.3 shows that the upper bound of the prediction error mainly depends on the sparsity cardinality s of the ground-truth parameter β and perturbation magnitude δ. The perturbation magnitude δ is assumed to be equal to or larger than 2 X ϵ / ϵ 1. We could apply the concentration inequalities to give a closed-form expression of the perturbation magnitude, based on which the convergence rate of the prediction error is derived. The convergence rate holds with a high probability and can be found in the following corollary.

Corollary 2.4. Consider the adversarial training problem (1) with perturbation magnitude

If β is supported on a subset S of {1, ..., p} where |S| s, and the design matrix X satisfies RE(s, 3) with parameter γ(s, 3), then we have that

1 2n X(bβ β ) 2 2 192s log p

( 9 γ2(s, 3)

2 , 164 β 2 2

holds with a probability greater than 1 2 exp ( C1n) 2/p, where C1 is a positive constant.

Remark 2.5. We discuss the choice of δ. Corollary 2.4 implies that the perturbation magnitude δ should be of the order 1/ n in order to derive the non-asymptotic convergence rate (6). The associated order is consistent with the asymptotic analysis in [28], where the sparsity-recovery ability could be proven in the asymptotic sense if the sample size is of the order 1/ n.

In Corollary 2.4, we choose δ as is shown in (5). Under this setting, it can be proven that the inequality (4) holds with a high probability. Then, we adopt Theorem 2.3 and could have the expression of the prediction error in terms of p and n as shown in (6).

The convergence rate could be further simplified in the following corollary if both the error variance σ2 and the ℓ2-norm of the ground-truth parameter β are bounded.

Corollary 2.6. Under the assumptions stated in Corollary 2.4, suppose there exists a finite positive constant R such that

41 β 2 R, σ < 1

then we have that 1 2n X(bβ β ) 2 2 192s log p

holds with a probability greater than 1 2/p 2 exp ( C1n) exp( n), where C1 is a positive constant.

Remark 2.7. Corollary 2.6 investigates the behavior of the adversarial training estimator under ℓ -perturbation by computing the resulting prediction error while [28] studies the behavior of the adversarial training estimator under ℓ -perturbation by deriving the associated limiting distribution. Both [28] and our results consider the sparsity condition. [28] proves that ℓ -adversarial training can help recover sparsity asymptotically if the parameter sparsity is known while our paper, i.e., Corollary 2.6, provides a fast non-asymptotic convergence rate for prediction error under the sparsity condition.

Remark 2.8. Corollary 2.6 illustrates that the convergence rate of prediction error for ℓ -adversarial training in linear regression is of the order s log p/n while the prediction error shown in [21] has a lower order 1/ n in terms of n. Our paper achieves a faster rate by incorporating the sparsity information and applying the restricted eigenvalue condition.

Corollary 2.6 implies that the prediction error of high-dimensional ℓ -adversarial-trained estimator is of the order s log p/n. This order is optimal up to a logarithmic factor in the minimax sense for any estimators over a class of s-sparse vectors in Rp when n samples are given [2, 20].

3 Group Adversarial Training

This section will elaborate on the formulation of group adversarial training and the associated convergence rate. Also, we compare group adversarial training under (2, )-perturbation with classic adversarial training under ℓ -perturbation.

Since the adversarial training forces the perturbation with uniform magnitude to each component of the input variable, it may not perform well if the input variable has a group effect. The group structure exists in many real-world problems. For example, groups of genes act together in pathways in gene-expression arrays [18], and financial data can be grouped by different sectors and industries to help market prediction [6]. Also, if an input variable is a multilevel factor and dummy variables are introduced, then these dummy variables act in a group [32]. Group adversarial training can tackle the group effect by adding group-structured perturbation. The detailed formulation can be seen in Section 3.1.

3.1 Problem Formulation

In this subsection, we describe the formulation of the group adversarial training.

Suppose the input variable x can be divided into L non-overlapped groups. Then, we have the definition of the group-structured weighted norm accordingly in the following proposition, where the associated dual norm is also stated. Proposition 3.1 (Proposition 5 in [5], Theorem 2.2 in [7]). Consider a vector x = (x1, ..., x L), where each xl Rpl, and PL l=1 pl = p. Define the weighted (r, s)-norm of x with the L-dimensional weight vector ω = (ω1, ..., ωL) to be:

l=1 ωlxl s r

xω r, = max 1 l L ωlxl r, s = ,

where ωl > 0, l and r 1. Then, the dual norm of (r, s)-norm with weight ω is the (q, t)-norm with weight ω 1 = (1/ω1, ..., 1/ωL), i.e. xω 1 q,t, where 1/r + 1/q = 1 and 1/s + 1/t = 1.

To handle the group structure in the input variable, the weighted (r, s)-norm is applied to add group structure in the perturbation accordingly, and the group adversarial training is formulated as follows,

i=1 sup ω r,s δ

(Xi + ) β Yi 2 ,

where ω = (ω1, ..., ωL).

Recall we focus on adversarial training problems under ℓ -perturbation, high-dimension setting, and sparsity condition. Under this consideration, we let s = and r = 2, and then the associated group adversarial training problem under (2, )-perturbation has the following expression:

i=1 sup ω 2, δ

(Xi + ) β Yi 2 . (7)

To facilitate convenience for the computation and analysis, similar to our study towards classic adversarial training in Section 2.1, we derive the dual formulation of problem (7) in the following proposition. One can check that the corresponding objective in the dual formulation (8) is also convex.

Proposition 3.2 (Dual Formulation of problem (7)). If we denote the optimal value of problem (7) by e R(δ), then we have that

e R(δ) = min β 1 n

|(Xi + ) β Yi| + δ βω 1 2,1 2 , (8)

3.2 Convergence Analysis

In this subsection, we deliver the convergence analysis of the prediction error of the estimator obtained from group adversarial training under (2, )-perturbation, i.e., problem (7).

First, we clarify some notations for subsequent analysis. In terms of the group structure of the input variable and the perturbation, we focus on non-overlapped cases. Assume that the index set {1, ..., p} has the prescribed (disjoint) partition {1, ..., p} = SL l=1 Gl. We use pl to denote the cardinality of each group, i.e., |Gl| = pl.

Consider the group sparsity structure in the ground-truth parameter β Rp, where sparsity is imposed at the group level instead of on individual components. Specially, the set J {1, ..., L} denotes a set of groups and β is supported at these J groups, i.e., β is supported on the GJ = S

We make the following assumption before we proceed to derive the convergence analysis. Assumption 3.3 (Group Restricted Eigenvalue Condition). The matrix X Rn p satisfies the group restricted eigenvalue condition if there exists a positive number κ = κ(g, c2) > 0 such that

( Xv 2 n v GJ 2 : |J| g, v Rp\{0}, X

1 ωl vl 2 c2 X

where J is some subset of {1, ..., L}.

In the sequel, we use the notation GRE(g, c2) to denote the restricted eigenvalue condition w.r.t. the cardinality g of the index set J and the constant c2 in the constrained cone, i.e., P

l Jc 1 ωl vl 2 c2 P

l J 1 ωl vl 2. Group restricted eigenvalue condition is an extension of the restricted eigenvalue condition and can be used in the theoretical analysis for the group LASSO, seeing [17]. Theorem 3.4 (Prediction Error Analysis for Group Adversarial Training). Consider the group adversarial training problem (7) satisfying

2 X ϵ l 2 ϵ 1 δ

If β is supported on a subset GJ of {1, ..., p} where |J| g, and the design matrix X satisfies GRE(g, 3) with parameter κ(g, 3), then we have that

1 2n X(eβ β ) 2 2 3δ2 X

( 9 κ2(g, 3)

2 , 164 β 2 2

where eβ is the estimator obtained from solving problem (7).

Theorem 3.4 shows that the upper bound of the prediction error mainly depends on the weight ω and perturbation magnitude δ. We apply the arguments in concentration inequalities and obtain the convergence rate in the following corollary. Corollary 3.5. Consider the group adversarial training problem (7) satisfying

3pl + 9 log L

and Ψl = X Gl XGl/n = Ipl pl, where XGl denotes the n pl sub-matrix of X formed by the columns indexed by Gl. If β is supported on a subset GJ of {1, ..., p} where |J| g, and the design matrix X satisfies GRE(g, 3) with parameter κ(g, 3), then we have that

1 2n X(eβ β ) 2 2 432|GJ| + g log L

( 9 κ2(s, 3)

2 , 164 β 2 2

holds with a probability greater than 1 2 exp ( C1n) 2/L, where C1 is a positive constant.

Remark 3.6. Note that we assume the gram matrix satisfies that X Gl XGl/n = Ipl pl. This is a standard assumption in the theoretical analysis in sparse high-dimensional linear regression, seeing [17, 19].

Similar to the analytic investigations in Section 2, the convergence rate of the prediction error could be further simplified in the following corollary if the ℓ2-norm of the ground-truth parameter β and error variance σ2 are bounded. Corollary 3.7. Under the assumptions stated in Corollary 3.5, suppose there exists a finite positive constant R such that 2

41 β 2 R, σ < 1

then we have that 1 2n X(eβ β ) 2 2 432|GJ| + g log L

holds with a probability greater than 1 2/L 2 exp ( C1n) exp( n), where C1 is a positive constant. Remark 3.8. If we make the number of groups equal to p, i.e., each group only has one component, then we will have that L = p, pl = 1, |GJ| = g. The resulting error bound is g log p/n, where g denotes the number of nonzero components of β . This order matches what is derived in Corollary 2.6.

Corollary 3.7 indicates that the upper bound of the associated prediction error in group adversarial training under (2, )-perturbation is of the order (|GJ| + g log L)/n. Recall that L is the number of prescribed groups for the p-dimensional variable, the ground-truth parameter β Rp is supported by a subset of the L groups, and the subset is denoted by J {1, ..., L}. The cardinality of the support subset J is g. We also use GJ {1, ..., p} to denote all the indexes included in J.

It follows from Corollary 2.6 that the convergence rate of the prediction error for the classic adversarial training under ℓ -perturbation is of the order s log p/n, where s denotes the cardinality of the support set of β . Then, we can conclude that if |GJ|/s log p and g s, the group adversarial training is superior to the classic adversarial training. In essence, if the sparsity pattern of β has a good group structure, i.e., most of the nonzero components can be captured in J, then the group adversarial training procedure can provide an improved upper bound for the prediction error.

4 Numerical Experiments

In this section, we will run numerical experiments to observe the empirical performances of (group) adversarial training in high-dimensional linear regression.

We consider the following models to generate synthetic data: The response variable Y is generated by the Gaussian linear model, as stated in Assumption 1.2. The standard deviation of the error ϵ is chosen as 0.1. In Model 1, the ground truth parameter β is a 500-dimensional vector. The first four components are [0.1, 0.2, 0.15, 0.25], the last four components are [0.9, 0.95, 1, 1.05], and the other components are zero. In Model 2, the ground truth parameter β is a 600-dimensional vector. The first three components of β are [0.4, 0.5, 0.6]. The last three components of β correspond to dummy variables generated from a four-level categorical factor. These dummy variable components are assigned values [0.2, 0.3, 0.7]. The other components are zero.

We run adversarial training under ℓ -perturbation and group adversarial training under (2, )- perturbation to give the estimations for the ground-truth parameter β , respectively. As suggested in Corollary 2.4 and Corollary 3.5, the perturbation magnitude is chosen in the order of 1/ n in the

adversarial training; the ratio of the perturbation magnitude and the perturbation weight is chosen in the order of 1/ n in the group adversarial training. For the constant, we selected 1 for simplicity and experimental convenience. For the group adversarial training, we divide the parameter equally into 125 groups of size 4 for Model 1 and 200 groups of size 3 for Model 2. The sample sizes are chosen {50, 100, 150, 200, 250, 300, 350, 400} for Model 1 and {50, 100, 150, 200, 250, 350, 450, 550} for Model 2. In terms of computation, we apply the dual formulations, i.e., problem (2) and problem (8), and solve these convex optimization problems using the CVXPY toolbox [8]. Five random samples are generated at each sample size, and we run (group) adversarial training for each sample. The mean and standard error of the coefficient estimations and prediction errors are computed and recorded.

We first plot the coefficient estimation paths of adversarial training with error bars in Figure 1 and Figure 2. Both adversarial training and group adversarial training can shrink the parameter estimation, while group adversarial training performs a better shrinkage effect, In addition, the final values that the coefficients converge to are annotated in the figures. Given the ground-truth nonzero values [0.1, 0.15, 0.2, 0.25, 0.9, 0.95, 1, 1.05] and [0.4, 0.5, 0.6, 0.2, 0.3, 0.7], the final values of group adversarial training are closer to the ground-truth, indicating that the group adversarial training output more accurate estimations.

We also plot the curve of log10(prediction error) versus log10(sample size) with error bars in Figure 3 and Figure 4. We can observe that the slopes of two curves are approximately equal to 1, which is consistent with our theoretical analysis, where we have proved that the prediction error for highdimensional (group) adversarial training is of the order 1/n in terms of the sample size n. Further, the curves and error bars of group adversarial training are below those of adversarial training, indicating the superiority of group adversarial training. This conclusion is also consistent with our theoretical analysis that if the model has a good group structure, group adversarial training has a lower order of prediction error.

50 100 150 200 250 300 350 400 Sample Size

Adversarial Training

0.86 0.91 0.96 1.02

50 100 150 200 250 300 350 400 Sample Size

Group Adversarial Training

0.88 0.93 0.97 1.03

Figure 1: Coefficient Estimation Path in Model 1

50 100 150 200 250 300 350 400 450 500 Sample Size

Adversarial Training

50 100 150 200 250 300 350 400 450 500 Sample Size

Group Adversarial Training

Figure 2: Coefficient Estimation Path in Model 2

1.8 2.0 2.2 2.4 2.6 log(Sample Size)

Prediction Error

Group Adversarial Training Adversarial Training

Figure 3: Prediction Error in Model 1

1.8 2.0 2.2 2.4 2.6 log(Sample Size)

Prediction Error

Group Adversarial Training Adversarial Training

Figure 4: Prediction Error in Model 2

5 Discussions

This paper reveals the statistical optimality of adversarial training under ℓ -perturbation in high dimensional linear regression and discusses potential improvements that can be achieved by group adversarial training. In the future, we may generalize the analysis in linear regression to broader statistical models, e.g., the generalized linear model and other parametric models. Also, since the prediction errors are investigated in this paper, we will consider analyzing estimation errors as our next step. More advanced analytical future work may use the primal-dual witness technique for the non-asymptotic variable-selection analysis.

Acknowledgments and Disclosure of Funding

The authors would like to thank the chairs and anonymous reviewers for their careful comments, which helped enhance the presentation of the manuscript. The authors are partially sponsored by a subcontract of NSF grant 2229876, the A. Russell Chandler III Professorship at Georgia Institute of Technology, and NIH-sponsored Georgia Clinical & Translational Science Alliance.

[1] M. Andriushchenko and N. Flammarion. Understanding and improving fast adversarial training. Advances in Neural Information Processing Systems, 33:16048 16059, 2020.

[2] P. C. Bellec, G. Lecué, and A. B. Tsybakov. Slope meets lasso: improved oracle bounds and optimality. The Annals of Statistics, 46(6B):3603 3642, 2018.

[3] A. Belloni, V. Chernozhukov, and L. Wang. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791 806, 2011.

[4] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, pages 1705 1732, 2009.

[5] J. Blanchet and Y. Kang. Distributionally robust groupwise regularization estimator. In Asian Conference on Machine Learning, pages 97 112. PMLR, 2017.

[6] M. J. A. Chan-Lau. Lasso regressions and forecasting models in applied stress testing. International Monetary Fund, 2017.

[7] R. Chen and I. C. Paschalidis. Robust grouped variable selection using distributionally robust optimization. Journal of Optimization Theory and Applications, 194(3):1042 1071, 2022.

[8] S. Diamond and S. Boyd. CVXPY: A python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1 5, 2016.

[9] R. Gao and A. Kleywegt. Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research, 48(2):603 655, 2023.

[10] Y. Gao, L. Qin, Z. Song, and Y. Wang. A sublinear adversarial training algorithm. In The Twelfth International Conference on Learning Representations, 2024.

[11] T. Hastie, R. Tibshirani, and M. Wainwright. Statistical learning with sparsity. Monographs on statistics and applied probability, 143(143):8, 2015.

[12] D. Hendrycks, K. Lee, and M. Mazeika. Using pre-training can improve model robustness and uncertainty. In International conference on machine learning, pages 2712 2721. PMLR, 2019.

[13] C. Herrmann, K. Sargent, L. Jiang, R. Zabih, H. Chang, C. Liu, D. Krishnan, and D. Sun. Pyramid adversarial training improves vit performance. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13419 13429, 2022.

[14] X. Hu, P.-Y. Chen, and T.-Y. Ho. Radar: Robust AI-text detection via adversarial learning. Advances in Neural Information Processing Systems, 36:15077 15095, 2023.

[15] A. Javanmard and M. Soltanolkotabi. Precise statistical analysis of classification accuracies for adversarial training. The Annals of Statistics, 50(4):2127 2156, 2022.

[16] A. Javanmard, M. Soltanolkotabi, and H. Hassani. Precise tradeoffs in adversarial training for linear regression. In Conference on Learning Theory, pages 2034 2078. PMLR, 2020.

[17] K. Lounici, M. Pontil, S. van de Geer, and A. B. Tsybakov. Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics, pages 2164 2204, 2011.

[18] G. Malenová, D. Rowson, and V. Boeva. Exploring pathway-based group lasso for cancer survival analysis: a special case of multi-task learning. Frontiers in Genetics, 12:771301, 2021.

[19] Y. Nardi and A. Rinaldo. On the asymptotic properties of the group lasso estimator for linear models. Electronic Journal of Statistics, 2:605 633, 2008.

[20] G. Raskutti, M. J. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional linear regression over ℓq-balls. IEEE Transactions on information theory, 57(10):6976 6994, 2011.

[21] A. Ribeiro, D. Zachariah, F. Bach, and T. Schön. Regularization properties of adversariallytrained linear regression. Advances in Neural Information Processing Systems, 36, 2023.

[22] A. Robey, F. Latorre, G. J. Pappas, H. Hassani, and V. Cevher. Adversarial training should be cast as a non-zero-sum game. In The Twelfth International Conference on Learning Representations, 2024.

[23] A. Shafahi, M. Najibi, M. A. Ghiasi, Z. Xu, J. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein. Adversarial training for free! Advances in neural information processing systems, 32, 2019.

[24] E. Shayegani, M. A. A. Mamun, Y. Fu, P. Zaree, Y. Dong, and N. Abu-Ghazaleh. Survey of vulnerabilities in large language models revealed by adversarial attacks. ar Xiv preprint ar Xiv:2310.10844, 2023.

[25] V. N. Vapnik, V. Vapnik, et al. Statistical learning theory. 1998.

[26] R. Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

[27] Z. Wang, T. Pang, C. Du, M. Lin, W. Liu, and S. Yan. Better diffusion models further improve adversarial training. In International Conference on Machine Learning, pages 36246 36263. PMLR, 2023.

[28] Y. Xie and X. Huo. Asymptotic behavior of adversarial training estimator under ℓ -perturbation. ar Xiv preprint ar Xiv:2401.15262, 2024.

[29] Y. Xing, Q. Song, and G. Cheng. On the algorithmic stability of adversarial training. Advances in neural information processing systems, 34:26523 26535, 2021.

[30] Y. Xing, R. Zhang, and G. Cheng. Adversarially robust estimate and risk analysis in linear regression. In International Conference on Artificial Intelligence and Statistics, pages 514 522. PMLR, 2021.

[31] D. Yin, R. Kannan, and P. Bartlett. Rademacher complexity for adversarially robust generalization. In International conference on machine learning, pages 7085 7094. PMLR, 2019.

[32] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(1):49 67, 2006.

[33] S. Zhou, C. Liu, D. Ye, T. Zhu, W. Zhou, and P. S. Yu. Adversarial attacks and defenses in deep learning: From a perspective of cybersecurity. ACM Computing Surveys, 55(8):1 39, 2022.

A Proof of Theorem 2.3

We first give the upper bound of the bβ 1 in terms of β 1.

Lemma A.1 (Upper Bound of bβ 1). Under conditions stated in Theorem 2.3, we have that

bβ 1 9 β 1.

Proof. The proof of this lemma follows a similar approach to the proof of Theorem 2 in [21]. It follows from the first-order condition of dual formulation (2) of the adversarial training problem that

n X (X bβ Y ) + δ2 bβ 1w + δ

n X bβ Y 1w + δ

n bβ 1X z, (9)

where zi = |X i bβ Y |, wi = |bβ|i.

Then, we take the dot product of both sides of equation (9) with bβ β and could have the following:

1 n X(bβ β ) 2 2

n(X(bβ β )) ϵ δ2 bβ 1w (bβ β ) δ

n X bβ Y 1w (bβ β ) δ

n bβ 1(X(bβ β )) z,

n(X(bβ β )) ϵ δ2 bβ 2 1 + δ2 bβ 1w β δ

n X bβ Y 1 bβ 1 + δ

n X bβ Y 1w β

n bβ 1 X bβ Y 1 δ

2n ϵ 1 bβ β 1 δ2 bβ 2 1 + δ2 bβ 1 β 1 δ

n X bβ Y 1 bβ 1 + δ

n X bβ Y 1 β 1

n bβ 1 X bβ Y 1 + δ

n bβ 1( ϵ 1 2 X bβ Y 1) + δ

n X bβ Y 1 β 1 + δ

2n ϵ 1( bβ 1 + β 1)

δ2 bβ 2 1 + δ2 bβ 1 β 1

3 X(bβ β ) 1 2

n X bβ Y 1 β 1 + δ

2n ϵ 1( bβ 1 + β 1)

δ2 bβ 2 1 + δ2 bβ 1 β 1

(d) δ n X(bβ β ) 2(5

3 bβ 1 + β 1) + δ

2 β 1) δ2 bβ 2 1 + δ2 bβ 1 β 1,

(10) where (a) comes from the Hölder s inequality, i.e.,

(X(bβ β )) ϵ bβ β 1 X ϵ , w β w β 1 = β 1, ϵ z ϵ 1 z = ϵ 1,

and the condition 2 X ϵ / ϵ 1 δ, (b) comes from bβ β 1 bβ 1 + β 1, (c) comes from the relationship that

2 X bβ Y 1 5

3 X bβ Y 1 5

3 X(bβ β ) 1 + 5

and (d) comes from the inequality that X(bβ β ) 1 n X(bβ β ) 2 and X bβ Y 1 X(bβ β ) 1 + ϵ 1.

One may observe that (10) is a second-order inequality of variable X(bβ β ) 2/ n, then the associate discriminant should be equal to or larger than 0, resulting in

1 9δ2( 11 bβ 2 1 + 66 β 1 bβ 1 + 9 β 2 1) + δ

3n ϵ 1( 2 bβ 1 + 18 β 1) 0,

from which we could conclude that at least one of two terms should be equal or larger than 0, implying

bβ 1 9 β 1.

Then, we proceed to prove Theorem 2.3.

Proof. We write the objective function in (2) in the matrix norm and then have that

1 2n Y X bβ 2 2+1

2δ2 bβ 2 1+ δ

n Y X bβ 1 bβ 1 1

2n Y Xβ 2 2+1

2δ2 β 2 1+ δ

n Y Xβ 1 β 1. (11) It follows from Y = Xβ + ϵ, i.e., Assumption 1.2, that

Y Xβ 2 2 = ϵ 2 2,

Y X bβ 2 2 = X(bβ β ) ϵ 2 2 = X(bβ β ) 2 2 + ϵ 2 2 2ϵ X(bβ β ),

Y X bβ 1 = X(bβ β ) ϵ 1 ϵ 1 X(bβ β ) 1. In this way, the inequality (11) could be reformulated as

1 2n X(bβ β ) 2 2

nϵ X(bβ β ) + δ

n X(bβ β ) 1 bβ 1 + δ

n ϵ 1 β 1 bβ 1 + 1

2δ2 β 2 1 1

n X ϵ bβ β 1 + δ

n X(bβ β ) 1 bβ 1 + δ

n ϵ 1 β 1 bβ 1 + 1

2δ2 β 2 1 1

(b) δ 2n ϵ 1 bβ β 1 + δ

n X(bβ β ) 1 bβ 1 + δ

n ϵ 1 β 1 bβ 1 + 1

2δ2 β 2 1 1

(c) δ 2n ϵ 1 bβ β 1 + δ n X(bβ β ) 2 bβ 1 + δ

n ϵ 1 β 1 bβ 1 + 1

2δ2 β 2 1 1

2δ2 bβ 2 1,

(12) where (a) comes from the Hölder s inequality, (b) comes from the condition 2 X ϵ / ϵ 1 δ, (c) comes from X(bβ β ) 1 n X(bβ β ) 2.

Then, we begin to give the upper bound of the prediction error. Two cases should be discussed.

First Case: bβ β 1 + 2 β 1 2 bβ 1 0. (13)

In this case, we have that

β 1 bβ 1 = bβ 1 β 1 + 2 β 1 2 bβ 1 bβ β 1 + 2 β 1 2 bβ 1 0. (14)

Then, it follows from (12) that

1 2n X(bβ β ) 2 2

2n ϵ 1 bβ β 1 + δ n X(bβ β ) 2 bβ 1 + δ

n ϵ 1 β 1 bβ 1 + 1

2δ2 β 2 1 1

2n ϵ 1 bβ β 1 + 2 β 1 2 bβ 1 + δ n X(bβ β ) 2 bβ 1 + 1

2δ2 β 2 1 1

δ n X(bβ β ) 2 bβ 1,

where the last inequality comes from (13) and (14).

Lemma A.1 indicates that 1

2n X(bβ β ) 2

which is equivalent to

1 2n X(bβ β ) 2 2 162δ2 β 2 1 162δ2|S| β 2 2 162δ2s β 2 2. (15)

Second Case: bβ β 1 + 2 β 1 2 bβ 1 0. (16)

If we let bv = bβ β , then we have that

β bβ 1 = bv 1 = bv S 1 + bv Sc 1,

β 1 bβ 1 = β S 1 ( β S + bv S 1 + bv Sc 1) bv S 1 bv Sc 1.

The inequality (16) indicates that

3 bv S 1 bv Sc 1 bβ β 1 + 2 β 1 2 bβ 1 0, (17)

implying bv Sc 1 3 bv S 1.

In this way, the RE(s, 3) condition can be applied.

In addition, it follows the inequality (12) that

1 2n X(bβ β ) 2 2

2n ϵ 1 bβ β 1 + δ n X(bβ β ) 2 bβ 1 + δ

n ϵ 1 β 1 bβ 1 + 1

2δ2 β 2 1 1

(a) δ 2n ϵ 1 bβ β 1 + 1

4n X(bβ β ) 2 2 + δ2 bβ 2 1 + δ

n ϵ 1 β 1 bβ 1 + 1

2δ2 β 2 1 1

2n ϵ 1 bβ β 1 + 2 β 1 2 bβ 1 + 1

4n X(bβ β ) 2 2 + 1

2δ2 β 2 1 + 1

2n ϵ 1 bv S 1 + 1

4n X(bβ β ) 2 2 + 41δ2 β 2 1,

where (a) comes from the inequality

1 4n X(bβ β ) 2 2 + δ2 bβ 2 1 δ n X(bβ β ) 2 bβ 1,

(b) comes from (17) and Lemma A.1.

Further, we have that

1 4n X(bβ β ) 2 2 3δ

|S| ϵ 1 bv 2 + 41δ2|S| β 2 2

s γ(s, 3) ϵ 1

n X(bβ β ) 2 + 41δ2s β 2 2, (18)

where the first inequality comes from bv S 1 p

|S| bv 2 and β 1 p

|S| β 2, and the last inequality comes from the RE(s, 3) condition and |S| s.

Then, if we solve the inequality (18), we could have that

1 n X(bβ β ) 2 δ s γ(s, 3)

2 + 164γ2(s, 3) β 2 2

which is equivalent to

1 n X(bβ β ) 2 (1 +

γ(s, 3) max 3 ϵ 1

164γ(s, 3) β 2

indicating that

1 2n X(bβ β ) 2 2 3 δ2s γ2(s, 3) max

2 , 164γ2(s, 3) β 2 2

Combining (15) and (19), we have that

1 2n X(bβ β ) 2 2 3δ2s max

( 9 γ2(s, 3)

2 , 164 β 2 2

B Proof of Corollary 2.4

Proof. To give the high probability result, we should analyze the bound of the term X ϵ /n and ϵ 1/n. The associated arguments are discussed in the following two parts, respectively. (Note that we assume ϵ has i.i.d. Gaussian entries. However, our high-probability result can be extended to sub-Gaussian cases, as sub-Gaussian variables share similar tail decay behavior.)

Part I: We focus on the tail bound of X ϵ /n. Since the design matrix X is normalized, the random variable x j ϵ/n is stochastically dominated by N(0, σ2/n). As shown in Theorem 11.1 in [11], it follows from the Gaussian tail bound and the union bound that

n t 2 exp nt2

2σ2 + log p .

In this way, with a probability greater than 1 2/p, the following holds

Part II: We focus on the concentration inequality of ϵ 1/n. It follows from general Hoeffding s inequality [26] that

2 exp C2 t2

2 exp C2 nt2

where C2 is some positive constant. By choosing t = σ/10, we have that

1 2 exp ( C1n) ,

where C1 is some positive constant. This is to say, with a probability greater than 1 2 exp ( C1n), we have that

Suppose we let

with a probability greater than 1 2 exp ( C1n) 2/p, we have that 2 X ϵ

It follows from Theorem 2.3 that 1 2n2 X(bβ β ) 2 2 192s log p

( 9 γ2(s, 3)

2 , 164 β 2 2

holds with a probability greater 1 2 exp ( C1n) 2/p.

C Proof of Corollary 2.6

Proof. We focus on the tail bound of ϵ 1/n. We apply the Chernoff bound to give the tail bound of ϵ 1, and we have that P ( ϵ 1 t) inf s>0 M(s) exp( ts),

where M(s) is the moment-generating function of ϵ 1. We also could obtain that

Mi(s) = E[exp(s|ϵi|)] 2E[exp(sϵi)] = 2 exp σ2s2

M(s) 2n exp nσ2s2

Then, we have that

P ( ϵ 1 t) inf s>0 2n exp nσ2s2

n ϵ 1 t exp nt2

2σ2 + n log 2 .

In this way, with a probability greater 1 exp( n), the following holds:

2 log 2 + 2σ 2σ.

Since we have that 1

6γ(s, 3)R σ,

holds with a probability greater 1 exp( n). Due to Corollary 2.4, we have the following

1 2n2 X(bβ β ) 2 2 192s log p

holds with a probability greater 1 2/p 2 exp( C1n) exp( n).

D Proof of Proposition 3.2

It follows from Proposition 1 in [21] that

e R(δ) = min β 1 n

|(Xi + ) β Yi| + δ βω 1 2 ,

where denotes the dual norm of (2, )-norm of βω. Due to Proposition 3.1, we conclude that (8) holds.

E Proof of Theorem 3.4

We first give the upper bound of the eβω 1 2,1 in terms of β ω 1 2,1.

Lemma E.1 (Upper Bound of eβω 1 2,1). Under conditions stated in Theorem 3.4, we have that

eβω 1 2,1 9 β ω 1 2,1.

Proof. The proof of this lemma follows a similar approach to the proof of Lemma A.1. It follows from the first-order condition of the dual formulation (8) of the group adversarial training problem that 0 = 1

n X (X eβ Y ) + δ2 eβω 1 2,1t + δ

n X eβ Y 1t + δ

n eβω 1 2,1X z, (20)

where zi = |X i eβ Y |,

t = eβω 1 2,1, tl = 1

Notice we have that

1 ωl eβl 2 = eβω 1 2,1, (21)

and it follows from the Hölder s inequality that

(eβl) βl eβl 2

1 ωl βl 2 = β ω 1 2,1. (22)

Also, we have the following from the Hölder s inequality:

(X(eβ β )) ϵ

l=1 (X ϵ)l 2 eβl βl 2 1

1 ωl eβl βl 2 = 1

2δ ϵ 1 (eβl βl )ω 1 2,1,

(23) where the second inequality comes from the condition 2 X ϵ l 2/ ϵ 1 δ/ωl.

Then, we take the dot product of both sides of equation (20) with eβ β and could have the following: 1 n X(eβ β ) 2 2

n(X(eβ β )) ϵ δ2 eβω 1 2,1t (eβ β ) δ

n X eβ Y 1t (eβ β )

n eβω 1 2,1(X(eβ β )) z,

n(X(eβ β )) ϵ δ2 eβω 1 2 2,1 + δ2 eβω 1 2,1t β δ

n X eβ Y eβω 1 2,1

n X eβ Y 1t β δ

n eβω 1 2,1 X eβ Y 1 + δ

n eβω 1 2,1ϵ z

2n ϵ 1 (eβl βl )ω 1 2,1 δ2 eβω 1 2 2,1 + δ2 eβω 1 2,1 β ω 1 2,1 δ

n X eβ Y 1 eβω 1 2,1

n X eβ Y 1 β ω 1 2,1 δ

n eβω 1 2,1 X eβ Y 1 + δ

n eβω 1 2,1ϵ z

n eβω 1 2,1( ϵ 1 2 X eβ Y 1) + δ

n X eβ Y 1 β ω 1 2,1

2n ϵ 1( eβω 1 2,1 + β ω 1 2,1) δ2 eβω 1 2 2,1 + δ2 eβω 1 2,1 β ω 1 2,1

n eβω 1 2,1(5

3 X(eβ β ) 1 2

n X eβ Y 1 β ω 1 2,1

2n ϵ 1( eβω 1 2,1 + β ω 1 2,1) δ2 eβω 1 2 2,1 + δ2 eβω 1 2,1 β ω 1 2,1

(e) δ n X(eβ β ) 2(5

3 eβω 1 2,1 + β ω 1 2,1) + δ

6 eβω 1 2,1 + 3

2 β ω 1 2,1)

δ2 eβω 1 2 2,1 + δ2 eβω 1 2,1 β ω 1 2,1, (24) where (a) comes from (21) and (22), (b) comes from (23), (c) comes from (eβl βl )ω 1 2,1 eβω 1 2,1 + β ω 1 2,1, (d) comes from 2 X eβ Y 1 5

3 X eβ Y 1 5

3 X(eβ β ) 1 + 5 3 ϵ 1, (e) comes from the inequality that X(eβ β ) 1 n X(eβ β ) 2 and X eβ Y 1 X(eβ β ) 1 + ϵ 1.

Since the inequality (24) is a second-order inequality of variable X(eβ β ) 2/ n, then the associate discriminant should be equal to or larger than 0, resulting in 1 9δ2( 11 eβω 1 2 2,1 + 66 β ω 1 2,1 eβω 1 2,1 + 9 β ω 1 2 2,1)

3n ϵ 1( 2 eβω 1 2,1 + 18 β ω 1 2,1) 0,

from which we could conclude that eβω 1 2,1 9 β ω 1 2,1.

Then, we proceed to prove Theorem 3.4.

Proof. We write the objective function in (8) in the matrix norm and then have that 1 2n y X eβ 2 1 + 1

2δ2 eβω 1 2 2,1 + δ

n y X eβ eβω 1 2,1

2n y Xβ 2 1 + 1

2δ2 β ω 1 2 2,1 + δ

n y Xβ 1 β ω 1 2,1. (25)

In this way, we have the following reformulation: 1 2n X(eβ β ) 2 2

nϵ X(eβ β ) + δ

n X(eβ β ) 1 eβω 1 2,1 + δ

n ϵ 1 β ω 1 2,1 eβω 1 2,1

2δ2 β ω 1 2 2,1 1

2δ2 eβω 1 2 2,1

2n ϵ 1 (eβ β )ω 1 2,1 + δ n X(eβ β ) 2 eβω 1 2,1 + δ

n ϵ 1 β ω 1 2,1 eβω 1 2,1

2δ2 β ω 1 2 2,1 1

2δ2 eβω 1 2 2,1, (26) where the last inequality comes from (23) and X(eβ β ) 1 n X(eβ β ) 2.

Similar to the proof of Theorem 2.3, we begin to give the upper bound of the prediction error. Two cases should be discussed.

First Case: (eβ β )ω 1 2,1 + 2 β ω 1 2,1 2 eβω 1 2,1 0. (27)

In this case, we have that

β ω 1 2,1 eβω 1 2,1 (eβ β )ω 1 2,1 + 2 β ω 1 2,1 2 eβω 1 2,1 0. (28) It follow from (26) that 1 2n X(eβ β ) 2 2

2n ϵ 1 (eβ β )ω 1 2,1 + δ n X(eβ β ) 2 eβω 1 2,1 + δ

n ϵ 1 β ω 1 2,1 eβω 1 2,1

2δ2 β ω 1 2 2,1 1

2δ2 eβω 1 2 2,1,

= δ n X(eβ β ) 2 eβω 1 2,1 + δ

2n ϵ 1 (eβ β )ω 1 2,1 + 2 β ω 1 2,1 2 eβω 1 2,1

2δ2 β ω 1 2 2,1 1

2δ2 eβω 1 2 2,1,

δ n X(eβ β ) 2 eβω 1 2,1,

where the last inequality comes from (27) and (28).

Notice we have that

β ω 1 2,1 = X

1 ω2 l β 2. (30)

Lemma E.1, (29) and (30) indicate that

1 2n2 X(bβ β ) 2 2 162δ2 β ω 1 2 2,1 162δ2 X

1 ω2 l β 2 2. (31)

Second Case: (eβ β )ω 1 2,1 + 2 β ω 1 2,1 2 eβω 1 2,1 0.

Notice we have that

(eβ β )ω 1 2,1 =

1 ωl eβl βl 2 = X

1 ωl (eβ β )l 2 + X

1 ωl (eβ β )l 2,

β ω 1 2,1 eβω 1 2,1 = X

1 ωl eβl 2 + X

1 ωl (eβ β )l 2

1 ωl (eβ β )l 2 X

1 ωl (eβ β )l 2.

If we let ev = eβ β ,then we have that

1 ωl evl 2 X

1 ωl evl 2 (eβ β )ω 1 2,1 + 2 β ω 1 2,1 2 eβω 1 2,1 0, (32)

indicating X

1 ωl evl 2 3 X

1 ωl evl 2.

In this way, the GRE(g, 3) condition can be applied.

We also have the following from (26)

1 2n X(eβ β ) 2 2

n X(eβ β ) 1 eβω 1 2,1 + δ

2n ϵ 1 β ω 1 2,1 eβω 1 2,1 + 2 (eβ β )ω 1 2,1

2δ2 β ω 1 2 2,1 1

2δ2 eβω 1 2 2,1

(a) 1 4n X(eβ β ) 2 2 + 3δ

1 ωl evl 2 + 1

2δ2 β ω 1 2 2,1 + 1

2δ2 eβω 1 2 2,1,

4n X(eβ β ) 2 2 + 3δ

1 ω2 l ev GJ 2 + 1

2δ2 β ω 1 2 2,1 + 1

2δ2 eβω 1 2 2,1

where (a) comes from (32), and (b) comes from

1 ω2 l ev GJ 2.

It follows from GRE(g, 3) and Lemma E.1 that

1 4n X(eβ β ) 2 2 3δ 2 n

1 κ(g, 3) ϵ 1

n X(eβ β ) 2 + 41δ2 X

1 ω2 l β 2 2.

Then, we could have that

1 2n X(eβ β ) 2 2 3 δ2

2 , 164κ2(g, 3) β 2 2

Combining (31) and (33), we have that

1 2n X(eβ β ) 2 2 3δ2 X

( 9 κ2(g, 3)

2 , 164 β 2 2

F Proof of Corollary 3.5

It follows from Lemma 3.1 in [17] that 2 n (X ϵ)l 2 2σ n

tr(Ψl) + 2|||Ψl|||(4 log L + p

holds with a probability greater than 1 2/L, tr(Ψl) denotes the trace of Ψl, and |||Ψl||| denotes the maximum eigenvalue of Ψl. Since Ψl = Ipl pl, we have that |||Ψl||| = 1 and tr(Ψl) = pl. Consequently, we have that 2 n (X ϵ)l 2 2σ n

pl + 2(4 log L + p

2pl log L) 2σ n

3pl + 9 log L,

holds with a probability greater than 1 2/L.

Also, it follows from the proof of Corollary 2.4 that

holds with a probability greater 1 2 exp ( C1n). Suppose we let

3pl + 9 log L

we have that 2 (X ϵ)l 2

holds with a probability greater than 1 2 exp ( C1n) 2/L.

It follows from Theorem 3.4 that

1 2n X(eβ β ) 2 2 48δ2 X

( 9 κ2(g, 3)

2 , 164 β 2 2

3pl + 9 log L

( 9 κ2(s, 3)

2 , 164 β 2 2

= 432|GJ| + g log L

( 9 κ2(s, 3)

2 , 164 β 2 2

G Proof of Corollary 3.7

Corollary 3.7 is straightforward due to Corollary 3.5 and the upper bound arguments in the proof in Corollary 2.6.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: The main claim is that the convergence rate of the prediction error of adversarial training estimator under ℓ -perturbation can achieve the minimax rate up to a logarithmic factor in the high-dimensional linear regression on the class of sparse parameters. We also discuss the group adversarial training. These contributions, including important assumptions, are clearly stated in the abstract and introduction.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: We have mentioned the limitations and future work in Section 5.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: All theoretical results in this paper are written mathematically rigorous with full set of assumptions, and all associated proofs can be seen in the Appendix. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We have mainly made theoretical contributions. However, some numerical experiments are also provided. Full experimental settings have been stated in the paper. Our results can be reproduced from the submitted code in the Supplementary Materials. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in

some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Our numerical experiments are based on synthetic data. The code has been submitted in the Supplementary Materials. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: All settings are specified in Section 4. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: The paper has reported the error bars. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Our numerical experiments do not require any workers, and our numerical experiments compare prediction errors achieved by different approaches instead of computation speed or storage. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The research conforms with the Neur IPS Code of Ethics. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: [NA] Guidelines:

The answer NA means that there is no societal impact of the work performed.

If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: [NA]

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification:[NA]

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification: [NA]

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: [NA]

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justification: [NA]

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.