# generalization_bounds_for_wasserstein_robust_optimization__b8071ed6.pdf

Generalization Bounds for (Wasserstein) Robust

Optimization

Yang An Department of Mathematics

Columbia University yangan@math.columbia.edu

Rui Gao Department of IROM University of Texas at Austin, rui.gao@mccombs.utexas.edu

(Distributionally) robust optimization has gained momentum in machine learning community recently, due to its promising applications in developing generalizable learning paradigms. In this paper, we derive generalization bounds for robust optimization and Wasserstein robust optimization for Lipschitz and piecewise Hölder smooth loss functions under both stochastic and adversarial setting, assuming that the underlying data distribution satisﬁes transportation-information inequalities. The proofs are built on new generalization bounds for variation regularization (such as Lipschitz or gradient regularization) and its connection with robustness.

1 Introduction

Consider the following stochastic optimization problem of seeking a function f : X ! R from the hypothesis space F so as to minimize the expected loss:

Ltrue(f) := EX Ptrue[f(X)]

Here the data X follows some underlying distribution Ptrue on X . In practice, Ptrue is often replaced with its empirical estimate Pn = 1

i=1 δˆxi, where δ is the Dirac measure:

min f2F {Lerm

n (f) := EX Pn[f(X)]} .

The discrepancy between Ptrue and Pn may be resulted from

(i) The sampling error of Pn when data are independently drew from Ptrue; (ii) Non-i.i.d. data, e.g., xi s are samples from a Markov chain with a stationary distribution Ptrue; (iii) The deploying environment is (adversarially) different from the data-collecting environment.

Such discrepancy largely contributes to the poor generalization behavior of empirical risk minimization. To tackle this issue, robust optimization [4, 37, 38] studies

n,p(f; %) := max {xi}n

i=1 f(xi) :

i=1kxi ˆxikp' 1

Here p 2 [1, 1]; the uncertainty set, denoted by Xp(%), describes the data perturbations to hedge against; and the radius % > 0 speciﬁes the level of robustness. In a similar spirit, Wasserstein robust optimization [15, 7, 19] studies

n,p(f; %) := max

Ex P[f(x)] : Wp(P, Pn) %

Here the distributional uncertainty set, denoted by Pp(%), consists of all probability distributions that are close to Pn in p-Wasserstein distance. When p = 1, (WO) and (RO) coincide. Both problems have received increasing attentions in machine learning; see [25] for a survey.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

When data is generated stochastically, such as i.i.d. or Markovian (see (i)(ii) above), we would like to choose the radius % = %n adaptive to the sample size, so that the robust loss can balance between performance and robustness. More precisely, %n should be large enough so that the robust loss dominates the true loss with high probability, but not be too large to produce an inferior solution. When the testing environment is adversarially different from the training environment (see (iii) above), consider the adversarial loss minimization

p (f; %) := max P2Pp(%)

Ex P[f(x)] : Wp(P, Ptrue) %

where the radius % captures the difference between the training and testing environments. When Ptrue is not available, a surrogate problem (WO) based on Pn is often considered. In this case, % is often ﬁxed and we hope that the robust solution yielding from (WO) generalizes well.

The goal of this paper is to provide a comprehensive analysis of the generalization capability of the above robust learning paradigms. Speciﬁcally, in the stochastic setting, under proper choice of the radius %n, we derive ﬁnite-sample guarantees for (RO) and (WO). In the adversarial setting, we derive generalization bounds for (Adv) with a small ﬁxed radius %.

Related Literature

Stochastic setting. The most related work to ours is [17], but we differ in several important ways. First, [17] studies only (WO) while we also consider (RO). Second, [17] studies i.i.d. data while we also consider non-i.i.d. by exploiting a different notion of transport inequality. Third, even in the i.i.d. setting, [17] studies only smooth losses and Wasserstein order p = 1, 2. We consider non-smooth losses and p 2 [1, 2], which includes important cases like | >x y|p. Moreover, for p = 1, we tighten the order of the remainder; for p = 2, we obtain a cleaner expression using a different set of assumptions. All these extensions require in-depth analysis on worst-case scenarios.

For (WO) with i.i.d. data, [15] originally proposed to choose the radius %n such that Ptrue is contained in the uncertainty set with high probability, but the resulting performance bound is overly conservative and suffers from the curse of dimensionality. This issue has been mitigated asymptotically for smooth loss functions [6, 8, 33], and non-asymptotically for linear models [32, 12] and for smooth loss functions [17]. As mentioned above, our results apply to non-i.i.d. data and non-smooth losses.

For (RO) in the stochastic setting, its generalization properties were ﬁrst studied in the pioneer work [39], which established algorithmic generalization generalization bounds for various learning tasks with i.i.d. and Markovian data. Nonetheless, many bounds developed therein suffer from the curse of dimensionality. This framework is then extended to metric learning [3] and ensemble models [41]. Our results circumvent the curse of dimensionality of the bounds developed in [39].

Adversarial setting. For p 2 [1, 1), generalization properties of (WO) with ﬁxed radius was

considered ﬁrst in [27]. The main difference is that we consider different magnitudes of data perturbation. We mainly focus on adversarial robustness where the radius is often a tiny number, and our results in Section 5 suggest that the generalization gap is small when the radius is small. While their motivation was from domain adaptation, and their bound is mostly useful when the Wasserstein radius is not extremely small, which suggests that the generalization gap becomes smaller as the radius grows to inﬁnity (see their Remark 4). The bounds developed in [34, 30] does not depend on the Wasserstein radius, while our result indicates that when the radius is small, the remainder actually demonstrates its linear scaling and therefore tighter.

For p = 1, [24, 40, 1] focus on deriving generalization bounds for speciﬁc loss function classes such as linear hypotheses and neural networks. Our results apply to generic function classes, and most importantly, the bound suggests the the generalization gap is controlled by not only the complexity of the loss function class, but also the complexity of the gradient norm functions of the loss functions. The intuition is that the gradient norm controls the robustness of the model subject to adversarial perturbations. As far as we know, the bound in such form is new in the literature. There are also some earlier works focusing on investigating the fundamental limit of adversarial learning and its difference from empirical risk minimization, such as [31, 35, 10, 11]. They consider somewhat different settings than ours, which usually involves a stylized data generating model that facilitates theoretical ﬁndings.

Regularization. Part of our analysis is based on the connection between robustness and regularization. The connection between (RO)(WO) and norm regularization was established for various

machine learning tasks [38, 37, 5, 6, 12], and the connection between (WO) and Lipschitz/gradient regularization has been established for smooth loss functions [32, 34, 16, 13, 8, 2] and non-smooth loss functions [18]. Our results improve the sandwich bounds in [18] for p = 1, and generalizes the results therein to piecewise Hölder smooth loss functions for p 2 (1, 1]. Finally, there are also works connecting regularization and distributional robustness under other choices of distance [26, 21, 14, 23].

2 Preliminaries

In this section, we prepare several technical tools that are useful in deriving our main results.

Notations For p 2 [1, 1], we denote by q its Hölder conjugate number, i.e. 1 p + 1

q = 1. Let P(X ) be the set of Borel probability distributions on X = (Rd, k k), and let Pp(X ) = {P 2 P(X ) : EP[kxkp] < 1}. For a given probability measure Q, denote by (Lp

Q(X ), k kp,Q) the set of Lp functions on X with respect to Q. Unless stated otherwise, we will not assume the samples are i.i.d. Instead, we use Sn = {ˆxi}n

i=1 to represent a sample of size n, and denote by µSn the sampling distribution and use ESn to indicate the expectation with respect to the µSn, and Ltrue(f) = ESn[ 1

i=1 f(ˆxi)]. For a ﬁxed set of n points, we denote Pn as the empirical uniform distribution on these n points. For two real numbers, a and b, we denote a b = min(a, b), and a _ b = max(a, b).

Transport inequalities In the stochastic setting, we will work with distributions that satisfy the following condition. Let p 2 [1, 1) and > 0. A distribution Q 2 Pp(X ) satisﬁes a transport inequality, denoted as Tp( ), if for all P 2 Pp(X ) it holds that

log(d P/d Q)d P,

where d P/d Q denotes the Radon-Nikodym derivative. The integral on the right-hand side of the inequality above represents the relative entropy between P and Q. The transportation-information inequality Tp( ) covers a variety of useful classes of distributions. For example, if X is bounded by B, then any distribution on X satisﬁes T1(8B2). Gaussian distributions with covariance satisﬁes T2(2λmax( )). Interested readers are referred to [22] for a recent survey.

Variation Regularization Following [18], we deﬁne the local and global slope of a continuous function f : X ! R at x 2 X , denoted by |@f|(x) and G(f)(x) respectively, as

|@f|(x) = lim sup

(f( x) f(x))+

k x xk , G(f)(x) = sup

(f( x) f(x))+

When f is differentiable at x, we have |@f|(x) = krf(x)k . When f is L-Lipschitz, we have G(f)(x) L for all x 2 X and k G(f)k1,Ptrue = kfk Lip. For a function f, denote by kfk Lip its Lipschitz norm whenever it exists, and by k|@f|kq the Lq(Ptrue)-norm of its slope function. We deﬁne variation regularization as

min f2F Lvr

n,q(f; %) :=

i=1 f(ˆxi) + %k|@f|kq,Pn, q < 1, 1 n

i=1 f(ˆxi) + %kfk Lip, q = 1. (VR)

We introduce the following class of functions consisting of slope functions of loss functions.

|@f|q : f 2 F

Throughout we utilize the following assumptions, which covers most loss functions that are amenable to optimize efﬁciently using ﬁrst-order methods.

Assumption 1. Every f 2 F is L-Lipschitz.

Assumption 2. Every f 2 F is piecewise differentiable. There exists h, > 0 such that for every f 2 F and every x, x in the same piece of f, it holds that krf( x) rf(x)k hk x xk .

3 Robustness and Variation Regularization

In this section, we establishes connection between robust optimization (RO) (WO) and variation regularization (VR), which serve as building blocks for results in Section 5.

We make the following assumption, which states that the loss function family has sufﬁcient variation. Assumption 3. := inff2Fk|@f|kq,Ptrue > 0.

We also deﬁne := inff2Fk|@f|kq,Pn. To simplify notations, we introduce a function en : R+ ! R+, deﬁned as

en( ) = sup

( d(X, Df))+

which describes the data concentration within a margin of non-differentiable points. Remark 1. can be ensured to be positive with high probability using Assumptions 1, 3 (see Lemma 5 in the supplementary). en(%) vanishes for differentiable families, and is often Op(1/n) when % = %n = O(1/pn) (c.f., [18]).

Our ﬁrst result in this section bounds the difference between (WO) and (VR) for p > 1. Proposition 1. Let p 2 (1, 1]. Assume Assumptions 1, 2 and 3 hold. Then it holds that

n,p(f; %) Lvr

n,q(f; %)| C0%( +1) p + 2Len

where C0 equals h if p = 1, h max(1, (Lp/ )

+1 p 1 ) if p 2 ( + 1, 1), and L(h/L)

Proposition 1 shows that (WO) and (VR) are close with a gap controlled by two high-order terms: one term %( +1) p depends on the Hölder smoothness parameter ; while the other term en depends on the data distribution around the non-differentiable region. The smoother the loss function, the smaller the gap is. Intuitively, the worst-case distribution in (WO) perturbs the data approximately aligned with gradient ascent direction, thus the gradient norm in (VR) provides a tighter ﬁrst-order approximation on the difference between the worst-case and the empirical losses for smoother losses. Compared to [18, Theorem 1], here we focus on Lipschitz loss functions, thereby we are able to obtain a cleaner result, but we have to modify the proof therein by bounding the worst-case loss differently.

Our second result in this section establishes a relationship between (WO) and (RO) for p 2 (1, 1), recalling that they coincide for p = 1. Proposition 2. Let p 2 (1, 1). Assume Assumptions 1, 2 and 3 hold. Then it holds that

n,p(f; %) Lro

n,p(f; %) 2L

1 p 1 , 8f 2 F,

where λn is the dual minimizer of (WO) satisfying

(2p 1)%pλn % 2 +1% +1h 2Len(2%) 2Len

%p h(Lp/ )% +1.

Proposition 2 shows that (WO) and (RO) have a uniformly upper bounded gap. Intuitively, both these two robust paradigms perturbs data points aligned with gradient ascent direction. The difference is that (WO) allows to perturb a tiny fraction of probability arbitrarily far away, but (RO) prohibits probability splitting. The proof is based on careful analysis on the probabilistic nature of the worstcase distribution. The ﬁrst result of this kind was established in [19] for a single loss function, but the bound therein is not controlled uniformly for a family of losses. Under the conditions in Remark 1, the gap vanishes in the order Op(%/n).

Thus far, we have shown that when p 2 (1, 1], Lvr

n,q(f; %n), Lro

n,p(%n) and Lwo

n,p(%n) only differ by a higher order gap that depends on the smoothness of the loss function class.

Next, we consider the case p = 1. In light of the example in [18], it is not always possible to have Lvr

n,1(%) being of a higher order than %. Below we identify conditions to achieve this. For f 2 F, we deﬁne cumulative distribution functions Hf, Hf as

Hf(a) := Ptrue

kfk Lip G(f)(X) a

, Hf(a) := Ptrue

kfk Lip |@f|(X) a

, a 0, where the random variable X has distribution Ptrue. They describe how the global and local slope functions distribute around their maximum, i.e., the Lipschitz norm.

Assumption 4. Assume there exists β, a, c > 0 such that for all a 2 [0, a) and all f 2 F, it holds that Hf(aβ) ca for (WO), and Hf(aβ) ca for (RO).

This assumption ensures sufﬁcient concentration on the maximal slope kfk Lip, and large β indicates that G(f)(X) or |@f|(X) concentrates more on the maximal slope. Ideally, we would like to have a large β, and in particular, Hf has a density at 0 when β = 1. For example, if Hf has a density near zero, then Assumption 4 imposes a positive lower bound on the density with β = 1. As another example, if f is a linear function, then the set of global slopes is a singleton and Hf(a) = Hf(a) = 1 for all a 0, hence Assumption 4 is trivially satisﬁed with a = β = 1. More generally, under the assumption that lim supkx ˆxk!1

(f(x) f(ˆx))+

kx ˆxk = kfk Lip, which is imposed by [17], the global slope G(f)(x) = kfk Lip for all x 2 X , thereby Hf(a) = 1 for all a 0.

Proposition 3. Let p = 1. Assume Assumptions 1, 2, 4 hold. Let δ 2 [0, 1

2). Then for every f 2 F, it holds that

n,1(f; %) Lwo

n,1(f; %) Lvr

n,1(f; %) Lvr

n,1(f; %) n, Lro

n,1(f; %) Lvr

n,1(f; %) n,

n = % n + 4hn δ% +1 + 2Len(%nδ),

n = % n + 4hn δ% +1 + 2Len(%nδ),

where n (resp. n) equals the (bn1 δc + 1)-th order statistics of the sample {kfk Lip G(f)(ˆxi)}n

i=1 (resp. {kfk Lip |@f|(ˆxi)}n

Remark 2. Consider data are generated i.i.d., % = %n = O(1/pn), and f has piecewise Lipschitz gradient (i.e., = 1). As shown in the supplementary, the remainder would be O(n β+1

2 +βδ) + n (1 2δ)) = O(n β+1

β+2 ). When Gf(X) has a positive mass on the maximal global slope, by setting δ = 0, with high probability, we have n = 0, and n = O(%2 + en(%)), which is

Op(1/n) under the condition in Remark 1.

Proposition 3 shows that for p = 1, (RO), (WO) and (VR) are equivalent up to a higher order error term, which depends on the smoothness of the loss function as indicated by and en, and the concentration of slopes as indicated by n. The smoother the function and the more concentrated on the maximal global slope, the smaller the gap between (WO), (RO) and (VR) is. This improves the sandwich result of [18] in which n = O(1/pn). Our ﬁner analysis is based on construction of feasible solutions for (WO) and (RO) obtained from perturbation of empirical data points as well as careful control on the distances of perturbation.

4 Generalization Bounds for Variation Regularization

In this section, we derive generalization bounds for variation regularization in the stochastic setting, which are building blocks to develop the generalization bounds for (WO) and (RO).

We make the following assumptions on the sampling distribution.

Assumption 5. The sampling distribution µSn satisﬁes Tp( n).

When Sn contains i.i.d. samples from some underlying distribution Ptrue that satisﬁes Tp( ), then by tensorization of transport inequalities [22], µSn satisﬁes Tp( n) with n = n2/p 1. In Example 2, we will consider Markovian data that satisﬁes T1( n) with n = O(n).

Suppose we work with a parametric family F = {f } 2 , where the parameter space is endowed with a norm k k . We impose the following assumption of a Lipschitz parametric family.

Assumption 6. There exists > 0 such that |f 0(x) f (x)| k 0 k for all x 2 X .

Proposition 4. Let p = 1. Assume Assumptions 1, 5 and 6 are in force. Let t > 0. Set

n(t + log N( 1

n; , k k )) n2 .

Then with probability at least 1 e t, for all f 2 F, it holds that

Ltrue(f) Lvr

n,1(f; %n) + 2

Proposition 4 shows that by choosing the radius properly, Lipschitz regularization upper bounds the true risk Ltrue(f) uniformly with high probability up to some error terms. The radius %n depends on the transport inequality parameter of sampling distribution in Assumption 5 and the complexity of the loss function class. When n = O(n) and log N( 1

n; , k k ) = O(1), we have %n = O(1/pn), corresponding to the canonical choice.

For p > 1, let us assume Ptrue is continuous and deﬁne the population counterpart of en(%) in (2) as

( d(X, Df))+

Proposition 5. Let p 2 (1, 2]. Assume Assumptions 1, 2, 3, 5 and 6 are in force. Let t > 0. Set

1 2Rn(Nq) (L/ )q

n(t + log(1 + N( 1

n; , k k ))) n2/p .

Then with probability at least 1 e t, for all f 2 F, it holds that

Ltrue(f) Lvr

n,q(f; %n) + 2

n + C0%( +1) p

where C0 equals h if p = 1, h max(1, (Lp/ )

+1 p 1 ) if p 2 ( + 1, 1), and L(h/L)

Recalling Nq is the set of normalized slope deﬁned in (1), and the inﬂation of the radius compared with the case of p = 1 is to compensate the discrepancy between k|@f|kq and k|@f|kq,Pn. The inﬂation factor is well-deﬁned so long as the Rademacher complexity of Nq diminishes with respect to sample size. The additional remainder term is similar to the one in Proposition 1, which measures the difference between (WO) and (VR) when the nominal distribution is Ptrue. The proof of Propositions 4 and 5 follows a similar ﬂow as that of [17, Corollary 2], but our result extends it to non-i.i.d. and non-smooth settings.

Example 1 (Linear Prediction with Polynomial Loss). Consider a supervised learning problem with a feature vector x 2 X Rd and a response variable y 2 Y R. Suppose kxk2 B1 for all x 2 X and |y| B2 for all y 2 Y . Let be a bounded set in Rd such that k k2 B3 1 for all 2 . Consider a linear predictor with polynomial loss F = {(x, y) 7! f (x, y) = | >x y|p : 2 }. We now discuss how to ensure the assumptions in Propositions 4, with details in the supplementary. It is easy to verify that every f (x, y) is Lipschitz in both variables and parameters, with = p(B1 + B2)p Bp 1

3 in Assumption 6. Suppose the samples are i.i.d. from Ptrue satisfying Tp( ), which implies Assumption 5 with n = n

2 p 1. Hence we can apply Proposition 4, and by [36, Example 5.8], we have log N( 1

n; , k k ) d log(1 + 2B3n).

Example 2 (Markovian Data). In this example, we show that our Assumption 5 about µSn can be satisﬁed for Markovian data. Consider a homogeneous Markov chain on X with a transition kernel P(dy|x). Assume there exist 0 > 0, β < 0 and C < 1, such that

e 0kyk2P(dy|x) Ce

βkxk2, 8x 2 X .

i=1 be n consecutive outputs of this Markov chain. Then by Theorem 4.1 in [9], µSn satisﬁes T1( n) with n = 2n(1+log C)

0 β . Thus Assumption 5 is satisﬁed.

5 Generalization Bounds for (Wasserstein) Robust Optimization

In this section, we develop the generalization bounds for (WO) and (RO), based on the results in previous sections.

5.1 Stochastic Setting

Combining the results in Sections 3 and 4, we immediately obtain the following theorems.

Theorem 1. Let p 2 (1, 2] and t > 0. Under the setup of Propositions 1 and 5, with probability at least 1 e t, for all f 2 F, it holds that

Ltrue(f) Lwo

n,p(f; %n) + n,

n = (C0 + C0)%( +1) p

(p L/ )q 1%n

(p L/ )q 1%n

With the additional setup as in Proposition 2, it holds that

Ltrue(f) Lro

n,p(f; %n) + n + 2L

Theorem 2. Let p = 1 and t > 0. Under the setup of Propositions 3 and 4, with probability at least 1 e t, for all f 2 F, it holds that

Ltrue(f) Lwo

n,1(f; %n) + n + 2

Ltrue(f) Lro

n,1(f; %n) + n + 2

Both Theorems 1 and 2 show that by choosing a reasonably small radius without suffering from the curse of dimensionality, the robust solution resulting from solving (VR), (RO), or (WO) has a nice generalization bound, expressed as the robust loss plus a high-order term. The high-order term depends mainly on the smoothness of the loss function as indicated by , the (non-)concentration of probability distributions on non-differentiable points as indicated by en, and the concentration of the maximal slope as indicated by n (resp. n) in the expression of n (resp. n) (when the Wasserstein order p = 1).

Example 3 (Unsupervised Learning). Consider the principal component analysis

max W 2Rd k

F : W >W = Ik

that seeks for d principal directions, along which the data has maximized variance. Here X is a d n sample matrix consisting of n samples {Xi}n

i=1 from Ptrue on a set X {x 2 Rd : kxk2 B}; k k F denotes the Frobenius norm; Ik denotes the k-dimensional identity matrix; and W = [w1, . . . wk] is the set of k orthonormal principal directions in Rd. Assume that Ptrue satisﬁes T2( ), Ex Ptrue[x] = 0, and the smallest eigenvalue of the covariance matrix Ex Ptrue[xx>] is positive. Let p = 2 and F = {x 7! f W (x) = k W >xk2

2 : W >W = Ik}. Below we verify Assumptions 1, 2, 3, 5, 6 as required by Theorem 1 and compute the constants, and refer to the supplementary for detailed calculation.

It is easy to verify that Assumption 1 since f W is 2Bk-Lipschitz; Assumption 2 holds with = 1; the zero-mean and non-singular covariance of Ptrue imply Assumption 3; Assumption 5 holds with n = ; and Assumption 6 holds with = 2

k B2. Finally, to compute the covering number of the set W = {W : W >W = Ik}, observe that each W consists of k orthogonal vectors, each of which belongs to the unit ball in Rd. Hence log N( 1

n; W, k k2) kd log((1 + 2n)). Thus, we improve the bound in [39, Example 8] which has an exponential dependence on d.

Before closing this subsection, we would like to emphasize that this stochastic setting is drastically different from the adversarial setting in the next subsection, for which many recent studies on the generalization properties of robust optimization. In fact, choosing a good radius scaling scheme with nice ﬁnite-sample guarantees is at the core of Wasserstein DRO and remains open for quite some time until the recent work [17]. We generalize this result to non-i.i.d. and non-smooth settings as well as for (RO).

5.2 Adversarial Setting

In the adversarial setting (Adv), we are trying to minimize the expected loss under the worst-case scenario when the learning algorithm is being purposely attacked by an adversary. By replacing Ptrue with Pn in (Adv), we obtain the empirical adversarial robustness problem, which is the same as the Wasserstein robust optimization problem (WO). One of the most fundamental questions in adversarial robustness problem is to estimate the generalization gap between (Adv) and its empirical counterpart (WO). We aim to derive generalization bounds for p 2 [1, 1] by virtue of Theorems 1 and 3. Following the literature, in this subsection, we assume Pn consists of i.i.d. samples from a continuous distribution Ptrue.

We start with p = 1. Theorem 3. Let p = 1. Under the setup of Proposition 3, assume that kfk1 M for all f 2 F. Let δ 2 [0, 1

2) and t > 0. Then with probability at least 1 e t, it holds that for every f 2 F,

n,1(f; %) Ladv

1 (f; %)| 2Rn(F) + M

t 2n + 2 n + C%1+ β

On the right-hand side of the inequality in Theorem 3, the ﬁrst two terms are identical to the generalization bound of empirical risk minimization. The third term n from Theorem 3 is linear in the adversarial perturbation level %, which represents the inﬂation of generalization bound in the adversarial setting as opposed to the stochastic setting, whose dominating factor is the concentration of the maximal slope. The rest terms are of higher-order for small adversarial perturbation level %, that mainly depends on the smoothness of the loss function.

Next, we consider p 2 (1, 1]. Theorem 4. Let p 2 (1, 1]. Under the setup of Proposition 1, assume that kfk1 M for all f 2 F. Let % c

1 βδ and t > 0. Then with probability at least 1 2e t, it holds that for every f 2 F,

n,p(f; %) Ladv

p (f; %)| 2Rn(F) + M

2Rn(@Fq) + Lqq

+ C0%( +1) p + Len

Similar to the case of p = 1, the ﬁrst two terms in the above bound is identical to the bound of empirical risk minimization, and the high-order term in the second row depends on the smoothness of the loss functions. The major difference from p = 1 is that, the inﬂation term in the adversarial setting depends on Rn(@Fq), the Rademacher complexity of the class of gradient norm functions. This term appears to be new in the literature. Intuitively, if the complexity of the variation is small, the model is more robust to the adversarial perturbations and thus generalizes better.

Theorems 3 and 4 are proved based on a different strategy than that in the stochastic setting. Set Lvr

q (%) := minf2F{EPtrue[f] + %k|@f|kq,Ptrue}. Using the triangle inequality we decompose the difference |Lwo

n,p(f; %) Ladv

p (f; %)| into three terms |Lwo

n,p(f; %) Lvr

n,q(f; %)|, |Lvr

n,q(f; %) Lvr

q (f; %)|, |Lvr

q (f; %) Ladv

p (f; %)| and then bound each term individually.

6 Concluding Remarks

In this paper, we thoroughly analyzed the ﬁnite-sample performance bounds for robust optimization, drawing connection with transport inequality and local Rademacher complexity. Our results generalize existing results from various perspectives, and particularly, to non-smooth loss functions and to noni.i.d. data, both of which requires non-trivial analysis. Our results help to better understand the regularization effect of robust learning and explain their superior empirical performances. We hope our results can inspire regularization schemes for other applications as well. Our bounds involve the Rademacher complexities of (normalized) gradient norm function @Fq, Nq and margin to discontinuity I%, which are non-standard; we have provided various examples and will investigate in more depth for future work.

[1] Pranjal Awasthi, Natalie Frank, and Mehryar Mohri. Adversarial learning guarantees for linear

hypotheses and neural networks. ar Xiv preprint ar Xiv:2004.13617, 2020.

[2] Daniel Bartl, Samuel Drapeau, Jan Obloj, and Johannes Wiesel. Robust uncertainty sensitivity

analysis. ar Xiv preprint ar Xiv:2006.12022, 2020.

[3] Aurélien Bellet and Amaury Habrard. Robustness and generalization for metric learning.

Neurocomputing, 151:259 267, 2015.

[4] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust optimization. Princeton

university press, 2009.

[5] Dimitris Bertsimas and Martin S Copenhaver. Characterization of the equivalence of robusti-

ﬁcation and regularization in linear and matrix regression. European Journal of Operational Research, 270(3):931 942, 2018.

[6] Jose Blanchet, Yang Kang, and Karthyek Murthy. Robust wasserstein proﬁle inference and

applications to machine learning. Journal of Applied Probability, 56(3):830 857, 2019.

[7] Jose Blanchet and Karthyek Murthy. Quantifying distributional model risk via optimal transport.

Mathematics of Operations Research, 44(2):565 600, 2019.

[8] Jose Blanchet, Karthyek Murthy, and Nian Si. Conﬁdence regions in wasserstein distributionally

robust estimation. ar Xiv preprint ar Xiv:1906.01614, 2019.

[9] François Bolley and Cédric Villani. Weighted csiszár-kullback-pinsker inequalities and ap-

plications to transportation inequalities. In Annales de la Faculté des sciences de Toulouse: Mathématiques, volume 14, pages 331 352, 2005.

[10] Sébastien Bubeck, Yin Tat Lee, Eric Price, and Ilya Razenshteyn. Adversarial examples from

computational constraints. In International Conference on Machine Learning, pages 831 840. PMLR, 2019.

[11] Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, and John Duchi. Unlabeled data improves

adversarial robustness. Advances in neural information processing systems, 2019.

[12] Ruidi Chen and Ioannis C Paschalidis. A robust learning approach for regression models based

on distributionally robust optimization. Journal of Machine Learning Research, 19(13), 2018.

[13] Zac Cranko, Simon Kornblith, Zhan Shi, and Richard Nock. Lipschitz networks and distribu-

tional robustness. ar Xiv preprint ar Xiv:1809.01129, 2018.

[14] John C Duchi and Hongseok Namkoong. Variance-based regularization with convex objectives.

J. Mach. Learn. Res., 20:68 1, 2019.

[15] Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization

using the wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1):115 166, 2018.

[16] Chris Finlay, Jeff Calder, Bilal Abbasi, and Adam Oberman. Lipschitz regularized deep neural

networks generalize and are adversarially robust. ar Xiv preprint ar Xiv:1808.09540, 2018.

[17] Rui Gao. Finite-sample guarantees for wasserstein distributionally robust optimization: Break-

ing the curse of dimensionality. ar Xiv preprint ar Xiv:2009.04382, 2020.

[18] Rui Gao, Xi Chen, and Anton J Kleywegt. Wasserstein distributional robustness and regulariza-

tion in statistical learning. ar Xiv, pages ar Xiv 1712, 2017.

[19] Rui Gao and Anton J Kleywegt. Distributionally robust stochastic optimization with wasserstein

distance. ar Xiv preprint ar Xiv:1604.02199, 2016.

[20] Rui Gao and Anton J Kleywegt. Distributionally robust stochastic optimization with dependence

structure. ar Xiv preprint ar Xiv:1701.04200, 2017.

[21] Jun-ya Gotoh, Michael Jong Kim, and Andrew EB Lim. Robust empirical optimization is

almost the same as mean variance optimization. Operations research letters, 46(4):448 452, 2018.

[22] Nathael Gozlan and Christian Léonard. Transport inequalities. a survey. Markov Processes and

Related Fields, 16:635 736, 2010.

[23] Hisham Husain. Distributional robustness with ipms and links to regularization and gans. ar Xiv

preprint ar Xiv:2006.04349, 2020.

[24] Justin Khim and Po-Ling Loh. Adversarial risk bounds via function transformation. ar Xiv

preprint ar Xiv:1810.09519, 2018.

[25] Daniel Kuhn, Peyman Mohajerin Esfahani, Viet Anh Nguyen, and Soroosh Shaﬁeezadeh-

Abadeh. Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations Research & Management Science in the Age of Analytics, pages 130 166. INFORMS, 2019.

[26] Henry Lam. Recovering best statistical guarantees via the empirical divergence-based distribu-

tionally robust optimization. Operations Research, 67(4):1090 1105, 2019.

[27] Jaeho Lee and Maxim Raginsky. Minimax statistical learning with wasserstein distances.

Advances in Neural Information Processing Systems, 31:2687 2696, 2018.

[28] Olivier Marchal, Julyan Arbel, et al. On the sub-gaussianity of the beta and dirichlet distributions.

Electronic Communications in Probability, 22, 2017.

[29] Mehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-iid processes.

In Proceedings of the 21st International Conference on Neural Information Processing Systems, pages 1097 1104, 2008.

[30] Amir Najaﬁ, Shin-ichi Maeda, Masanori Koyama, and Takeru Miyato. Robustness to adversarial

perturbations in learning from incomplete data. Advances in Neural Information Processing Systems, 32:5541 5551, 2019.

[31] Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry.

Adversarially robust generalization requires more data. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5019 5031, 2018.

[32] Soroosh Shaﬁeezadeh-Abadeh, Daniel Kuhn, and Peyman Mohajerin Esfahani. Regularization

via mass transportation. Journal of Machine Learning Research, 20(103):1 68, 2019.

[33] Nian Si, Jose Blanchet, Soumyadip Ghosh, and Mark Squillante. Quantifying the empirical

wasserstein distance to a set of measures: Beating the curse of dimensionality. Advances in Neural Information Processing Systems, 33, 2020.

[34] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness

with principled adversarial training. In International Conference on Learning Representations, 2018.

[35] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander

Madry. Robustness may be at odds with accuracy. In International Conference on Learning Representations, 2018.

[36] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48.

Cambridge University Press, 2019.

[37] Huan Xu, Constantine Caramanis, and Shie Mannor. Robustness and regularization of support

vector machines. Journal of machine learning research, 10(7), 2009.

[38] Huan Xu, Constantine Caramanis, and Shie Mannor. Robust regression and lasso. IEEE

Transactions on Information Theory, 56(7):3561 3574, 2010.

[39] Huan Xu and Shie Mannor. Robustness and generalization. Machine learning, 86(3):391 423,

[40] Dong Yin, Ramchandran Kannan, and Peter Bartlett. Rademacher complexity for adversarially

robust generalization. In International Conference on Machine Learning, pages 7085 7094. PMLR, 2019.

[41] Tom Zahavy, Bingyi Kang, Alex Sivak, Jiashi Feng, Huan Xu, and Shie Mannor. Ensem-

ble robustness and generalization of stochastic deep learning algorithms. ar Xiv preprint ar Xiv:1602.02389, 2016.