# outlierrobust_wasserstein_dro__31c331ea.pdf Outlier-Robust Wasserstein DRO Sloan Nietert Cornell University nietert@cs.cornell.edu Ziv Goldfeld Cornell University goldfeld@cornell.edu Soroosh Shafiee Cornell University shafiee@cornell.edu Distributionally robust optimization (DRO) is an effective approach for data-driven decision-making in the presence of uncertainty. Geometric uncertainty due to sampling or localized perturbations of data points is captured by Wasserstein DRO (WDRO), which seeks to learn a model that performs uniformly well over a Wasserstein ball centered around the observed data distribution. However, WDRO fails to account for non-geometric perturbations such as adversarial outliers, which can greatly distort the Wasserstein distance measurement and impede the learned model. We address this gap by proposing a novel outlier-robust WDRO framework for decision-making under both geometric (Wasserstein) perturbations and nongeometric (total variation (TV)) contamination that allows an ε-fraction of data to be arbitrarily corrupted. We design an uncertainty set using a certain robust Wasserstein ball that accounts for both perturbation types and derive minimax optimal excess risk bounds for this procedure that explicitly capture the Wasserstein and TV risks. We prove a strong duality result that enables tractable convex reformulations and efficient computation of our outlier-robust WDRO problem. When the loss function depends only on low-dimensional features of the data, we eliminate certain dimension dependencies from the risk bounds that are unavoidable in the general setting. Finally, we present experiments validating our theory on standard regression and classification tasks. 1 Introduction The safety and effectiveness of various operations rely on making informed, data-driven decisions in uncertain environments. Distributionally robust optimization (DRO) has emerged as a powerful framework for decision-making in the presence of uncertainties. In particular, Wasserstein DRO (WDRO) captures uncertainties of geometric nature, e.g., due to sampling or localized (adversarial) perturbations of the data points. The WDRO problem is a two-player zero-sum game between a learner (decision-maker), who chooses a decision θ Θ, and Nature (adversary), who chooses a distribution ν from an ambiguity set defined as the p-Wasserstein ball of a prescribed radius around the observed data distribution µ. Namely, WDRO is given by1 inf θ Θ sup ν: Wp(ν, µ) ρ EZ ν[ℓ(θ, Z)], (1) whose solution ˆθ Θ is chosen to minimize risk over the Wasserstein ball with respect to (w.r.t.) the loss function ℓ. WDRO has received considerable attention in many fields, including machine learning [6, 22, 45, 48, 59], estimation and filtering [36, 37, 46], and chance constraint programming [12, 55]. In many practical scenarios, the observed data may be contaminated by non-geometric perturbations, such as adversarial outliers. Unfortunately, the WDRO problem from (1) is not suited for handling this 1Here, Wp(µ, ν) := infπ Π(µ,ν) R x y pdπ(x, y) 1/p is the p-Wasserstein metric between µ and ν, where Π(µ, ν) is the set of all their couplings. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). issue, as even a small fraction of outliers can greatly distort the Wp measurement and impede decisionmaking. In this work, we address this gap by proposing a novel outlier-robust WDRO framework that can learn well-performing decisions even in the presence of outliers. We couple it with a comprehensive theory of excess risk bounds, statistical guarantees, and computationally-tractable reformulations, as well as supporting numerical results. 1.1 Contributions Wε p( µ,µ) ρ µ = (1 ε)µ + εα Figure 1: A visual depiction of a clean measure µ P(R) and a corrupted observation µ P(R) satisfying Wε p( µ, µ) ρ. We consider a scenario where the observed data distribution µ is subject to both geometric (Wasserstein) perturbations and non-geometric (total variation (TV)) contamination, which allows an ε-fraction of data to be arbitrarily corrupted. Namely, if µ is the true (unknown) data distribution, then the Wasserstein perturbation maps it to some µ with Wp(µ , µ) ρ, and the TV contamination step further produces µ with µ µ TV ε (e.g., in the special case of the Huber model [28], µ = (1 ε)µ +εα where α is an arbitrary noise distribution). To enable robust decision-making under this model, we replace the Wasserstein ambiguity set in (1) with a ball w.r.t. the recently proposed outlier-robust Wasserstein distance Wε p [38, 39]. The Wε p distance (see (2) ahead) filters out the ε-fraction of mass from the contaminated distribution that contributed most to the transportation cost, and then measures the Wp distance post-filtering. To obtain well-performing solutions for our WDRO problem, the Wε p ball is intersected with a set that encodes standard moment assumptions on the uncorrupted data distribution, which are necessary for meaningful outlier-robust estimation guarantees. We establish minimax optimal excess risk bounds for the decision ˆθ that solves the proposed outlierrobust WDRO problem. The bounds control the gap E[ℓ(ˆθ, Z)] E[ℓ(θ , Z)], where Z µ follows the true data distribution and θ = argminθ E[ℓ(θ, Z)] is the optimal decision, subject to regularity properties of ℓ = ℓ(θ , ). In turn, our bounds imply that the learner can make effective decisions using outlier-robust WDRO based on the contaminated observation µ, so long as ℓ has low variational complexity. The bounds capture this complexity using the Lipschitz or Sobolev seminorms of ℓ and clarify the distinct effect of each perturbation (Wasserstein versus TV) on the quality of the learned ˆθ solution. We further establish their minimax optimality when p = 1, by providing a matching lower bound in the setting when an adversary picks a class of Lipschitz functions over which the learner must perform uniformly well. The excess risk bounds become looser as the data dimension d grows. We show that this degradation is alleviated when the loss function depends on the data only through k-dimensional affine features, by providing risk bounds that adapt to k instead of d. We then move to study the computational side of the problem, which may initially appear intractable due to non-convexity of the constraint set. We resolve this via a cheap preprocessing step that computes a coarse robust estimate of the mean [34] and replaces the original constraint set (that involves the true mean) with a version centered around the estimate. We adapt our excess risk bounds to this formulation and then prove a strong duality theorem. The dual form is reminiscent of the one for classical WDRO with adaptations reflecting the constraint to the clean distribution family and the partial transportation under Wε p. Under additional convexity conditions on the loss, we further derive an efficiently-computable, finite-dimensional, convex reformulation. The optimization results are also adapted to the setting with low-dimensional features. Using the developed machinery, we present experiments that validate our theory on simple regression/classification tasks and demonstrate the superiority of the proposed approach over classical WRDO, when the observed data is contaminated. 1.2 Related Work Distributionally robust optimization. The Wasserstein distance has emerged as a powerful tool for modeling uncertainty in the data generating distribution. It was first used to construct an ambiguity set around the empirical distribution in [40]. Recent advancements in convex reformulations and approximations of the WDRO problem, as discussed in [8, 20, 35], have brought notable computational advantages. Additionally, WDRO is linked to various forms of variation [2, 9, 19, 43] and Lipschitz [7, 11, 44] regularization, which contribute to its success in practice. Robust generalization guarantees can also be provided by WDRO via measure concentration argument or transportation inequalities [18, 30, 31, 51, 53, 54]. Several works have raised concerns regarding the sensitivity of standard DRO to outliers [24, 27, 58]. An attempt to address this was proposed in [56] using a refined risk function based on a family of f-divergences. This formulation aims to prevent DRO from overfitting to potential outliers but is not robust to geometric perturbations. Further, their risk bounds require a moment condition to hold uniformly over Θ, in contrast to our bounds that depend only on θ . We are able to address these limitations by setting a WDRO framework based on partial transportation. While partial OT has been previously used in the context of DRO problems, it was introduced to address stochastic programs with side information in [17] rather than to account for outlier robustness. Another closely related line of work is presented in [4, 5], where the ambiguity set is constructed using an f-divergence to mitigate statistical errors and the Prokhorov distance to handle outlier data. The proposed model is both computationally efficient and statistically reliable. However, they have not investigated its minimax optimality or robustness against the Huber contamination model, which we aim to do in this paper. Additionally, a best-case favorable analysis approach has been proposed in [29] to address outlier data. This approach is an alternative to the worst-case distributionally robust method. However, it requires solving a non-convex optimization problem, significantly impacting its scalability, and is not accompanied by any proof of minimax optimality. Robust statistics. The problem of learning from data under TV ε-corruptions dates back to [28]. Over the years, various robust and sample-efficient estimators, particularly for mean and scale parameters, have been developed in the robust statistics community; see [41] for a comprehensive survey. The theoretical computer science community, on the other hand, has focused on developing computationally efficient estimators that achieve optimal estimation rates in high dimensions [13, 16]. Relatedly, the probably approximate correct (PAC) learning framework has been well-studied in similar models [1, 10]. Recently, [58] developed a unified robust estimation framework based on minimum distance estimation that gives sharp population-limit and promising finite-sample guarantees for mean and covariance estimation, as well linear regression. Their analysis centers on a generalized resilience quantity, which is essential to our work. We are unaware of any results in the settings above which extend to combined TV and Wp corruptions. Finally, our analysis relies on the outlier-robust Wasserstein distance from [38, 39], which was shown to yield an optimal minimum distance estimate for robust distribution estimation under Wp loss. 2 Preliminaries Notation. Consider a closed, non-empty set Z Rd equipped with the Euclidean norm . A continuously differentiable function f : Z R is called α-smooth if f(z) f(z ) α z z , for all z, z Z. The perspective function of a lower semi-continuous (l.s.c.) and convex function f is Pf(x, λ) := λf(x/λ) for λ > 0, with Pf(x, λ) = limλ 0 λf(x/λ) when λ = 0. The convex conjugate of f is f (y) := supx Rd y x f(x). We denote by χZ the indicator function of Z, that is, χZ(z) = 0 if z Z and χZ(z) = otherwise. The convex conjugate of χZ, denoted by χ Z, is termed as the support function of Z. We use M(Z) for the set of signed Radon measures on Z equipped with the TV norm µ TV := 1 2|µ|(Z), and write µ ν for set-wise inequality. The class of Borel probability measures on Z is denoted by P(Z). We write Eµ[f(Z)] for expectation of f(Z) with Z µ; when clear from the context, the random variable is dropped and we write Eµ[f]. Let Σµ denote the covariance matrix of µ P2(Z). Define Pp(Z) := {µ P(Z) : infz0 Z Eµ[ Z z0 p] < }. The push-forward of f through µ P(Z) is f#µ( ) := µ(f 1( )), and, for A P(Z), write f#A := {f#µ : µ A}. The pth order homogeneous Sobolev (semi)norm of continuously differentiable f : Z R w.r.t. µ is f H1,p(µ) := Eµ[ f p]1/p. The set of integers up to n N is denote by [n]; we also use the shorthand [x]+ = max{x, 0}. We write , , for inequalities/equality up to absolute constants. Classical and outlier-robust Wasserstein distances. For p [1, ), the p-Wasserstein distance between µ, ν Pp(Z) is Wp(µ, ν) := infπ Π(µ,ν) Eπ X Y p 1/p, where Π(µ, ν) := {π P(Z2) : π( Z) = µ, π(Z ) = ν} is the set of all their couplings. Some basic properties of Wp are (see, e.g., [42, 52]): (i) Wp is a metric on Pp(Z); (ii) the distance is monotone in the order, i.e., Wp Wq for p q; and (iii) Wp metrizes weak convergence plus convergence of pth moments: Wp(µn, µ) 0 if and only if µn w µ and R x pdµn(x) R x pdµ(x). To handle corrupted data, we employ the ε-outlier-robust p-Wasserstein distance2, defined by Wε p(µ, ν) := inf µ P(Rd) µ µ TV ε Wp(µ , ν) = inf ν P(Rd) ν ν TV ε Wp(µ, ν ). (2) The second equality is a useful consequence of Lemma 4 in [39] (see Appendix A for details, along with an interpretation of Wε p as a partial OT distance). Robust statistics. Resilience is a standard sufficient condition for population-limit robust statistics bounds [49, 58]. The p-Wasserstein resilience of a measure µ P(Z) is defined by τp(µ, ε) := sup µ 1 1 ε µ sup µ 1 1 ε µ Wp(µ , µ), and that of a family G P(R) by τp(G, ε) := supµ G τp(µ, ε). The relation between Wp resilience and robust estimation is formalized in the following proposition. Proposition 1 (Robust estimation under Wp resilience [39]). Fix 0 ε 0.49. For any clean distribution µ G P(Z) and corrupted measure µ P(Z) such that Wε p( µ, µ) ρ, the minimum distance estimate ˆµ = argminν G Wε p(ν, µ) satisfies Wp(ˆµ, µ) ρ + τp(G, 2ε).3 Throughout, we focus on the bounded covariance class Gcov := µ P(Z) : Σµ Id . Proposition 2 (Wp resilience bound for Gcov [39]). Fixing 0 ε 0.99 and 1 p 2, we have τp(Gcov, ε) d ε1/p 1/2. 3 Outlier-robust WDRO We perform stochastic optimization with respect to an unknown data distribution µ, given access only to a corrupted version µ. We allow both localized Wasserstein perturbations, that map µ to some µ with Wp(µ, µ ) ρ, and TV ε-contamination that takes µ to µ with µ µ TV ε. Equivalently, both perturbations are captured by Wε p( µ, µ) ρ.4 To simplify notation, we henceforth suppress the dependence of the loss function ℓon the model parameters θ Θ, writing ℓfor ℓ(θ, ) for a specific function and L = {ℓ(θ, )}θ Θ for the whole class. Our full model is as follows. Setting A: Fix a p-Wasserstein radius ρ 0 and TV contamination level ε [0, 0.49]. Let L RZ be a family of real-valued loss functions on Z, such that each ℓ L is l.s.c. with supz Z ℓ(z) 1+ z p < , and fix a class G Pp(Z) encoding distributional assumptions. We consider the following model: (i) Nature selects a distribution µ G, unknown to the learner; (ii) The learner observes a corrupted measure µ P(Z) such that Wε p( µ, µ) ρ; (iii) The learner selects a decision ˆℓ L and suffers excess risk Eµ[ˆℓ] infℓ L Eµ[ℓ]. We seek a decision-making procedure for the learner which provides strong excess risk guarantees when ℓ = argminℓ L Eµ[ℓ] is appropriately simple. To achieve this, we introduce the ε-outlierrobust p-Wasserstein DRO problem: inf ℓ L sup ν G: Wεp( µ,ν) ρ Eν[ℓ]. (OR-WDRO) 3.1 Excess Risk Bounds We quantify the excess risk of decisions made using OR-WDRO for the two most popular choices of order, p = 1, 2. Proofs are provided in Supplement B. 2While not a metric, Wε p is symmetric and satisfies an approximate triangle inequality ([39], Proposition 3). 3If a minimizer does not exist for either problem, an infimizing sequence will achieve the same guarantee. 4We defer explicit modeling of sampling to Section 3.2 but note that the following results immediately transfer to the n-sample setting so long as ρ is taken to be larger than Wp(ˆµn, µ) with high probability. Theorem 1 (OR-WDRO risk bound). Under Setting A, let ˆℓminimize (OR-WDRO). Then, writing c = 2(1 ε) 1/p, the excess risk is bounded by Eµ[ˆℓ] Eµ[ℓ ] ( ℓ Lip cρ + 2τ1(G, 2ε) , p = 1, ℓ Lipschitz ℓ H1,2(µ) cρ + 2τ2(G, 2ε) + 1 2α cρ + 2τ2(G, 2ε) 2, p = 2, ℓ α-smooth. Note that c = O(1) since ε 0.49. These bounds imply that the learner can make effective decisions when ℓ has low variational complexity5. In contrast, there are simple regression settings with TV corruption that drive the excess risk of standard WDRO to infinity. Our proof derives both results as a special case of a general bound in terms of the Wp regularizer, defined by Rµ,p(ρ; ℓ) := supν P(Z): Wp(ν ,ν) ρ Eν [ℓ] Eν[ℓ]. Introduced in [18], this quantity appears implicitly throughout the WDRO literature. In particular, for each ℓ L, we derive the following bound: Eµ[ˆℓ] Eµ[ℓ] Rµ,p cρ |{z} Wp + 2τp(G, 2ε) | {z } TV whose radius reveals the effect of each perturbation (viz. Wasserstein versus TV) on the quality of the decision. The first bound of the theorem follows by plugging in p = 1 and controlling Rµ,1 via Kantorovich duality. The second bound uses p = 2 and controls Rµ,2 by replacing ℓwith its Taylor expansion about Z µ. We now instantiate Theorem 1 for the bounded covariance class Gcov. Corollary 1 (Risk bounds for Gcov). Under the setting of Theorem 1 with G Gcov, we have Eµ[ˆℓ] Eµ[ℓ ] ( ℓ Lip ρ + dε , p = 1, ℓ Lipschitz ℓ H1,2(µ)(ρ + d ) + α(ρ2 + d), p = 2, ℓ α-smooth. Since Gcov encodes second moment constraints, τ2(Gcov, ε) d is independent of ε. Therefore, the first bound is preferable as ε 0 if ℓ H1,2(µ) ℓ Lip, while the second is better when ε = Ω(1) and ℓ H1,2(µ) ℓ Lip. 6 Distinct trade-offs are observed under stronger tail bounds like sub-Gaussianity, i.e., for Gsub G := {µ P(Z) : Eµ[e(θ (Z E[Z])2] 2, θ Sd 1}. Corollary 2 (Risk bounds for Gsub G). Under the setting of Theorem 1 with G Gsub G, the excess risk Eµ[ˆℓ] Eµ[ℓ ] is bounded up to constants by ℓ Lip ρ + q ε ε , p = 1, ℓ Lipschitz ℓ H1,2(µ) ρ+ q ε)ε +α ρ2+ d+log 1 ε ε , p = 2, ℓ α-smooth . Remark 1 (Comparison to MDE under Wε p). We note that the excess risk Rµ,p cρ + 2τp(G, 2ε); ℓ from (3) can alternatively be obtained by performing standard p-WDRO with an expanded radius cρ+ 2τ1(G, 2ε) around the minimum distance estimate ˆµ = argminν G Wε 1( µ, ν). However, obtaining ˆµ is an expensive preprocessing step, and we are unaware of any efficient algorithms for such MDE in the finite-sample setting. In Supplement D, we explore recentering WDRO around a tractable estimate obtained from iterative filtering [15], but find the resulting risk to be highly suboptimal. Furthermore, the improvements to our risk bounds under low-dimensional structure, which are derived in Section 4, do not extend to decisions obtained from these alternative procedures. We now show that Theorem 1 cannot be improved in general. In particular, the first bound is minimax optimal over Lipschitz loss families when µ Gcov. Proposition 3 (Lower bound). Fix Z = Rd and ε [0, 0.49]. For any L 0, there exists a family L Lip L(Rd), independent of ε, such that for any decision rule D : P(Z) L there exists a pair (µ, µ) Gcov P(Z) with Wε 1( µ, µ) ρ satisfying Eµ[D( µ)] infℓ L Eµ[ℓ] L ρ + Each family L encodes a multivariate regression problem. Our proof combines a one-dimensional lower bound of [49] for linear regression with lower bounds of [39] for robust estimation under W1. 5The same bounds hold up to ε additive slack if ℓ is only ε-approximately optimal for (OR-WDRO). 6Under Wε p perturbations, one may perform outlier-robust WDRO using any p [1, p], which may be advantageous in terms of the TV component of the excess risk. 3.2 Statistical Guarantees We next formalize a finite-sample model and adapt our excess risk bounds to it. Setting B: Fix ρ, ε, L, G as in Setting A, and let Z1, . . . , Zn be identically and independently distributed (i.i.d.) according to µ G, with empirical measure ˆµn = 1 n Pn i=1 δZi. Upon observing these clean samples, Nature applies a Wp perturbation of size ρ0, producing {Z i}n i=1 with empirical measure µ n such that Wp(ˆµn, µ n) ρ0. Finally, Nature corrupts up to εn samples to obtain { Zi}n i=1 with empirical measure µn such that µn µ TV = 1 n Pn i=1 1{ Zi = Zi} ε. Equivalently, the final dataset satisfies Wε p( µn, ˆµn) ρ0.7 The learner is now tasked with selecting ˆℓ L given µn. The results from Section 3 apply whenever ρ ρ0 + Wp(µ, ˆµn). In particular, we obtain the following corollary as an immediate consequence of Theorem 1 and Theorem 3.1 of [32]. Corollary 3 (Finite-sample risk bounds). Under Setting B, fix ˆℓ L minimizing (OR-WDRO) centered at µ = µn with ρ ρ0 + 100 E[Wp(µ, ˆµn)]. Then the excess risk bounds of Theorem 1 hold with probability at least 0.99. If G {Gcov, Gsub G}, p = 1, and d 3, or if G = Gsub G, p = 2, and d 5, then E[Wp(µ, ˆµn)] Remark 2 (Smaller radius). In the classic WDRO setting with ρ0 = ε = 0, the radius ρ can be taken significantly smaller than n 1/d if L and µ are sufficiently well-behaved. For example, [18] proves that ρ = e O(n 1/2) gives meaningful risk bounds when µ satisfies a T2 transportation inequality.8 While this high-level condition may be hard to verify in practice, Supplement E shows that this improvement can be lifted to an instance of our outlier-robust WDRO problem. 3.3 Tractable Reformulations and Computation For computation, we restrict to µ Gcov. Initially, (OR-WDRO) may appear intractable, since Gcov is non-convex when viewed as a subset of the cone M+(Z). Moreover, enforcing membership to this class is non-trivial. To remedy these issues, we use a cheap preprocessing step to obtain a robust estimate z0 Z of the mean Eµ[Z], and we optimize over the modified class G2(σ, z0) := ν P(Z) : Eν[ Z z0 2] σ2 , with σ z0 Eµ[Z] + d taken so that µ G2(σ, z0). Finally, for technical reasons, we switch to the one-sided robust distance Wε p(µ ν) := infµ P(Rd):µ 1 1 ε µ Wp(µ , ν). Altogether, we arrive at the modified DRO problem inf ℓ L sup ν G2(σ,z0):Wε p( µn ν) ρ Eν[ℓ], (4) which, as stated next, admits risk bounds matching Corollary 1 up to empirical approximation error. Proposition 4 (Risk bound for modified problem). Consider Setting B with G Gcov. Fix z0 Z such that z0 Eµ[Z] E = O(ρ0+ d), and suppose that Wp(ˆµn, µ) δ. Take ˆℓminimizing (4) with ρ = (ρ0 + δ)(1 ε) 1/p + τp(Gcov, ε) and σ = d + E. We then have Eµ[ˆℓ] Eµ[ℓ ] ( ℓ Lip ρ0 + dε + δ , p = 1, ℓ Lipschitz ℓ H1,2(µ) ρ0 + d + δ + α ρ0 + d + δ 2, p = 2, ℓ α-smooth. Parameters ρ, σ are taken so that µ G2(σ, z0) and Wε p( µn µ) ρ. Noting this, the proof mirrors that of Theorem 1, using a Wp resilience bound for G2(σ, z0). To ensure Wp(ˆµn, µ) δ with decent probability, one should take δ to be an upper bound on supν G E[Wp(ˆνn, ν)]. When p = 2, this quantity is only finite if Z is bounded or if G encodes stronger tail bounds than Gcov (see, e.g., [32]). For efficient computation, we must specify a robust mean estimation algorithm to obtain z0 and a procedure for solving (4). The former is achieved by taking a coordinate-wise trimmed mean. Proposition 5 (Coarse robust mean estimation). Consider Setting B with G Gcov and ε 1/3. For n = Ω(log(d)), there is a trimmed mean procedure, which applied coordinate-wise to { Zi}n i=1, returns z0 Rd with z0 Eµ[Z] d + ρ0 with probability at least 0.99, in time O(d). 7In general, { Zi}n i=1 may be any measurable function of {Zi}n i=1 and independent randomness such that Wε p( µn, ˆµn) ρ0. 8We say that µ T2(τ) if W2(ν, µ) p τH(ν µ), for all ν P2(Z), where H(ν µ) is relative entropy. More sophisticated methods, e.g., iterative filtering [15], achieve dimension-free estimation guarantees at the cost of additional sample and computational complexity. We will return to these techniques in Section 4, but overlook them for now since they do not impact worst-case excess risk bounds. We next show that that the inner maximization problem of (4) can be simplified to a minimization problem involving only two scalars provided the following assumption holds. Assumption 1 (Slater condition I). Given the distribution µn and the fixed point z0, there exists ν0 P(Z) such that Wε p( µn ν0) < ρ and Eν0[ Z z0 2] < σ2. Additionally, we require ρ > 0. Notice that Assumption 1 indeed holds for ν0 = µ as applied in Proposition 4. Proposition 6 (Strong duality). Under Assumption 1, for any ℓ L and z0 Rd, we have sup ν G2(σ,z0): Wε p( µn ν) ρ Eν[ℓ] = inf λ1,λ2 R+ α R λ1σ2 + λ2ρp + α + 1 1 ε E µn ℓ( ; λ1, λ2, α) , (5) where ℓ(z; λ1, λ2, α) := supξ Z ℓ(ξ) λ1 ξ z0 2 λ2 ξ z p α The minimization problem over (λ1, λ2, α) is an instance of stochastic convex optimization, where the expectation of the implicit function ℓis taken w.r.t. the contaminated empirical measure µn. In contrast, the dual reformulation for classical WDRO only involves λ2 and takes the expectation of the implicit function ℓ(z; λ2) := supξ Z ℓ(ξ) λ2 ξ z p w.r.t. µn. The additional λ1 variable above is introduced to account for the clean family G2(σ, z0), and the use of partial transportation under Wε p results in the introduction of the operator [ ]+ and the decision variable α. Remark 3 (Connection to conditional value at risk (CVa R)). The CVa R of a Borel measurable loss function ℓacting on a random vector Z µ P(Z) with risk level ε (0, 1) is defined as CVa R1 ε,µ[ℓ(Z)] = inf α R α + 1 1 ε EZ µ [ℓ(Z) α]+ . CVa R is also known as expected shortfall and is equivalent to the conditional expectation of ℓ(Z), given that it is above an ε threshold. This concept is often used in finance to evaluate the market risk of a portfolio. With this definition, the result of Proposition 6 can be written as sup ν G2(σ,z0): Wε p( µn ν) ρ Eν[ℓ] = inf λ1,λ2 R+ λ1σ2+λ2ρp+CVa R1 ε, µn sup ξ Z ℓ(ξ) λ1 ξ z0 2 λ2 ξ Z p # When ε 0 and σ , whence CVa R reduces to expected value and the constrained class G2(σ, z0) expands to P(Z), the dual formulation above reduces to that of classical WDRO [8, 21]. Evaluating ℓrequires solving a maximization problem, which could be in itself challenging. To overcome this, we impose additional convexity assumptions, which are standard for WDRO [35, 43]. Assumption 2 (Convexity condition). The loss ℓis a pointwise maximum of finitely many concave functions, i.e., ℓ(ξ) = maxj [J] ℓj(ξ), for some J N, where ℓj is real-valued, l.s.c., and concave9. The set Z is closed and convex. The atoms of µn are in the relative interior of Z. Theorem 2 (Convex reformulation). Under Assumption 1, for any ℓ L satisfying Assumption 2 and z0 Rd, we have sup ν Gq(σ,z0): Wε p( µn ν) ρ inf λ1σ2 + λ2ρp + α + 1 n(1 ε) P s.t. α R, λ1, λ2 R+, s, τij Rn +, ζℓ ij, ζG ij, ζW ij , ζZ ij Rd, i [n], j [J] si ( ℓj) (ζℓ ij) + z 0 ζG ij + τij + Z i ζW ij + Ph(ζW ij , λ2) + χ Z(ζZ ij) α, i [n], j [J] ζℓ ij + ζG ij + ζW ij + ζZ ij = 0, ζG ij 2 λ1τij, i [n], j [J], where Ph is the perspective function (i.e., Ph(ζ, λ) = λh(ζ/λ)) of 9Generally, any continuous function can be approximated arbitrarily well by a maximum of finitely many concave functions. However, the number of functions needed may be arbitrarily large in general. Fortunately, some losses like the ℓ -norm z =maxi [d],a { 1} σzi require only poly(d) pieces. χ{z Rd: z 1}(ζ), p = 1 p p 1 , p > 1. (6) The minimization problem in Theorem 2 is a finite-dimensional convex program. In Section 5, we use this result in conjunction with Proposition 5 to efficiently perform outlier-robust WDRO. We conclude this section by characterizing the worst-case distribution, i.e., the optimal adversarial strategy, for our outlier-robust WDRO problem. To that end, we need the primal formulation below. Theorem 3 (Worst-case distribution). Under Assumption 1, for any ℓ L satisfying Assumption 2 and z0 Rd, we have sup ν Gq(σ,z0): Wε p( µn ν) ρ (i,j) [n] [J] P ℓj(ξij, qij) s.t. qij R+, ξij qij Z i [n], j [J] P j [J] qij 1 n(1 ε) i [n] P (i,j) [n] [J] qij = 1 P (i,j) [n] [J] P p(ξij qij Zi, qij) ρ P (i,j) [n] [J] P 2(ξij qijz0, qij) σ2 The discrete distribution ν = P (i,j) Q q ijδξ ij/q ij achieves the worst-case expectation on the lefthand side, where (q ij, ξ ij)(i,j) [n] [J] are optimizers of the maximization problem on the right and Q := {(i, j) [n] [J] : q ij > 0}. The maximization problem from Theorem 3 is the conjugate dual of the minimization in Theorem 2. Subsequently, we propose a systematic approach for constructing a discrete distribution based on a solution derived from the maximization problem that achieves the worst-case expected loss. Remark 4 (Comparison to WDRO worst-case distribution). Recall that our robust WDRO approach reduces to the classic WDRO approach as ε = 0 and σ . Consequently, this implies that the constraints P j [J] qij 1/(n(1 ε)) and P (i,j) [n] [J] P 2(ξij qijz0, qij) σ2 can be dropped under this specific choice of ε and σ. As a result, our construction simplifies to the approach presented in [35, Theorem 4.4] for WDRO problems. Remark 5 (Parameter tuning). In practice, ε, ρ0, and the relevant tail bound may be unknown. Thus, in Appendix F, we consider learning under Setting B with G = Gcov(σ) for potentially unknown ε, σ, and ρ0. First, we observe that knowledge of upper bounds on these parameters is sufficient to attain risk bounds scaling in terms of said upper bounds. This approach avoids meticulous parameter tuning but may result in suboptimal risk. To efficiently match our risk bounds with known parameters, we show that it is necessary and sufficient to know ρ0 and at least one of ε or σ (up to constant factors). 4 Low-Dimensional Features While Proposition 3 shows that the excess risk bounds from Theorem 1 cannot be improved in general, finer guarantees can be derived when the optimal loss function depends only on k-dimensional affine features of the data. Defining G(k) as the union of the projections {U#µ : µ G} over U Rk d with UU = Ik10, we improve the excess risk bound of Theorem 1 for this setting. Theorem 4 (Excess risk bound). Under Setting A, let ˆℓminimize (OR-WDRO), and assume that ℓ = ℓ A for an affine map A : Rd Rk and some ℓ: Rk R. Writing c=2(1 ε) 1/p, we have Eµ[ˆℓ] Eµ[ℓ ] ( ℓ Lip cρ + 2τ1(G(k), 2ε) , p=1, ℓ Lipschitz ℓ H1,2(µ) cρ+2τ2(G(k), 2ε) + 1 2α cρ+2τ2(G(k), 2ε) 2, p=2, ℓ α-smooth. This dependence on G(k) rather than G = G(d) is a substantial improvement when k d. 10If G is closed under isometries, like Gcov, then G(k) = {U#µ : µ G} for any such U. Corollary 4 (Risk bounds for Gcov). Under the setting of Theorem 4 with G Gcov, we have Eµ[ˆℓ] Eµ[ℓ ] ( ℓ Lip ρ + kε , p = 1, ℓ Lipschitz ℓ H1,2(µ)(ρ + k ) + α(ρ2 + k), p = 2, ℓ α-smooth. We again have a matching lower bound for the Lipschitz setting, this time using k-variate regression. Proposition 7 (Lower bound). Fix Z = Rd and ε [0, 0.49]. For any L 0, there exists a family L Lip L(Rd), independent of ε, such that each ℓ L decomposes as ℓ= ℓ A for A Rk d and ℓ: Rk R, and such that for any decision rule D : P(Z) L there exists a pair (µ, µ) Gcov P(Z) with Wε 1( µ, µ) ρ satisfying Eµ[D( µ)] infℓ L Eµ[ℓ] L ρ + For computation, we turn to a slightly modified n-sample contamination model. Our analysis for the low-dimensional case only supports additive TV corruptions (sometimes called Huber contamination). Setting B : Fix ρ, ε, L, G as in Setting A, and fix m = (1 ε)n for some n N. Let Z1, . . . , Zm be drawn i.i.d. from µ G, with empirical measure ˆµm = 1 m Pm i=1 δZi. Upon observing these clean samples, Nature applies a Wp perturbation of size ρ0, producing {Z i}m i=1 with empirical measure µ m such that Wp(ˆµm, µ m) ρ0. Finally, Nature adds εn samples to obtain { Zi}n i=1 with empirical measure µn such that µ m 1 1 ε µn. Equivalently, the final dataset satisfies Wε p( µn ˆµm) ρ0. As before, we modify (OR-WDRO) using a centered alternative to Gcov. Defining Gcov(σ, z0) := µ P(Z) : Eµ[(Z z0)(Z z0) ] σ2Id , we consider the outlier-robust WDRO problem inf ℓ L sup ν Gcov(σ,z0):Wε p( µn ν) ρ Eν[ℓ]. (7) To start, we provide a corresponding risk bound which matches Corollary 4 when k = O(1). Proposition 8 (Risk bound for modified problem). Consider Setting B with G Gcov, and assume ℓ = ℓ A for affine A : Rd Rk and ℓ: Rk R. Fix z0 Z such that z0 Eµ[Z] E = O(ρ0 + 1), and assume Wp(ˆµm, µ) δ. If ˆℓminimizes (7) with ρ = ρ0 + δ and σ = 1 + E, then Eµ[ˆℓ] Eµ[ℓ ] kε + δ , p = 1, ℓ Lipschitz ℓ H1,2(µ) k + δ 2, p = 2, ℓ α-smooth. Here, the stronger requirement for the robust mean estimate, the restriction to additive contamination, and the need to optimize over the centered Gcov class rather than G2 all stem from the fact that the resilience term τp((Gcov)k, ε) scales with k rather than d. Fortunately, efficient computation is still possible. First, we employ iterative filtering [15] for dimension-free robust mean estimation. Proposition 9 (Refined robust mean estimation). Consider Setting B or B with G = Gcov and ε 1/12. For n = Ω(d), there exists an iterative filtering algorithm which takes µn as input, runs in time O(nd2), and outputs z0 Rd such that z0 Eµ[Z] ρ0 + 1 with probability at least 0.99. The analysis requires care when p = 1, since W1 perturbations can arbitrarily increase the initial covariance bound. Fortunately, this increase can be controlled by trimming out a few samples. Next, we show that computing the inner worst-case expectation in (7) can be simplified into a minimization problem involving only a scalar and a positive semidefinite matrix provided the following assumption holds (which is indeed the case in the setting of Proposition 8). Assumption 3 (Slater condition II). Given the distribution µn and fixed point z0, there exists ν0 P(Z) such that Wε p( µn ν0) < ρ and Eν0[(Z z0)(Z z0) ] σ2Id. Further, we require ρ>0. Proposition 10 (Strong duality). Under Assumption 3, for any ℓ L and z0 Rd, we have sup ν Gcov(σ,z0): Wε p( µn ν) ρ Eν[ℓ] = inf Λ1 Qd + λ2 R+,α R z 0 Λ1z0+σ2 Tr[Λ1]+λ2ρp+α+ 1 1 ε E µn ℓ( ; Λ1, λ2, α) , where ℓ(z; Λ1, λ2, α) := supξ Z [ℓ(ξ) ξ Λ1ξ + 2ξ Λ1z0 λ2 ξ z p α]+. The minimization problem over the variables (Λ1, λ2, α) belongs to the class of stochastic convex optimization problems. As before, we show that under the convexity condition from Assumption 2 we obtain a tractable reformulation that does not involve an extra optimization problem for evaluating ℓ. Theorem 5 (Convex reformulation). Under Assumption 3, for any ℓ L satisfying Assumption 2 and z0 Rd, we have sup ν Gcov(σ,z0): Wε p( µn ν) ρ inf z 0 Λ1z0 + σ2 Tr[Λ1] + λ2ρp + α + 1 n(1 ε) P i [n] si s.t. α R, Λ1 Qd +, λ2 R+, s Rn +, τij R+, ζℓ ij, ζG ij, ζW ij , ζZ ij Rd, i [n], j [J] si ( ℓj) (ζℓ ij) + τij + Z i ζW ij +Ph(ζW ij , λ2) + χ Z(ζZ ij) α, i [n], j [J] ζℓ ij+ζG ij+ζW ij +ζZ ij =2Λ1z0, (ζG ij) Λ 1 1 ζG ij 4τij, i [n], j [J], where Ph is the perspective function of h defined in (6). 5 Experiments 10 20 30 40 50 60 70 80 90 100 # samples excess risk (mean absolute deviation) Excess Risk for Varied Sample Size and Method standard WDRO 5 10 15 20 25 30 35 40 dimension excess risk (mean absolute deviation) Excess Risk for Varied Dimension and Method OR-WDRO w/ G = G2 OR-WDRO w/ G = Gcov Figure 2: Excess risk of standard WDRO and several forms of outlier-robust WDRO for linear regression under Wp and TV corruptions, with varied sample size and dimension. Lastly, we implement our tractable reformulations and validate their excess risk bounds. Fixing Z = X Y = Rd 1 R, we focus on linear regression with the mean absolute deviation loss, i.e., L = {ℓθ(x, y) = |θ x y| : θ Rd}. The experiments below were run in 80 minutes on an M1 Mac Book Air with 16GB RAM. See Supplement G for full details and additional experiments treating classification and multivariate regression. Code is available at https://github.com/sbnietert/ outlier-robust-WDRO. Fix Z = Rd for d 2, ρ = ε = 0.1, and θ Sd 2. Taking X N(0, Id 1), we fix clean data (X, θ X) µ and corrupted data ( X, Y ) µ such that ( X, Y ) = (X, θ X + ρ) with probability 1 ε and ( X, Y )=(CX, C2θ X) with probability ε, so that Wε0 p ( µ µ) ρ. In Figure 2 (top), we fix d = 10, C = 8 and compare the excess risk Eµ[ℓˆθ] Eµ[ℓθ ] of standard WDRO and outlierrobust WDRO with G = G2, as described by Proposition 4 and implemented via Theorem 2. The results are averaged over T = 20 runs for sample size n {10, 20, 50, 75, 100}. We run outlier-robust WDRO with corruption fraction ˆε {0, ε, 2ε}, achieving low excess risk when ˆε ε as predicted. In Figure 2 (bottom), to highlight the Section 4 improvements due to low-dimensional structure, we fix n = 20, C = 100 and compare the excess risk of outlier-robust WDRO with G = G2 to that with G = Gcov, as described by Proposition 8 and implemented via Theorem 5. We average over T = 10 runs and present results for dimension d {5, 10, 25, 40}. Confidence bands in both plots depict the top and bottom 10% quantiles among 100 bootstrapped means from the T runs. Implementations were performed in MATLAB using the YALMIP toolbox [33] and the Gurobi and Se Du Mi solvers [23, 50]. 6 Concluding Remarks In this work, we have introduced a novel framework for outlier-robust WDRO that allows for both geometric and non-geometric perturbations of the data distribution, as captured by Wp and TV, respectively. We provided minimax optimal excess risk bounds and strong duality results, with the latter enabling efficient computation via convex reformulations. There are numerous directions for future work, including refined statistical guarantees for ρ ρ0 + n 1/d and convex reformulations for distribution families beyond Gcov. Overall, our approach enables principled, data-driven decisionmaking in realistic scenarios where observations may be subject to adversarial contamination. [1] D. Angluin and P. Laird. Learning from noisy examples. Machine Learning, 2:343 370, 1988. [2] D. Bartl, S. Drapeau, J. Obloj, and J. Wiesel. Robust uncertainty sensitivity analysis. ar Xiv preprint ar Xiv:2006.12022, 2020. [3] A. Ben-Tal and M. Teboulle. An old-new concept of convex risk measures: The optimized certainty equivalent. Mathematical Finance, 17(3):449 476, 2007. [4] A. Bennouna, R. Lucas, and B. Van Parys. Certified robust neural networks: Generalization and corruption resistance. In International Conference on Machine Learning, 2023. [5] A. Bennouna and B. Van Parys. Holistic robust data-driven decisions. ar Xiv preprint ar Xiv:2207.09560, 2022. [6] J. Blanchet, P. W. Glynn, J. Yan, and Z. Zhou. Multivariate distributionally robust convex regression under absolute error loss. In Advances in Neural Information Processing Systems, 2019. [7] J. Blanchet, Y. Kang, and K. Murthy. Robust Wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830 857, 2019. [8] J. Blanchet and K. Murthy. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565 600, 2019. [9] J. Blanchet, K. Murthy, and N. Si. Confidence regions in Wasserstein distributionally robust estimation. Biometrika, 109(2):295 315, 2022. [10] N. H. Bshouty, N. Eiron, and E. Kushilevitz. PAC learning with nasty noise. Theoretical Computer Science, 288(2):255 275, 2002. [11] R. Chen and I. C. Paschalidis. A robust learning approach for regression models based on distributionally robust optimization. Journal of Machine Learning Research, 19(1):517 564, 2018. [12] Z. Chen, D. Kuhn, and W. Wiesemann. Data-driven chance constrained programs over Wasserstein balls. Operations Research (Forthcoming), 2022. [13] Y. Cheng, I. Diakonikolas, and R. Ge. High-dimensional robust mean estimation in nearly-linear time. In ACM-SIAM Symposium on Discrete Algorithms, 2019. [14] L. Deng. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141 142, 2012. [15] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Being robust (in high dimensions) can be practical. In International Conference on Machine Learning, 2017. [16] I. Diakonikolas and D. Kane. Algorithm High-Dimensional Robust Statistics. Cambridge University Press, 2022. [17] A. Esteban-Pérez and J. M. Morales. Distributionally robust stochastic programs with side information based on trimmings. Mathematical Programming, 195(1-2):1069 1105, 2022. [18] R. Gao. Finite-sample guarantees for Wasserstein distributionally robust optimization: Breaking the curse of dimensionality. Operations Research (Forthcoming), 2022. [19] R. Gao, X. Chen, and A. J. Kleywegt. Wasserstein distributionally robust optimization and variation regularization. Operations Research (Forthcoming), 2022. [20] R. Gao and A. Kleywegt. Distributionally robust stochastic optimization with Wasserstein distance. Mathematics of Operations Research, 48(2):603 655, 2023. [21] R. Gao and A. Kleywegt. Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research, 48(2):603 655, 2023. [22] R. Gao, L. Xie, Y. Xie, and H. Xu. Robust hypothesis testing using Wasserstein uncertainty sets. In Advances in Neural Information Processing Systems, 2018. [23] Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023. URL https://www. gurobi.com. [24] T. Hashimoto, M. Srivastava, H. Namkoong, and P. Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, 2018. [25] J.-B. Hiriart-Urruty and C. Lemaréchal. Fundamentals of Convex Analysis. Springer, 2004. [26] S. Hopkins, J. Li, and F. Zhang. Robust and heavy-tailed mean estimation made simple, via regret minimization. In Advances in Neural Information Processing Systems, 2020. [27] W. Hu, G. Niu, I. Sato, and M. Sugiyama. Does distributionally robust supervised learning give robust classifiers? In International Conference on Machine Learning, 2018. [28] P. J. Huber. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1):73 101, 1964. [29] N. Jiang and W. Xie. Distributionally favorable optimization: A framework for data-driven decision-making with endogenous outliers. Optimization Online, 2023. [30] Y. Kwon, W. Kim, J.-H. Won, and M. C. Paik. Principled learning method for Wasserstein distributionally robust optimization with local perturbations. In International Conference on Machine Learning, 2020. [31] J. Lee and M. Raginsky. Minimax statistical learning with Wasserstein distances. In Advances in Neural Information Processing Systems, 2018. [32] J. Lei. Convergence and concentration of empirical measures under Wasserstein distance in unbounded functional spaces. Bernoulli, 26(1):767 798, 2020. [33] J. Lofberg. YALMIP: A toolbox for modeling and optimization in MATLAB. In IEEE international conference on robotics and automation, pages 284 289. IEEE, 2004. [34] G. Lugosi and S. Mendelson. Robust multivariate mean estimation: the optimality of trimmed mean. The Annals of Statistics, 49(1):393 410, 2021. [35] P. Mohajerin Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1-2):115 166, 2018. [36] V. A. Nguyen, D. Kuhn, and P. Mohajerin Esfahani. Distributionally robust inverse covariance estimation: The Wasserstein shrinkage estimator. Operations Research, 70(1):490 515, 2022. [37] V. A. Nguyen, S. Shafieezadeh-Abadeh, D. Kuhn, and P. Mohajerin Esfahani. Bridging Bayesian and minimax mean square error estimation via Wasserstein distributionally robust optimization. Mathematics of Operations Research, 48(1):1 37, 2023. [38] S. Nietert, R. Cummings, and Z. Goldfeld. Outlier-robust optimal transport: duality, structure, and statistical analysis. In International Conference on Artificial Intelligence and Statistics, 2022. [39] S. Nietert, R. Cummings, and Z. Goldfeld. Robust estimation under the Wasserstein distance. ar Xiv preprint ar Xiv:2302.01237, 2023. [40] G. Pflug and D. Wozabal. Ambiguity in portfolio selection. Quantitative Finance, 7(4):435 442, 2007. [41] E. M. Ronchetti and P. J. Huber. Robust Statistics. John Wiley & Sons Hoboken, 2009. [42] F. Santambrogio. Optimal Transport for Applied Mathematicians. Springer, 2015. [43] S. Shafieezadeh-Abadeh, L. Aolaritei, F. Dörfler, and D. Kuhn. New perspectives on regularization and computation in optimal transport-based distributionally robust optimization. ar Xiv preprint ar Xiv:2303.03900, 2023. [44] S. Shafieezadeh-Abadeh, D. Kuhn, and P. Mohajerin Esfahani. Regularization via mass transportation. Journal of Machine Learning Research, 20(103):1 68, 2019. [45] S. Shafieezadeh-Abadeh, P. Mohajerin Esfahani, and D. Kuhn. Distributionally robust logistic regression. In Advances in Neural Information Processing Systems, 2015. [46] S. Shafieezadeh-Abadeh, V. A. Nguyen, D. Kuhn, and P. Mohajerin Esfahani. Wasserstein distributionally robust Kalman filtering. In Advances in Neural Information Processing Systems, 2018. [47] A. Shapiro. On duality theory of conic linear problems. In M. Á. Goberna and M. A. López, editors, Semi-Infinite Programming, pages 135 165. Kluwer Academic Publishers, 2001. [48] A. Sinha, H. Namkoong, and J. Duchi. Certifying some distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018. [49] J. Steinhardt, M. Charikar, and G. Valiant. Resilience: A criterion for learning in the presence of arbitrary outliers. In Innovations in Theoretical Computer Science Conference, 2018. [50] J. F. Sturm. Using Se Du Mi 1.02, A MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software, 11(1-4):625 653, 1999. [51] Z. Tu, J. Zhang, and D. Tao. Theoretical analysis of adversarial learning: A minimax approach. In Advances in Neural Information Processing Systems, 2019. [52] C. Villani. Topics in Optimal Transportation. American Mathematical Society, 2003. [53] R. Volpi, H. Namkoong, O. Sener, J. Duchi, V. Murino, and S. Savarese. Generalizing to unseen domains via adversarial data augmentation. In Advances in Neural Information Processing Systems, pages 5339 5349, 2018. [54] Y. Wang, X. Ma, J. Bailey, J. Yi, B. Zhou, and Q. Gu. On the convergence and robustness of adversarial training. In International Conference on Machine Learning, 2019. [55] W. Xie. On distributionally robust chance constrained programs with Wasserstein distance. Mathematical Programming, 186(1):115 155, 2021. [56] R. Zhai, C. Dan, Z. Kolter, and P. Ravikumar. DORO: Distributional and outlier robust optimization. In International Conference on Machine Learning, 2021. [57] J. Zhen, D. Kuhn, and W. Wiesemann. A unified theory of robust and distributionally robust optimization via the primal-worst-equals-dual-best principle. Operations Research (Forthcoming), 2023. [58] B. Zhu, J. Jiao, and J. Steinhardt. Generalized resilience and robust statistics. The Annals of Statistics, 50(4):2256 2283, 2022. [59] S. Zhu, L. Xie, M. Zhang, R. Gao, and Y. Xie. Distributionally robust weighted k-nearest neighbors. In Advances in Neural Information Processing Systems, 2022. A Preliminary Results on Robust OT and Wasserstein DRO We first recall properties of the robust Wasserstein distance Wε p which will be used throughout the supplement. To start, we show that our definition coincides with another based on partial OT, considered in [39]. In what follows, we fix p 1, write c P(Z) := {cµ : µ P(Z)}, and, for µ, ν c P(Z), we define Π(µ, ν) := cΠ(µ/c, ν/c) and Wp(µ, ν)p := c Wp(µ/c, ν/c). Lemma 1 (Wε p as partial OT). Fix ε [0, 1]. For any µ, ν P(Z), we have Wε p(µ, ν) := inf µ P(Z) µ µ TV ε Wp(µ , ν) = inf ν P(Z) ν ν TV ε Wp(µ, ν ) = inf µ ,ν (1 ε)P(Z) µ µ, ν ν Wp(µ , ν ). Proof. We write Wε p(µ, ν) for the rightmost expression; this is definition of Wε p considered in [39]. We first show that Wε p(µ, ν) Wε p(µ, ν). Fix any µ feasible for the Wε p problem. Then, by the approximate triangle inequality for Wε p (Proposition 3 of [39]), we have Wε p(µ, ν) Wε p(µ, µ ) + Wp(µ , ν) Wp(µ , ν). Indeed, writing c := (µ µ )(Z) 1 ε, the last inequality uses that Wε p(µ, µ ) Wp( 1 ε c µ µ , 1 ε c µ µ ) = 0. Infimizing over feasible µ gives that Wε p(µ, ν) Wε p(µ, ν). For the other direction, take any µ , ν feasible for the Wε p problem. Let π Π(µ , ν ) be any optimal coupling for the Wp(µ , ν ) problem, and write µ = µ + (ν ν ) P(Z). Defining the coupling π = π + (Id, Id)#(ν ν ) Π(µ , ν), we compute Wp(µ , ν)p Z Z Z x y p dπ (x, y) = Z Z Z x y p dπ(x, y) = Wp(µ , ν ). By construction, µ µ TV ε, and so Wε p(µ, ν) Wp(µ , ν ). Infimizing over feasible µ , ν gives that Wε p(µ, ν) Wε p(µ, ν). We thus inherit several results for Wp given in [39]. Lemma 2 (Approximate triangle inequality [39]). If µ, ν, κ P(Z) and ε1, ε2 [0, 1], then Wε1+ε2 p (µ, ν) Wε1 p (µ, κ) + Wε2 p (κ, ν). Lemma 3 (Wε p modulus of continuity, [39], Lemma 3). For any G P(Z), we have sup α,β G Wε p(α,β) ρ Wp(α, β) (1 ε) 1/pρ + 2τp(G, ε). Lemma 4 (One-sided vs. two-sided Wε p). For µ, ν P(Z), we have Wε p(µ ν) (1 ε) 1/p Wε p(µ, ν) + τp(ν, ε). Proof. Fix any µ , ν (1 ε)P(Z) with µ µ and ν ν. By design, we have Wε p(µ ν) Wp 1 1 εµ , ν Wp 1 1 εµ , 1 1 εν + Wp 1 1 εν , ν = (1 ε) 1/p Wp(µ , ν ) + τp(ν, ε) Infimizing over µ and ν and applying Lemma 1 gives the lemma. Next, we specify explicit constants for the Wp resilience of Gcov. Our analysis goes through the related notion of mean resilience [49], defined by τ(µ, ε) = supµ P(Z):µ 1 1 ε µ Eµ [Z] Eµ[Z] . We say that Z µ P(Z) is (τ0, ε)-resilient in mean or under Wp if µ τ(τ0, ε) or µ τp(τ0, ε). Lemma 5 (Wp resilience for G2 and Gcov). Fix ε [0, 1) and σ 0. For 1 p 2, we have τp(G2(σ), ε) 4σε1/p 1/2(1 ε) 1/p. Moreover, we have τp(Gcov(σ), ε) τp(G2( Proof. Fix any Z µ G2(σ, z0). By definition, we have E[( Z z0 p)2/p] = E[ Z z0 2] σ2 = (σp)2/p. Thus, standard bounds (e.g., Lemma E.2 of [58]) give that Z z0 p is (σpε1 p/2(1 ε) 1, ε)-resilient in mean. By Lemma 7 of [39], we thus have that Z is (2σε1/p 1/2(1 ε) 1/p + 2ε1/pσ, ε)-resilient under Wp. This gives the first result. For the second, we observe that for Z µ Gcov(σ), we have E[ Z E[Z] 2] = tr(Σµ) Lastly, we turn to the Wp regularizer. Lemma 6 (Controlling Wp regularizer, [18], Lemmas 1 and 2). For any ℓ L, we have Rν,1(ρ; ℓ) ρ ℓ Lip, with equality if ℓis convex. For α-smooth ℓ, we have |Rν,2(ρ; ℓ) ρ ℓ H1,2(ν)| 1 The factor of 1/2 under smoothness was not present in Lemma 2 of [18], since the proof only used that |ℓ(z) ℓ(z0) (z0) (z z0)| α z z0 2, instead of the tight upper bound of α This quantity naturally bounds the excess risk of standard WDRO. Lemma 7 (WDRO excess risk bound). Under Setting A with ε = 0, the standard WDRO estimate ˆℓ= argminℓ L supν P(Z):Wp( µ,ν) ρ Eν[ℓ] satisfies Eµ[ˆℓ] Eµ[ℓ ] Rµ,p(2ρ; ℓ ). Proof. We bound Eµ[ˆℓ] Eµ[ℓ ] sup ν P(Z) Wp(ν, µ) ρ Eν[ˆℓ] Eµ[ℓ ] (Wp(µ, µ) ρ) sup ν P(Z) Wp(ν, µ) ρ Eν[ℓ ] Eµ[ℓ ] (optimality of ˆℓ) sup ν P(Z) Wp(ν,µ) 2ρ Eν[ℓ ] Eµ[ℓ ] (Wp triangle inequality) = Rµ,p(2ρ; ℓ ), as desired. Note that this bound does not incorporate the distributional assumptions encoded by G. B Proofs for Section 3 B.1 Proof of Theorem 1 Eµ[ˆℓ] Eµ[ℓ] sup ν G Wε p( µ,ν) ρ Eν[ˆℓ] Eµ[ℓ] (µ G, Wε p( µ, µ) ρ) sup ν G Wε p( µ,ν) ρ Eν[ℓ] Eµ[ℓ] (ˆℓoptimal for (OR-WDRO)) sup ν G W2ε p (ν,µ) 2ρ Eν[ℓ] Eµ[ℓ] (Lemma 2) sup ν G Wp(ν,µ) cρ+2τp(G,2ε) Eν[ℓ] Eµ[ℓ] (Lemma 3) sup ν P(Z) Wp(ν,µ) cρ+2τp(G,2ε) Eν[ℓ] Eµ[ℓ] (G P(Z)) = Rµ,p cρ + 2τp(G, 2ε); ℓ . Combining this bound with Lemma 6 gives the theorem. B.2 Proof of Corollary 1 The corollary follows as an immediate consequence of Theorem 1 and the resilience bounds of Proposition 2. B.3 Proof of Corollary 2 The corollary follows as an immediate consequence of Theorem 1 and the resilience bound τp(Gsub G, ε) q d + p + log 1 ε ε1/p established in [39, Theorem 2]. B.4 Proof of Proposition 3 For ease of presentation, suppose d = 2m is even. Consider Rd as Rm Rm, fix w Rm with w = ρ, and let L consist of the following loss functions: ℓ+,0(x, y) := L x + y ℓ ,0(x, y) := L x y ℓ+,1(x, y) := L x + y w ℓ ,1(x, y) := L x y + w . Fixing corrupted measure µ = δ0, we consider the following candidates for the clean measure µ: µ+,0 := (1 ε)δ0 + ε(Id, Id)#N(0, Id/ε) µ ,0 := (1 ε)δ0 + ε(Id, + Id)#N(0, Id/ε) µ+,1 := (1 ε)δ(0,w) + ε(Id, Id + w)#N(0, Id/ε) µ ,1 := (1 ε)δ(0,w) + ε(Id, Id + w))#N(0, Id/ε), where Id : x 7 x is the identity map. By design, Wε 1( µ µ+,0), Wε 1( µ µ ,0), Wε 1( µ µ+,1), and Wε 1( µ µ ,1) are all at most ρ and µ+,0, µ ,0, µ+,1, µ ,1 Gcov. Moreover, Eµ+,0[ℓ+,0] = Eµ ,0[ℓ ,0] = Eµ+,1[(ℓ+,1) = Eµ ,1[ℓ ,1] = 0 Eµ+,0[ℓ ,0] = Eµ ,0[ℓ+,0] = Eµ+,1[ℓ ,1] = Eµ ,1[ℓ+,1] = 2Lε EZ N(0,Id/ε)[ Z ] L Eµ+,0[ℓ+,1] = Eµ+,1[ℓ+,0] = Eµ ,0[ℓ ,1] = Eµ ,1[ℓ ,0] = L w = Lρ. Thus, for any ˆℓ= D( µ) L, there exists µ {µ+,0, µ ,0, µ+,1, µ ,1} such that Eµ[ˆℓ] inf ℓ L Eµ[ℓ] = Eµ[ˆℓ] L max{ρ, B.5 Proof of Proposition 4 Since µ Gcov, we have Eµ Z z0 2 1 2 Eµ[ Z Eµ[Z] 2] 1 2 + Eµ[Z] z0 = tr(Σµ) 1 2 + Eµ[Z] z0 d + Eµ[Z] z0 σ. Consequently, we have µ G2(σ, z0). Next, we bound Wε p( µn µ) := inf ν P(Z) ν 1 1 ε µn inf ν,µ P(Z) ν 1 1 ε µn µ 1 1 ε µ Wp(ν, µ ) + Wp(µ , µ) inf ν,µ P(Z) ν 1 1 ε µn µ 1 1 ε µ Wp(ν, µ ) + τp(µ, ε) (definition of τp) p Wε p( µn, µ) + τp(µ, ε) (Lemma 1) p (ρ0 + δ) + τp(Gcov, ε) = ρ. Writing G = G2(σ, z0) and mirroring the proof of Theorem 1, we have for each ℓ L that Eµ[ˆℓ] Eµ[ℓ] sup ν G Wε p( µn ν) ρ Eν[ˆℓ] Eµ[ℓ] sup ν G Wε p( µn ν) ρ Eν[ℓ] Eµ[ℓ] (ˆℓoptimal for (4)) W2ε p (ν,µ) 2ρ Eν[ℓ] Eµ[ℓ] (Lemma 2) Wp(ν,µ) cρ+2τp(G ,2ε) Eν[ℓ] Eµ[ℓ] (Lemma 3) sup ν P(Z) Wp(ν,µ) cρ+2τp(G ,2ε) Eν[ℓ] Eµ[ℓ] = Rµ,p cρ + 2τp(G , 2ε); ℓ Rµ,p cρ + 8σ(2ε) 1 p 1 p ; ℓ . (Lemma 5) When p = 1, we bound the regularizer radius by 2ε(1 2ε) 1 ρ0 + δ + τ1(Gcov, ε) + ( d + ρ0) ε ρ0 + δ + using Lemma 5. Similarly, when p = 2, we bound the radius by cρ + 8σ(1 2ε) 1 2 ρ0 + δ + τ2(Gcov, ε) + ( d + ρ0) ρ0 + δ + Taking ℓ= ℓ and applying Lemma 7 gives the proposition. B.6 Proof of Proposition 5 To start, we fix d = 1. Given 0 γ < 1/2 and ν P(R) with cumulative distribution function (CDF) Fν, define the γ-trimming Tγ(ν) P(R) as the law of F 1 ν (U), where U Unif([γ, 1 γ]), and let mγ(ν) := ETγ(ν)[Z] denote the γ-trimmed mean. If ν = 1 |A| P a A δa is uniform over a finite set A = {a1 < a2 < < an} and γn is an integer, we have mγ(ν) = 1 (1 2γ)n P(1 γ)n i=γn+1 ai. Our robust mean estimate when d = 1 is z0 = mγ( µn) with γ = 1/3. The smaller choice of γ = ε gives tighter guarantees at the cost of increased sample complexity; we keep the larger choice since we only require a coarse estimate. Lemma 8. Consider Setting B with d = 1, G = Gcov, ρ0 = 0, ε 1/3. Fix sample size n = Ω(log(1/δ)), for 0 < δ < 1/2. Then, mγ( µn) Eµ[Z] 1 with probability at least 1 δ. Proof. This follows by Proposition 1.18 of [16] applied to the distribution µ with corruption fraction γ, ε = 4γ/3 < 1/2, and resilience bound τ(Gcov, 2ε ) γ 1. Now, since we are free to permute the order of the TV and Wp corruptions (see Lemma 1), there exist {Wi}n i=1 R with empirical measure νn P(R) such that νn ˆµn TV ε and Wp(νn, µn) ρ0. By Lemma 8, we have |mγ(νn) Eµ[Z]| 1. Of course, we do not observe νn, so this result is not immediately useful. To apply this fact, we use that Tγ is an approximate Wasserstein contraction. Lemma 9. If α, β P(R) and 0 γ < 1/2, then Wp(Tγ(α), Tγ(β)) (1 2γ) 1/p Wp(α, β). Proof. Writing Fα and Fβ for the CDFs of α and β, respectively, we compute Wp(Tγ(α), Tγ(β))p = 1 1 2γ γ |F 1 α (t) F 1 β (t)|p dt 0 |F 1 α (t) F 1 β (t)|p dt = 1 1 2γ Wp(α, β)p, as desired. Applying Lemma 9, we bound |mγ( µn) Eµ[Z]| |mγ(νn) Eµ[Z]| + |mγ(νn) mγ( µn)| |mγ(νn) Eµ[Z]| + W1(Tγ(νn), Tγ( µn)) 1 + W1(νn, µn) 1 + ρ0. This matches the proposition statement when d = 1. For general d > 1, we propose the coordinate-wise trimmed estimate z0 Rd given by (z0)i = mγ(e i # µn). Plugging in δ 1/(100d) into Lemma 8 and taking a union bound over coordinates, we condition on the 0.99 probability event that the one-dimensional bound holds for all coordinates. We can then bound z0 Eµ[Z] 2 = mγ(e i # µn) Eµ[e i Z] 2 i=1 W1(e i #νn, e i # µn)2 d + W1(νn, µn)2 as desired. The penultimate inequality is a consequence of the reverse Minkowski inequality. Lemma 10. Fix α, β P(Rd) Write αi = e i #α and βi = e i #β, i [d], for their coordinate-wise marginals. We then have d X i=1 W1(αi, βi)2 W1(α, β)2. (8) Proof. Take (X, Y ) to be an optimal coupling for the W1(α, β) problem. Writing i = Yi Xi 2 for i [d], the right hand side of (8) can be written as the L1/2 norm Pd i=1 i 1/2. We then bound i=1 W1(αi, βi)2 where the final inequality follows by the reverse Minkowski inequality for L1/2. B.7 Proof of Proposition 6 sup ν G2(σ,z0): Wε p( µn ν) ρ Eν[ℓ] = sup µ ,ν P(Z) π Π(µ ,ν) Eν[ Z z0 2] σ2, Eπ[ Z Z p] ρp, = sup m Rn ν1,...,νn P(Z) i [n] mi Eνi[ℓ] : P i [n] mi Eνi[ Zi z0 2] σ2, P i [n] mi Eνi[ Zi Zi p] ρp, 0 mi 1 n(1 ε), i [n] P i [n] mi = 1 where the first equality follows from the definitions of G2(σ, z0) and Wε p( µn ν). The second equality holds because µn = 1 i [n] δ Zi, which implies that the distributions µ , ν and π take the form µ = P i [n] miδ Zi, ν = P i [n] miνi, and π = P i [n] miδ Zi νi, respectively. Note that the distribution νi models the probability distribution of the random variable Z condition on the event that Z = zi. Using the definition of the expectation operator and introducing the positive measure ν i = miνi for every i [n], we arrive at sup ν G2(σ,z0): Wε p( µn ν) ρ Eν[ℓ] = sup m Rn ν 1,...,ν n 0 i [n] Eν i[ℓ] : Z zi z0 2dν i(zi) σ2, P Z zi Zi pdν i(zi) ρp, 0 mi 1 n(1 ε), i [n], P i [n] mi = 1 R Z dν i(zi) = mi, i [n] = inf λ1,λ2 R+ r,s Rn,α R λ1σq + λ2ρp + α + i [n] si n(1 ε) : si max{0, ri α}, i [n], ri ℓ(ξ) λ1 ξ z0 2 λ2 ξ Zi p, where the second equality follows from strong duality, which holds because the Slater condition outlined in [47, Proposition 3.4] is satisfied thanks to Assumption 1. The proof concludes by removing the decision variables r and s and using the definition of µn. B.8 Proof of Theorem 2 The proof requires the following preparatory lemma. We say that the function f is proper if f(x) > and dom(f) = . Lemma 11. The followings hold. (i) Let f(x) = λg(x x0), where λ 0 and g : Rd R is l.s.c. and convex. Then, f (y) = x 0 y + λg (y/λ). (ii) Let f(x) = x p for some p 1. Then, f (y) = h(y), where the function h is defined as in (6). (iii) Let f(x) = x Σx for some Σ 0. Then, f (y) = 1 Proof. The claims follows from [25, E, Proposition 1.3.1 ], [57, Lemma B.8 (ii)] and [25, E, Example 1.1.3], respectively. Proof of Theorem 2. By Proposition 6 and exploiting the definition of µn, we have sup ν G2(σ,z0): Wε p( µn ν) ρ inf λ1σ2 + λ2ρp + α + 1 n(1 ε) P i [n] si s.t. α R, λ1, λ2 R+, s Rn + si sup ξ Z ℓ(ξ) λ1 ξ z0 2 λ2 ξ Zi p α i [n] inf λ1σ2 + λ2ρp + α + 1 n(1 ε) P i [n] si s.t. α R, λ1, λ2 R+, s Rn + si sup ξ Z ℓj(ξ) λ1 ξ z0 2 λ2 ξ Zi p α i [n], j [J] (9) where the second equality follows form Assumption 2. For any fixed i [n] and j [J], we have sup ξ Z ℓj(ξ) λ1 ξ z0 2 λ2 ξ Zi p α ( inf ( ℓj) (ζℓ ij) + z 0 ζG ij + τij + Z i ζW ij + Ph(ζW ij , λ2) + χ Z(ζZ ij) α s.t. τij Rn +, ζℓ ij, ζG ij, ζW ij , ζZ ij Rd, ζℓ ij + ζG ij + ζW ij + ζZ ij = 0, ζG ij 2 λ1τij where the equality is a result of strong duality due to [57, Theorem 2] and Lemma 11. The claim follows by substituting all resulting dual minimization problems into (9) and eliminating the corresponding minimization operators. B.9 Proof of Theorem 3 Thanks to Remark 3, we have sup ν G2(σ,z0): Wε p( µn ν) ρ = inf λ1,λ2 R+ λ1σ2 + λ2ρp + CVa R1 ε, µn sup ξ Z ℓ(ξ) λ1 ξ z0 2 λ2 ξ Z p # = inf λ1,λ2 R+ sup m Mε λ1σ2 + λ2ρp + X sup ξ Z ℓ(ξ) λ1 ξ z0 2 λ2 ξ Zi p # where Mε := {m Rn + : mi 1/(n(1 ε)), i [n], P i [n] mi = 1}, and the equality follows from the primal representation of the CVa R as a coherent risk measure [3, Example 4.1] and the fact that µn is discrete. By Assumption 2, we have sup ξ Z ℓ(ξ) λ1 ξ z0 2 λ2 ξ Zi p # = sup m Mε sup ξij Z max j [J] ℓ(ξij) λ1 ξij z0 2 λ2 ξij Zi p = sup q Qε sup ξij Z (i,j) [n] [J] qij h ℓ(ξij) λ1 ξij z0 2 λ2 ξij Zi pi , where Qε := {q Rn J + : P j [J] qij 1 n(1 ε), i [n], P (i,j) [n] [J] qij = 1}, and the last equality easily follows by introducing the variables qij as a means to merge the variables mi and the maximum operator. Note that the final supremum problem is nonconvex as we have bilinearity between qij and ξij. Using the definition of the perspective function and the simple variable substitution ξij ξij/qij, however, one can convexify this problem and arrive at sup ν G2(σ,z0): Wε p( µn ν) ρ Eν[ℓ] = inf λ1,λ2 R+ sup q Qε ξij qij Z n λ1σ2 + λ2ρp X (i,j) [n] [J] P ℓj(ξij, qij) λ1P 2(ξij qijz0, qij) λ2P p(ξij qij Zi, qij) o . Note that strong duality holds similar to the proof of [57, Section 6]. This allows us to interchange the infimum and supremum without changing the optimal value of the problem. Then, infimizing over λ1 and λ2, and noticing that the resulting supremum problem is solvable, since the feasible set is compact, conclude the first part of the proof. Following the discussion in [57, 6], it is easy to show that the proposed discrete distribution ν solves the worst-case expectation problem. The details are omitted for brevity. This concludes the proof. C Proofs for Section 4 C.1 Proof of Theorem 4 We start with the following lemma. Lemma 12. Under the setting of Theorem 4, we may decompose ℓ = ℓ Q for Q Rk d with QQ = Ik and some ℓ: Rk Rd. For any such decomposition, we have sup ν G Wε p(ν, µ) ρ Eν[ℓ ] = sup ν Gk Wε p(ν,Q# µ) ρ Proof. To start, we decompose A(z) = RQz + z0, where Q Rk d with QQ = Ik, R Rk k, and z0 Rk. Note that the orthogonality condition ensures that Q isometrically embeds Rk into Rd. We then choose ℓ(w) = ℓ(Rw + z0). Next, given ν G, we have Q#ν G(k) with Wε p(Q#ν, Q# µ) Wε p(ν, µ), and Eν[ℓ] = EQ#ν[ ℓ]. Thus, the RHS supremum is always at least as large as the LHS. It remains to show the reverse. Fix ν Gk with Wε p(ν, Q# µ). Take any ν P(Rk) with Wp(ν, ν ) ρ and ν Q# µ TV ε. Write κ = Q #ν G and κ = Q #ν . Since Q is an isometric embedding, we have κ G, Wp(κ, κ ) = Wp(ν, ν ) ρ, and κ µ TV = ν Q# µ TV ε. Finally, we have Eν[ℓ] = Eκ[ ℓ]. Thus, the RHS supremum is no greater than the LHS, and we have the desired equality. Writing µk = Q#µ, we mirror the proof of Theorem 1 and bound Eµ[ˆℓ] Eµ[ℓ ] sup ν G Wε p( µ,ν) ρ Eν[ˆℓ] Eµ[ℓ ] (µ G, Wε p( µ, µ) ρ) sup ν G Wε p( µ,ν) ρ Eν[ℓ ] Eµ[ℓ ] (ˆℓoptimal for (OR-WDRO)) = sup ν G(k) Wε p(ν,Q# µ) ρ Eν[ ℓ] Eµk[ ℓ] (Lemma 12) W2ε p (ν,µk) 2ρ Eν[ ℓ] Eµk[ ℓ] (Lemma 2) Wp(ν,µk) cρ+2τp(G(k),2ε) Eν[ ℓ] Eµk[ ℓ] (Lemma 3) sup ν P(Z) Wp(ν,µk) cρ+2τp(G(k),2ε) Eν[ ℓ] Eµk[ ℓ] (G P(Z)) = Rµk,p cρ + 2τp(G(k), 2ε); ℓ . To obtain the theorem, we apply Lemma 6 and observe that ℓ Lip = ℓ Lip and ℓ H1,2(µk) = ℓ H1,2(µ) (since Q is an isometric embedding from Rk into Rd). C.2 Proof of Corollary 4 This follows as an immediate consequence of Theorem 4 and Corollary 1. C.3 Proof of Proposition 7 We simply instantiate the lower bound construction from Proposition 3 in Rk, viewed as a subspace of Rd. Extending each ℓ L to Rd by ℓ(z) = ℓ(z1:k), the same lower bound applies with d k. C.4 Proof of Proposition 8 For Z µ Gcov and z0, w Rd, we bound E w (Z z0)(Z z0) w 1 2 = E w (Z z0)2 1 E w (Z E[Z])2 1 2 + |w (E[Z] z0)| w (1 + z0 E[Z] ). Consequently, we have µ Gcov(1 + z0 E[Z] , z0) Gcov(σ, z0). Moreover, we have Wε p( µn, µ) Wp(µ m, µ) ρ0 + δ = ρ. Decomposing ℓ = ℓ Q as in Lemma 12 and writing µk = Q#µ, the same approach applied in the proof of Theorem 4 gives Eµ[ˆℓ] Eµ[ℓ ] Rµk,p cρ + 2τp(Q#Gcov(σ, z0), 2ε); ℓ Rµk,p cρ + O(σ When p = 1, we bound the regularizer radius by kε) ρ0 + δ + (1 + ρ0) using Lemma 5. Similarly, when p = 2, we bound the radius by k) ρ0 + δ + (1 + ρ0) We then conclude as in Theorem 4. C.5 Proof of Proposition 9 Since iterative filtering works by identifying a subset of samples with bounded covariance and Wp perturbations can arbitrarily increase second moments when p < 2, it is not immediately clear how to apply this method. Fortunately, W2 perturbations have a bounded affect on second moments, and, by trimming out a small fraction of samples, we can ensure that a W1 step is bounded under W2. Lemma 13. For any µ, ν P(Rd) and 0 < γ 1, there exists ν P(Rd) with ν ν TV τ such that W1(µ, ν ) W1(µ, ν) and W (µ, ν ) W1(µ, ν)/γ. Proof. Let (X, Y ) be a coupling of µ and ν with E[ X Y ] = W1(µ, ν). Writing = X Y , the event E that W1(µ, ν)/γ has probability at least 1 γ by Markov s inequality. We shall take ν be the law of Y = 1EY + (1 1E)X. By design, ν ν TV γ, and W1(µ, ν ) E[ X Y p] = E[1E X Y p] W1(µ, ν). Finally, we bound W (µ, ν ) 1E W1(µ, ν)/γ. For consistency between Settings B and B , we let m = n if we are the former. Thus, in both cases, we have Wp(ˆµm, µ m) ρ0 and µ m µn TV ε ε0 := 1/12. It is well known that the empirical measure ˆµm will inherit the bounded covariance of µ for m sufficiently large, so long as a small fraction of samples are trimmed out. In particular, by Lemma A.18 of [15] and our sample complexity requirement, there exists a uniform discrete measure αk over a subset of k = (1 ε0/120)m points, such that Eαk[Z] Eµ[Z] 1 and Σαk O(1)Id with probability at least 0.99. Moreover, applying Lemma 13 with γ = ε0/120, there exists β P(Rd) with β µ m TV ε0/120 and W2(β, ˆµn) 240ρ0/ε0. Combining, we have that Wε0/120+ε0/120+ε0 2 (αm, µn) = W61ε0/60 2 (αk, µn) 240ρ0/ε0, and so there exists κ P(Rd) such that W (αk, κ) 240ρ0/ε0 and κ µn TV 61ε0/60. The W bound implies that Σκ O(1 + ρ2 0ε 2 0 )Id. Thus, by the proof of Theorem 4.1 in [26] and our sample complexity requirement, the iterative filtering algorithm (Algorithm 1 therein, based on that of [15]) applied with an outlier fraction of 61/60ε0 1/10 returns a reweighting of µn whose mean z0 Rd is within O( ε0 + ρ0/ ε0) = O(1 + ρ0) of that of κ. Thus, we obtain z0 Eµ[Z] Eκ[Z] Eµ[Z] + O(1 + ρ0) Eαk[Z] Eµ[Z] + O(1 + ρ0) O(1 + ρ0), as desired. C.6 Proof of Proposition 10 sup ν Gcov(σ,z0): Wε p( µn ν) ρ Eν[ℓ] = sup µ ,ν P(Z) π Π(µ ,ν) Eν[(Z z0)(Z z0) ] σId, Eπ[ Z Z p] ρp, µ 1 1 ε µn = sup m Rn ν1,...,νn P(Z) i [n] mi Eνi[ℓ] : i [n] mi Eνi[(Zi z0)(Zi z0) ] σId, P i [n] mi Eνi[ Zi Zi p] ρp, 0 mi 1 n(1 ε), i [n], P i [n] mi = 1 = sup m Rn ν 1,...,ν n 0 i [n] Eν i[ℓ] : Z(zi z0)(zi z0) dν i(zi) σId, P Z zi Zi pdν i(zi) ρp, 0 mi 1 n(1 ε), i [n], P i [n] mi = 1 R Z dν i(zi) = mi, i [n] where the first equality follows from the definitions of Gcov(σ, z0) and Wε p( µn ν). The second and the third equalities follow from the same variable substitution as in the proof of Proposition 6. The last optimization problem admits the dual form inf Λ1 Qd +,λ2 R+ r,s Rn,α R z 0 Λ1z0 + σ Tr[Λ1] + λ2ρp + α + i [n] si n(1 ε) : si max{0, ri α}, i [n], ri ℓ(ξ) ξ Λ1ξ + 2ξ Λ1z0 λ2 ξ Zi p, ξ Z, i [n] Strong duality holds thanks to Assumption 3 and [47, Proposition 3.4]. The proof of the second claim concludes by removing the decision variables r and s and using the definition of µn. C.7 Proof of Theorem 5 By Proposition 10 and exploiting the definition of µn, we have sup ν Gcov(σ,z0): Wε p( µn ν) ρ inf z 0 Λ1z0 + σ Tr[Λ1] + λ2ρp + α + 1 n(1 ε) s.t. Λ1 Qd +, λ2 R+, s Rn + si sup ξ Z ℓ(ξ) ξ Λ1ξ + 2ξ Λ1z0 λ2 ξ Zi p α i [n] inf z 0 Λ1z0 + σ Tr[Λ1] + λ2ρp + 1 n(1 ε) s.t. Λ1 Qd +, λ2 R+, s Rn + si sup ξ Z ℓj(ξ) ξ Λ1ξ + 2ξ Λ1z0 λ2 ξ Zi p α i [n], j [J] where the second equality follows form Assumption 2. For any fixed i [n] and j [J], we have sup ξ Z ℓj(ξ) ξ Λ1ξ + 2ξ Λ1z0 λ2 ξ Zi p α ( inf ( ℓj) (ζℓ ij) + 1 4(ζG ij) Λ 1 1 ζG ij + Z i ζW ij + λ2h(ζW ij /λ2, p) + χ Z(ζZ ij) α s.t. ζℓ ij, ζG ij, ζW ij , ζZ ij Rd, ζℓ ij + ζG ij + ζW ij + ζZ ij = 2Λ1z0 where the equality is a result of strong duality due to [57, Theorem 2] and Lemma 11. The claim follows by introducing the epigraph variable τij for the term 1 4(ζG ij) Λ 1 1 ζG ij, and substituting all resulting dual minimization problems into (10) and eliminating the corresponding minimization operators. Note that by the Schur complement argument, we have 1 4(ζG ij) Λ 1 1 ζG ij τij Λ1 ζG ij (ζG ij) 4τij which implies that the resulting reformulation is indeed convex. D Comparison to WDRO with Expanded Radius around Minimum Distance Estimate (Remark 1) First, we prove the claimed excess risk bound for WDRO with an expanded radius around the minimum distance estimate ˆµ = ˆµ( µ, G, ε) := argminν G Wε p(ν, µ). We write c = 2(1 ε) 1/p as in Theorem 1. Lemma 14. Under Setting A, let ˆℓ= argminℓ L supν P(Z):Wp(ν,ˆµ) ρ Eν[ℓ], for the expanded radius ρ := cρ + 2τp(G, 2ε). We then have Eµ[ˆℓ] Eµ[ℓ ] Rµ,p(cρ + 2τp(G, 2ε); ℓ ). Proof. Since Wε p(µ, µ) ρ and µ G, we have Wε p(ˆµ, µ) ρ. Thus, Lemma 2 gives that W2ε p (ˆµ, µ) 2ρ. By Lemma 3, we then have Wp(ˆµ, µ) cρ + 2τp(G, 2ε), and so Lemma 7 gives the desired result. In practice, we are unaware of efficient finite-sample algorithms to compute ˆµ. For the class Gcov, we instead propose the spectral reweighing estimate ˇµ = ˇµ( µ, ε) := argminν P2(Z), ν 1 1 2ε µ Σν op, where op is the matrix operator norm (see [26] for varied applications of spectral reweighing). In practice, when µ = µn is an n-sample empirical measure, one can efficiently obtain a feasible measure ν whose objective value is optimal up to constant factors for the problem with ε 3ε, using the iterative filtering algorithm [15]. We work with the exact minimizer ˇµ for convenience, but our results are robust to such approximate solutions. Lemma 15. Under Setting A with G = Gcov, p = 1, and 0 < ε 0.2, we have W1(ˇµ, µ) dε, and this bound is tight; that is, there exists an instance (µ, µ) Gcov P(Rd) with Wε 1(µ, µ) ρ such that W1(ˇµ, µ) dε. Consequently, the WDRO estimate ˇℓ= argminℓ L supν P(Z):W1(ν,ˇµ) ˇρ Eν[ℓ] satisfies Eµ[ˇℓ] Eµ[ℓ ] Rµ,1(O( Proof. Upper bound: For the upper bound on W1 estimation error, fix any µ P(Rd) with Wε 1( µ, µ) ρ. Take any µ P(Rd) such that W1(µ , µ) ρ and µ µ TV ε. By Lemma 13, there exists α P(Rd) with W1(α, µ) ρ, W2(α, µ) ρ p 2/ε, and α µ TV 2ε. Fixing an optimal coupling π Π(α, µ) for the W2(α, µ) problem and letting (Z, W) π, we bound 1 2op = sup θ Sd 1 E θ (Z E[Z])2 1 2 sup θ Sd 1 E θ (Z E[W])2 1 1 2op + W2(α, µ) Thus, α Gcov(1 + ρ p 2/ε). Write β := 1 (α µ)(Rd)α µ, and note that this midpoint measure is feasible for the problem defining ˇµ. Hence, we have Σˇµ op Σβ op sup θ Sd 1 Eβ (θ (Z Eα[Z]))2 1 1 2ε Σα op 1 1 2ε(1 + ρ p and so ˇµ Gcov (1 2ε) 1/2(1 + ρ p 2/ε) . Moreover, we have ˇµ α TV 4ε. Thus, using Lemma 5 and the fact that 4ε is bounded away from 1, we bound W1(ˇµ, α) W1 ˇµ, 1 (ˇµ α)(Rd) ˇµ α + W1 1 (ˇµ α)(Rd) ˇµ α By the triangle inequality, we have W1(ˇµ, µ) dε. The risk bound follows by Lemma 7. Taking the final error measurement using distance between means instead of W1, we observe that Eˇµ[Z] Eµ[Z] 2 ρ + ε. Lower bound: To see that this guarantee cannot be improved, fix clean measure µ = δ0 Gcov, and consider the corrupted measure µ = (1 3ε)δ0 + 2ε 1 2ε e1 + εN 0, ρ2 100ε2 Id , constructed so that Wε 1( µ, µ) ρ. Intuitively, iterative filtering seeks to drive down the operator norm of covariance matrix and will thus focus on removing mass from the second mixture component. To formalize this, we decompose the output of filtering as (1 2ε)ˇµ = (1 3ε τ)δ0 +α+β, where 0 τ 2ε and α, β M+(Rd) such that α 2ε 1 2ε e1 and β εN 0, ρ2 100ε2 Id . We have τ + (2ε α(Rd)) + (ε β(Rd)) = 2ε by the definition of ˇµ. Further note that Eˇµ[Z] ε + ρ ρ, by the bounds above. Now suppose for sake of contradiction that β(Rd) ε/2. Then α(Rd) ε/2, and so we bound Eˇµ[Z2 1] Eˇµ[Z] ε 2(1 2ε) ρ2 On the other hand, another feasible outcome for spectral reweighing is µ = 1 1 ε(1 3ε)δ0 + 100ε2 Id , for which we have Σµ 1/2 op = ρ 10 ε. Since 10 > 8, this contradicts optimality of ˇµ if ε cρ2 for a sufficiently small constant c. However, if ε > cρ2, then a lower bound of Ω( dε) suffices. This bound holds even without Wasserstein perturbations; see the Gcov lower risk bound of Ω( dε) in Theorem 2 of [38]. We now suppose that β(Rd) ε/2. Let Z N 0, ρ2 100ε2 and write F for the CDF of Z 2 (which has a scaled χ2 d distribution). We then have R z dβ(z) ε E Z | Z 2 F 1(1/2) dρ, using concentration of χ2 d about its mean. Thus, W1(ˇµ, µ) Eˇµ[ Z ] Eµ[ Z ] E Smaller Robustness Radius for Outlier-Robust WDRO (Remark 2) In the classical WDRO setting with ρ0 = ε = 0, the radius ρ can often be taken significantly smaller than n 1/d if L and µ are sufficiently well-behaved. In particular, when µ satisfies a T2 transportation inequality, [18] proves that ρ = e O(n 1/2) gives meaningful risk bounds. Recall that µ T2(τ) if τH(ν µ), ν P2(Z), where H(ν µ) := R Z log(dν/dµ)dν denotes relative entropy when ν µ (and is + otherwise). We note that T2 is implied by the log-Sobolev inequality, which holds for example when µ has strongly log-concave density. Under T2, [18] shows the following. Proposition 11 (Example 3 in [18]). Fix Z = Rd R, τ, B > 0, and an α-smooth and L-Lipschitz function f : R R. Consider the parameterized family of loss functions L = {(x, y) 7 ℓθ(x, y) = f(θ x y) : θ Θ}, where Θ {θ Rd : θ B}. Fix µ P(Z) whose first marginal µX = µ( R) satisfies µX T2(τ) and such that infθ Θ Eµ[f (θ X Y )2] > 0. Write σ = sup θ Θ Eµ[f (θ X, Y )4] 1 2 Eµ[f (θ X, Y )2] L2 infθ Θ Eµ[f (θ X, Y )2] < . For t > 0, define τt 1 + d log(2 + 2Bn) 2t(1 + d log(2 + 2Bn)) δn = 2L+2BαEµ[ X ]+B2α2Varµ( X )+ρn q Eµ (L+Bα X )2 +Varµ (L+Bα X )2 ηn = 2αB2τt 1 + d log(2 + 2Bn) Then, with probability at least 1 2/n 2e t, we have |Eµ[ℓθ] Eˆµn[ℓθ]| Rˆµn,2(ρn; ℓθ) + δn + ηn θ Θ. (11) We note that (11) is stated without the absolute value on the right hand side, but that this strengthened result holds due to the discussion after [18, Theorem 1]. This generalization bound immediately gives the following excess risk bound. Corollary 5. Assume n 800. Fix ρn, δn, ηn, and L as in Proposition 11 with t = 7, and take ℓˆθ L minimizing (1) with p = 2 and radius ρ = ρn. Then, with probability at least 0.99, we have Eµ[ℓˆθ] Eµ[ℓθ] ρn ℓθ H1,2(µ) + αρ2 n + δn + ηn θ Θ. Proof. By Proposition 11 and [18, Remark 1], we have Eµ[ℓˆθ] Eµ[ℓθ] 2Rˆµn,2(ρn; ℓθ) + 2δn + 2ηn θ Θ, with probability at least 1 2/n 2e t 0.995. Since ℓθ is α-smooth, Lemma 6 gives that Eµ[ℓˆθ] Eµ[ℓθ] 2ρn ℓθ H1,2(ˆµn) + 2αρ2 n + 2δn + 2ηn θ Θ, with probability at least 0.995. By Markov s inequality, we can substitute ˆµn with µ at the cost of a constant factor blow-up in excess risk and a decrease in the confidence probability to, say, 0.99. In the example above, excess risk is controlled by the 2-Wasserstein regularizer with radius ρn = O(n 1/2), up to O(n 1) correction terms, which is significantly smaller that the typical radius size of O(n 1/d). We shall now lift this improvement to the outlier-robust setting. Similar to Proposition 4, we perform outlier-robust DRO with a modified choice of A. This time, writing G2(σ) = z0 ZG2(σ, z0), we have the following. Proposition 12 (Outlier-robust WDRO under T2). Assume n 800 and µ Gcov. Fix ρn, δn, ηn, and L as in Proposition 11 with t = 8, and take ℓˆθ minimizing (OR-WDRO) with center µn, radius ρ = ρ0 + 15ρn + 200 d, and A = G2(15 d + ρn). Then, with probability at least 0.99, we have Eµ[ℓˆθ] Eµ[ℓθ] ℓθ H1,2(µ) ρ0 + ρn + d + α ρ0 + ρn + d 2 + δn + ηn θ Θ. Proof. Noting that Gcov G2( d), we have by Markov s inequality that ˆµn G2(15 d) with probability at least 0.995. In other words, there exists z0 Z such that W2(ˆµn, δz0) 15 d. Thus, for any ν P(Z) with W2(ˆµn, ν) ρn, we have W2(ν, δz0) 15 d + ρn, and so ν G2(15 d + ρn). By Lemmas 4 and 5, this implies that Wε 2( µn ν) Wε 2( µn, ν) + τ2(ν, ε) ρ0 + ρn + 8(15 d + ρn)(1 ε) 1/2 Next, by Proposition 11, with probability at least 1 1/400 + 2e 8 0.9985, we have for each θ Θ that Eµ[ℓˆθ] Eµ[ℓθ] Eˆµn[ℓˆθ] + Rˆµn,2(ρn; ℓˆθ) Eµ[ℓθ] + δn + ηn = Eˆµn[ℓˆθ] + sup ν P(Z) W2(ˆµn,ν) ρn Eν[ℓˆθ] Eˆµn[ℓˆθ] Eµ[ℓθ] + δn + ηn = Eˆµn[ℓˆθ] + sup ν G2(15 d+ρn) W2(ˆµn,ν) ρn Eν[ℓˆθ] Eˆµn[ℓˆθ] Eµ[ℓθ] + δn + ηn Eˆµn[ℓˆθ] + sup ν G2(15 d+ρn) Wε 2( µn ν) ρ Eν[ℓˆθ] Eˆµn[ℓˆθ] Eµ[ℓθ] + δn + ηn. Let c = 2 1 ε 1 2ε 1/p. Using optimality of ˆℓand Lemma 2, we further bound Eµ[ℓˆθ] Eµ[ℓθ] sup ν G2(15 d+ρn) Wε 2( µn ν) ρ Eν[ℓθ] Eˆµn[ℓθ] + Eˆµn[ℓθ] Eµ[ℓθ] + δn + ηn sup ν G2(15 d+ρn) Wε 2( µn ν) ρ Eν[ℓθ] Eˆµn[ℓθ] + Rˆµn,2(ρn; ℓθ) + 2δn + 2ηn sup ν G2(15 d+ρn) W2ε 2 (ˆµn,ν) cρ Eν[ℓθ] Eˆµn[ℓθ] + Rˆµn,2(ρn; ℓθ) + 2δn + 2ηn sup ν G2(15 d+ρn) W2(ν,ˆµn) cρ+τ2(ˆµn,2ε)+τ2(A,2ε) Eν[ℓθ] Eˆµn[ℓθ] + Rˆµn,2(ρn; ℓθ) + 2δn + 2ηn Rˆµn2 cρ + τ2(ˆµn, 2ε) + τ2(A, 2ε); ℓθ + Rˆµn,2(ρn; ℓθ) + 2δn + 2ηn ℓθ H1,2(ˆµn) ρ0 + ρn + d + α ρ0 + ρn + d 2 + δn + γn. As in Corollary 5, we can substitute ˆµn with µ at the cost of a constant factor increase in excess risk along with a small decrease in the confidence probability (in this case, sufficiently small such that the total failure probability is at most 0.01). The main goal of Proposition 12 was to demonstrate that one can expect improved excess risk bounds for outlier-robust WDRO in situations where such improvements hold for standard WDRO. We conjecture that similar guarantees hold for additional settings and under milder assumptions like the T1 inequality, but leave such refinements for future work. In particular, for the class Gcov, it would be desirable to prove such bounds when p = 1, so that the TV contribution to the risk vanishes as ε 0. F Parameter Tuning (Remark 5) To clarify the parameter selection process, we consider Setting B with the class G = Gcov(σ), p = 1, and ε 1/3. We aim to efficiently achieve excess risk Eµ[ˆℓ] Eµ[ℓ ] ℓ Lip ρ0 + σ dn 1/d , (12) matching Proposition 4, when ˆℓ= argminℓ L sup ν G2(ˆσ,z0): Wˆε 1( µn,ν) ˆρ Eν[ℓ] (13) for some parameter guesses ˆσ, ˆε, and ˆρ, and a robust mean estimate z0. First, we observe that the coordinate-wise trimmed mean estimate from Proposition 5 is computed without knowledge of the parameters, so it is safe to assume that z0 Eµ[Z] d + ρ0. If the parameter guesses are conservative, i.e., ˆσ σ, ˆε ε, and ˆρ ρ0 + W1(µ, ˆµn), then we may still employ Proposition 4. If they are not too large, i.e., ˆσ σ, ˆε ε, and ˆρ ρ0 + σ dn 1/d, this gives the desired excess risk. We now explore what prior knowledge of σ, ε, and ρ0 is needed to obtain such guesses. First, we show that effective learning is impossible without knowledge of ρ0, even for standard WDRO (i.e., ε = 0, σ = ). For ease of presentation, we present the following lower bound without sampling. Lemma 16. There exists a family of loss functions L over R such that, for any C > 0 and decision rule D : P(R) L, there are µ, µ P(R) such that Eµ[D( µ)] > C(infℓ L Eµ[ℓ] + W1(µ, µ) ℓ Lip). Proof. Let L = {ℓθ : θ > 0}, where ℓθ(z) := z/θ + θ. By design, we have ℓθ Lip = 1/θ. Let µ = δ0 and write D( µ) = ℓˆθ for some ˆθ > 0. We then set µ = δρ for ρ = 10ˆθ2C2. This gives Eµ[D( µ)] = ℓˆθ(ρ) = ρ ˆθ + ˆθ > ρ ˆθ = 10C2ˆθ, inf ℓ L Eµ[ℓ] + W1(µ, µ) ℓ Lip = inf θ>0 ℓθ(ρ) + ρ θ = inf θ>0 2ρ θ + θ = 2 p Thus, we have Eµ[D( µ)] > infℓ L Eµ[ℓ] + W1(µ, µ) ℓ Lip, as desired. Thus, we assume in what follows that ˆρ = ρ0 is known. Moreover, we require knowledge of at least one of ε and σ. If both are unknown, then is information theoretically impossible to meaningfully distinguish inliers from outliers (see Exercise 1.7b of [16] for a discussion of this issue in the setting of robust mean estimation). If ε is known, then we can choose ˆσ as 2i for the smallest i such that the supremum of Eq. (13) is feasible (or, equivalently, such that the associated dual is bounded for some fixed ℓ L). We can overshoot by at most a factor of two and thus achieve the desired risk bound. Using binary search, this adds a multiplicative overhead logarithmic in the ratio of the initial guess for σ and its true value. The same approach can be employed if σ is known but not ε. G Additional Experiments We now provide several experiments in addition to those in the main body. Code is again provided at https://github.com/sbnietert/outlier-robust-WDRO. First, in Fig. 3, we extend the experiment summarized Fig. 2 (top) to include runs with ˆε = ε and varied Wasserstein radius ˆρ {ρ/2, ρ, 2ρ}. For this simple learning problem and perturbation model, we find that the choice of ˆρ plays a minor role in the resulting excess risk. We emphasize that, in the worst case, selection of ˆρ can be critical, as demonstrated by Lemma 16. 10 20 30 40 50 60 70 80 90 100 # samples excess risk (mean absolute deviation) Excess Risk for Varied Sample Size and Method standard WDRO Figure 3: Excess risk of standard WDRO and outlier-robust WDRO for linear regression under Wp and TV corruptions, with varied sample size and dimension. Next, we consider linear classification with the hinge loss, i.e. L = {ℓθ(x, y) = max{0, 1 y(θ x)} : θ Rd 1}. This time (to ensure that the resulting optimization problem is convex), our approach supports Euclidean Wasserstein perturbations in the feature space, but no Wasserstein perturbations in the label space (this corresponds to using Z = Rd 1 R equipped with the (extended) norm (x, y) = x 2 + 1{y = 0}. We consider clean data (X, Y = sign(θ X)) µ, where X N(0, Id 1). The corrupted data ( X, Y ) µ satisfies ( X, Y ) = (X + ρe1, Y ) with probability 1 ε and ( X, Y ) = (100X, Y ) with probability ε, so that Wε p( µ µ) ρ. In Figure 4 (left), we fix d = 10 and compare the excess risk Eµ[ℓˆθ] Eµ[ℓθ ] of standard WDRO and outlier-robust WDRO with G = G2, as described by Proposition 4 and implemented via Theorem 2. The results are averaged over T = 20 runs for sample size n {10, 20, 50, 75, 100}. We note that this example cannot drive the excess risk of standard WDRO to infinity, so the separation between standard and outlier-robust WDRO is less striking than regression, though still present. We further present results for multivariate regression. This time, we consider Z = Rd k equipped with the ℓ2 norm, and use the loss family L = {ℓM(x, y) = Mx y 1 : M Rk d}. We consider clean data (X, Y = M X) µ, where M Rk d and X have standard normal entries. The corrupted data ( X, Y ) µ satisfies ( X, Y ) = (X + ρe1, Y ) with probability 1 ε and ( X, Y ) = (10X, 100M X) with probability ε, so that Wε p( µ µ) ρ. In Figure 4 (right), we fix d = 10 and k = 3, and compare the excess risk Eµ[ℓˆ M] Eµ[ℓM ] of standard WDRO and outlierrobust WDRO with G = G2, as described by Proposition 4 and implemented via Theorem 2. The results are averaged over T = 10 runs for sample size n {10, 20, 50, 75, 100}. We are restricted to low k since the ℓ1 norm in the losses is expressed as the maximum of 2k concave functions (specifically, we use that ℓM(x, y) = maxα { 1,1}k α (Mx y)). Finally, we turn to a classification task with image data. We train a robust linear classifier to distinguish 10 20 30 40 50 60 70 80 90 100 # samples excess risk (hinge loss) Excess Classification Risk with WDRO standard WDRO OR-WDRO 10 20 30 40 50 60 70 80 90 100 # samples 102 Excess Linear Regression Risk with WDRO standard WDRO OR-WDRO Figure 4: Excess risk of standard WDRO and outlier-robust WDRO for classification and multivariate linear regression under Wp and TV corruptions, with varied sample size. 0 50 100 150 sample size classification accuracy (% correct) MNIST Classification Accuracy regular WDRO OR-WDRO Figure 5: Classification accuracy of standard WDRO and outlier-robust WDRO predictors for MNIST digit classification under random label flips, with varied number of training samples. between the MNIST [14] digits 3 and 8 when 10% of training labels are flipped uniformly at random, using outlier-robust WDRO with G = G2 and the hinge loss, as above, applied with ε = 0.1 and ρ = 0.01. To ensure tractability, we first pass the images through principal component analysis to reduce the input to 50 dimensions, and we again use Theorem 2 for implementation. In Fig. 5, we plot the classification accuracy of our robust classifier compared to one learned via standard WDRO for training set size n {10, 20, 50, 100, 150}, averaged over T = 10 runs. In this case, we do not witness a noticeable improvement over standard WDRO. We suspect that the relevant G2 class is too large to be of significant use in this high-dimensional setting. For all experiments, confidence bands are plotted representing the top and bottom 10% quantiles among 100 bootstrapped means from the T runs. The additional experiments were performed on an M1 Macbook Air with 16GB RAM in roughly 30 minutes each.