# regularization_via_mass_transportation__6d4c1d9e.pdf Journal of Machine Learning Research 20 (2019) 1-68 Submitted 10/17; Revised 5/19; Published 6/19 Regularization via Mass Transportation Soroosh Shafieezadeh-Abadeh soroosh.shafiee@epfl.ch Daniel Kuhn daniel.kuhn@epfl.ch Risk Analytics and Optimization Chair, EPFL, Switzerland Peyman Mohajerin Esfahani p.mohajerinesfahani@tudelft.nl Delft Center for Systems and Control TU Delft, The Netherlands Editor: Koji Tsuda The goal of regression and classification methods in supervised learning is to minimize the empirical risk, that is, the expectation of some loss function quantifying the prediction error under the empirical distribution. When facing scarce training data, overfitting is typically mitigated by adding regularization terms to the objective that penalize hypothesis complexity. In this paper we introduce new regularization techniques using ideas from distributionally robust optimization, and we give new probabilistic interpretations to existing techniques. Specifically, we propose to minimize the worst-case expected loss, where the worst case is taken over the ball of all (continuous or discrete) distributions that have a bounded transportation distance from the (discrete) empirical distribution. By choosing the radius of this ball judiciously, we can guarantee that the worst-case expected loss provides an upper confidence bound on the loss on test data, thus offering new generalization bounds. We prove that the resulting regularized learning problems are tractable and can be tractably kernelized for many popular loss functions. The proposed approach to regluarization is also extended to neural networks. We validate our theoretical out-of-sample guarantees through simulated and empirical experiments. Keywords: Distributionally robust optimization, optimal transport, Wasserstein distance, robust optimization, regularization 1. Introduction The fields of machine learning and optimization are closely intertwined. On the one hand, optimization algorithms are routinely used for the solution of classical machine learning problems. Conversely, recent advances in optimization under uncertainty have inspired many new machine learning models. From a conceptual point of view, many statistical learning tasks give naturally rise to stochastic optimization problems. Indeed, they aim to find an estimator from within a prescribed hypothesis space that minimizes the expected value of some loss function. The loss function quantifies the estimator s ability to correctly predict random outputs (i.e., dependent variables or labels) from random inputs (i.e., independent variables or features). Unfortunately, such stochastic optimization problems cannot be solved exactly because the input-output distribution, which is needed to evaluate the expected loss in the objective c 2019 Soroosh Shafieezadeh-Abadeh and Daniel Kuhn and Peyman Mohajerin Esfahani. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/17-633.html. Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani function, is not accessible and only indirectly observable through finitely many training samples. Approximating the expected loss with the empirical loss, that is, the average loss across all training samples, yields fragile estimators that are sensitive to perturbations in the data and suffer from overfitting. Regularization is the standard remedy to combat overfitting. Regularized learning models minimize the sum of the empirical loss and a penalty for hypothesis complexity, which is typically chosen as a norm of the hypothesis. There is ample empirical evidence that regularization reduces a model s generalization error. Statistical learning theory reasons that regularization implicitly restricts the hypothesis space, thereby controlling the gap between the training error and the testing error, see, e.g., Bartlett and Mendelson (2002). However, alternative explanations for the practical success of regularization are possible. In particular, ideas from modern robust optimization (Ben-Tal et al. (2009)) recently led to a fresh perspective on regularization. Robust regression and classification models seek estimators that are immunized against adversarial perturbations in the training data. They have received considerable attention since the seminal treatise on robust least-squares regression by El Ghaoui and Lebret (1997), who seem to be the first authors to discover an intimate connection between robustification and regularization. Specifically, they show that minimizing the worst-case residual error with respect to all perturbations in a Frobenius norm-uncertainty set is equivalent to a Tikhonov regularization procedure. Xu et al. (2010) disclose a similar equivalence between robust least-squares regression with a feature-wise independent uncertainty set and the celebrated Lasso (least absolute shrinkage and selection operator) algorithm. Leveraging this new robustness interpretation, they extend Lasso to a wider class of regularization schemes tailored to regression problems with disturbances that are coupled across features. In the context of classification, Xu et al. (2009) provide a linkage between robustification over non-box-typed uncertainty sets and the standard regularization scheme of support vector machines. A comprehensive characterization of the conditions under which robustification and regularization are equivalent has recently been compiled by Bertsimas and Copenhaver (2017). New learning models have also been inspired by recent advances in the emerging field of distributionally robust optimization, which bridges the gap between the conservatism of robust optimization and the specificity of stochastic programming. Distributionally robust optimization seeks to minimize a worst-case expected loss, where the worst case is taken with respect to all distributions in an ambiguity set, that is, a family of distributions consistent with the given prior information on the uncertainty, see, e.g., Calafiore and El Ghaoui (2006), Delage and Ye (2010), Goh and Sim (2010), Wiesemann et al. (2014) and the references therein. Ambiguity sets are often characterized through generalized moment conditions. For instance, Lanckriet et al. (2002a) propose a distributionally robust minimax probability machine for binary classification, where both classes are encoded by the first and second moments of their features, and the goal is to find a linear classifier that minimizes the worstcase misclassification error in view of all possible input distributions consistent with the given moment information. By construction, this approach forces the worst-case accuracies of both classes to be equal. Huang et al. (2004) propose a generalization of the minimax probability machine that allows for uneven worst-case classification accuracies. Lanckriet et al. (2002b) extend the minimax probability machine to account for estimation errors in the mean vectors and covariance matrices. Strohmann and Grudic (2003) and Bhattacharyya (2004) develop Regularization via Mass Transportation minimax probability machines for regression and feature selection, respectively. Shivaswamy et al. (2006) study linear classification problems trained with incomplete and noisy features, where each training sample is modeled by an ambiguous distribution with known first and second-order moments. The authors propose to address such classification problems with a distributionally robust soft margin support vector machine and then prove that it is equivalent to a classical robust support vector machine with a feature-wise uncertainty set. Farnia and Tse (2016) investigate distributionally robust learning models with moment ambiguity sets that restrict the marginal of the features to the empirical marginal. The authors highlight similarities and differences to classical regression models. Ambiguity sets containing all distributions that share certain low-order moments are computationally attractive but fail to converge to a singleton when the number N of training samples tends to infinity. Thus, they preclude any asymptotic consistency results. A possible remedy is to design spherical ambiguity sets with respect to some probability distance functions and to drive their radii to zero as N grows. Examples include the φ-divergence ambiguity sets proposed by Ben-Tal et al. (2013) or the Wasserstein ambiguity sets studied by Mohajerin Esfahani and Kuhn (2018) and Zhao and Guan (2018). Blanchet and Murthy (2019) and Gao and Kleywegt (2016) consider generalized Wasserstein ambiguity sets defined over Polish spaces. In this paper we investigate distributionally robust learning models with Wasserstein ambiguity sets. The Wasserstein distance between two distributions is defined as the minimum cost of transporting one distribution to the other, where the cost of moving a unit point mass is determined by the ground metric on the space of uncertainty realizations. In computer science the Wasserstein distance is therefore sometimes aptly termed the earth mover s distance (Rubner et al. (2000)). Following Mohajerin Esfahani and Kuhn (2018), we define Wasserstein ambiguity sets as balls with respect to the Wasserstein distance that are centered at the empirical distribution on the training samples. These ambiguity sets contain all (continuous or discrete) distributions that can be converted to the (discrete) empirical distribution at bounded transportation cost. Wasserstein distances are widely used in machine learning to compare histograms. For example, Rubner et al. (2000) use the Wasserstein distance as a metric for image retrieval with a focus on applications to color and texture. Cuturi (2013) and Benamou et al. (2015) propose fast iterative algorithms to compute a regularized Wasserstein distance between two high-dimensional discrete distributions for image classification tasks. Moreover, Cuturi and Doucet (2014) develop first-order algorithms to compute the Wasserstein barycenter between several empirical probability distributions, which has applications in clustering. Arjovsky et al. (2017) utilize the Wasserstein distance to measure the distance between the data distribution and the model distribution in generative adversarial networks. Furthermore, Frogner et al. (2015) propose a learning algorithm based on the Wasserstein distance to predict multi-label outputs. Distributionally robust optimization models with Wasserstein ambiguity sets were introduced to the realm of supervised learning by Shafieezadeh-Abadeh et al. (2015), who show that distributionally robust logistic regression problems admit a tractable reformulation and encapsulate the classical as well as the popular regularized logistic regression problems as special cases. When the Wasserstein ball is restricted to distributions on a compact set, the problem becomes intractable but can still be addressed with an efficient decomposition Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani algorithm due to Luo and Mehrotra (2019). Support vector machine models with distributionally robust chance constraints over Wasserstein ambiguity sets are studied by Lee and Mehrotra (2015). These models are equivalent to hard semi-infinite programs and can be solved approximately with a cutting plane algorithm. Wasserstein ambiguity sets are popular for their attractive statistical properties. For example, Fournier and Guillin (2015) prove that the empirical distribution on N training samples converges in Wasserstein distance to the true distribution at rate O(N 1/(n+1)), where n denotes the feature dimension. This implies that properly scaled Wasserstein balls constitute natural confidence regions for the data-generating distribution. The worstcase expected prediction loss over all distributions in a Wasserstein ball thus provides an upper confidence bound on the expected loss under the unknown true distribution; see Mohajerin Esfahani and Kuhn (2018). Blanchet et al. (2016) show, however, that radii of the order O(N 1/2) are asymptotically optimal even though the corresponding Wasserstein balls are too small to contain the true distribution with constant confidence. For Wasserstein distances of type two (where the transportation cost equals the squared ground metric) Blanchet et al. (2017) develop a systematic methodology for selecting the ground metric. Generalization bounds for the worst-case prediction loss with respect to a Wasserstein ball centered at the true distribution are derived by Lee and Raginsky (2018) in order to address emerging challenges in domain adaptation problems, where the distributions of the training and test samples can differ. This paper extends the results by Shafieezadeh-Abadeh et al. (2015) on distributionally robust logistic regression along several dimensions. Our main contributions can be summarized as follows: Tractability: We propose data-driven distributionally robust regression and classification models that hedge against all input-output distributions in a Wasserstein ball. We demonstrate that the emerging semi-infinite optimization problems admit equivalent reformulations as tractable convex programs for many commonly used loss functions and for spaces of linear hypotheses. We also show that lifted variants of these new learning models are kernelizable and thus offer an efficient procedure for optimizing over all nonlinear hypotheses in a reproducible kernel Hilbert space. Finally, we study distributionally robust learning models over families of feed-forward neural networks. We show that these models can be approximated by regularized empirical loss minimization problems with a convex regularization term and can be addressed with a stochastic proximal gradient descent algorithm. Probabilistic Interpretation of Existing Regularization Techniques: We show that the classical regularized learning models emerge as special cases of our framework when the cost of moving probability mass along the output space tends to infinity. In this case, the regularization function and its regularization weight are determined by the transportation cost on the input space and the radius of the Wasserstein ball underlying the distributionally robust optimization model, respectively. Generalization Bounds: We demonstrate that the proposed distributionally robust learning models enjoy new generalization bounds that can be obtained under minimal assumptions. In particular, they do not rely on any notions of hypothesis complexity Regularization via Mass Transportation and may therefore even extend to hypothesis spaces with infinite VC-dimensions. A na ıve generalization bound is obtained by leveraging modern measure concentration results, which imply that Wasserstein balls constitute confidence sets for the unknown data-generating distribution. Unfortunately, this generalization bound suffers from a curse of dimensionality and converges slowly for high input dimensions. By imposing bounds on the hypothesis space, however, we can derive an improved generalization bound, which essentially follows a dimension-independent square root law reminiscent of the central limit theorem. Relation to Robust Optimization: In classical robust regression and classification the training samples are viewed as uncertain variables that range over a joint uncertainty set, and the best hypothesis is found by minimizing the worst-case loss over this set. We prove that the classical robust and new distributionally robust learning models are equivalent if the data satisfies a dispersion condition (for regression) or a separability condition (for classification). While there is no efficient algorithm for solving the robust learning models in the absence of this condition, the distributionally robust models are efficiently solvable irrespective of the underlying training datasets. Confidence Intervals for Error and Risk: Using distributionally robust optimization techniques based on the Wasserstein ball, we develop two tractable linear programs whose optimal values provide a confidence interval for the absolute prediction error of any fixed regressor or the misclassification risk of any fixed classifier. Worst-Case Distributions: We formulate tractable convex programs that enable us to efficiently compute a worst-case distribution in the Wasserstein ball for any fixed hypothesis. This worst-case distribution can be useful for stress tests or contamination experiments. The rest of the paper develops as follows. Section 2 introduces our new distributionally robust learning models. Section 3 provides finite convex reformulations for learning problems over linear and nonlinear hypothesis spaces and describes efficient procedures for constructing worst-case distributions. Moreover, it compares the new distributionally robust method against existing robust optimization and regularization approaches. Section 4 develops new generalization bounds, while Section 5 addresses error and risk estimation. Numerical experiments are reported in Section 6. All proofs are relegated to the appendix. 1.1. Notation Throughout this paper, we adopt the conventions of extended arithmetics, whereby 0 = 0 = 0/0 = 0 and = + = 1/0 = . The inner product of two vectors x, x Rn is denoted by x, x , and for any norm on Rn, we use to denote its dual norm defined through x = sup { x, x : x 1}. The conjugate of an extended real-valued function f(x) on Rn is defined as f (x) = supx x, x f(x ). The indicator function of a set X Rn is defined as δX(x) = 0 if x X; = otherwise. Its conjugate SX(x) = sup{ x , x : x X} is termed the support function of X. The characteristic function of X is defined through 1X(x) = 1 if x X; = 0 otherwise. For a proper cone C Rn the relation x C x indicates that x x C. The cone dual to C is defined as Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani C = {x : x , x 0 x C}. The Lipschitz modulus of a function L : X R is denoted by lip(L) = supx,x X{|L(x) L(x )|/ x x : x = x }. If P is a distribution on a set Ξ, then PN denotes the N-fold product of P on the Cartesian product ΞN. For N N we define [N] = {1, . . . , N}. A list of commonly used notations is provided in the following table. X input space Y output space ℓ loss function L univariate loss function H hypothesis space K kernel matrix f conjugate of f lip(f) Lipschitz modulus of f C dual cone of C dual norm of SX support function of X δX indicator function of X 1X characteristic function of X [N] {1, . . . , N} 2. Problem Statement We first introduce the basic terminology and then describe our new perspective on regularization. 2.1. Classical Statistical Learning The goal of supervised learning is to infer an unknown target function f : X Y from limited data. The target function maps any input x X (e.g., information on the frequency of certain keywords in an email) to some output y Y (e.g., a label +1 ( 1) if the email is likely (unlikely) to be a spam message). If the true target function was accessible, it could be used as a means to reliably predict outputs from inputs (e.g., it could be used to recognize spam messages in an automated fashion). In a supervised learning framework, however, one has only access to finitely many input-output examples (bxi, byi) for i = 1, . . . , N (e.g., a database of emails which have been classified by a human as legitimate or as spam messages). We will henceforth refer to these examples as the training data or the in-sample data. It is assumed that the training samples are mutually independent and follow an unknown distribution P on X Y. The supervised learning problems are commonly subdivided into regression problems, where the output y is continuous and Y = R, and classification problems, where y is categorical and Y = {+1, 1}. As the space of all functions from X to Y is typically vast, it may be very difficult to learn the target function from finitely many training samples. Thus, it is convenient to restrict the search space to a structured family of candidate functions H RX such as the space of all linear functions, some reproducible kernel Hilbert space or the family of all feed-forward neural networks with a fixed number of layers. We henceforth refer to each candidate function h H as a hypothesis and to H as the hypothesis space. A learning algorithm is a method for finding a hypothesis h H that faithfully replicates the unknown target function f. Specifically, in regression we seek to approximate f with a hypothesis h, and in classification we seek to approximate f with a thresholded hypothesis sign(h). Many learning algorithms achieve this goal by minimizing the in-sample error, that is, the empirical average of a loss function ℓ: R Y R+ that estimates the mismatch Regularization via Mass Transportation between the output predicted by h(x) and the actual output y for a particular input-output pair (x, y). Any such algorithm solves a minimization problem of the form i=1 ℓ(h(bxi), byi) = E b PN [ℓ(h(x), y)] , (1) where b PN = 1 N PN i=1 δ(ˆxi,ˆyi) denotes the empirical distribution, that is, the uniform distribution on the training data. For different choices of the the loss function ℓ, the generic supervised learning problem (1) reduces to different popular regression and classification problems from the literature. Examples of Regression Models For ease of exposition, we focus here on learning models with X Rn and Y R, where H is set to the space of all linear hypotheses h(x) = w, x with w Rn. Thus, there is a one-to-one correspondence between hypotheses and weight vectors w. Moreover, we focus on loss functions of the form ℓ(h(x), y) = L(h(x) y) = L( w, x ) y) that are generated by a univariate loss function L. 1. A rich class of robust regression problems is obtained from (1) if ℓis generated by the Huber loss function with robustness parameter δ > 0, which is defined as L(z) = 1 2z2 if |z| δ; = δ(|z| 1 2δ) otherwise. Note that the Huber loss function is both convex and smooth and reduces to the squared loss L(z) = 1 2z2 for δ , which is routinely used in ordinary least squares regression. Problem (1) with squared loss seeks a hypothesis w under which w, x approximates the mean of y conditional on x. The Huber loss function for finite δ favors similar hypotheses but is less sensitive to outliers. 2. The support vector regression problem (Smola and Sch olkopf, 2004) emerges as a special case of (1) if ℓis generated by the ϵ-insensitive loss function L(z) = max{0, |z| ϵ} with ϵ 0. In this setting, a training sample (bxi, byi) is penalized in (1) only if the output w, bxi predicted by hypothesis w differs from the true output byi by more than ϵ. Support vector regression thus seeks hypotheses w under which all training samples reside within a slab of width 2ϵ centered around the hyperplane {(x, y) : w, x = y}. 3. The quantile regression problem (Koenker, 2005) is obtained from (1) if ℓis generated by the pinball loss function L(z) = max{ τz, (1 τ)z} defined for τ [0, 1]. Quantile regression seeks hypotheses that approximate the τ 100%-quantile of the output conditional on the input. More precisely, it seeks hypotheses w for which τ 100% of all training samples lie in the halfspace {(x, y) : w, x y}. Examples of Classification Models We focus here on linear learning models with X Rn and Y = {+1, 1}, where H is again identified with the space of all linear hypotheses h(x) = w, x with w Rn. Moreover, we focus on loss functions of the form ℓ(x, y) = L(yh(x)) = L(y w, x ) generated by a univariate loss function L. Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani 1. The support vector machine problem (Cortes and Vapnik, 1995) is obtained from (1) if ℓis generated by the hinge loss function L(z) = max{0, 1 z}, which is large if z is small. Thus, a training sample (bxi, byi) is penalized in (1) if the output sign( w, bxi ) predicted by hypothesis w and the true output byi have opposite signs. More precisely, support vector machines seek hypotheses w under which the inputs of all training samples with output +1 reside in the halfspace {x : w, x 1}, while the inputs of training samples with output 1 are confined to {x : w, x 1}. 2. An alternative support vector machine problem is obtained from (1) if ℓis generated by the smooth hinge loss function, which is defined as L(z) = 1 2 z if z 0; = 1 if 0 < z < 1; = 0 otherwise. The smooth hinge loss inherits many properties of the ordinary hinge loss but has a continuous derivative. Thus, it may be amenable to faster optimization algorithms. 3. The logistic regression problem (Hosmer et al., 2013) emerges as a special case of (1) if ℓis generated by the logloss function L(z) = log(1 + e z), which is large if z is small similar to the hinge loss function. In this case the objective function of (1) can be viewed as the log-likelihood function corresponding to the logistic model P(y = 1|x) = [1 + exp( w, x )] 1 for the conditional probability of y = 1 given x. Thus, logistic regression allows us to learn the conditional distribution of y given x. Remark 1 (Convex approximation) Note that the hinge loss and the logloss functions represent convex approximations for the (discontinuous) one-zero loss defined through L(z) = 1 if z 0; = 0 otherwise. In practice there may be many hypotheses that are compatible with the given training data and thus achieve a small empirical loss in (1). Any such hypothesis would accurately predict outputs from inputs within the training dataset (Defourny, 2010). However, due to overfitting, these hypotheses might constitute poor predictors beyond the training dataset, that is, on inputs that have not yet been recorded in the database. Mathematically, even if the in-sample error Eb PN[ℓ( w, x , y)] of a given hypothesis w is small, the out-of-sample error EP[ℓ( w, x , y)] with respect to the unknown true input-output distribution P may be large. Regularization is the standard remedy to combat overfitting. Instead of na ıvely minimizing the in-sample error as is done in (1), it may thus be advisable to solve the regularized learning problem inf w E b PN [ℓ( w, x , y)] + c Ω(w), (2) which minimizes the sum of the emiprial average loss and a penalty for hpothesis complexity, which consists of a regularization function Ω(w) and its associated regularization weight c. Tikhonov regularization (Tikhonov et al., 1977), for example, corresponds to the choice Ω(w) = Γw 2 2 for some Tikhonov matrix Γ Rn n. Setting Γ to the identity matrix gives rise to standard L2-regularization. Similarly, Lasso (least absolute shrinkage and selection operator) regularization or L1-regularization (Tibshirani, 1996) is obtained by setting Ω(w) = w 1. Lasso regularization has gained popularity because it favors parsimonious interpretable hypotheses. Regularization via Mass Transportation Most popular regualization methods admit probabilistic interpretations. However, these interpretations typically rely on prior distributional assumptions that remain to some extent arbitrary (e.g., L2and L1-regularization can be justified if w is governed by a Gaussian or Laplacian prior distribution, respectively (Tibshirani, 1996)). Thus, in spite of their many desirable theoretical properties, there is a consensus that most of the (regularization) methods used successfully in practice are heuristic methods (Abu-Mostafa et al., 2012). 2.2. A New Perspective on Regularization When linear hypotheses are used, problem (1) minimizes the in-sample error Eb PN [ℓ( w, x , y)]. However, a hypothesis w enjoying a low in-sample error may still suffer from a high out-ofsample error EP[ℓ( w, x , y)] due to overfitting. This is unfortunate as we seek hypotheses that offer high prediction accuracy on future data, meaning that the out-of-sample error is the actual quantity of interest. An ideal learning model would therefore minimize the out-of-sample error. This is impossible, however, for the following reasons: The true input-output distribution P is unknown and only indirectly observable through the N training samples. Thus, we lack essential information to compute the out-of-sample error. Even if the distribution P was known, computing the out-of-sample error would typically be hard due to the intractability of high-dimensional integration; see, e.g., (Hanasusanto et al., 2016, Corollary 1). The regularized loss Eb PN [ℓ( w, x , y)] + c Ω(w) used in (2), which consists of the in-sample error and an overfitting penalty, can be viewed as an in-sample estimate of the out-of-sample error. Yet, problem (2) remains difficult to justify rigorously. Therefore, we advocate here a more principled approach to regularization. Specifically, we propose to take into account the expected loss of hypothesis w under every distribution Q that is close to the empirical distribution b PN, that is, every Q that could have generated the training data with high confidence. To this end, we first introduce a distance measure for distributions. For ease of notation, we henceforth denote the input-output pair (x, y) by ξ, and we set Ξ = X Y. Definition 2 (Wasserstein metric) The Wasserstein distance between two distributions Q and Q supported on Ξ is defined as W(Q, Q ) := inf Π Ξ2 d(ξ, ξ ) Π(dξ, dξ ) : Π is a joint distribution of ξ and ξ with marginals Q and Q , respectively where d is a metric on Ξ. By definition, W(Q, Q ) represents the solution of an infinite-dimensional transportation problem, that is, it corresponds to the minimal cost for moving the distribution Q to Q , where the cost for moving a unit probability mass from ξ to ξ is given by the transportation distance d(ξ, ξ ). Due to this interpretation, the metric d is often referred to as the transportation cost (Villani, 2008) or ground metric (Cuturi and Avis, 2014), while the Wasserstein metric is sometimes termed the mass transportation distance or earth mover s distance (Rubner et al., 2000). Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani Consider now the Wasserstein ball of radius ρ 0 around the empirical distribution b PN, Bρ(b PN) = Q : Q(Ξ) = 1, W(Q, b PN) ρ , (3) which contains all input-output distributions Q supported on Ξ whose Wasserstein distance from b PN does not exceed ρ. This means that Q can be transported to b PN (or vice versa) at a cost of at most ρ. The hope is that a large enough Wasserstein ball will contain distributions that are representative of the unknown true input-output distribution P, such that the worst-case expectation sup Q Bρ(b PN) EQ[ℓ( w, x , y)] can serve as an upper confidence bound on the out-of-sample error EP[ℓ( w, x , y)]. This motivates us to introduce a new regularized learning model, which minimizes precisely this worst-case expectation. inf w sup Q Bρ(b PN) EQ [ℓ( w, x , y)] (4) Problem (4) represents a distributionally robust convex program of the type considered in (Mohajerin Esfahani and Kuhn, 2018). Note that if ℓ( w, x , y) is convex in w for every fixed (x, y), i.e., if ℓis convex in its first argument, then the objective function of (4) is convex because convexity is preserved under integration and maximization. Note also that if ρ is set to zero, then (4) collapses to the unregularized in-sample error minimization problem (1). Remark 3 (Support information) The uncertainty set Ξ captures prior information on the range of the inputs and outputs. In image processing, for example, pixel intensities range over a known interval. Similarly, in diagnostic medicine, physiological parameters such as blood glucose or cholesterol concentrations are restricted to be non-negative. Sometimes it is also useful to construct Ξ as a confidence set that covers the support of P with a prescribed probability. Such confidence sets are often constructed as ellipsoids, as intersections of different norm balls (Ben-Tal et al., 2009; Delage and Ye, 2010) or as sublevel sets of kernel expansions (Sch olkopf et al., 2001). In the remainder we establish that the distributionally robust learning problem (4) has several desirable properties. (i) Problem (4) is computationally tractable under standard assumptions about the loss function ℓ, the input-output space Ξ and the transportation metric d. For specific choices of d it even reduces to a regularized learning problem of the form (2). (ii) For all univariate loss functions reviewed in Section 2.1, a tight conservative approximation of (4) is kernelizable, that is, it can be solved implicitly over high-dimensional spaces of nonlinear hypotheses at the same computational cost required for linear hypothesis spaces. (iii) Leveraging modern measure concentration results, the optimal value of (4) can be shown to provide an upper confidence bound on the out-of-sample error. This obviates the need to mobilize the full machinery of VC theory and, in particular, to estimate the VC dimension of the hypothesis space in order to establish generalization bounds. (iv) If the number of training samples tends to infinity while the Wasserstein ball shrinks at an appropriate rate, then problem (4) asymptotically recovers the ex post optimal hypothesis that attains the minimal out-of-sample error. Regularization via Mass Transportation 3. Tractable Reformulations In this section we demonstrate that the distributionally robust learning problem (4) over linear hypotheses is amenable to efficient computational solution procedures. We also discuss generalizations to nonlinear hypothesis classes such as reproducing kernel Hilbert spaces and families of feed-forward neural networks. 3.1. Distributionally Robust Linear Regression Throughout this section we focus on linear regression problems, where ℓ( w, x , y) = L( w, x y) for some convex univariate loss function L. We also assume that X and Y are both convex and closed and that the transportation metric d is induced by a norm on the input-output space Rn+1. In this setting, the distributionally robust regression problem (4) admits an equivalent reformulation as a finite convex optimization problem if either (i) the univariate loss function L is piecewise affine or (ii) Ξ = Rn+1 and L is Lipschitz continuous (but not necessarily piecewise affine). Theorem 4 (Distributionally robust linear regression) The following statements hold. (i) If L(z) = maxj J{ajz + bj}, then (4) is equivalent to inf w,λ,si pij,uij λρ + 1 s.t. SΞ(ajw pij, aj uij) + bj + pij, bxi + uijbyi si i [N], j [J] (pij, uij) λ i [N], j [J], (5) where SΞ denotes the support function of Ξ. (ii) If Ξ = Rn+1 and L(z) is Lipschitz continuous, then (4) is equivalent to i=1 L( w, bxi byi) + ρ lip(L) (w, 1) . (6) In the following, we exemplify Theorem 4 for the Huber, ϵ-insensitive and pinball loss functions under the assumption that the uncertainty set Ξ admits the conic representation Ξ = {(x, y) Rn+1 : C1x + c2y C d} (7) for some matrix C1, vectors c2 and d and proper convex cone C of appropriate dimensions. We also assume that Ξ admits a Slater point (x S, y S) Rn+1 with C1x S + c2y S C d. Corollary 5 (Robust regression) If L represents the Huber loss function with threshold δ 0 and Ξ = Rn+1, then (4) is equivalent to inf w,zi 1 N 1 2z2 i + δ| w, bxi byi zi| + ρ δ (w, 1) . (8) Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani Corollary 6 (Support vector regression) If L represents the ϵ-insensitive loss function for some ϵ 0 and Ξ is of the form (7), then (4) is equivalent to inf w,λ,si p+ i ,p i s.t. byi w, bxi ϵ + p+ i , d C1bxi c2byi si i [N] w, bxi byi ϵ + p i , d C1bxi c2byi si i [N] C 1 p+ i + w, c 2 p+ i 1 λ i [N] C 1 p i w, c 2 p i + 1 λ i [N] p+ i , p i C i [N] si 0 i [N]. Corollary 7 (Quantile regression) If L represents the pinball loss function for some τ [0, 1] and Ξ is of the form (7), then (4) is equivalent to inf w,λ,si p+ i ,p i s.t. τ (byi w, bxi ) + p+ i , d C1bxi c2byi si i [N] (1 τ) ( w, bxi byi) + p i , d C1bxi c2byi si i [N] C 1 p+ i + τw, c 2 p+ i τ λ i [N] C 1 p i (1 τ)w, c 2 p i + 1 τ λ i [N] p+ i , p i C i [N] si 0 i [N]. Remark 8 (Relation to classical regularization) Assume now that the mass transportation costs are additively separable with respect to inputs and outputs, that is, d((x1, y1), (x2, y2)) = x1 x2 + κ|y1 y2| (11) for some κ > 0.1 Note that κ captures the costs of moving probability mass along the output space. For κ = all distributions in the Wasserstein ball Bρ(b PN) are thus obtained by reshaping b PN only along the input space. It is easy to verify that for κ = and Ξ = Rn+1 the learning models portrayed in Corollaries 5-7 all simplify to i=1 L( w, bxi byi) + c w , (12) where c = ρδ for robust regression with Huber loss, c = ρ for support vector regression with ϵ-insensitive loss and c = max{τ, 1 τ} ρ for quantile regression with pinball loss. Thus, (12) is easily identified as an instance of the classical regularized learning problem (2), where the dual norm term w plays the role of the regularization function, while c represents 1. By slight abuse of notation, the symbol now denotes a norm on Rn. Regularization via Mass Transportation the usual regularization weight. By definition of the dual norm, the penalty w assigned to a hypothesis w is maximal (minimal) if the cost of moving probability mass along w is minimal (maximal). We emphasize that if κ = , then the marginal distribution of y corresponding to every Q Bρ(b PN) coincides with the empirical distribution 1 N PN i=1 δbyi. Thus, classical regularization methods, which correspond to κ = , are explained by a counterintuitive probabilistic model, which pretends that any training sample must have an output that has already been recordeded in the training dataset. In other words, classical regularization implicitly assumes that there is no uncertainty in the outputs. More intuitively appealing regularization schemes are obtained for finite values of κ. To establish a connection between distributionally robust and classical robust regression as discussed in (El Ghaoui and Lebret, 1997; Xu et al., 2010), we further investigate the worst-case expected loss of a fixed linear hypothesis w. sup Q Bρ(b PN) EQ[L( w, x y)] (13) Theorem 9 (Extremal distributions in linear regression) The following statements hold. (i) If L(z) = maxj J{ajz + bj}, then the worst-case expectation (13) coincides with sup αij,qij,vij j=1 αij aj( w, bxi byi) + bj + aj( w, qij vij) j=1 (qij, vij) ρ j=1 αij = 1 i [N] (bxi + qij/αij, byi + vij/αij) Ξ i [N], j [J] αij 0 i [N], j [J] for any fixed hypothesis w. Moreover, if (α ij, q ij, v ij) maximizes (14), then the discrete distribution j=1 α ij δ(bxi+q ij/α ij, byi+v ij/α ij), represents a maximizer for (13). (ii) If Ξ = Rn+1 and L(z) is Lipschitz continuous, then the discrete distributions i=2 δ(bxi, byi) + 1 γ N δ(bx1, by1) + γ N δ(bx1+ ρN γ x ,by1+ ρN γ y ) for γ (0, 1], where (x , y ) solves max (x,y) 1 w, x y, are feasible and asymptotically optimal in (13) for γ 0. Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani Recall that 0/0 = 0 and 1/0 = by our conventions of extended arithmetic. Thus, any solution feasible in (14) with αij = 0 must satisfy qij = 0 and vij = 0 because otherwise (bxi + qij/αij, byi + vij/αij) / Ξ. Theorem 9 shows how one can use convex optimization to construct a sequence of distributions that are asymptotically optimal in (13). Next, we argue that the worstcase expected cost (13) is equivalent to a (robust) worst-case cost over a suitably defined uncertainty set if the following assumption holds. Assumption 10 (Minimal dispersion) For every w Rn there is a training sample (bxk, byk) for some k N such that the derivative L exists at w, bxk byk and satisfies |L ( w, bxk byk)| = lip(L). Remark 11 (Minimal dispersion) Assumption 10 is reminiscent of the non-separability condition in (Xu et al., 2009, Theorem 3), which is necessary to prove the equivalence of robust and regularized support vector machines. In the regression context studied here, Assumption 10 ensures that, for every w, there exists a training sample that activates the largest absolute slope of L. For instance, in support vector regression, it means that for every w there exists a data point outside of the slab of width 2ϵ/ (w, 1) 2 centered around the hyperplane Hw = {(x, y) Rn R : w, x y = 0} (i.e., the empirical ϵ-insensitive loss is not zero). Similarly, in robust regression with the Huber loss function, Assumption 10 stipulates that for every w there exists a data point outside of the slab of width 2δ/ (w, 1) 2 centered around Hw. However, quantile regression with τ = 0.5 fails to satisfy Assumption 10. Indeed, for any training dataset there always exists some w such that all data points reside on the side of Hw where the pinball loss function is less steep. Theorem 12 (Robust regression) If Ξ = Rn+1 and the loss function L(z) is Lipschitz continuous, then the worst-case expected loss (13) provides an upper bound on the (robust) worst-case loss i=1 [L( w, bxi + xi byi yi)] i=1 ( xi, yi) ρ. Moreover, if Assumption 10 holds, then (13) and (15) are equal. Remark 13 (Tractability of robust regression) Assume that Ξ = Rn+1, while L and both admit a tractable conic representation. By Theorem 4, the worst-case expected loss (13) can then be computed in polynomial time by solving a tractable convex program. Theorem 12 thus implies that the worst-case loss (28) can also be computed in polynomial time if Assumption 10 holds. To our best knowledge, there exists no generic efficient method for computing (28) if Assumption 10 fails to hold and L is not piecewise affine. This reinforces our belief that a distributionally robust approach to regression is more natural. Regularization via Mass Transportation 3.2. Distributionally Robust Linear Classification Throughout this section we focus on linear classification problems, where ℓ( w, x , y) = L(y w, x ) for some convex univariate loss function L. We also assume that X is both convex and closed and that Y = {+1, 1}. Moreover, we assume that the transportation metric d is defined via d((x, y), (x , y )) = x x + κ1{y =y }, (16) where represents a norm on the input space Rn, and κ > 0 quantifies the cost of switching a label. In this setting, the distributionally robust classification problem (4) admits an equivalent reformulation as a finite convex optimization problem if either (i) the univariate loss function L is piecewise affine or (ii) X = Rn and L is Lipschitz continuous (but not necessarily piecewise affine). Theorem 14 (Distributionally robust linear classification) The following statements hold. (i) If L(z) = maxj J{ajz + bj}, then (4) is equivalent to inf w,λ,si p+ ij,p ij s.t. SX(ajbyiw p+ ij) + bj + p+ ij, bxi si i [N], j [J] SX( ajbyiw p ij) + bj + p ij, bxi κλ si i [N], j [J] p+ ij λ, p ij λ i [N], j [J], where SX denotes the support function of X. (ii) If X = Rn and L is Lipschitz continuous, then (4) is equivalent to inf w,λ,si λρ + 1 s.t. L(byi w, bxi ) si i [N] L( byi w, bxi ) κλ si i [N] lip(L) w λ. In the following, we exemplify Theorem 14 for the hinge loss, logloss and smoothed hinge loss functions under the assumption that the input space X admits the conic representation X = {x Rn : Cx C d} (19) for some matrix C, vector d and proper convex cone C of appropriate dimensions. We also assume that X admits a Slater point x S Rn with Cx S C d. Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani Corollary 15 (Support vector machine) If L represents the hinge loss function and X is of the form (19), then (4) is equivalent to inf w,λ si,p+ i ,p i s.t. 1 byi ( w, bxi ) + p+ i , d C bxi si i [N] 1 + byi ( w, bxi ) + p i , d C bxi κλ si i [N] C p+ i + byiw λ, C p i byiw λ i [N] si 0, p+ i , p i C i [N]. Corollary 16 (Support vector machine with smooth hinge loss) If L represents the smooth hinge loss function and X = Rn, then (4) is equivalent to min w,λ,si z+ i ,z i ,t+ i ,t i s.t. 1 2(z+ i byi w, bxi )2 + t+ i si i [N] 1 2(z i + byi w, bxi )2 + t i κλ si i [N] 1 z+ i t+ i , 1 z i t i i [N] t+ i , t i 0 i [N] w λ. Corollary 17 (Logistic regression) If L represents the logloss function and X = Rn, then (4) is equivalent to min w,λ,si λρ + 1 s.t. log 1 + exp byi w, bxi si i [N] log 1 + exp byi w, bxi κλ si i [N] Remark 18 (Relation to classical regularization) If X = Rn and the weight parameter κ in the transportation metric (16) is set to infinity, then the learning problems portrayed in Corollaries 15 17 all simplify to i=1 L(byi w, bxi ) + ρ w . (23) Thus, in analogy to the case of regression, (23) reduces to an instance of the classical regularized learning problem (2), where the dual norm term w plays the role of the regularization function, while the Wasserstein radius ρ represents the usual regularization weight. Note that if κ = , then mass transportation along the output space is infinitely expensive, that is, any distribution Q Bρ(b PN) can smear out the training samples along Regularization via Mass Transportation X, but it cannot flip outputs from +1 to 1 or vice versa. Thus, classical regularization schemes, which are recovered for κ = , implicitly assume that output measurements are exact. As this belief is not tenable in most applications, an approach with κ < may be more satisfying. We remark that alternative approaches for learning with noisy labels have previously been studied by Lawrence and Sch olkopf (2001), Natarajan et al. (2013), and Yang et al. (2012). Remark 19 (Relation to Tikhonov regularization) The learning problem i=1 L(byi w, bxi ) + c w 2 2 (24) with Tikhonov regularizer enjoys wide popularity. If L represents the hinge loss, for example, then (24) reduces to the celebrated soft margin support vector machine problem. However, the Tikhonov regularizer appearing in (24) is not explained by a distributionally robust learning problem of the form (4). It is known, however, that (23) with = 2 and (24) are equivalent in the sense that for every ρ 0 there exists c 0 such that the solution of (23) also solves (24) and vice versa (Xu et al., 2009, Corollary 6). To establish a connection between distributionally robust and classical robust classification as discussed in (Xu et al., 2009), we further investigate the worst-case expected loss of a fixed linear hypothesis w. sup Q Bρ(b PN) EQ[L(y w, x )] (25) Theorem 20 (Extremal distributions in linear classification) The following statements hold. (i) If L(z) = maxj J{ajz + bj}, then the worst-case expectation (25) coincides with sup α+ ij,α ij q+ ij,q ij j=1 (α+ ij α ij)ajbyi w, bxi + ajbyi w, q+ ij q ij + j=1 q+ ij + q ij + κα ij Nρ j=1 α+ ij + α ij = 1 i [N] bxi + q+ ij/α+ ij X, bxi + q ij/α ij X i [N], j [J] α+ ij, α ij 0 i [N], j [J] for any fixed w. Also, if (α+ ij , α ij , q+ ij , q ij ) maximizes (26), then the discrete distribution j=1 α+ ij δ(bxi q+ ij /α+ ij , byi) + α ij δ(bxi q ij /α ij , byi) represents a maximizer for (25). Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani (ii) If X = Rn, then the worst-case expectation (25) coincides with the optimal value of sup αi,θ lip(L) (w) θ + 1 i=1 (1 αi)L(byi w, bxi ) + αi L( byi w, bxi ) i=1 αi = ρ γ 0 αi 1 i [N] θ 0 for γ = 0. Moreover, if (α i (γ), θ (γ)) maximizes (27) for γ > 0, η(γ) = γ/(θ (γ) + κ ρ + γ + 1) and x solves maxx{ w, x : x 1}, then the discrete distributions i=2 (1 α i (γ)) δ(bxi, byi) + α i (γ) δ(bxi, byi) + η(γ) N δ(bx1+ θ (γ)N η(γ) x , by1) h (1 α 1(γ)) δ(bx1, by1) + α 1(γ) δ(bx1, by1) i for γ [0, min{ρ, 1}] are feasible and asymptotically optimal in (25) for γ 0. Theorem 20 shows how one can use convex optimization to construct a sequence of distributions that are asymptotically optimal in (25). Next, we show that the worstcase expected cost (25) is equivalent to a (robust) worst-case cost over a suitably defined uncertainty set if the following assumption holds. Assumption 21 (Non-separability) For every w Rn there is a training sample (bxk, byk) for some k N such that the derivative L exists at byk w, bxk and satisfies |L (byk w, bxk )| = lip(L). Remark 22 (Non-separability) Assumption 21 generalizes the non-separability condition in (Xu et al., 2009, Theorem 3) for the classical and smooth hinge loss functions to more general Lipschitz continuous losses. Note that, in the case of the hinge loss, Assumption 21 effectively stipulates that for any w there exists a training sample (bxk, byk) with byk w, bxk < 1, implying that the dataset cannot be perfectly separated by any linear hypothesis w. An equivalent requirement is that the empirical hinge loss is nonzero for every w. Similarly, in the case of the smooth hinge loss, Assumption 21 ensures that for any w there is a training sample with byk w, bxk < 0, which implies again that the dataset admits no perfect linear separation. Note, however, that the logloss fails to satisfy Assumption 21 as its steepest slope is attained at infinity. Theorem 23 (Robust classification) Suppose that X = Rn, the loss function L is Lipschitz continuous and the cost of flipping a label in the transportation metric (16) is set to κ = . Then, the worst-case expected loss (25) provides an upper bound on the (robust) worst-case loss i=1 L (byi w, bxi + xi ) Regularization via Mass Transportation Moreover, if Assumption 21 holds, then (25) and (28) are equal. Remark 24 (Tractability of robust classification) Assume that X = Rn, while L and both admit a tractable conic representation. By Theorem 14, the worst-case expected loss (25) can then be computed in polynomial time by solving a tractable convex program. Theorem 23 thus implies that the worst-case loss (28) can also be computed in polynomial time if Assumption 21 holds. This confirms Proposition 4 in (Xu et al., 2009). No efficient method for computing (28) is know if Assumption 21 fails to hold. 3.3. Nonlinear Hypotheses: Reproducing Kernel Hilbert Spaces We now generalize the learning models from Sections 3.1 and 3.2 to nonlinear hypotheses that range over a reproducing kernel Hilbert space (RKHS) H RX with inner product , H. By definition, H thus constitutes a complete metric space with respect to the norm H induced by the inner product, and the point evaluation h 7 h(x) of the functions h H represents a continuous linear functional on H for any fixed x X. The Riesz representation theorem then implies that for every x X there exists a unique function Φ(x) H such that h(x) = h, Φ(x) H for all h H. We henceforth refer to Φ : X H as the feature map and to k : X X R+ with k(x, x ) = Φ(x), Φ(x ) H as the kernel function. By construction, the kernel function is symmetric and positive definite, that is, the kernel matrix K RN N defined through Kij = k(xi, xj) is positive definite for all N N and {xi}i N X. By the Moore-Aronszajn theorem, any symmetric and positive definite kernel function k on X induces a unique RKHS H RX, which can be represented as h RX : βi R, xi X i N with h(x) = i=1 βik(xi, x) and j=1 βik(xi, xj)βj < where the inner product of two arbitrary functions h1, h2 H with h1(x) = P i=1 βik(xi, x) and h2(x) = P j=1 β jk(x j, x) is defined as h1, h2 H = P i=1 P j=1 βik(xi, x j)β j. One may now use the kernel function to define the feature map Φ through [Φ(x )](x) = k(x , x) for all x, x X. This choice is admissible because it respects the consistency condition Φ(x), Φ(x ) H = k(x, x ) for all x, x H, and because it implies the desired reproducing property f, Φ(x ) H = P i=1 βik(xi, x ) = f(x ) for all f H and x X. In summary, given a symmetric and positive definite kernel function k, there exists an associated RKHS H and a feature map Φ with the reproducing property. As we will see below, however, to optimize over nonlinear hypotheses in H, knowledge of k is sufficient, and there is no need to construct H and Φ explicitly. Assume now that we are given any symmetric and positive definite kernel function k, and construct a distributionally robust learning problem over all nonlinear hypotheses in the corresponding RKHS H via b J(ρ) = inf h H sup Q Bρ(b PN) EQ [ℓ(h(x), y)] , (29) Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani where the transportation metric is given by the Euclidean norm on X Y (for regression problems) or the separable metric (16) with the Euclidean norm on X (for classification problems). While problem (29) is hard to solve in general due to the nonlinearity of the hypotheses h H, it is easy to solve a lifted learning problem where the inputs x X are replaced with features x H H, while each nonlinear hypothesis h H over the input space X is identified with a linear hypothesis h H H over the feature space H through the identity h H(x H) = h, x H H. Thus, the lifted learning problem can be represented as b JH(ρ) = inf h H sup Q Bρ(b PH N) EQ [ℓ( h, x H H, y)] , (30) where b PH N = 1/N PN i=1 δ(Φ(bxi),byi) on H Y denotes the pushforward measure of the emprical distribution b PN under the feature map Φ induced by k, while Bρ(b PH N) constitutes the Wasserstein ball of radius ρ around b PH N corresponding to the transportation metric d H (x H, y), (x H, y ) = x H x H 2 H + (y y )2 for regression problems, x H x H H + κ1{y =y } for classification problems. Even though b PH N constitutes the pushforward measure of b PN under Φ, not every distribution QH Bρ(b PH N) can be obtained as the pushforward measure of some Q Bρ(b PN). Thus, we should not expect (29) to be equivalent to (30). Instead, one can show that under a judicious transformation of the Wasserstein radius, (30) provides an upper bound on (29) whenever the kernel function satisfies a calmness condition. Assumption 25 (Calmness of the kernel) The kernel function k is calm from above, that is, there exist a concave smooth growth function g : R+ R+ with g(0) = 0 and g (z) 1 for all z R+ such that p k(x1, x1) 2k(x1, x2) + k(x2, x2) g( x1 x2 2) x1, x2 X. The calmness condition is non-restrictive. In fact, it is satisfied by most commonly used kernels. Example 1 (Growth Functions for Popular Kernels) For most commonly used kernels k on X Rn, we can construct an explicit growth function g that certifies the calmness of k in the sense of Assumption 25. This construction typically relies on elementary estimates. Derivations are omitted for brevity. 1. Linear kernel: For k(x1, x2) = x1, x2 , we may set g(z) = z. 2. Gaussian kernel: For k(x1, x2) = e γ x1 x2 2 2 with γ > 0, we may set g(z) = max{ 2γ, 1}z. 3. Laplacian kernel: For k(x1, x2) = e γ x1 x2 1 with γ > 0, we may set g(z) = p 2γz n if 0 z γ n/2 and g(z) = z + γ n/2 otherwise. Regularization via Mass Transportation 4. Polynomial kernel: The kernel k(x1, x2) = (γ x1, x2 + 1)d with γ > 0 and d N fails to satisfy the calmness condition if X is unbounded and d > 1, in which case p k(x1, x1) 2k(x1, x2) + k(x2, x2) grows superlinearly. If X {x Rn : x 2 R} for some R > 0, however, the polynomial kernel is calm with respect to the growth function 2(γR2 + 1)d, 1}z if d is even, 2(γR2 + 1)d 2(1 γR2)d, 1}z if d is odd. Theorem 26 (Lifted learning problems) If Assumption 25 holds for some growth function g, then the following statements hold for all Wasserstein radii ρ 0. (i) For regression problems we have b J(ρ) b JH( (ii) For classification problems we have b J(ρ) b JH(g(ρ)). We now argue that the lifted learning problem (30) can be solved efficiently by leveraging the following representer theorem, which generalizes (Sch olkopf and Smola, 2001, Theorem 4.2) to non-separable loss functions. Theorem 27 (Representer theorem) Assume that we are given a symmetric positive definite kernel k on X with corresponding RKHS H, a set of training samples (bxi, byi) X Y, i N, and an arbitrary loss function f : (X Y R)N R+ R that is non-decreasing in its last argument. Then, there exist βi R, i N, such that the learning problem min h H f((bx1, by1, h(bx1)), . . . , (bx N, by N, h(bx N)), h H) (31) is solved by a hypothesis h H representable as h (x) = PN i=1 βik(x, bxi). The subsequent results involve the Kernel matrix K = [Kij] defined through Kij = k(bxi, bxj), i, j N. The following theorems demonstrate that the lifted learning problem (30) admits a kernel representation. Theorem 28 (Kernelized distributionally robust regression) Suppose that X = Rn, Y = R and k is a symmetric positive definite kernel on X with associated RKHS H. If ℓis generated by a convex and Lipschitz continuous loss function L, that is, ℓ(h(x), y) = L(h(x) y), then (30) is equivalent to min β RN 1 N j=1 Kijβj byi + ρ lip(L) (K 1 2 β, 1) 2, and for any of its minimizers β the hypothesis h (x) = PN i=1 β i k(x, bxi) is optimal in (30). Theorem 29 (Kernelized distributionally robust classification) Suppose that X = Rn, Y = {+1, 1} and k is a symmetric positive definite kernel on X = Rn with associated Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani RKHS H. If ℓis generated by a convex and Lipschitz continuous loss function L, that is, ℓ(h(x), y) = L(yh(x)), then (30) is equivalent to min βi,λ,si λρ + 1 j=1 byi Kijβj si i [N] j=1 byi Kijβj κλ si i [N] lip(L) K 1 2 β 2 λ, and for any of its minimizers β the hypothesis h (x) = PN i=1 β i k(x, bxi) is optimal in (30). Theorems 28 and 29 show that the lifted learning problem (30) can be solved with similar computational effort as problem (4), that is, optimizing over a possibly infinite-dimensional RKHS of nonlinear hypotheses is not substantially harder than optimizing over the space of linear hypotheses. Remark 30 (Kernelization in robust regression and classification) Recall from Theorem 12 that distributionally robust and classical robust linear regression are equivalent if Ξ = Rn+1 and the training samples are sufficiently dispersed in the sense of Assumption 10. Similarly, Theorem 23 implies that distributionally robust and classical robust linear classification are equivalent if κ = and the training samples are non-separable in the sense of Assumption 21. One can show that Theorems 12 and 23 naturally extend to nonlinear regression and classification models over an RKHS induced by some symmetric and positive definite kernel. Specifically, one can show that some lifted robust learning problem is equivalent to the lifted distributionally robust learning problem (30) whenever the lifted training samples (Φ(bx1), by1), , (Φ(bx N), by N) satisfy Assumption 10 (for regression) or 21 (for classification). Theorems 28 and 29 thus imply that the lifted robust regression and classification problems can be solved efficiently under mild regularity conditions whenever Assumptions 10 and 21 hold, respectively. Unfortunately, these conditions are often violated for popular kernels. For example, the lifted samples are always linearly separable under the Gaussian kernel (Xu et al., 2009, p. 1496). In this case, the lifted robust classification problem can never be reduced to an efficiently solvable lifted distributionally robust classification problem of the form (30). In fact, no efficient method for solving the lifted robust classification problem seems to be known. In contrast, the lifted distributionally robust learning problems are always efficiently solvable under standard regularity conditions. 3.4. Nonlinear Hypotheses: Neural Networks2 Families of neural networks represent particularly expressive classes of nonlinear hypotheses. In the following, we characterize a family H of neural networks with M N layers through M continuous activation functions σm : Rnm+1 Rnm+1 and M weight matrices Wm 2. We are grateful to an anonymous referee for encouraging us to write this section. Regularization via Mass Transportation Rnm+1 nm, m [M]. The weight matrices can encode fully connected or convolutional layers, for example. If n1 = n and n M+1 = 1, then we may set H = n h RX : Wm Rnm+1 nm, m [M], h(x) = σM WM σ2 W2σ1(W1x) o . Each hypothesis h H constitutes a neural network and is uniquely determined by the collection of all weight matrices W[M] := (W1, . . . , WM). In order to emphasize the dependence on W[M], we will sometimes use h(x; W[M]) to denote the hypotheses in H. Setting x1 = x, the features of the neural network are defined recursively through xm+1 = σm(zm), where zm = Wmxm, m [M]. The features xm, m = 2, . . . , M, correspond to the hidden layers of the neural network, while x M+1 determines its output. Example 2 (Activation functions) The following activation functions are most widely used. 1. Hyperbolic tangent: [σm(zm)]i = (exp(2[zm]i) 1)/(exp(2[zm]i) + 1) 2. Sigmoid: [σm(zm)]i = 1/(1 + exp( [zm]i)) 3. Softmax: [σm(zm)]i = exp([zm]i)/Pnm+1 j=1 exp([zm]j) 4. Rectified linear unit (Re LU): [σm(zm)]i = max{0, [zm]i} 5. Exponential linear unit (ELU): [σm(zm)]i = max{0, [zm]i}+min{0, α(exp([zm]i) 1)} The distributionally robust learning model over the hypothesis class H can now be represented as inf h H sup Q Bρ(b PN) EQ ℓ h(x), y = inf W[M] sup Q Bρ(b PN) EQ ℓ h(x; W[M]), y , (33) where we use the transportation metrics (11) and (16) for regression and classification problems, respectively. Moreover, we adopt the standard convention that ℓ(h(x), y) = L(h(x) y) for regression problems and ℓ(h(x, y)) = L(yh(x)) for classification problems, where L is a convex and Lipschitz continuous univariate loss function. In the following we equip each feature space Rnm with a norm , m [M + 1]. By slight abuse of notation, we use the same symbol for all norms even though the norms on different feature spaces may differ. Using the norm on Rnm+1, we define the Lipschitz modulus of σm as lip(σm) := sup z,z Rnm+1 z z : z = z . We are now ready to state the main result of this section, which provides a conservative upper bound on the distributionally robust learning model (33). Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani Theorem 31 (Distributionally robust learning with neural networks) The distributionally robust learning model (33) is bounded above by the regularized empirical loss minimization problem i=1 ℓ(h(bxi; W[M]), byi) + ρ lip(L) max m=1 lip(σm) Wm , c where c = 1 for regression problems and c = max{1, 2 suph H,x X |h(x)|} for classification problems. Moreover, Wm = sup xm =1 Wmxm is the operator norm induced by the norms on Rnm and Rnm+1. Remark 32 (Uniform upper bound on all neural networks) For classification problems the constant c in (34) represents a uniform upper bound on all neural networks and may be difficult to evaluate in general. It is easy to estimate c, however, if the last activation function is itself bounded such as the softmax function, which yields a probability distribution over the output space. In this case one may simply set c = 2. The product term QM m=1 lip(σm) Wm in (34) represents an upper bound on the Lipschitz modulus of h(x; W[M]). We emphasize that computing the exact Lipschitz modulus of a neural network is NP-hard even if there are only two layers and all activation functions are of the Re LU type (Scaman and Virmaux, 2018, Theorem 2). In contrast, the upper bound at hand is easy to compute as all activation functions listed in Example 2 have Lipschitz modulus 1 with respect to the Euclidean norms on the domain and range spaces (Gouk et al., 2018; Wiatowski et al., 2016). For more details on how to estimate the Lipschitz moduli of neural networks we refer to (Gouk et al., 2018; Miyato et al., 2018; Neyshabur et al., 2018; Szegedy et al., 2013). Note that even though (34) represents a finite-dimensional optimization problem over the weight matrices of the neural network, both the empirical prediction loss as well as the regularization term are non-convex in W[M], which complicates numerical solution. If κ = , however, one can derive an alternative upper bound on the distributionally robust learning model (33) with a convex regularization term. Corollary 33 (Convex regularization term) If κ = , then there is ρ 0 such that the distributionally robust learning model (33) is bounded above by the regularized empirical loss minimization problem i=1 ℓ(h(bxi; W[M]), byi) + ρ m=1 Wm . (35) As the empirical prediction loss remains non-convex, it is expedient to address problem (35) with local optimization methods such as stochastic gradient descent algorithms. For a comprehensive review of firstand the second-order stochastic gradient algorithms we refer to (Agarwal et al., 2017) and the references therein. In the numerical experiments we will use a stochastic proximal gradient descent algorithm that exploits the convexity of the regularization term and generates iterates W k [M] for k N according to the update rule W k+1 m = proxηkρ Wm W k m ηk Wmℓ(h(bxik; W k [M]), byik) m [M], Regularization via Mass Transportation where ηk > 0 is a given step size and ik is drawn randomly from the index set [N], see, e.g., Nitanda (2014). Here, the proximal operator associated with a convex function ϕ : Rnm+1 nm R is defined through proxϕ(Wm) := arg min W m ϕ(W m) + 1 2 W m Wm 2 F , where F stands for the Frobenius norm. The algorithm is stopped as soon as the improvement of the objective value falls below a prescribed threshold. As the empirical prediction loss is non-convex and potentially non-smooth, the algorithm fails to offer any strong performance guarantees. For the scalability of the algorithm, however, it is essential that the proximal operator can be evaluated efficiently. Example 3 (Proximal operator) Suppose that all feature spaces Rnm are equipped with the p-norm for some p {1, 2, }, which implies that all parameter spaces Rnm+1 nm are equipped with the corresponding matrix p-norm. In this case the proximal operator of ϕ(Wm) = η Wm p for some fixed η > 0 can be evaluated highly efficiently. 1. MACS (p = 1): The matrix 1-norm returns the maximum absolute column sum (MACS). Evaluating the proximal operator of ϕ(Wm) = η Wm 1 amounts to solving the minimization problem proxϕ(Wm) = min W m,u ηu + i=1 [W m]:,i [Wm]:,i 2 2 s.t. [W m]:,i 1 u i [nm], where [Wm]:,i and [W m]:,i represent the i-th columns of Wm and W m, respectively. For any fixed u, the above problem decomposes into nm projections of the vectors [Wm]:,i, i [nm], to the ℓ1-ball of radius u centered at the origin. Each of these projections can be computed via an efficient sorting algorithm proposed in (Duchi et al., 2008). Next, we can use any line search method such as the golden-section search algorithm to optimize over u, thereby solving the full proximal problem. 2. Spectral (p = 2): The matrix 2-norm coincides with the spectral norm, which returns the maximum singular value. In this case, the proximal problem for ϕ(Wm) = η Wm 2 can be solved analytically via singular value thresholding (Cai et al., 2010, Theorem 2.1), that is, given the singular value decomposition Wm = USV with U Rnm+1 nm+1 orthogonal, S Rnm+1 nm + diagonal and V Rnm nm orthogonal, the proximal operator satisfies proxϕ(Wm) = proxϕ(USV ) = U e SV , where e Sij = max{Sij η, 0} . The singular value decomposition can be accelerated using a randomized algorithm proposed in (Halko et al., 2011). 3. MARS (p = ): The matrix -norm returns the maximum absolute row sum (MARS) and thus satisfies Wm = W m 1. Therefore, one can use the iterative scheme developed for MACS to compute the proximal operator of ϕ(Wm) = η Wm by simply transposing the weight matrix Wm. Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani The convergence behavior of the stochastic proximal gradient descent algorithm can be further improved by including a momentum term inside the proximal operator, see, e.g., Loizou and Richt arik (2017). 4. Generalization Bounds Generalization bounds constitute upper confidence bounds on the out-of-sample error. Traditionally, generalization bounds are derived by controlling the complexity of the hypothesis space, which is typically quantified in terms of its VC-dimension or via covering numbers or Rademacher averages (Shalev-Shwartz and Ben-David, 2014). Strengthened generalization bounds for large margin classifiers can be obtained by improving the estimates of the VC-dimension and the Rademacher average (Shivaswamy and Jebara, 2007, 2010). We will now demonstrate that distributionally robust learning models of the type (4) or (30) enjoy simple new generalization bounds that can be obtained under minimal assumptions. In particular, they do not rely on any notions of hypothesis complexity and may therefore even extend to hypothesis spaces with infinite VC-dimensions. Our approach is reminiscent of the generalization theory for robust support vector machines portrayed in (Xu et al., 2009), which also replaces measures of hypothesis complexity with robustness properties. However, we derive explicit finite sample guarantees, while (Xu et al., 2009) establishes asymptotic consistency results. Moreover, we relax some technical conditions used in (Xu et al., 2009) such as the compactness of the input space X. The key enabling mechanism of our analysis is a measure concentration property of the Wasserstein metric, which holds whenever the unknown data-generating distribution has exponentially decaying tails. Assumption 34 (Light-tailed distribution) There exist constants a > 1 and A > 0 and a reference point ξ Rn+1 such that EP[exp (d(ξ, ξ ))a)] A, where d denotes the usual mass transportation cost. Theorem 35 (Measure concentration (Fournier and Guillin, 2015, Theorem 2)) If Assumption 34 holds, then we have PNn W(P, b PN) ρ o c1 exp c2Nρmax{n+1,2} if ρ 1, c1 exp c2Nρa if ρ > 1, (36) for all N 1, n = 1, and ρ > 0, where the constants c1, c2 > 0 depend only on a, A, d and n.3 Theorem 35 asserts that the empirical distribution b PN converges exponentially fast to the unknown data-generating distribution P, in probability with respect to the Wasserstein metric, as the sample size N tends to infinity. We can now derive simple generalization bounds by increasing the Wasserstein radius ρ until the violation probability on the of the right hand side of (36) drops below a prescribed significance level η (0, 1]. Specifically, 3. A similar but slightly more complicated inequality also holds for the special case n = 1; see (Fournier and Guillin, 2015, Theorem 2) for details. Regularization via Mass Transportation Theorem 36 borrowed from (Mohajerin Esfahani and Kuhn, 2018, Theorem 3.5) implies that PN{P Bρ(b PN)} 1 η for any ρ ρN(η), where 1 max{n+1,2} if N log(c1/η) c2 , log(c1/η) a if N < log(c1/η) Theorem 36 (Basic generalization bound) If Assumption 34 holds, then EP [ℓ( w, x , y)] sup Q Bρ(b PN) EQ [ℓ( w, x , y)] w Rn ) for any N 1, n = 1, η (0, 1] and ρ ρN(η). Remark 37 (Discussion of basic generalization bound) The following comments are in order. I. Performance guarantees for optimal hypotheses: If b J(ρ) denotes the minimum and b w a minimizer of the distributionally robust learning problem (4), then Theorem 36 implies that PN n EP [ℓ( b w, x , y)] b J(ρ) o 1 η for any N 1, n > 1, η (0, 1] and ρ ρN(η). II. Light-tail assumption: Assumption 34 is restrictive but unavoidable for any measure concentration result of the type described in Theorem 35. It is automatically satisfied if the input-output pair has bounded support or is known to follow a Gaussian or exponential distribution, for instance. III. Asymptotic consistency: It is clear from (37) that for any fixed η (0, 1], the radius ρN(η) tends to 0 as N increases. Moreover, Theorem 3.6 in (Mohajerin Esfahani and Kuhn, 2018) implies that if ηN converges to 0 at a carefully chosen rate (e.g., ηN = exp( N)), then the solution of the distributionally robust learning problem (4) with Wasserstein radius ρ = ρN(ηN) converges almost surely to the solution of the ideal learning problem that minimizes the out-of-sample error under the unknown true distribution P. IV. Curse of dimensionality: The Wasserstein radius (37) has two decay regimes. For small N, ρN(η) decays as N 1 a , and for large N it is proportional to N 1 n+1 . We thus face a curse of dimensionality for large sample sizes. In order to half the Wasserstein radius, one has to increase N by a factor of 2n. This curse of dimensionality is fundamental, i.e., the dependence of the measure concentration result in Theorem 35 on the input dimension n cannot be improved for generic distributions P; see (Weed and Bach, 2019) or (Fournier and Guillin, 2015, Section 1.3). Improvements are only possible in special cases, e.g., if P is finitely supported. Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani V. Extension to nonlinear hypotheses: Theorem 36 directly extends to any distributionally robust learning problem over an RKHS H induced by some symmetric and positive definite kernel function k. Specifically, if k is calm in the sense of Assumption 25 with growth function g, then we have EP[ℓ(h(x), y)] sup Q Bρ(b PH N) EQ [ℓ( h, x H H, y)] h H for any N 1, n = 1, η (0, 1] and ρ cg(ρN(η)), where c = 2 for regression problems and c = 1 for classification problems. To see this, note that the inclusion P Bρ(b PN) implies EP[ℓ(h(x), y)] sup Q Bρ(b PN) EQ [ℓ(h(x), y)] sup Q Bρ(b PH N) EQ [ℓ( h, x H H, y)] h H, (40) where the second inequality follows from the proof of Theorem 26. The generalization bound (39) thus holds because PN{P Bρ(b PN)} 1 η for any ρ ρN(η). Note that the rightmost term in (40) can be computed for any finitely generated hypothesis h H representable as h(x) = PN i=1 βik(x, bxi), which follows from Theorems 28 and 29, while the middle term is hard to compute. We emphasize that the generalization bound (39) does not rely on any notion of hypothesis complexity and remains valid even if H has infinite VC-dimension (e.g., if H is generated by the Gaussian kernel). Theorem 35 provides a confidence set for the unknown probability distribution P, and Theorem 36 uses this confidence set to construct a uniform generalization bound on the prediction error under P. The radius of the confidence set for P decreases slowly due to a curse of dimensionality, but the decay rate is essentially optimal. This does not imply that the decay rate of the generalization bound (38) is optimal, too. In fact, the worst-case expected error over a Wasserstein ball of radius ρ can be a (1 η)-confidence bound on the expected error under P even if the Wasserstein ball fails to contain P with confidence 1 η. Thus, the measure concentration result of Theorem 35 is too powerful for our purposes and leads to an over-conservative generalization bound. Below we will show that the curse of dimensionality in the generalization bound (38) can be broken if we impose the following restriction on the hypothesis space. Assumption 38 (Hypothesis space) The space of admissible hypotheses in (4) is restricted to W Rn. There exists Ω> 0 with infw W (w, 1) Ωif (4) is a regression problem and infw W w Ωif (4) is a classification problem. Similarly, there exists Ω 0 with supw,w W w w Ω. Theorem 39 (Improved generalization bound) Suppose that Assumptions 34 and 38 hold, and the function L is Lipschitz continuous. Moreover, assume that Ξ = Rn+1 and Mn = maxi n en+1 i if (4) is a regression problem, while Ξ = Rn { 1, 1} and Mn = maxi n en i if (4) is a classification problem, where en i is the i-th standard basis vector in Rn. Then, there exist constants c3 1, c4 > 0 depending only on Regularization via Mass Transportation the light tail constants a and A such that the generalization bound (38) holds for any N max (16n/c4)2, 16 log(c3/η)/c4 , η (0, 1] and ρ ρ N(η), where ρ N(η) = 2Ω N) + log(c3/η) The improved generalization bound from Theorem 38 does not suffer from a curse of dimensionality. In fact, in order to half the Wasserstein radius ρ N(η), it suffices to increase the sample size N by a factor of 4, irrespective of the input dimension n. Remark 40 (Discussion of improved generalization bound) The following comments are in order. I. Bounds on hypothesis space: Assumption 38 imposes upper and lower bounds on W. The upper bound enables us to control the difference between the empirical and the true expected loss uniformly across all admissible hypotheses. This bound is less restrictive than the uniform bound on the loss function used to derive Rademacher generalization bounds (see, e.g., (Shalev-Shwartz and Ben-David, 2014, Theorem 26.4)), which essentially imposes upper bounds both on the hypotheses and the input-output pairs. The lower bound in Assumption 38 is restrictive for classification problems but trivially holds for regression problems because (w, 1) is uniformly bounded away from zero for any (dual) norm on Rn+1. II. Breaking the curse of dimensionality: By leveraging Assumption 38, Theorem 39 reduces the critical Wasserstein radius in the generalization bound (38) from ρN(η) O([log(η 1)/N]1/(n+1)), which suffers from a curse of dimensionality, to ρ N(η) O([log(η 1)+n log(N))/N]1/2), which essentially follows a square root law reminiscent of the central limit theorem. 5. Error and Risk Estimation Once a hypothesis h(x) has been chosen, it is instructive to derive pessimistic and optimistic estimates of its out-of-sample prediction error (in the case of regression) or its out-of-sample risk (in the case of classification). We will argue below that the distributionally robust optimization techniques developed in this paper also offer new perspectives on error and risk estimation. For ease of exposition, we ignore any support constraints, that is, we set X = Rn and Y = R (for regression) or X = Rn and Y = {+1, 1} (for classification). Moreover, we focus on linear hypotheses of the form h(x) = w, x . Note, however, that all results extend directly to conic representable support sets and to nonlinear hypotheses. In the context of regression, we aim to estimate the prediction error defined as E(w) = EP [|y w, x |] or, more precisely, the mean absolute prediction error under the unknown data-generating distribution P. As usual, we assume that the transportation metric d is induced by a norm on the input-output space Rn+1. Theorem 41 (Error bounds in linear regression) The prediction error admits the following estimates. Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani (i) The worst-case error Emax(w) = sup Q Bρ(b PN) EQ [|y w, x |] is given by Emax(w) = 1 i=1 |byi w, bxi | + ρ (w, 1) . (41a) [(ii) ] The best-case error Emin(w) = inf Q Bρ(b PN) EQ [|y w, x |] is given by Emin(w) = max i=1 |byi w, bxi | ρ (w, 1) , 0 In the context of classification, we aim to quantify the risk R(w) = P [y = sign( w, x )], that is, the misclassification probability under the unknown true distribution P. Note that the risk can equivalently be defined as the expectation of a characteristic function, that is, R(w) = EP[1{y = w,x }]. As usual, we assume that the transportation metric d is of the form (16), where κ 0 is the cost of flipping a label. Theorem 42 (Risk bounds in linear classification) The risk admits the following estimates. (i) The worst-case risk Rmax(w) = sup Q Bρ(b PN) EQ[1{y = w,x }] is given by min λ,si ri,ti λρ + 1 s.t. 1 ri byi w, bxi si i [N] 1 + ti byi w, bxi λκ si i [N] ri w λ, ti w λ i [N] ri, ti, si 0 i [N]. (ii) The best-case risk Rmin(w) = inf Q Bρ(b PN) EQ[1{y = w,x }] is given by Rmin(w) = 1 min λ,si ri,ti λρ + 1 s.t. 1 + ri byi w, bxi si i [N] 1 ti byi w, bxi λκ si i [N] ri w λ, ti w λ i [N] ri, ti, si 0 i [N]. We emphasize that, as the hypothesis w is fixed, the error and risk estimation problems (41) and (42) constitute tractable linear programs that can be solved highly efficiently. Regularization via Mass Transportation Remark 43 (Confidence intervals for error and risk) If the Wasserstein radius is set to ρN(η/2) defined in (37), where η (0, 1] is a prescribed significance level, then Theorem 36 implies that E( b w) [Emin( b w), Emax( b w)] and R( b w) [Rmin( b w), Rmax( b w)] with confidence 1 η for any b w Rn that may even depend on the training data. Theorem 41 implies that the confidence interval for the true error E( b w) can be calculated analytically from (41), while Theorem 42 implies that the confidence interval for the true risk R( b w) can be computed efficiently by solving the tractable linear programs (42). Remark 44 (Extension to nonlinear hypotheses) By using the tools of Section 3.3, Theorems 41 and 42 generalize immediately to nonlinear hypotheses that range over a RKHS. Specifically, we can formulate lifted error and risk estimation problems where the inputs x X are replaced with features x H H, while each nonlinear hypothesis h H over the input space X is identified with a linear hypothesis h H H over the feature space H through the identity h H(x H) = h, x H H. Tractability is again facilitated by Theorem 27, which allows us to focus on finitely parameterized hypotheses of the form h(x) = PN i=1 βik(x, bxi). 6. Numerical Results We showcase the power of regularization via mass transportation in various applications based on standard datasets from the literature. All optimization problems are implemented in Python and solved with Gurobi 7.5.1 All experiments are run on an Intel XEON CPU (3.40GHz), and the corresponding codes are made publicly available at https://github.com/sorooshafiee/Regularization-via-Transportation. 6.1. Regularization with Pre-selected Parameters We first assess how the out-of-sample performance of a distributionally robust support vector machine (DRSVM) is impacted by the choice of the Wasserstein radius ρ, the cost κ of flipping a label, and the kernel function k. To this end, we solve three binary classification problems from the MNIST database (Le Cun et al., 1998) targeted at distinguishing pairs of similar handwritten digits (1-vs-7, 3-vs-8, 4-vs-9). In the first experiment we optimize over linear hypotheses and use the separable transporation metric (16) involving the -norm on the input space. All results are averaged over 100 independent trials. In each trial, we randomly select 500 images to train the DRSVM model (20) and use the remaining images for testing. The correct classification rate (CCR) on the test data, averaged across all 100 trials, is visualized in Figure 1 as a function of the Wasserstein radius ρ for each κ {0.1, 0.25, 0.5, 0.75, }. The best out-of-sample CCR is obtained for κ = 0.25 uniformly across all Wasserstein radii, and performance deteriorates significantly when κ is reduced or increased. Recall from Remark 18 that, as κ tends to infinity, the DRSVM reduces to the classical regularized support vector machine (RSVM) with 1-norm regularizer. Thus, the results of Figure 1 indicate that regularization via mass transportation may be preferable to classical regularization in terms of the maximum achievable out-of-sample CCR. More specifically, we observe that the out-of-sample CCR of the best DRSVM (κ = 0.25) displays a slightly higher and significantly wider plateau around the optimal regularization parameter ρ than the classical RSVM (κ = ). This suggests that the regularization parameter in DRSVMs may be easier to calibrate from data than in RSVMs, a conjecture that will be put Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani Table 1: Average out-of-sample CCR scores of the DRSVM with learned parameters. Polynomial Laplacian Gaussian RSVM DRSVM RSVM DRSVM RSVM DRSVM 1-vs-7 98.9 0.2 99.1 0.2 98.3 0.5 98.5 0.4 99.1 0.2 99.2 0.2 3-vs-8 95.2 0.4 97.0 0.4 96.5 0.4 96.8 0.4 97.0 0.3 97.2 0.3 4-vs-9 95.0 0.4 96.5 0.4 95.8 0.6 96.0 0.6 96.8 0.4 96.9 0.4 to scrutiny in Section 6.2. Finally, Figure 1 reveals that the standard (unregularized) support vector machine (SVM), which can be viewed as a special case of the DRSVM with ρ = 0, is dominated by the RSVMs and DRSVMs across a wide range of regularization parameters.4 Note that the SVM problem (20) with ρ = 0 reduces to a linear program and may thus suffer from multiple optimal solutions. This explains why the limiting out-of-sample CCR for ρ 0 changes with κ. 6.2. Regularization with Learned Parameters It is easy to read offthe best regularization parameters ρ and κ from the charts in Figure 1. As these charts are constructed from more than 12,000 test samples, however, they are not accessible in the training phase. In practice, ρ and κ must be calibrated from the training data alone. This motivates us to revisit the three classification problems from Section 6.1 using a fully data-driven procedure, where all free model parameters are calibrated via 5-fold cross validation; see, e.g., (Abu-Mostafa et al., 2012, 4.3.3). Moreover, to evaluate the benefits of kernelization, we now solve a generalized DRSVM model of the form (32), which implicitly optimizes over all nonlinear hypotheses in some RKHS. As explained in Section 3.3, kernelization necessitates the use of the separable transportation metric (16) with the Euclidean norm on the input space. All free parameters of the resulting DRSVM model are restricted to finite search grids in order to ease the computational burden of cross validation. Specifically, we select the Wasserstein radius ρ from within {b 10e : b {1, 5}, e {1, 2, 3, 4}} and the label flipping cost κ from within {0.1, 0.25, 0.5, 0.75, }. Moreover, we select the degree d of the polynomial kernel from within {1, 2, 3, 4, 5} and the peakedness parameter γ of the Laplacian and Gaussian kernels from within { 1 25}. Otherwise, we use the same experimental setup as in Section 6.1. Table 1 reports the averages and standard deviations of the CCR scores on the test data based on 100 independent trials. We observe that the DRSVM (ρ, κ, d, and γ learned by cross validation) outperforms the RSVM (ρ, d and γ learned by cross validation, κ = ) consistently across all tested kernel functions (Polynomial, Laplacian, Gaussian). Note that the DRSVM with polynomial kernel subsumes the non-kernelized DRSVM (20) as a special case because the polynomial kernel with d = 1 coincides with the linear kernel. 4. By slight abuse of notation, we use the acronym SVM to refer to the unregularized empirical hinge loss minimization problem even though the traditional formulations of the support vector machine involve a Tikhonov regularization term. Regularization via Mass Transportation 10-4 10-3 10-2 10-1 ρ κ = 0. 1 κ = 0. 25 κ = 0. 5 κ = 0. 75 κ = 10-4 10-3 10-2 10-1 ρ κ = 0. 1 κ = 0. 25 κ = 0. 5 κ = 0. 75 κ = 10-4 10-3 10-2 10-1 ρ κ = 0. 1 κ = 0. 25 κ = 0. 5 κ = 0. 75 κ = Figure 1: Average out-of-sample CCR scores of the DRSVM with pre-selected parameters. In the third experiment, we assess the out-of-sample performance of the DRSVM (20) for different transportation metrics on 10 standard datasets from the UCI repository (Bache and Lichman, 2013). Specifically, we use different variants of the separable transportation metric (16), where distances in the input space are measured via a p-norm with p {1, 2, }. We focus exclusively on linear hypotheses because the kernelization techniques described in Section 3.3 are only available for p = 2. The DRSVM is compared against the standard (unregularized) SVM and the RSVM with q-norm regularizer ( 1 q = 1). All results are averaged across 100 independent trials. In each trial, we randomly select 75% of the data for training and the remaining 25% for testing. The training dataset is first standardized to zero mean and unit variance along each coordinate axis. The Wasserstein radius ρ and the label flipping cost κ in the DRSVM as well as the regularization weight ρ in the RSVM are estimated via stratified 5-fold cross validation. Classifier performance is now quantified in terms of the receiver operating characteristic (ROC) curve, which plots the true positive rate (percentage of correctly classified test samples with true label y = 1) against the false positive rate (percentage of incorrectly classified test samples with true label y = 1) by sweeping the discrimination threshold. Specifically, we use the area under the ROC curve (AUC) as a measure of classifier performance. AUC does not bias on the size of the test data and is a more appropriate performance measure than CCR in the presence of an unbalanced label distribution in the training data. We emphasize that most of the considered datasets are indeed imbalanced, and thus a high CCR score would not necessarily provide evidence of superior classifier performance. The averages and standard deviations of the AUC scores based on 100 trials are reported in Table 2. The results suggest that the DRSVM outperforms the RSVM in terms of AUC for all norms by about the same amount by which the RSVM outperforms the classical hinge loss minimization, consistently across all datasets. 6.3. Multi-Label Classification The aim of object recognition is to discover instances of particular object classes in digital images. We now describe an object recognition experiment based on the PASCAL VOC 2007 dataset Everingham et al. (2010) consisting of 9,963 images, which are pre-partitioned into 25% for training, 25% for validation and 50% for testing. Each image is annotated with 20 binary labels corresponding to 20 given object categories (the n-th label is set to Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani Table 2: Average out-of-sample AUC scores of the SVM, RSVM and DRSVM. p = / q = 1 p = 2 / q = 2 p = 1 / q = SVM RSVM DRSVM RSVM DRSVM RSVM DRSVM Australian 91.6 3.0 91.5 3.2 92.0 2.5 92.0 2.2 92.3 2.0 91.9 2.8 92.2 2.4 Blood transfusion 73.7 3.8 73.8 3.8 75.5 3.8 74.9 3.5 75.5 3.7 75.4 3.4 75.4 3.7 Climate model 93.8 3.9 94.4 4.0 94.3 4.0 94.3 3.8 94.0 4.0 93.6 3.9 93.9 4.0 Cylinder 72.0 3.7 71.2 4.0 72.1 4.1 71.3 4.0 71.8 3.8 71.5 3.8 72.2 3.7 Heart 90.4 2.7 90.1 2.7 90.3 2.7 90.6 2.6 90.9 2.5 90.5 2.6 90.7 2.6 Ionosphere 85.0 4.9 89.7 4.5 89.2 4.3 90.4 3.7 89.9 3.9 86.0 4.9 87.2 4.8 Liver disorders 60.5 0.0 61.1 0.7 61.7 0.7 61.2 0.3 61.7 0.5 61.1 0.4 61.8 0.5 QSAR 90.5 1.5 90.5 1.6 91.0 1.6 90.5 1.5 91.2 1.5 90.6 1.5 91.1 1.6 Splice 92.1 0.0 93.0 0.4 93.1 0.1 92.5 0.1 92.6 0.1 92.0 0.1 92.5 0.1 Thoracic surgery 61.7 7.1 61.5 6.5 64.6 6.6 64.4 6.4 64.3 7.0 64.0 6.3 64.6 6.3 +1 if the image contains the n-th object and to 1 otherwise). A multi-label classifier is a function that predicts all labels of an unlabelled input image. The ability of a classifier to detect objects belonging to any fixed category is measured by the average precision (AP), which is defined in Everingham et al. (2010) as (a proxy for) the area under the classifier s precision-recall curve. The overall performance of a classifier is quantified by the mean average precision (m AP), that is, the arithmetic mean of the AP scores across all object categories. In the first scenario, we train a separate binary RSVM and DRSVM classifier for each of the 20 object categories. This classifier predicts whether an object of the respective category appears in the input image. At the beginning we preprocess the entire dataset by resizing each image to 256 256 pixels and extracting the central patch of 244 244 pixels. As shown in (Chatfield et al., 2014; Donahue et al., 2014; Zeiler and Fergus, 2014), the features generated by the penultimate layer of a deep convolutional neural network trained on a large image dataset provide a powerful image descriptor. Using the ALEXNET neural network trained on the Image Net dataset (Krizhevsky et al., 2012), we can thus compress each (preprocessed) image of the PASCAL VOC 2007 dataset into 1,000 meaningful features. We normalize these feature vectors to lie on the unit sphere. When training the RSVM and DRSVM classifiers, we can thus work with these feature vectors instead of the corresponding images. Moreover, we restrict attention to linear hypotheses and assume that transportation distances in the inputoutput space are measured by the separable metric (16) with the Euclidean norm on the input space. We tune the Wasserstein radius ρ {b 10e : b {1, . . . , 9}, e { 2, 3, 4}} and the label flipping cost κ {0.1, 0.2, . . . , 1, } via the holdout method using the validation data. As usual, we fix κ = for RSVM. Table 3 reports the AP scores of the RSVM and DRSVM models for each object category. The ensemble of all 20 binary RSVM or DRSVM classifiers, respectively, can be viewed as a na ıve multi-label classifier that predicts all labels of an image. As DRSVM outperforms RSVM on an object-by-object basis, it also wins in terms of m AP. In the second scenario, we construct a proper multi-label classifier by fine-tuning the last layer of the pre-trained ALEXNET network. To this end, we replace the original M-th layer of the network with a new fully connected layer characterized by a parameter matrix WM R20 1000, and we set σM to the Sigmoid activation function. The resulting classifer outputs for each of the 20 object categories a probability that an object from the respective Regularization via Mass Transportation category appears in the input image. The quality of a classifier (which is encoded by WM) is measured by the cross-entropy loss function, which naturally generalizes the logloss to multiple labels. The resulting empirical loss minimization problem is enhanced with a regularization term proportional to WM 1,1 (Lasso), WM 2 F (Tikhonov), WM 1 (MACS), WM 2 (Spectral) or WM (MARS). By using similar arguments as in Section 3.4, one can show that the empirical cross-entropy with MACS, Spectral or MARS regularization term overestimates the worst-case expected cross-entropy over all distributions of (x M, x M+1) in a Wasserstein ball provided that the transportation cost is given by d((x M, x M+1), (x M, x M+1)) = x M x M p + κ1{x M+1 =x M+1} for κ = , whenever p = 1, p = 2 or p = , respectively. Thus, the MACS, Spectral and MARS regularization terms admit a distributionally robust interpretation. We use the stochastic proximal gradient descent algorithm of Section 3.4 to tune WM, including an additional momentum term with weight 0.9. As in (Krizhevsky et al., 2012), we split the training phase into 100 epochs, each corresponding to a complete pass through the training dataset in a random order. As the ALEXNET requires input images of size 244 244, in each iteration we extract a random patch of 244 244 pixels from the current image and flip it horizontally at random. This procedure effectively augments the training dataset. The initial step size is set to 10 3 and then reduced by a factor of 10 after every 7 epochs. The algorithm terminates after 100 epochs. We preprocess the images in the validation and test datasets as in Scenario 1 and tune the regularization weights via the holdout method using the validation data. Table 3 reports the AP and m AP scores of the different classifiers that were tested. These results suggest that fine-tuning the last layer of a pre-trained neural network may improve classifier performance. We observe that the spectral norm regularizer, which has a distributionally robust interpretation, consistently outperforms almost all other methods. For further details on the experimental setup (such as the exact search grids for all hyperparameters) we refer to the code publicized on Github. 6.4. Generalization Bounds The next experiment estimates the scaling behavior of the smallest Wasserstein radius that verifies the generalization bound (38) for the synthetic threenorm classification problem (Breiman, 1996). The experiment involves 1,000 simulation trials. In each trial we generate N training samples for some N {10, . . . , 90} {100, . . . , 1,000} as well as 105 test samples. Each sample (x, y) R20 { 1, 1} is constructed as follows. The label y is drawn uniformly from { 1, 1}. If y = 1, then x is drawn from a standard multivariate normal distribution shifted by (c, . . . , c) or ( c, . . . , c) with equal probabilities, where c = 2/ 20. If by = 1, on the other hand, then x is drawn from a standard multivariate normal distribution shifted by (c, c, +c, . . . , c). We now describe three different approaches to choose the Wasserstein radius ρ in the DRSVM (20) with transportation cost (16), where κ = and represents the -norm on the input space. Throughout the experiment we use P = {b 10 e : b {1, . . . , 10}, e {1, . . . , 5}} as the search space for ρ. Approach 1 ( cross validation ) calibrates the Wasserstein radius as before via 5-fold cross validation based solely on the N training samples. This approach reflects what would typically be done in practice. Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani Table 3: AP scores of different multi-label classifiers. Scenario 1 Scenario 2 RSVM DRSVM Lasso Tikhonov MACS Spectral MARS 84.80 84.85 85.50 84.26 84.35 83.89 85.53 78.54 78.49 76.18 76.55 76.37 75.67 76.22 82.19 82.19 83.37 83.11 83.52 84.19 83.08 79.55 79.56 77.65 78.04 77.70 78.92 77.53 37.52 37.98 38.10 39.73 39.53 38.75 38.10 72.05 72.03 70.05 69.43 69.17 70.28 69.77 83.08 83.12 83.10 83.67 83.50 82.92 83.17 80.60 80.56 79.87 79.86 80.07 79.85 79.82 54.64 54.64 54.40 54.76 53.94 54.72 54.55 47.82 53.13 52.06 52.10 51.62 55.54 51.19 54.26 58.88 63.41 65.23 65.15 66.79 62.95 75.81 75.81 76.92 77.39 77.26 76.54 76.97 82.74 82.72 82.17 81.89 81.6 80.81 81.9 72.69 72.88 74.70 75.41 74.76 76.48 74.31 90.36 90.36 90.07 90.30 90.22 90.35 90.09 50.22 51.90 50.20 50.18 50.27 51.39 50.20 60.75 63.57 71.78 71.3 70.39 71.64 71.40 56.85 56.98 52.15 54.36 54.65 55.12 51.94 85.09 85.03 84.89 84.55 84.41 85.43 84.96 69.02 69.08 64.73 65.85 65.63 64.26 64.48 m AP 69.92 70.69 70.56 70.90 70.71 71.20 70.40 Approaches 2 and 3 both solve (20) based on the empirical distribution induced by the N training samples and select the Wasserstein radius using the 105 test samples. Specifically, approach 2 ( optimal ) chooses the Wasserstain radius that leads to the lowest test error, while approach 3 ( generalization bound ) selects the smallest Wasserstein radius for which the optimal value of (20) exceeds the expected loss on the test samples in at least 95% of all trials, that is, it approximates the smallest Wasserstein radius that verifies the generalization bound (38) for η = 5%. As the test samples are not available in the training phase, the last two approaches are not implementable in practice, and we merely study them to gain insights. Figure 2(a) visualizes all resulting Wasserstein radii as a function of N. Note that the radii obtained with the first two approaches are uncertain as they depend on a particular choice of the training samples. Figure 2(a) thus only shows their averages across all simulation trials. In contrast, the radii obtained with the third approach depend on the training sample sets of all 1,000 trials and are thus essentially deterministic. We observe that the Wasserstein radii of all three approaches decay approximately as 1/ N, which is in line with the theoretical generalization bound of Theorem 39. We expect Regularization via Mass Transportation 101 102 103 N optimal ρ generalization bound ρ cross validation ρ (a) Dependence of the Wasserstein radius on the number of training samples 10 5 10 4 10 3 10 2 10 1 100 ρ True Risk Upper Bound Lower Bound Confidence (b) Confidence bounds on the risk Figure 2: Results of the threenorm classification problem. this decay rate to be optimal because any faster decay would be in conflict with the central limit theorem. Note also that our results empirically confirm Theorem 39 even though we did not impose any restrictions on W as dictated by Assumption 38. This suggests that Theorem 39 might remain valid under weaker conditions. In the experiment underlying Figure 2(b), we first fix b w to an optimal solution of (20) for ρ = 0.1 and N = 100. Figure 2(b) shows the true risk R( b w) and its confidence bounds given by Theorem 42. As expected, for ρ = 0 the upper and lower bounds coincide with the empirical risk on the training data, which is a lower bound for the true risk on the test data due to over-fitting effects. As ρ increases, the confidence interval between the bounds widens and eventually covers the true risk. For instance, at ρ 0.009 the confidence interval is given by [0.008, 0.162] and contains the true risk with probability 1 η = 95%. 6.5. Worst-Case Distributions Consider again the 3-vs-8 classification problems from the MNIST database (Le Cun et al., 1998) and fix w to an optimal solution of the empirical hinge loss minimization problem. The goal of the last experiment is to evaluate the worst-case hinge loss of w for different Wasserstein radii ρ {0, 0.01, 0.05, 0.1, 0.5, 1} and label flipping costs κ {0, } and to investigate the corresponding worst-case distributions, which are computable by virtue of Theorem 20(i). As each input constitutes a vector of pixels intensities between zero and one, we impose support constraints of the form Cx d with C = [I, I] and d = [1 , 0 ] . For illustrative purposes we only use the N = 10 first datapoints in the MNIST dataset as training samples. Each training sample bxi corresponds to four discretization points (bxi + q+ ij/α+ ij and bxi + q ij/α ij for j = 1, 2) in the worst-case distribution obtained from (26). We observe that for every i exactly one out of these four points has probability 1 N , while all others have probability 0. Figure 3 depicts only those 10 discretization points that have nonzero probability for a fixed ρ and κ. As expected, the perturbations of the training samples are more severe for larger Wasserstein radii. For κ = these scenarios must have the same labels as the corresponding training samples. For κ = 0, on the other hand, the labels can be flipped at no cost (flipped labels are indicated by red frames). Each scenario group shown in Figure 3 can thus be viewed as a worst-case training dataset for the corresponding Wasserstein radius and label flipping cost. Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani Figure 3: Discretization points (input images) of the worst-case distribution for different ρ and κ. Red frames indicate that the corresponding labels are flipped under the worst-case distribution. Acknowledgments We gratefully acknoweldge financial support from the Swiss National Science Foundation under grants BSCGI0 157733 and P2EZP2 165264. Appendix: Proofs A.1. Proofs of Section 3 The proof of Theorem 4 requires three preparatory Lemmas. The first lemma is adapted from (Mohajerin Esfahani and Kuhn, 2018) and asserts that the worst-case expectation over a Wasserstein ball can be re-expressed as a classical robust optimization problem. Lemma 45 (Robust reformulation) Set bξi = (bxi, byi) for all i N. For any measurable integrand I(ξ) that is bounded above by a Lipschitz continuous function we have sup Q Bρ(b PN) EQ [I(ξ)] = inf λ 0 λρ + 1 i=1 sup ξ Ξ I(ξ) λd(ξ, bξi). (A.1) Proof By the definition of the Wasserstein ball we have sup Q Bρ(b PN) EQ [I(ξ)] = Ξ2 I(ξ) Π(dξ, dξ ) s.t. Π is a joint distribution of ξ and ξ with marginals Q and b PN Z Ξ2 d ξ, ξ Π(dξ, dξ ) ρ Regularization via Mass Transportation Ξ I(ξ) Qi(dξ) Ξ Qi(dξ) = 1 i [N] Ξ d(ξ, bξi) Qi(dξ) ρ. Note that the integral of I(ξ) exists under every Q Bρ(b PN) because I(ξ) admits a Lipschitz continuous majorant. The last equality in the above expression holds because the marginal distribution of ξ is the uniform distribution on the training samples, which implies that Π is completely determined by the conditional distributions Qi of ξ given ξ = bξi, that is, Π(dξ, dξ ) = 1 N PN i=1 δbξi(dξ )Qi(dξ). The resulting generalized moment problem over the normalized measures Qi admits the semi-infinite dual sup Q Bρ(b PN) EQ [I(ξ)] = inf λ,si λρ + 1 s.t. sup ξ Ξ I(ξ) λd(ξ, bξi) si i [N] Strong duality holds for any ρ > 0 due to (Shapiro, 2001, Proposition 3.4). The claim then follows by eliminating si. Lemma 46 For any a R, β, bζ Rd, γ R+ and ζ Z, where Z Rd is a closed convex set, we have sup ζ Z a β, ζ γ ζ bζ = ( inf p SZ(aβ p) + p, bζ Proof We have sup ζ Z a β, ζ γ ζ bζ = sup ζ Z inf p γ a β, ζ p, ζ bζ = inf p γ sup ζ Z a β, ζ p, ζ bζ = inf p γ sup ζ Rd aβ p, ζ δZ(ζ) + p, bζ = inf p γ SZ(aβ p) + p, bζ , where the first equality follows from the definition of the dual norm, the second equality holds due to the minimax theorem (Bertsekas, 2009, Proposition 5.5.4), and the last equality holds because the support function SZ is the conjugate of the indicator function δZ. Thus, the claim follows. Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani Lemma 47 If L(z) is a convex and Lipschitz continuous loss function, β, bζ Rd and γ > 0, then sup ζ Rd L( β, ζ ) γ ζ bζ = L( β, bζ ) if lip(L) β γ + otherwise. Proof Note that L( β, ζ ) γ ζ bζ constitutes a difference of convex functions and may thus be neither convex nor concave in ζ. In order to maximize this function, we re-write I(ζ) = L( β, ζ ) as an upper envelope of infinitely many affine functions. To this end, we express the conjugate of I(ζ) as I (z) = sup ζ z, ζ L( β, ζ ) = sup t, ζ { z, ζ L(t) : t = β, ζ } = inf θ {L (θ) : θβ = z} , where the last equality follows from strong Lagrangian duality, which holds because Slater s constraint qualification is trivially satisfied in the absence of inequality constraints (Bertsekas, 2009, Proposition 5.3.1). Defining Θ = {θ R : L (θ) < } as the effective domain of L (θ), we may then replace θ R with θ Θ in the last expression. As I(ζ) is convex and continuous, it coincides with its bi-conjugate, that is, I(ζ) = I (ζ) = sup z z, ζ I (z) = sup θ Θ θβ, ζ L (θ). In other words, we have represented I(ζ) as the upper envelope of infinitely many linear functions. Using this representation, we obtain sup ζ I(ζ) γ ζ bζ = sup ζ I (ζ) γ ζ bζ = sup ζ sup θ Θ θβ, ζ L (θ) γ ζ bζ = sup θ Θ sup ζ inf p γ θ β, ζ L (θ) p, ζ bζ = sup θ Θ inf p γ sup ζ θβ p, ζ L (θ) + p, bζ , where the last equality holds due to (Bertsekas, 2009, Proposition 5.5.4). Evaluating the maximization over ζ yields sup ζ I(ζ) γ ζ bζ = sup θ Θ inf p γ p, bζ L (θ) if p = θβ + otherwise θβ, bζ L (θ) if θβ γ + otherwise ( sup θ Θ θ β, bζ L (θ) if sup θ Θ θβ γ + otherwise ( L( β, bζ ) if sup θ Θ |θ| β γ + otherwise. Regularization via Mass Transportation Thus, the claim follows by noting that supθ{|θ| : L (θ) < } represents the Lipschitz modulus of L. Proof of Theorem 4 To prove assertion (i), we apply Lemma 45 to the integrand f(x, y) = L( w, x y) with L(z) = maxj J{ajz + bj} to obtain sup Q Bρ(b PN) EQ [ℓ( w, x , y)] = inf λ 0 λρ + 1 i=1 sup (x,y) Ξ L( w, x y) λ (x, y) (bxi, byi) = inf λ 0 λρ + 1 i=1 max j [J] sup (x,y) Ξ aj( w, x y) + bj λ (x, y) (bxi, byi) inf λ,pij,uij λρ + i=1 max j J SΞ(ajw pij, aj uij) + pij, bxi + uijbyi + bj s.t. (pij, uij) λ i [N], j [J], where the last equality follows from Lemma 46. The claim now follows by introducing auxiliary epigraphical variables si for the max-terms in the objective function and by including w as a decision variable. To prove assertion (ii), we apply Lemma 45 to the integrand f(x, y) = L( w, x y), where L is a Lipschitz continuous convex loss function. Thus we find sup Q Bρ(b PN) EQ [ℓw(x, y)] = inf λ 0 λρ + i=1 sup x,y L( w, x y) λ (x, y) (bxi, byi) inf λ λρ + 1 i=1 L( w, bxi byi) s.t. lip(L)|θ| (w, 1) λ, where the last equality uses Lemma 47. Next, we eliminate λ and include w as a decision variable. Proof of Corollary 5 Note that the Huber loss function L(z) coincides with the infconvolution of 1 2z2 and δ|z| and can thus be expressed as L(z) = minz1 1 2z2 1 + δ|z z1|. Moreover, the Lipschitz modulus of the Huber loss function is δ. The rest of the proof follows from Theorem 4(ii). Proof of Corollary 6 Notice that the ϵ-insensitive loss function is a piecewise linear function with J = 3 pieces, see Section 2.1. By strong conic duality, the support function of Ξ = {(x, y) Rn+1 : C1x + c2y C d} can be re-expressed as SΞ(z1, z2) = sup x,y { z1, x + z2y : C1x + c2y C d} = inf q C n q, d : C 1 q = z1, c 2 q = z2 o . Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani Strong duality holds because Ξ admits a Slater point. The rest of proof follows from Theorem 4(i). Proof of Corollary 7 The pinball loss function is a piecewise linear function with J = 2 pieces, see Section 2.1. The rest of proof follows from the dual representation of the support function SΞ(z1, z2), which is known from the proof of Corollary 6, and from Theorem 4(i). The proof of Theorem 9 is based on the following preparatory lemma. Lemma 48 If Z Rd is a non-empty convex closed set, bζ Z, β Rd and α, γ 0, then we have inf p αSZ(β p) + α p, bζ + γ p = sup q γ α β, bζ + β, q s.t. bζ + q/α Z. Proof If α = 0, then the optimal values of both opimization problems vanish due to our conventions of extended arithmetic, and thus the claim trivially holds. If α > 0, however, we have inf p α SZ(β p) + α p, bζ + γ p = inf p sup q γ αSZ(β p) + α p, bζ + p, q = sup q γ inf p αSZ(β p) + p, αbζ + q = sup q γ inf z αSZ(z) + β z, αbζ + q = sup q γ α β, bζ + β, q α sup z z, bζ + q/α SZ(z) = sup q γ α β, bζ + β, q αδZ(bζ + q/α), where the first equality follows from the definition of the dual norm, the second equality exploits (Bertsekas, 2009, Proposition 5.5.4), and the last equality holds because, for any convex closed set, the indicator function is the conjugate of the support function. Proof of Theorem 9 We first prove assertion (i). By Theorem 4(i), the worst-case expectation problem (13) constitutes a restriction of (5) where w is fixed, and thus it coincides with the minimax problem inf λ,si pij,uij sup αij 0,γij 0 λρ + 1 j=1 γij (pij, uij) λ j=1 αij SΞ( ajw pij, aj uij) + bj + pij, bxi + uijbyi si . The minimization and the maximization may be interchanged by strong duality, which holds because the convex program (5) satisfies Slater s constraint qualification for every fixed Regularization via Mass Transportation w (Bertsekas, 2009, Proposition 5.3.1). Indeed, note that SΞ is proper, convex and lower semi-continuous and appears in constraints that are always satisfiable because they involve a free decision variable. Thus, the above minimax problem is equivalent to sup αij,γij inf pij,uij j=1 αij SΞ(ajw pij, aj uij) + bj + pij, bxi + uijbyi j=1 γij pij j=1 γij = ρ j=1 αij = 1 αij, γij 0 i [N], j [J]. By Lemma 48, which applies because (bxi, byi) Ξ for all i N, the above dual problem simplifies to sup αij,γij qij,vij j=1 αij aj( w, bxi byi) + bj + aj( w, qij vij) j=1 γij = ρ j=1 αij = 1 (qij, vij) γij i [N], j [J] (bxi qij/αij, byi vij/αij) Ξ i [N], j [J] αij, γij 0 i [N], j [J]. Problem (14) is now obtained by eliminating the variables γij and by substituting αij, qij, and vij with αij/N, qij/N, and vij/N, respectively. As for assertion (ii), we first show that the discrete distribution Qγ belongs to the Wasserstein ball Bρ(b PN) for all γ (0, 1]. Indeed, the Wasserstein distance between Qγ and b PN amounts to d(Qγ, b PN) γ γ x , by1 + ρN γ y (bx1, by1) = ρ (x , y ) ρ, where the first inequality holds because the Wasserstein distance coincides with the optimal mass transportation cost, and the last inequality holds because the norm of (x , y ) is at most 1 by construction. Thus, Qγ Bρ(b PN) for all γ (0, 1]. Denoting the optimal value Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani of (13) by J (w) and using ℓw(x, y) as a shorthand for L( w, x y), we find J (w) EQγ [ℓw(x, y)] i=1 ℓw(bxi, byi) γ N ℓw(bx1, by1) + γ N ℓw(bx1 + ρN γ x , by1 + ρN i=1 ℓw(bxi, byi) γ N ℓw(bx1, by1) + γ N (x, y), (bx1 + ρN γ x , by1 + ρN ℓ w(x, y) (x, y) Rn+1, where the last estimate follows from Fenchel s inequality. Setting (x, y) = lip(L)(w, 1) we thus have J (w) lim γ 0+ 1 N i=1 ℓw(bxi, byi) γ N ℓw(bx1, by1) + γ N lip(L)( w, bx1 by1) + ρ lip(L) (w, 1) γ N ℓ w(lip(L)(w, 1)) i=1 ℓw(bxi, byi) + ρ lip(L) (w, 1) = J (w), where the equality follows from Theorem 4(ii). The above reasoning implies that lim γ 0+ EQγ[ℓw(x, y)] = J (w), and thus the claim follows. Proof of Theorem 12 Assume first that the loss function L is convex piecewise linear, that is, L(z) = maxj J{ajz + bj}. As Ξ Rn+1, Theorem 9(i) implies that the worst-case expectation (13) is given by sup αij,qij,vij j=1 αij h aj( w, bxi byi) + bj i + aj( w, qij vij) j=1 (qij, vij) ρ j=1 αij = 1 i [N] αij 0 i [N], j [J] sup αij, xij, yij j=1 αij [aj( w, bxi + xij byi yij) + bj] j=1 αij ( xij, yij) ρ j=1 αij = 1 i [N] αij 0 i [N], j [J] Regularization via Mass Transportation sup αij, xi, yi j=1 αij [aj( w, bxi + xi byi yi) + bj] j=1 αij ( xi, yi) ρ j=1 αij = 1 i [N] αij 0 i [N], j [J]. The first inequality holds because for any feasible solution {αij, xij, yij} to the second optimization problem, the solution {αij, qij, vij} with qij = αij xij and vij = αij yij is feasible in the first problem and attains the same objective value (conversely, note that the first problem admits feasible solutions with αij = 0 and qij = 0 that have no counterpart in the second problem). The second inequality follows from the restriction that xij and yij must be independent of j. It is easy to verify that the last optimization problem in the above expression is equivalent to (15) because (αi1, . . . , αi J) ranges over a simplex for every i N, and thus (15) provides a lower bound on (13). Suppose now that Assumption 10 holds, and note that the worst-case loss (15) can be expressed as i=1 max j J [aj( w, bxi + xi byi yi) + bj] i=1 ( xi, yi) ρ i =k max j J [aj( w, bxi byi) + bj]+ 1 N max j J [aj( w, bxk + x byk y) + bj] s.t. 1 N ( x, y) ρ. The above inequality follows from setting xi = 0 and yi = 0 for all i = k, where (bxk, byk) is a training sample satisfying |L ( w, bxk byk)| = lip(L), which exists due to Assumption 10. The last expression equals i =k max j J [aj( w, bxi byi) + bj] + max j J 1 N [aj( w, bxk byk) + bj] N [aj( w, x y)] s.t. 1 N ( x, y) ρ i =k max j J [aj( w, bxi byi) + bj] + 1 h aj( w, bxk byk) + bj + ρN (w, 1) |aj| i i =k max j J [aj( w, bxi byi) + bj] + 1 N max j J [aj( w, bxk byk) + bj] + max j J ρ (w, 1) |aj| Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani i=1 L( w, bxi byi) + max j J ρ (w, 1) |aj|, where the penultimate equality holds because w, bxk byk resides on the steepest linear piece of the loss function L by virtue of Assumption 10. The claim then follows from Theorem 4(ii) because lip(L) = maxj J |aj|. Note that generic convex Lipschitz continuous loss functions can be uniformly approximated as closely as desired with convex piecewise linear functions. Thus, the above argument extends directly to generic convex Lipschitz continuous loss functions. Details are omitted for brevity. Proof of Theorem 14 To prove assertion (i), we apply Lemma 45 to problem (4) with the transportation distance (16), where ξ = (x, y) and Ξ = X { 1, +1}. Thus, we obtain sup Q Bρ(b PN) EQ [ℓ( w, x , y)] inf λ,si λρ + 1 s.t. sup (x,y) Ξ L(y w, x ) λ x bxi + κ 2|y byi| si i [N] inf λ,si λρ + 1 s.t. sup x X L(byi w, x ) λ x bxi si i [N] sup x X L( byi w, x ) λ x bxi κλ si i [N] inf λ,si λρ + 1 s.t. sup x X ajbyi w, x + bj λ x bxi si i [N], j [J] sup x X ajbyi w, x + bj λ x bxi κλ si i [N], j [J] where the second equality holds because, for every i, y can be either equal to byi or to byi. Reformulating the constraints using Lemma 46 and including w as a decision variable then yields (17). Regularization via Mass Transportation When X = Rn and L is Lipschitz continuous, we can use similar arguments as above to prove that sup Q Bρ(b PN) EQ [ℓ( w, x , y)] = inf λ,si λρ + 1 s.t. sup x Rn L(byi w, x ) λ x bxi si i [N] sup x Rn L( byi w, x ) λ x bxi κλ si i [N] Applying Lemma 47 to the subordinate maximization problems in the constraints yields sup Q Bρ(b PN) EQ [ℓ( w, x , y)] = inf λ,si λρ + 1 s.t. L(byi w, bxi ) si i [N], j [J] L( byi w, bxi ) κλ si i [N], j [J] sup θ Θ |θ| w λ. Thus, assertion (ii) follows by recalling that lip(L) = supθ{|θ| : L (θ) < } = supθ Θ |θ| and by including w as a decision variable. Proof of Corollary 15 Notice that the hinge loss function is piecewise linear with J = 2 pieces, see Section 2.1. Moreover, by strong conic duality the support function of X can be re-expressed as SX(z) = sup x { z, x : Cx C d} = inf q C n q, d : C q = z o . Strong duality holds because X admits a Slater point. The proof thus follows from Theorem 14(i). Proof of Corollary 16 The smooth hinge loss L(z) coincides with the inf-convolution of 1 2z2 and max{0, 1 z} and can thus be expressed as L(z) = minz1 1 2z2 1 + max{0, 1 z z1}. Moreover, the Lipschitz modulus of the smooth hinge loss function is 1. The proof thus follows from Theorem 14(ii). Proof of Corollary 17 Notice that the logloss function is convex and has Lipschitz modulus 1, see Section 2.1. The rest of proof follows from Theorem 14(ii). For more details see (Shafieezadeh-Abadeh et al., 2015). Proof of Theorem 20 We first prove assertion (i). By Theorem 14(i), the worst-case expectation problem (25) constitutes a restriction of (17) where w is fixed, and thus it is Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani equivalent to the minimax problem inf λ,si p+ ij,p ij sup α+ ij 0,γ+ ij 0 α ij 0,γ ij 0 j=1 α+ ij SX(ajbyiw p+ ij) + bj + p+ ij, bxi si j=1 α ij SX( ajbyiw p ij) + bj + p ij, bxi κλ si j=1 γ+ ij p+ ij λ + j=1 γ ij p ij λ . The minimization and the maximization may be interchanged by strong Lagrangian duality, which holds because the convex program (17) satisfies Slater s constraint qualification for any fixed w (Bertsekas, 2009, Proposition 5.3.1). Indeed, note that SX is proper, convex and lower semi-continuous and appears in constraints that are always satisfiable because they involve a free decision variable. Thus, problem (25) is equivalent to sup α+ ij,γ+ ij α ij,γ ij inf p+ ij,p ij j=1 α+ ij SX(ajbyiw p+ ij) + bj + p+ ij, bxi + γ+ ij p+ ij j=1 α ij SX( ajbyiw p ij) + bj + p ij, bxi + γ ij p ij j=1 γ+ ij + γ ij + κα ij = ρ j=1 α+ ij + α ij = 1 α+ ij, α ij, γ+ ij, γ ij 0 i [N], j [J]. By Lemma 48, the above dual problem simplifies to sup α+ ij,γ+ ij,q+ ij α ij,γ ij,q ij j=1 (α+ ij α ij)ajbyi w, bxi + ajbyi w, q+ ij q ij + j=1 γ+ ij + γ ij + κα ij = ρ j=1 α+ ij + α ij = 1 q+ ij γ+ ij, q ij γ ij i [N], j [J] bxi + q+ ij/α+ ij X, bxi + q ij/α ij X i [N], j [J] α+ ij, α ij, γ+ ij, γ ij 0 i [N], j [J]. Regularization via Mass Transportation Problem (26) is now obtained by eliminating γ+ ij and γ ij and by substituting α+ ij, α ij, q+ ij, and q ij with α+ ij/N, α ij/N, q+ ij/N, and q ij/N, respectively. As for assertion (ii), we use L+ i and L i to abbreviate L(byi w, bxi ) and L( byi w, bxi ), respectively. By Theorem 14(ii), we have sup Q Bρ(b PN) EQ[L(y w, x )] = inf w,λ,si λρ + 1 s.t. L+ i si i [N] L i κλ si i [N] lip(L) (w) λ sup α+ i ,α i ,θ lip(L) (w) θ + i=1 α+ i L+ i + α i L i s.t. α+ i + α i = 1 i=1 α i = ρ α+ i 0, α i 0 i [N] θ 0, where the second equality follows from strong linear programming duality, which holds because the primal problem is manifestly feasible. Eliminating the first constraint and replacing α i with αi/N and α+ i with (1 αi)/N allows us to reformulate the dual linear program as sup αi,θ lip(L)(w) θ + 1 i=1 (1 αi)L+ i + αi L i 0 αi 1 i [N] θ 0. Thus, the worst-case expectation (25) coincides with the optimal value of (27) for γ = 0. Next let (α i (γ), θ (γ)) be an optimal solution of (27) for γ 0, and define Qγ as in the theorem statement. Note that (27) is infeasible for γ > ρ. Moreover, note that the atoms of Qγ have non-negative probabilities if η(γ) [0, 1], which holds whenever γ [0, 1]. We thus focus on parameter values γ [0, min{ρ, 1}]. By construction, the Wasserstein distance between Qγ and the empirical distribution satisfies d(Qγ, b PN) κ i=1 α i (γ) η(γ)κ N α 1(γ) + η(γ) N d (bx1 + θ (γ)N η(γ) x , by1) (bx1, by1) ρ γ θ (γ) + θ (γ) x ρ, where the first inequality holds because the Wasserstein distance is defined as the minimum cost of moving Qγ to b PN, the second inequality follows from the feasibility of (α i (γ), θ (γ)) in (27) and the non-negativity of η(γ), and the last inequality holds because x 1, Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani θ (γ) 0 and γ 0. Thus, Qγ Bρ(b PN) for all γ [0, min{ρ, 1}]. Denoting the optimal value of (25) by J (w), we find J (w) EQγ [L(y w, x )] i=1 (1 α i (γ))L+ i + α i (γ)L i η(γ) N (1 α 1(γ))L+ 1 + α 1(γ)L 1 N L by1 w, bx1 + θ (γ)N i=1 (1 α i (γ))L+ i + α i (γ)L i η(γ) N (1 α 1(γ))L+ 1 + α 1(γ)L 1 N x, bx1 + θ (γ)N η(γ) x L (by1 w, x ) x Rn, where the last estimate follows from Fenchel s inequality. Setting x = lip(L)w and driving γ to zero yields J (w) lim γ 0+ 1 N NP i=1 (1 α i (γ))L+ i + α i (γ)L i η(γ) N (1 α 1(γ))L+ 1 + α 1(γ)L 1 N lip(L) w, bx1 L (by1 lip(L) w, w ) + lip(L) w θ (γ) = lim γ 0+ 1 N NP i=1 (1 α i (γ))L+ i + α i (γ)L i + lip(L) w θ (γ) = J (w), where the first equality follows from the observation that η(γ) [0, γ], which implies that η(γ) converges to zero as γ tends to zero. The second equality holds because the optimal value of (27) is concave and non-increasing and a fortiori continuous in γ [0, min{ρ, 1}] and because J (w) coincides with the optimal value of (27) when γ = 0. The above reasoning implies that limγ 0+ EQγ[L(y w, x )] = J (w), and thus the claim follows. Proof of Theorem 23 Assume first that the loss function L is convex piecewise linear, that is, L(z) = maxj J{ajz + bj}. As X = Rn and κ = , Theorem 20(i) implies that (25) can be expressed as sup α+ ij,q+ ij j=1 α+ ijajbyi w, bxi + ajbyi w, q+ ij + j=1 q+ ij Nρ j=1 α+ ij = 1 i [N] α+ ij 0 i [N], j [J] Regularization via Mass Transportation sup αij, xij j=1 αijajbyi w, bxi + xij + j=1 αij xij Nρ j=1 αij = 1 i [N] αij 0 i [N], j [J] sup αij, xi j=1 αijajbyi w, bxi + xi + j=1 αij xi Nρ j=1 αij = 1 i [N] αij 0 i [N], j [J]. The first optimization problem constitutes a special case of (26). Indeed, as κ = , the first constraint in (26) implies that α ij = 0, which in turn implies via the fourth constraint and our conventions of extended arithmetics that q ij = 0. The first inequality in the above expression holds because for any feasible solution {αij, xij} to the second problem, the solution {α+ ij, q+ ij} with q+ ij = α+ ij xij and αij = α+ ij for all i N and j J is feasible in the first problem and attains the same objective value. The second inequality in the above expression follows from the restriction that xij must be independent of j. It is easy to verify that the last optimization problem is equivalent to (28) because (αi1, . . . , αi J) ranges over a simplex for every i N, and thus (28) provides a lower bound on (25). Suppose now that Assumption 21 holds, and note that (28) can be expressed as i=1 max j J [ajbyi w, bxi + xi + bj] i =k max j J [ajbyi w, bxi + bj] + 1 N max j J [ajbyk w, bxk + x + bj] s.t. 1 N x ρ. The above inequality follows from setting xi = 0 and yi = 0 for all i = k, where (bxk, byk) is a training sample satisfying |L (byk w, bxk )| = lip(L), which exists due to Assumption 21. The last expression equals i =k max j J [ajbyi w, bxi + bj] + max j J 1 N [ajbyk w, bxk + bj] + 1 N [ajbyk w, x ] s.t. 1 N ( x, y) ρ Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani i =k max j J [ajbyi w, bxi + bj] + 1 h ajbyi w, bxk + bj + ρN w |aj| i i =k max j J [ajbyi w, bxi + bj] + 1 N max j J [ajbyk w, bxk + bj] + max j J ρ w |aj| i=1 L(byi w, bxi ) + max j J ρ w |aj|, where the penultimate equality holds because byk w, bxk resides on the steepest linear piece of the loss function L by virtue of Assumption 21. The claim then follows from Theorem 14(ii) because lip(L) = maxj J |aj|. Note that generic convex Lipschitz continuous loss functions can be uniformly approximated as closely as desired with convex piecewise linear functions. Thus, the above arguments extend directly to generic convex Lipschitz continuous loss functions. Details are omitted for brevity. Proof of Theorem 26 By the definition of the feature map Φ corresponding to the kernel k, we have Φ(x1) Φ(x2) H = p Φ(x1), Φ(x1) H 2 Φ(x1), Φ(x2) H + Φ(x2), Φ(x2) H k(x1, x1) 2k(x1, x2) + k(x2, x2) g( x1 x2 2) (A.3) for all x1, x2 X, where the inequality follows from Assumption 25. As for assertion (i), we may use similar argument as in the proof of Lemma 45 to reformulate the worst-case expectation in (29) as sup Q Bρ(b PN) EQ [ℓ(h(x), y)] (A.4) X Y ℓ(h(x), y)Qi(dx, dy) X Y d (x, y), (bxi, byi) Qi(dx, dy) ρ R X Y Qi(dx, dy) = 1 i [N] X Y ℓ(h(x), y)Qi(dx, dy) 2d (x, y), (bxi, byi) Qi(dx, dy) X Y Qi(dx, dy) = 1 i [N] X Y ℓ(h(x), y)Qi(dx, dy) 2d (x, y), (bxi, byi) Qi(dx, dy) g( X Y Qi(dx, dy) = 1 i [N], Regularization via Mass Transportation where the second equality holds because g is strictly monotonically increasing, and the inequality follows from Jensen inequality, which applies because g is concave. By the definition of the transportation metric on H Y for regression problems, we then find 2d (x, y), (bxi, byi) = g q 2 x bxi 2 2 + 2(y byi)2 g( x bxi 2 + |y byi|) g( x bxi 2) + |y byi| Φ(x) Φ(bxi) H + |y byi| Φ(x) Φ(bxi) 2 H + (y byi)2 = d H (Φ(x), y), (Φ(bxi), byi) , where the first inequality holds because 2a2 + 2b2 (a + b)2 for all a, b 0 and because g is strictly monotonically increasing, the second inequality exploits the assumption that g (z) 1 for all z 0, the third inequality follows form (A.3), and the last equality holds because a2 + b2 (a + b)2 for all a, b 0. Substituting the above estimate into (A.5) and using the reproducing property h(x) = h, Φ(x) H yields sup Q Bρ(b PN) EQ [ℓ(h(x), y)] X Y ℓ( h, Φ(x) H, y)Qi(dx, dy) X Y d H Φ(x), y), Φ(bxi), byi) Qi(dx, dy) g( X Y Qi(dx, dy) = 1 i [N] H Y ℓ(x H, y)Qi(dx H, dy) H Y d H x H, y), Φ(bxi), byi) Qi(dx H, dy) g( H Y Qi(dx H, dy) = 1 i [N] = sup Q Bg( 2ρ)(b PH N) EQ [ℓ(x H, y)] , where the second inequality follows from relaxing the implicit condition that the random variable x H must be supported on {Φ(x) : x X} H. This completes the proof of assertion (i) for regression problems. The proof of assertion (ii) parallels that of assertion (i) with obvious modifications. Due to the different transportation metric for classification problems, however, the estimate (A.6) changes to g d (x, y), (bxi, byi) = g ( x bxi 2 + κ|y byi|) g ( x bxi 2) + κ|y byi| Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani Φ(x) Φ(bxi) H + κ|y byi| = d H (Φ(x), y), (Φ(bxi), byi) , where the first inequality exploits the assumption that g (z) 1 for all z 0, and the second inequality follows form (A.3). Further details are omitted for brevity. Proof of Theorem 27 The proof follows immediately from (Sch olkopf and Smola, 2001, Theorem 4.2), which applies to loss functions representable as a sum of an empirical loss depending on (bxi, byi, h(bxi)), i N, and a regularization term that is strictly monotonically increasing in h H. However, the additive separability is not needed for the proof. We remark that the optimal solution of (31) is unique if f is striclty increasing in h H. If f is only non-decreasing in h H, on the other hand, uniqueness may be lost. Details are omitted for brevity. Proof of Theorem 28 Using similar arguments as in the proof of Theorem 4(ii) and observing that any Hilbert norm is self-dual, one can show that inf h H sup Q Bρ(b PH N) EQ [L( h, x H H y)] = min h H 1 N i=1 L(h(bxi) byi) + ρ lip(L) q By the representer theorem, which applies because the objective function of the above optimization problem is non-decreasing in h H, we may restrict the feasible set from H to the subset of all linearly parametrized hypotheses of the form h(x) = PN j=1 βjk(x, bxj) for some β RN without sacrificing optimality. The claim then follows by observing that h(bxi) = PN j=1 Kijβj and h 2 2 = β, Kβ . Proof of Theorem 29 Using similar arguments as in the proof of Theorem 14(ii) and observing that any Hilbert norm is self-dual, one can show that inf h H sup Q Bρ(b PH N) EQ [L(y h, x H H)] = min h,λ,si λρ + 1 s.t. L(byih(bxi)) si i [N] L( byih(bxi)) κλ si i [N] lip(L) h H λ, see (Gao and Kleywegt, 2016, Theorem 1) for a full proof. By the representer theorem, which applies because the loss function f((bx1, by1, h(bx1)), . . . , (bx N, by N, h(bx N)), h H) min λ λρ + 1 i=1 max{L(byih(bxi)), L( byih(bxi)) κλ} s.t. λ lip(L) h H is non-decreasing in h H, we may restrict attention to all linearly parametrized hypotheses of the form h(x) = PN j=1 βjk(x, bxj) for some β RN without sacrificing optimality. Thus, Regularization via Mass Transportation the claim follows. Proof of Theorem 31 For regression problems, the worst-case expected prediction loss satisfies sup Q Bρ(b PN) EQ [ℓ(h(x), y)] = inf λ 0 λρ + 1 i=1 sup x X y Y L(h(x) y) λ( x bxi + κ|y byi|) inf λ 0 λρ + 1 i=1 sup x X y Y L(h(bxi) byi) + lip(L)(|h(x) y h(bxi) + byi|) λ( x bxi + κ|y byi|) inf λ 0 λρ + 1 i=1 sup x X y Y L(h(bxi) byi) + lip(L) lip(h) x bxi + lip(L)|y byi| λ( x bxi + κ|y byi|) i=1 ℓ(h(bxi), byi) + ρ lip(L) max {lip(h), 1/κ} , where the equality holds due to Lemma 45, while the first and the second inequalities follow from the Lipschitz continuity of L and h, respectively. The last inequality holds by setting λ = lip(L) max {lip(h), 1/κ}. Note that lip(ϕ(ψ)) lip(ϕ) lip(ψ) for any functions ϕ and ψ defined on appropriate normed spaces; see for example (Rockafellar and Wets, 2009, Exercise 9.8). Thus, the Lipschitz modulus of h can be estimated in terms of the Lipschitz moduli of the activation functions σm and the operator norms of the weight matrices WM, which coincide with the Lipschitz moduli of the corresponding linear mappings. Formally, we have lip(h) QM m=1 lip(σm) Wm . Substituting this estimate into the last line of the above display equation yields the postulated upper bound on the worst-case expectation for regression problems. Similarly, for classification problems we have sup Q Bρ(b PN) EQ [ℓ(h(x), y)] = inf λ 0 λρ + 1 i=1 sup x X y Y L(yh(x)) λ( x bxi + κ1{y =byi}) inf λ 0 λρ + 1 i=1 sup x X y Y L(byih(bxi)) + lip(L)(|yh(x) byih(bxi)|) λ( x bxi + κ1{y =byi}) inf λ 0 λρ + 1 i=1 sup x X y Y L(byih(bxi)) + lip(L) lip(h) x bxi 1{y=byi} Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani + lip(L) |h(x) + h(bxi)| 1{y =byi} λ( x bxi + κ1{y =byi}) inf λ 0 λρ + 1 i=1 sup x X y Y L(byih(bxi)) λ( x bxi + κ1{y =byi}) + lip(L) max {2c/κ, lip(h)} ( x bxi + κ1{y =byi}), where c = suph H,x X |h(x)| is defined as in the claim. Setting λ = lip(L) max {1/κ, 2c/κ, lip(h)} and using the estimate lip(h) QM m=1 lip(σm) Wm yields the postulated upper bound on the worst-case expectation for classification problems. Proof of Corollary 33 Set σ = Q m [M] lip(σm). As κ = , Theorem 31 implies that the worst-case expected prediction loss, both for classification and regression problems, satisfies sup Q Bρ(b PN) EQ [ℓ(h(x), y)] 1 i=1 sup x X y Y ℓ(h(bxi), byi) + ρ σ lip(L) i=1 sup x X y Y ℓ(h(bxi), byi) + ρ σ lip(L) where the second inequality follows from the inequality of arithmetic and geometric means. By (Everett III, 1963, Theorem 1), if W [M] is a minimizer of the optimization problem i=1 sup x X y Y ℓ(h(bxi), byi) + ρ σ lip(ℓ) then the same W [M] also minimizes the constrained optimization problem i=1 ℓ(h(bxi), byi) for θ = PM m=1 W m . Notice that the constraint in the above optimization problem can be simplified to PM m=1 Wm θ. Hence, there exists a Lagrange multiplier ρ for the simplified constraint such that W [M] is a minimizer of the penalized problem i=1 ℓ(h(bxi), byi) + ρ This observation completes the proof. Regularization via Mass Transportation A.2. Proofs of Section 4 The proof of Theorem 39 relies on the following preparatory lemma, which basically asserts that the sample average of a linearly growing function of ξ has sub-Gaussian tails. Lemma 49 (Sub-Gaussian tails) If Assumption 34 holds, then there exist constants c3 1 and c4 > 0 that depend only on the light tail constants a and A of P such that PNn EP[f(ξ)] E b PN [f(ξ)] δ o c3 exp( c4Nδ2) for any N N, δ [0, 1] and function f : Ξ R with |f(ξ) f(ξ )| d(ξ, ξ ) for all ξ Ξ and some reference point ξ Ξ. Proof Assume that f : Ξ R is a linear growth function with |f(ξ) f(ξ )| d(ξ, ξ ) for all ξ Ξ and some reference point ξ Ξ. Set ξf = f(ξ) and ξ f = f(ξ ). Thus, the distribution of the scalar random variable ξf is given by the pushforward measure f (P) of P. By construction, we have Ef (P) exp |ξf ξ f|a = EP exp |f(ξ) f(ξ )|a EP exp d(ξ, ξ )a A, where the first inequality follows form the growth condition of f, while the second inequality holds because P satisfies Assumption 34. Hence, the distribution f (P) satisfies Assumption 34 for n = 0 when distances on R are measured by the absolute value, and it inherits the light-tail constants a and A from P. By using Theorem 35 for n = 0, we may thus conclude that there exist constants c3, c4 > 0 with PNn W(f (P), f (b PN)) δ o c3 exp c4Nδ2 δ [0, 1], where f (b PN) represents the empirical distribution of ξf, which coincides with the pushforward measure of b PN under f. By slight abuse of notation, W stands here for the Wasserstein distance between univariate distributions, where the absolute value is used as the ground metric. Note that the above univariate measure concentration result holds for any linear growth function f with asymptotic growth rate 1. We emphasize that c3 1 because otherwise the above estimate would fail to hold for δ = 0. By construction of the Wasserstein distance in Definition 2, we have W(f (P), f (b PN)) < δ if and only if the scalar random variables ξf and ξ f admit a joint distribution Π with EΠ[|ξf ξ f|] < δ under which ξf and ξ f have marginals f (P) and f (b PN), respectively. The inequality W(f (P), f (b PN)) < δ thus implies EP[f(ξ)] E b PN [f(ξ )] = Ef (P)[ξf] Ef (b PN)[ξ f] EΠ[ξf ξ f] = EΠ ξf ξ f < δ. By contraposition, we then obtain the implication EP[f(ξ)] E b PN [f(ξ )] δ = W(f (P), f (b PN)) δ, which leads to the desired inequality PNn EP[f(ξ)] E b PN [f(ξ)] δ o PNn W(f (P), f (b PN)) δ o c3 exp c4Nδ2 . Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani The last inequality holds for all δ [0, 1] and N N, irrespective of the linear growth function f. Remark 50 (Hoeffding s inequality) If it is known that P{f f(ξ) f} = 1, then Lemma 49 reduces to Hoeffding s inequality (Boucheron et al., 2013, Theorem 2.8), in which case we may set c3 = 2 and c4 = 2/(f f)2. Proof of Theorem 39 To avoid cumbersome case distinctions, we prove the theorem only in the case when (4) is a classification problem. Thus, we assume that Ξ = Rn { 1, 1} and that the transportation cost is of the form (16), where denotes a norm on the input space Rn. The proof for regression problems is similar and only requires minor modifications. It will be omitted for brevity. From Theorem 14(ii) we know that sup Q Bρ(b PN) EQ [ℓ( w, x , y)] = E b PN [ℓ( w, x , y)] + ρ lip(L)Ω(w) w W, where Ω(w) = w can be viewed as a regularization function. For every ρ ρ N(η) we thus have EP [ℓ( w, x , y)] sup Q Bρ(b PN) EQ [ℓ( w, x , y)] w W PN 0 min w W E b PN [ℓ( w, x , y)] + ρ N(η) lip(L)Ω(w) EP [ℓ( w, x , y)] = 1 PN min w W E b PN [ℓ( w, x , y)] + ρ N(η) lip(L)Ω(w) EP [ℓ( w, x , y)] < 0 . (A.7) Observe that ℓ( w, x , y) is Lipschitz continuous in w for every fixed x and y because the underlying univariate loss function L is Lipschitz continuous by assumption. Specifically, we have |ℓ( w, x , y) ℓ( w , x , y)| lip(L)| w w , x | lip(L) w w x 1 w, w W. For any > 0 there exists a finite set W W with = supw W infw W w w and whose cardinality satisfies |W | < (Ω/ 1)n < (Ω/ )n 1 where the second inequality holds because Ωby construction. In the following we set = Ω/ N. As ℓ( w, x , y) is Lipschitz continuous in w, for every w W there is w W with |ℓ( w, x , y)| ℓ( w , x , y) lip(L) x 1. Applying this estimate twice and recalling the assumption that Ω(w) Ωfor all w W, we may thus conclude that the probability in (A.7) is smaller or equal to PN min w W E b PN [ℓ( w, x , y)] EP [ℓ( w, x , y)] + ρ N(η) lip(L)Ω< 0 Regularization via Mass Transportation PN min w W E b PN [ℓ( w, x , y)] EP [ℓ( w, x , y)] lip(L) E b PN [ x 1] + EP [ x 1] < ρ N(η) lip(L)Ω PN min w W E b PN [ℓ( w, x , y)] EP [ℓ( w, x , y)] lip(L) E b PN [ x 1] EP [ x 1] < 2 lip(L) Mnn A ρ N(η) lip(L)Ω The second inequality in the above expression follows from the estimate Mn = max i n en i = max i n sup x 1 en i , x = max i n sup x 1 |xi| = sup x 1 x sup x Rn x 1 n x , which can be paraphrased as x 1 Mnn x for every x Rn and thus implies EP[ x 1] Mnn EP[ x ] Mnn EP[exp( x a)] Mnn EP[exp d(ξ, ξ ))a ] Mnn A. Next, we introduce an auxiliary parameter δ = ρ N(η)Ω Mn A N)/c4 + log(c3/η)/c4 where the second equality follows from the definition of ρ N(η) in the theorem statement and the convention that = Ω/ N. One can prove that δ [0, 1]. Indeed, the nonnegativity of δ is immediate because c3 1 and c4 > 0. Moreover, we find log(c3/η)/c4 where the first inequality follows from the observation that x1 + x2 x1 + x2 for all x1, x2 0, while the second inequality holds because log( N for all N N and N max (16n/c4)2, 16 log(c3/η)/c4 , which implies that both fractions in the middle of the above expression are smaller ore qual to 1 2. Multiplying the definition of δ with lip(L)( + Ω) yields the identity lip(L) δ lip(L)Ωδ = 2 lip(L) Mnn A lip(L)ρN(η)Ω, and thus the probability (A.8) can be bounded above by PN min w W E b PN [ℓ( w, x , y)] EP [ℓ( w, x , y)] lip(L)Ωδ or lip(L) E b PN [ x ] EP [ x ] lip(L) δ PN min w W E b PN ℓ( w, x , y) EP ℓ( w, x , y) Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani + PN E b PN [ x ] EP [ x ] δ , (A.9b) where the inequality follows from the subadditivity of probability measures. For any fixed w W one can show that the function f(ξ) = ℓ( w, x , y)/(lip(L)Ω) with ξ = (x, y) satisfies the linear growth condition |f(ξ) f(ξ )| d(ξ, ξ ) for all ξ Ξ if ξ = (0, 1). Details are omitted for brevity. By the subadditivity of probability measures, the probability (A.9a) is thus smaller or equal to w W PN E b PN hℓ( w, x , y) i EPhℓ( w, x , y) i > δ |W |c3 exp( c4Nδ2), where the inequality follows from Lemma 49, which applies because δ [0, 1]. Moreover, the function f(ξ) = x with ξ = (x, y) trivially satisfies the linear growth condition |f(ξ) f(ξ )| d(ξ, ξ ) for all ξ Ξ if ξ = (0, 1). By Lemma 49, the probability (A.9b) is thus smaller or equal to PN n EP [ x ] E b PN [ x ] > δ o c3 exp( c4Nδ2). By combining the above estimates, we may conclude that the probability in (A.7) does not exceed (|W | + 1) c3 exp c4Nδ2 Ω/ n c3 exp c4N ρ N(η)Ω 2 Mnn A = N n 2 c3 exp N n 2 c3 exp N)/c4 + log(c3/η)/c4 = N n 2 c3 exp n log( N) log(c3/η) = η, where the first inequality follows from the definition of δ and the assumption that |W | < (Ω/ )n 1, the first equality holds because Ω/ = N, and the second inequality holds due to the definition of ρ N(η). In summary, we have shown that the probability in (A.7) is at most η, and thus the claim follows. A.3. Proofs of Section 5 Proof of Theorem 41 As for assertion (i), note that the absolute value function coincides with the ϵ-insensitive loss for ϵ = 0. Thus, (41a) follows immediately from Corollary 6 by Regularization via Mass Transportation fixing w and by setting ϵ = 0 and Ξ = Rn+1. As for assertion (ii), similar arguments as in the proof of Lemma 45 show that Emin(w) = sup λ 0 λρ + 1 i=1 inf x,y |y w, x | + λ (x, y) (bxi, byi) . (A.10) The subordinate minimization problem in the first constraint of (A.10) is equivalent to inf x,y |y w, x | + λ (x, y) (bxi, byi) = inf x,y sup (qi,vi) λ |y w, x | + qi, x bxi + vi(yi byi) = sup (qi,vi) λ ( inf x,y,z z + qi, x bxi + vi(yi byi) s.t. z y w, x , z w, x y sup qi,vi,ri,ti qi, bxi byivi s.t. ti + ri = 1, ti ri = vi (ri ti)w = qi, (qi, vi) λ ri, ti 0 sup vi vi( w, bxi byi) s.t. 1 vi 1 vi (w, 1) λ, vi (w, 1) λ, where the second equality holds due to Proposition 5.5.4 in (Bertsekas, 2009), and the third equality holds due to strong linear programming duality. By substituting the last optimization problem into (A.10) and replacing vi with vi, we have sup vi,λ λρ + 1 i=1 vi(byi w, bxi ) s.t. 1 vi 1 i [N] vi (w, 1) λ i [N] vi (w, 1) λ i [N] sup vi,λ λρ + 1 i=1 vi|byi w, bxi | s.t. 0 vi 1 i [N] vi (w, 1) λ i [N] sup v,λ λρ + 1 i=1 v|byi w, bxi | s.t. 0 v 1 v (w, 1) λ i=1 |byi w, bxi | ρ (w, 1) Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani i=1 |byi w, bxi | ρ (w, 1) , 0 Thus, the claim follows. Proof of Theorem 42 As for assertion (i), similar arguments as in the proof of Lemma 45 show that inf λ,si λρ + 1 s.t. sup x 1{byi w,x 0} λ bxi x si i N sup x 1{ byi w,x 0} λ bxi x κλ si i N Next, observe that the indicator functions in (A.11a) can be represented as pointwise maxima of extended real-valued concave functions of the form 1{byi w,x 0} = max{I1(x), 0} and 1{byi w,x 0} = max{I2(x), 0}, respectively, where ( 1 byi w, x 0, otherwise, and I2(x) = ( 1 byi w, x 0, otherwise. This allows us to reformulate (A.11a) as inf λ,si λρ + 1 s.t. sup x Rn I1(x) λ bxi x si i N sup x 0 λ bxi x si i N sup x I2(x) λ bxi x κλ si i [N] sup x 0 λ bxi x κλ si i N Using the definition of the dual norm and applying the duality theorem (Bertsekas, 2009, Proposition 5.5.4), we find inf λ,si,pi,qi λρ + 1 s.t. sup x I1(x) + pi, x pi, bxi si i N sup x I2(x) + qi, x qi, bxi κλ si i N si 0, pi λ, qi λ i [N]. Moreover, by strong linear programming duality we have sup x I1(x) + pi, x = sup x {1 + pi, x : byi w, x 0} = inf ri 0 {1 : byiriw = pi} (A.11c) Regularization via Mass Transportation sup x I2(x) + qi, x = sup x {1 + qi, x : byi w, x 0} = inf ti 0 {1 : byitiw = qi} . (A.11d) Substituting (A.11c) and (A.11d) into (A.11b) yields (42a). The expression (42b) for the best-case risk can be proved in a similar fashion. Details are omitted for brevity. Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin. Learning from Data. AMLBook, 2012. N. Agarwal, B. Bullins, and E. Hazan. Second-order stochastic optimization for machine learning in linear time. Journal of Machine Learning Research, 18:4148 4187, 2017. M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214 223, 2017. K. Bache and M. Lichman. UCI machine learning repository, 2013. URL http://archive. ics.uci.edu/ml. Available from http://archive.ics.uci.edu/ml. P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002. A. Ben-Tal, L. El Ghaoui, and A. Nemirovski. Robust Optimization. Princeton University Press, 2009. A. Ben-Tal, D. Den Hertog, A. De Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341 357, 2013. J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyr e. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2): A1111 A1138, 2015. D. Bertsekas. Convex Optimization Theory. Athena Scientific, 2009. D. Bertsimas and M. S. Copenhaver. Characterization of the equivalence of robustification and regularization in linear and matrix regression. European Journal of Operational Research, 2017. C. Bhattacharyya. Second order cone programming formulations for feature selection. Journal of Machine Learning Research, 5:1417 1433, 2004. J. Blanchet and K. Murthy. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 2019. J. Blanchet, Y. Kang, and K. Murthy. Robust Wasserstein profile inference and applications to machine learning. ar Xiv preprint ar Xiv:1610.05627, 2016. Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani J. Blanchet, Y. Kang, F. Zhang, and K. Murthy. Data-driven optimal transport cost selection for distributionally robust optimization. ar Xiv preprint ar Xiv:1705.07152, 2017. S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013. L. Breiman. Bias, variance, and arcing classifiers. Technical report, University of California, Berkeley, 1996. J.-F. Cai, E. J. Cand es, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956 1982, 2010. G. C. Calafiore and L. El Ghaoui. On distributionally robust chance-constrained linear programs. Journal of Optimization Theory and Applications, 130(1):1 22, 2006. K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference, 2014. C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273 297, 1995. M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292 2300, 2013. M. Cuturi and D. Avis. Ground metric learning. Journal of Machine Learning Research, 15: 533 564, 2014. M. Cuturi and A. Doucet. Fast computation of Wasserstein barycenters. In International Conference on Machine Learning, pages 685 693, 2014. B. Defourny. Machine Learning Solution Methods for Multistage Stochastic Programming. Ph D thesis, Institut Montefiore, Universit e de Liege, 2010. E. Delage and Y. Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3):595 612, 2010. J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. De CAF: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning, pages 647 655, 2014. J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the ℓ1-ball for learning in high dimensions. In International Conference on Machine Learning, pages 272 279, 2008. L. El Ghaoui and H. Lebret. Robust solutions to least-squares problems with uncertain data. SIAM Journal on Matrix Analysis and Applications, 18(4):1035 1064, 1997. H. Everett III. Generalized Lagrange multiplier method for solving problems of optimum allocation of resources. Operations Research, 11(3):399 417, 1963. M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The Pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2): 303 338, 2010. Regularization via Mass Transportation F. Farnia and D. Tse. A minimax approach to supervised learning. In Advances in Neural Information Processing Systems, pages 4240 4248, 2016. N. Fournier and A. Guillin. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3-4):707 738, 2015. C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio. Learning with a Wasserstein loss. In Advances in Neural Information Processing Systems, pages 2053 2061, 2015. R. Gao and A. Kleywegt. Distributionally robust stochastic optimization with Wasserstein distance. ar Xiv preprint ar Xiv:1604.02199, 2016. J. Goh and M. Sim. Distributionally robust optimization and its tractable approximations. Operations Research, 58(4):902 917, 2010. H. Gouk, E. Frank, B. Pfahringer, and M. Cree. Regularisation of neural networks by enforcing Lipschitz continuity. ar Xiv preprint ar Xiv:1804.04368, 2018. N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53 (2):217 288, 2011. G. A. Hanasusanto, D. Kuhn, and W. Wiesemann. A comment on computational complexity of stochastic programming problems. Mathematical Programming A, 159(1-2):557 569, 2016. D. Hosmer, S. Lemeshow, and R. X. Sturdivant. Applied Logistic Regression. John Wiley & Sons, 2013. K. Huang, H. Yang, I. King, M. R. Lyu, and L. Chan. The minimum error minimax probability machine. Journal of Machine Learning Research, 5:1253 1286, 2004. R. Koenker. Quantile Regression. Cambridge University Press, 2005. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097 1105, 2012. G. Lanckriet, L. El Ghaoui, C. Bhattacharyya, and M. I. Jordan. Minimax probability machine. In Advances in Neural Information Processing Systems, pages 801 807, 2002a. G. R. Lanckriet, L. El Ghaoui, C. Bhattacharyya, and M. I. Jordan. A robust minimax approach to classification. Journal of Machine Learning Research, 3:555 582, 2002b. N. Lawrence and B. Sch olkopf. Estimating a kernel Fisher discriminant in the presence of label noise. In International Conference on Machine Learning, pages 306 306, 2001. Y. Le Cun, C. Cortes, and C. J. Burges. The MNIST database of handwritten digits, 1998. Available from http://yann.lecun.com/exdb/mnist. Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani C. Lee and S. Mehrotra. A distributionally-robust approach for finding support vector machines. Available from Optimization Online, 2015. J. Lee and M. Raginsky. Minimax statistical learning with Wasserstein distances. In Advances in Neural Information Processing Systems, pages 2687 2696, 2018. N. Loizou and P. Richt arik. Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods. ar Xiv preprint ar Xiv:1712.09677, 2017. F. Luo and S. Mehrotra. Decomposition algorithm for distributionally robust optimization using Wasserstein metric with an application to a class of regression models. European Journal of Operational Research, 278(1):20 35, 2019. T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. P. Mohajerin Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming A, 171(1-2):115 166, 2018. N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari. Learning with noisy labels. In Advances in Neural Information Processing Systems, pages 1196 1204, 2013. B. Neyshabur, S. Bhojanapalli, and N. Srebro. A PAC-Bayesian approach to spectrallynormalized margin bounds for neural networks. In International Conference on Learning Representations, 2018. A. Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems, pages 1574 1582, 2014. R. T. Rockafellar and R. J.-B. Wets. Variational Analysis. Springer, 2009. Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover s distance as a metric for image retrieval. International Journal of Computer Vision, 40(2):99 121, 2000. K. Scaman and A. Virmaux. Lipschitz regularity of deep neural networks: analysis and efficient estimation. In Advances in Neural Information Processing Systems, pages 3835 3844, 2018. B. Sch olkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2001. B. Sch olkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443 1471, 2001. S. Shafieezadeh-Abadeh, P. M. Esfahani, and D. Kuhn. Distributionally robust logistic regression. In Advances in Neural Information Processing Systems, pages 1576 1584, 2015. Regularization via Mass Transportation S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. A. Shapiro. On duality theory of conic linear problems. In M. A. Goberna and M. A. L opez, editors, Semi-Infinite Programming, pages 135 165. Kluwer Academic Publishers, 2001. P. K. Shivaswamy and T. Jebara. Ellipsoidal kernel machines. In Artificial Intelligence and Statistics, pages 484 491, 2007. P. K. Shivaswamy and T. Jebara. Maximum relative margin and data-dependent regularization. Journal of Machine Learning Research, 11:747 788, 2010. P. K. Shivaswamy, C. Bhattacharyya, and A. J. Smola. Second order cone programming approaches for handling missing and uncertain data. Journal of Machine Learning Research, 7:1283 1314, 2006. A. Smola and B. Sch olkopf. A tutorial on support vector regression. Statistics and Computing, 14(3):199 222, 2004. T. Strohmann and G. Z. Grudic. A formulation for minimax probability machine regression. In Advances in Neural Information Processing Systems, pages 785 792, 2003. C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013. R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B, 58(1):267 288, 1996. A. N. Tikhonov, V. I. Arsenin, and F. John. Solutions of Ill-Posed Problems. Winston, 1977. C. Villani. Optimal Transport: Old and New. Springer, 2008. J. Weed and F. Bach. Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. To appear in Bernoulli, 2019. T. Wiatowski, M. Tschannen, A. Stanic, P. Grohs, and H. B olcskei. Discrete deep feature extraction: A theory and new architectures. In International Conference on Machine Learning, pages 2149 2158, 2016. W. Wiesemann, D. Kuhn, and M. Sim. Distributionally robust convex optimization. Operations Research, 62(6):1358 1376, 2014. H. Xu, C. Caramanis, and S. Mannor. Robustness and regularization of support vector machines. Journal of Machine Learning Research, 10:1485 1510, 2009. H. Xu, C. Caramanis, and S. Mannor. Robust regression and Lasso. IEEE Transactions on Information Theory, 56(7):3561 3574, 2010. T. Yang, M. Mahdavi, R. Jin, L. Zhang, and Y. Zhou. Multiple kernel learning from noisy labels by stochastic programming. In International Conference on Machine Learning, pages 123 130, 2012. Shafieezadeh-Abadeh and Kuhn and Mohajerin Esfahani M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pages 818 833, 2014. C. Zhao and Y. Guan. Data-driven risk-averse stochastic optimization with Wasserstein metric. Operations Research Letters, 46(2):262 267, 2018.