# minimax_statistical_learning_with_wasserstein_distances__b9a0b030.pdf Minimax statistical learning with Wasserstein distances Jaeho Lee Maxim Raginsky {jlee620, maxim}@illinois.edu As opposed to standard empirical risk minimization (ERM), distributionally robust optimization aims to minimize the worst-case risk over a larger ambiguity set containing the original empirical distribution of the training data. In this work, we describe a minimax framework for statistical learning with ambiguity sets given by balls in Wasserstein space. In particular, we prove generalization bounds that involve the covering number properties of the original ERM problem. As an illustrative example, we provide generalization guarantees for transport-based domain adaptation problems where the Wasserstein distance between the source and target domain distributions can be reliably estimated from unlabeled samples. 1 Introduction In the traditional paradigm of statistical learning [20], we have a class P of probability measures on a measurable instance space Z and a class F of measurable functions f : Z ! R+. Each f 2 F quantifies the loss of some decision rule or a hypothesis applied to instances z 2 Z, so, with a slight abuse of terminology, we will refer to F as the hypothesis space. The (expected) risk of a hypothesis f on instances generated according to P is given by R(P, f) := EP [f(Z)] = Given an n-tuple Z1, . . . , Zn of i.i.d. training examples drawn from an unknown P 2 P, the objective is to find a hypothesis bf 2 F whose risk R(P, bf) is close to the minimum risk R (P, F) := inf f2F R(P, f) (1) with high probability. Under suitable regularity assumptions, this objective can be accomplished via Empirical Risk Minimization (ERM) [20, 13]: R(Pn, f) = 1 f(Zi) ! min, f 2 F (2) where Pn := 1 i=1 δZi is the empirical distribution of the training examples. Recently, however, an alternative viewpoint has emerged, inspired by ideas from robust statistics and robust stochastic optimization. In this distributionally robust framework, instead of solving the ERM problem (2), one aims to solve the minimax problem sup Q2A(Pn) R(Q, f) ! min, f 2 F (3) Department of Electrical and Computer Engineering and Coordinated Science Laboratory, University of Illinois, Urbana, IL 61801, USA. This work was supported in part by NSF grant nos. CIF-1527 388 and CIF-1302438, and in part by the NSF CAREER award 1254041. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. where A(Pn) is an ambiguity set containing the empirical distribution Pn and, possibly, the unknown probability law P either with high probability or almost surely. The ambiguity sets serve as a mechanism for compensating for the uncertainty about P that inherently arises due to having only a finite number of samples to work with, and can be constructed in a variety of ways, e.g. via moment constraints [9], f-divergence balls [8], and Wasserstein balls [16, 11, 5]. However, with the exception of the recent work by Farnia and Tse [9], the minimizer of (3) is still evaluated under the standard statistical risk minimization paradigm. In this work, we instead study the scheme where the statistical risk minimization criterion (1) is replaced with the local minimax risk inf f2F sup Q2A(P ) at P, where the ambiguity set A(P) is taken to be a Wasserstein ball centered at P. As we will argue below, this change of perspective is natural when there is a possibility of domain drift, i.e., when the learned hypothesis is evaluated on a distribution Q which may be different from the distribution P that was used to generate the training data. The rest of this paper is organized as follows: In Section 2, we formally present the notion of local minimax risk and discuss its relationship to the statistical risk, which allows us to assess the performance of minimax-optimal hypothesis in specific domains. We also provide an example to illustrate the role of ambiguity sets in rejecting nonrobust hypotheses. In Section 3, we show that the hypothesis learned with the Empirical Risk Minimization (ERM) procedure based on the local minimax risk closely achieves the optimal local minimax risk. In particular, we provide a data-dependent bound on the generalization error, which behaves like the bound for ordinary ERM in the no-ambiguity regime (Theorem 1), and excess risk bounds under uniform smoothness assumptions on F (Theorem 2) and a less restrictive assumption that F contains at least one smooth hypothesis (Theorem 3). In Section 4, we provide an alternative perspective on domain adaptation based on the minimax statistical learning under the framework of Courty et al. [6], where the domain drift is due to an unknown transformation of the feature space that preserves the conditional distribution of the labels given the features. Completely bypassing the estimation of the transport map, we provide a proper excess risk bound that compares the risk of the learned hypothesis to the minimal risk achievable within the given hypothesis class on the target domain (Theorem 4). To the best of our knowledge, all existing theoretical results on domain adaptation are stated in terms of the discrepancy between the best hypotheses on the source and on the target domains. All proofs are deferred to the appendix. 2 Local minimax risk with Wasserstein ambiguity sets We assume that the instance space Z is a Polish space (i.e., a complete separable metric space) with metric d Z. We denote by P(Z) the space of all Borel probability measures on Z, and by Pp(Z) with p 1 the space of all P 2 P(Z) with finite pth moments. The metric structure of Z can be used to define a family of metrics on the spaces Pp(Z) [21]: Definition 1. For p 1, the p-Wasserstein distance between P, Q 2 Pp(Z) is Wp(P, Q) := inf M( Z)=P M(Z )=Q where the infimum is taken over all couplings of P and Q, i.e. probability measures M on the product space Z Z with the given marginals P and Q. Remark 1. Wasserstein distances arise in the problem of optimal transport: for any coupling M of P and Q, the conditional distribution MZ0|Z can be viewed as a randomized policy for transporting a unit quantity of some material from a random location Z P to another location Z0, while satisfying the marginal constraint Z0 Q. If the cost of transporting a unit of material from z 2 Z to z0 2 Z is given by dp Z(z, z0), then W p p (P, Q) is the minimum expected tranport cost. We now consider a learning problem (P, F) with P = Pp(Z) for some p 1. Following [16, 17, 11], we let the ambiguity set A(P) be the p-Wasserstein ball of radius % 0 centered at P: %,p(P) := {Q 2 Pp(Z) : Wp(P, Q) %} , where the radius % > 0 is a tunable parameter. We then define the local worst-case risk of f at P, R%,p(P, f) := sup Q2BW %,p(P ) and the local minimax risk at P: %,p(P, F) := inf f2F R%,p(P, f). 2.1 Local worst-case risk vs. statistical risk We give a couple of inequalities relating the local worst-case (or local minimax) risks and the usual statistical risks, which will be useful in Section 4. The first one is a simple consequence of the Kantorovich duality theorem from the theory of optimal transport [21]: Proposition 1. Suppose that f is L-Lipschitz, i.e., |f(z) f(z0)| Ld Z(z, z0) for all z, z0 2 Z. Then, for any Q 2 BW R(Q, f) R%,p(P, f) R(Q, f) + 2L%. As an example, consider the problem of binary classification with hinge loss: Z = X Y, where X is an arbitrary feature space, Y = { 1, +1}, and the hypothesis space F consists of all functions of the form f(z) = f(x, y) = max{0, 1 yf0(x)}, where f0 : X ! R is a candidate predictor. Then, since the function u 7! max{0, 1 u} is Lipschitz-continuous with constant 1, we can write |f(x, y) f(x0, y0)| |yf0(x) y0f0(x0)| 2kf0k X1{y 6= y0} + |f0(x) f0(x0)|, where kf0k X := supx2X |f0(x)|. If kf0k X < 1 and if f0 is L0-Lipschitz with respect to some metric d X on X, then it follows that f is Lipschitz with constant max{2kf0k X, L0} with respect to the product metric d Z(z, z0) = d Z((x, y), (x0, y0)) := d X(x, x0) + 1{y 6= y0}. Next we consider the case when the function f is smooth but not Lipschitz-continuous. Since we are working with general metric spaces that may lack an obvious differentiable structure, we need to first introduce some concepts from metric geometry [1]. A metric space (Z, d Z) is a geodesic space if for every two points z, z0 2 Z there exists a path γ : [0, 1] ! Z, such that γ(0) = z, γ(1) = z0, and d Z(γ(s), γ(t)) = (t s) d Z(γ(0), γ(1)) for all 0 s t 1 (such a path is called a constant-speed geodesic). A functional F : Z ! R is geodesically convex if for any pair of points z, z0 2 Z there is a constant-speed geodesic γ, so that F(γ(t)) (1 t)F(γ(0)) + t F(γ(1)) = (1 t)F(z) + t F(z0), 8t 2 [0, 1]. An upper gradient of a Borel function f : Z ! R is a functional Gf : Z ! R+, such that for any pair of points z, z0 2 Z there exists a constant-speed geodesic γ obeying |f(z0) f(z)| Gf(γ(t))dt d Z(z, z0). (5) With these definitions at hand, we have the following: Proposition 2. Suppose that f has a geodesically convex upper gradient Gf. Then R(Q, f) R%,p(P, f) R(Q, f) + 2% sup Q2BW %,p(P ) k Gf(Z)k Lq(Q), where 1/p + 1/q = 1, and k k Lq(Q) := (EQ| |q)1/q. Consider the setting of regression with quadratic loss: let X be a convex subset of Rd, let Y = [ B, B] for some 0 < B < 1, and equip Z = X Y with the Euclidean metric d Z(z, z0) = 2 + |y y0|2, z = (x, y), z0 = (x0, y0). (6) Suppose that the functions f 2 F are of the form f(z) = f(x, y) = (y h(x))2 with h 2 C1(Rd, R), such that khk X M < 1 and krh(x)k2 Lkxk2 for some 0 < L < 1. Then Proposition 2 leads to the following: Proposition 3. R(Q, f) R%,2(P, f) R(Q, f) + 4%(B + M) 1 + L sup Q2BW where σQ,X := EQk Xk2 for Z = (X, Y ) Q. 2.2 An illustrative example: % as an exploratory budget Before providing formal theoretical guarantees for ERM based on the local minimax risk R%,p(Pn, f) in Section 3, we give a stylized yet insightful example to illustrate the key difference between the ordinary ERM and the local minimax ERM. In a nutshell, the local minimax ERM utilizes the Wasserstein radius % as an exploratory budget to reject hypotheses overly sensitive to domain drift. Consider Z Unif[0, 1] =: P on data space Z = [0, 2], along with the hypothesis class F with only two hypotheses: f0(z) = 1, f1(z) = 0, z 2 [0, 1) , z 2 [1, 2] for some 1. Notice that, if the training data are drawn from Z, the ordinary ERM will always return f1, the hypothesis that is not robust against small domain drifts, while we are looking for a structured procedure that will return f0, a hypothesis that works well for probability distributions close to the data-generating distribution Unif[0, 1]. The success of minimax learning depends solely on the ability to transport some weight from a nearby training sample to 1, the region where nonrobust f1 starts to perform poorly. Specifically, the minimax learning is successful when R%,p(Pn, f0) = 1 is smaller than R%,p(Pn, f1) %p/(1 max Zi)p, which happens with probability 1 (1 % 1/p)n. We make following key observations: While smaller % leads to the smaller nontrivial excess risk R%,p(P, f1) R%,p(P, f0), it also leads to a slower decay of error probability. As a result, for a given %, we can come up with a hypothesis class maximizing the excess risk at target % with excess risk behaving roughly as % p2/(p+1) without affecting the Rademacher average of the class (see supplementary Appendix B for details). It is possible to guarantee smooth behavior of the ERM hypothesis without having uniform smoothness assumptions on F; if there exists a single smooth hypothesis f0, it can be used as a baseline comparison to reject nonsmooth hypotheses. We build on this idea in Section 3.3. 3 Guarantees for empirical risk minimization Let Z1, . . . , Zn be an n-tuple of i.i.d. training examples drawn from P. In this section, we analyze the performance of the local minimax ERM procedure bf := arg min R%,p(Pn, f). (7) The following strong duality result due to Gao and Kleywegt [11] will be instrumental: Proposition 4. For any upper semicontinuous function f : Z ! R and for any Q 2 Pp(Z), R%,p(Q, f) = min λ 0 {λ%p + EQ['λ,f(Z)]} , (8) where 'λ,f(z) := supz02Z 3.1 Data-dependent bound on generalization error We begin by imposing standard regularity assumptions (see, e.g., [7]) which allow us to invoke concentration-of-meausre results for empirical processes. Assumption 1. The instance space Z is bounded: diam(Z) := supz,z02Z d Z(z, z0) < 1. Assumption 2. The functions in F are upper semicontinuous and uniformly bounded: 0 f(z) M < 1 for all f 2 F and z 2 Z. As a complexity measure of the hypothesis class F, we use the entropy integral [19] log N(F, k k1, u)du, where N(F, k k1, ) denotes the covering number of F in the uniform metric kf f 0k1 = supz2Z |f(z) f 0(z)|. The benefits of using the entropy integral C(F) instead of usual complexity measures such as Rademacher or Gaussian complexity [3] are twofold: (1) C(F) takes into account the behavior of hypotheses outside the support of the data-generating distribution P, and thus can be applied for the assessment of local worst-case risk; (2) Rademacher complexity of 'λ,f can be upper-bounded naturally via C(F) and the covering number of a suitable bounded subset of [0, 1). We are now ready to give our data-dependent bound on R%,p(P, f): Theorem 1. For any F, P satisfying Assumptions 1 2 and for any t > 0, 9f 2 F : R%,p(P, f) > min (λ + 1)%p + EPn['λ,f(Z)] + M log(λ + 1) pn + 24C(F) pn + Mt pn 2 exp( 2t2) 9f 2 F : R%,p(Pn, f) > min (λ + 1)%p + EP ['λ,f(Z)] + M log(λ + 1) pn + 24C(F) pn + Mt pn 2 exp( 2t2). Notice that Theorem 1 is in the style of data-dependent generalization bounds for margin cost function class [14], often used for the analysis of voting methods or support vector machines [2]. Remark 2. When % = 0, we recover the behavior of the usual statistical risk R(P, f). Specifically, it is not hard to show from the definition of 'λ,f that EPn['λ,f] = EPn[f] holds for all λ bλn := max f(z0) f(Zi) Z(z0, Zi) . In that case, when % = 0, the generalization error converges to zero at the rate of 1/pn with usual coefficients from the Dudley s entropy integral [19] and Mc Diarmid s inequality, plus an added term of order Mplog λ pn for some λ bλn. 3.2 Excess risk bounds with uniform smoothness As evident from Remark 2, if we have a priori knowledge that the hypothesis selected by the minimax ERM procedure (7) is smooth with respect to the underlying metric, then we can restrict the feasible values of λ to provide data-independent guarantees on generalization error, which vanishes to 0 as n ! 1. Let us start by imposing the following uniform smoothness on F: Assumption 3. The functions in F are L-Lipschitz: supz6=z0 f(z0) f(z) d Z(z0,z) L for all f 2 F. One motivation for Assumption 3 is the following bound on the excess risk: whenever the solution of the original ERM f = arg minf2F i=1 f(zi) is L-Lipschitz, Kantorovich duality gives us R%,p(P, f) R %,p(P, F) R(P, f) R (P, F) + L%, (9) where the right-hand side is the sum of excess risk of ordinary ERM, and the worst-case deviation of risk due to the ambiguity. The bound (9) is particularly useful when both % is and n are small, but it does not vanish as n ! 1. The following lemma enables the control of infimum-achieving dual parameter λ with respect to the true and empirical distribution: Lemma 1. Fix some Q 2 Pp(Z), and define f 2 F and λ 0 via f := arg min R%,p(Q, f) and λ := arg min λ%p + EQ['λ, f(Z)] Then under Assumptions 1 3, we have λ L% (p 1). Then, we can use the Dudley entropy integral arguments [19] on the joint search space of λ and f to get the following theorem: Theorem 2. Under Assumptions 1 3, the following holds with probability at least 1 δ: R%,p(P, bf) R %,p(P, F) 48C(F) pn + 48L diam(Z)p pn %p 1 + 3M Remark 3. The adversarial training procedure appearing in a concurrent work of Sinha et al. [18] can be interpreted as a relaxed version of local minimax ERM, where we consider λ to be fixed (to enhance implementability), rather than explicitly searching for an optimal λ. In such case, Lemma 1 may provide a guideline for the selection of parameter λ; for example, one might run the fixed-λ algorithm over a sufficiently fine grid of λ on the interval [0, L% (p 1)] to approximate the local minimax ERM. Note that when p = 1, we get a %-free bound of order 1/pn, recovering the correct rate of ordinary ERM as % = 0. On the other hand, Theorem 2 cannot be used to recover the rate of ordinary ERM for p > 1. This phenomenon is due to the fact that we are using the Lipschitz assumption on F, which is a data-independent constraint on the scale of the trade-off between f(z0) f(z) and d Z(z0, z). For p > 1, one may also think of a similar data-independent (or, worst-case) constraint Z(z0, z) < +1. However, this holds only if f is constant, even in the simplest case Z R. 3.3 Excess risk bound with minimal assumptions The illustrative example presented in Section 2.2 implies that the minimax learning might be possible even when the functions in F are not uniformly Lipschitz, but there exists at least one smooth hypothesis (at least, except for the regime % ! 0). Based on that observation, we now consider a weaker alternative to Assumption 3: Assumption 4. There exists a hypothesis f0 2 F, such that, for all z 2 Z, f0(z) C0dp Z(z, z0) for some C0 0 and z0 2 Z. Assumption 4 guarantees the existence of a hypothesis with smooth behavior with respect to the underlying metric d Z; on the other hand, smoothness is not required for every f 2 F, and thus Assumption 4 is particularly useful when paired with a rich class F. It is not difficult to see that Assumption 4 holds for most common hypothesis classes. As an example, consider again the setting of regression with quadratic loss as in Proposition 3; the functions f 2 F are of the form f(z) = f(x, y) = (y h(x))2, where h runs over some given class of candidate predictors that contains constants. Then, we can take h0(x) 0, in which case f0(z) = (h0(x) y)2 = |y|2 d2 Z(z, z0) for all z0 of the form (x, 0) 2 X Y. Under Assumption 4, we can prove the following counterpart of Lemma 1: Lemma 2. Fix some Q 2 Pp(Z). Define f 2 F and λ 0 via f := arg min R%,p(Q, f) and λ := arg min λ%p + EQ['λ, f(Z)] Then, under Assumptions 1,2,4, λ C02p 1 (1 + (diam(Z)/%)p). An intuition behind Lemma 2 is to interpret the Wasserstein perturbation % as a regularization parameter to thin out hypotheses with non-smooth behavior around Q by comparing it to f0. As % grows, a smaller dual parameter λ is sufficient to control the adversarial behavior. We can now give a performance guarantee for the ERM procedure (7): Theorem 3. Under Assumptions 1,2,4, the following holds with probability at least 1 δ: R%,p(P, bf) R %,p(P, F) 48C(F) pn + 24C0(2 diam(Z))p Remark 4. The second term decreases as % grows, which is consistent with the phenomenon illustrated in Section 2.2. Also note that the excess risk bound of [9] shows the same behavior as Theorem 3, where in that case % is the slack in the moment constraints defining the ambiguity set. While larger ambiguity can be helpful for learnability in this sense, note that the risk inequalities of Sec 2.1 imply that R%,p(P, f) R(P, f) can be bigger with larger %. Using these two elements, one can provide domain-specific excess risk bounds which explicitly describe the interplay of both elements with ambiguity (see Sec 4). 3.4 Example bounds In this subsection, we illustrate the use of Theorem 2 when (upper bounds on) the covering numbers for the hypothesis class F are available. Throughout this section, we continue to work in the setting of regression with quadratic loss as in Proposition 3; we let X = {x 2 Rd : kxk2 r0} be a ball of radius r0 in Rd centered at the origin, let Y = [ B, B] for some B > 0, and equip Z with the Euclidean metric (6). Also, we take p = 1. We first consider a simple neural network class F consisting of functions of the form f(z) = f(x, y) = (y s(f T 0 x))2, where s : R ! R is a bounded smooth nonlinearity with s(0) = 0 and with bounded first derivative, and where f0 takes values in the unit ball in Rd. Corollary 1. For any P 2 P(Z), with probability at least 1 δ, R%,1(P, bf) R %,1(P, F) C1 pn + 3(ksk1 + B)2p where C1 is a constant dependent only on d, r0, s, B: C1 = (B + ksk1) dks0k1 + 192(1 + ks0k1) We also consider the case of a massive nonparametric class. Let (HK, k k K) be the Gaussian reproducing kernel Hilbert space (RKHS) with the kernel K(x1, x2) = exp for some σ > 0, and let Br := {h 2 HK : khk K r} be the radius-r ball in HK. Let F be the class of all functions of the form f(z) = f(x, y) = (y f0(x))2, where the predictors f0 : X ! R belong to IK(Br), an embedding of Br into the space C(X) of continuous real-valued functions on X equipped with the sup norm kfk X := supx2X |f(x)|. Using the covering number estimates due to Cucker and Zhou [7], we can prove the following generalization bounds for Gaussian RKHS. Corollary 2. With probability at least 1 δ, for any P 2 P(Z), R%,1(P, bf) R C1 pn(r2 + Br) + 192 2(r + B) (1 + r 0 + B2 pn + 6(r2 + B2) where C1 is a constant dependent only on d, r0, σ: 32 + 2560dr2 (here, Γ(s, v) := v us 1e udu is the incomplete gamma function). 4 Application: Domain adaptation with optimal transport Ambiguity sets based on Wasserstein distances have two attractive features. First, the metric geometry of the instance space provides a natural mechanism for handling uncertainty due to transformations on the problem instances. For example, concurrent work by Sinha et al. [18] interprets the underlying metric as a perturbation cost of an adversary in the context of adversarial examples [12]. Second, Wasserstein distances can be approximated efficiently from the samples; Fournier and Guillin [10] provide nonasymptotic convergence results in terms of both moments and probability for general p. This allows us to approximate the Wasserstein distance between two distributions Wp(P, Q) by the Wasserstein distance between their empirical distributions Wp(Pn, Qn), which makes it possible to specify a suitable level of ambiguity %. One interesting area of application, where we benefit from both of these aspects is the problem of domain adaptation, arising when we want to transfer the data/knowledge from a source domain P 2 P(Z) to a different but related target domain Q 2 P(Z) [4]. While the domain adaptation problem is often stated in a broader context, we confine our discussion to adaptation in supervised learning, assuming Z = X Y where X is the feature space and Y is the label space. From now on, we disintegrate the source distribution as P = µ PY |X and target distribution as Q = QY |X. Existing theoretical results on domain adaptation are phrased in terms of the discrepancy metric [15]: given a loss function l : Y Y ! R and a family of predictors H of form h : X ! Y, the discrepancy metric is defined as disc H(µ, ) := max h,h02H |Eµ [l(h(X), h0(X))] E [l(h(X), h0(X))]| . Typical theoretical guarantees involving the discrepancy metric take the form of generalization bounds: for any h 2 H, R(Q, h) R (Q, H) R(P, h) + disc H(µ, ) + E Q are minimizers of R(P, h) = EP [l(h(X), Y )] and R(Q, h) = EQ[l(h(X), Y )]. While these generalization bounds provide a uniform guarantee for all predictors in a class, they can be considered pessimistic in the sense that we compare the excess risk to R(P, h), which is the performance of some selected predictor at the source domain. Our work, on the other hand, aims to provide an excess risk bound for a specific target hypothesis bf given by the solution of a minimax ERM. Suppose that it is possible to estimate the Wasserstein distance Wp(P, Q) between the two domain distributions. Then, as we show below, we can provide a generalization bound for the target domain by combining estimation guarantees for Wp(P, Q) with risk inequalities of Section 2. All proofs are given in supplementary Appendix E. We work in the setting considered by Courty et al. [6]: Let X, Y be metric spaces with metric d X and d Y. We then endow Z with the p product metric d Z(z, z0) = d Z((x, y), (x0, y0)) := X(x, x0) + dp We assume that domain drift is due to an unknown (possibly nonlinear) transformation T : X ! X of the feature space that preserves the conditional distribution of the labels given the features, e.g. acquisition condition, sensor drift, thermal noise, etc. That is, = T#µ, the pushforward of µ by T, and for any x 2 X and any measurable set B Y PY |X(B|x) = QY |X(B|T(x)). (13) This assumption leads to the following lemma, which enables us to estimate Wp(P, Q) only from unlabeled source domain data and unlabeled target domain data: Lemma 3. Suppose there exists a deterministic and invertible optimal transport map T : X ! X such that = T#µ, i.e., W p p (µ, ) = Eµ[dp X(X, T(X))]. Then Wp(P, Q) = Wp(µ, ). (14) Remark 5. If X is a convex subset of Rd endowed with the p metric d X(x, x0) = kx x0kp for p 2, then, under the assumption that µ and have positive densities with respect to the Lebesgue measure, the (unique) optimal transport map from µ to is deterministic and a.e. invertible in fact, its inverse is equal to the optimal transport map from to µ [21]. Now suppose that we have n labeled examples (X1, Y1), . . . , (Xn, Yn) from P and m unlabeled examples X0 1, . . . , X0 m from . Define the empirical distributions Notice that, by the triangle inequality, we have Wp(µ, ) Wp(µ, µn) + Wp(µn, m) + Wp( , m). (15) Here, Wp(µn, m) can be computed from unlabeled data by solving a finite-dimensional linear program [21], and the following convergence result of Fournier and Guillin [10] implies that, with high probability, both Wp(µ, µn) and Wp( , m) rapidly converge to zero as n, m ! 1: Proposition 5. Let µ be a probability distribution on a bounded set X Rd, where d > 2p. Let µn denote the empirical distribution of X1, . . . , Xn i.i.d. µ. Then, for any r 2 (0, 1), P(Wp(µn, µ) r) Ca exp( Cbnrd/p) (16) where Ca, Cb are constants depending on p, d, diam(X) only. Remark 6. Note that d > 2p is not a necessary constraint, and the bound still holds in the case d 2p with different speed of convergence. In particular, Proposition 5 is a constrained version of [10, Thm. 2] under finite E ,γ(µ) for = d > p. Based on these considerations, we propose the following domain adaptation scheme: 1. Compute the p-Wasserstein distance Wp(µn, m) between the empirical distributions of the features in the labeled training set from the source domain P and the unlabeled training set from the target domain Q. 2. Set the desired confidence parameter δ 2 (0, 1) and the radius b%(δ) := Wp(µn, m) + 3. Compute the empirical risk minimizer bf = arg min Rb%(δ),p(Pn, f), (18) where Pn is the empirical distribution of the n labeled samples from P. We can give the following target domain generalization bound for the hypothesis generated by (18): Theorem 4. Suppose that the feature space X is a bounded subset of Rd with d > 2p, take d X(x, x0) = kx x0kp, and let F be a family of hypotheses with Lipschitz constant at most L. Then, the empirical risk minimizer bf from (18) satisfies R(Q, bf) R (Q, F) 2Lb%(δ) + 48C(F) pn + 48L diamp(Z) pnb%p 1 + 3M with probability at least 1 δ. Remark 7. Comparing the bound of Theorem 4 with the discrepancy-based bound (12), we note that the former does not contain any terms related to R(P, bf) or the closeness of the optimal predictors for P and Q. The only contributions to the excess risk are the empirical Wasserstein distance Wp(µn, m) (which captures the discrepancy between the source and the target domains in a data-driven manner) and an empirical process fluctuation term. In this sense, the bound of Theorem 4 is closer in spirit to the usual excess risk bounds one obtains in the absence of domain drift. Acknowledgement We would like to thank Pierre Moulin, Yung Yi, and anonymous reviewers for helpful discussions. [1] L. Ambrosio, N. Gigli, and G. Savaré. Gradient Flows in Metric Spaces and in the Space of Probability Measures. Birkhäuser, 2008. [2] Peter Bartlett and John Shawe-Taylor. Generalization performance of support vector machines and other pattern classifiers. Advances in Kernel methods support vector learning, pages 43 54, 1999. [3] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002. [4] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine Learning, 79:151 175, 2010. [5] J. Blanchet, Y. Kang, and K. Murthy. Robust wasserstein profile inference and applications to machine learning. ar Xiv preprint 1610.05627v2, 2016. [6] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9):1853 1865, September 2017. [7] F. Cucker and D. Zhou. Learning theory: an approximation theory viewpoint. Cambridge University Press, Cambridge, MA, 2007. [8] J. C. Duchi, P. W. Glynn, and H. Namkoong. Statistics of robust optimization: a generalized empirical likelihood approach. ar Xiv preprint 1610.03425, 2016. [9] F. Farnia and D. Tse. A minimax approach to supervised learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4240 4248, 2016. [10] N. Fournier and A. Guillin. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3 4):707 738, 2015. [11] R. Gao and A. J. Kleywegt. Distributionally robust stochastic optimization with Wasserstein distance. ar Xiv preprint 1604.02199, 2016. [12] I. J. Goodfellow, J. Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint 1412.6572v3, 2014. [13] V. Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems, volume 2033 of Lecture Notes in Mathematics. Springer, 2011. [14] Vladimir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30(1):1 50, 2002. [15] Y. Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In Sanjoy Dasgupta and Adam Klivans, editors, Proceedings of 22nd Annual Conference on Learning Theory, 2009. [16] Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Mathematical Programming, 171(1):115 166, Sep 2018. [17] S. Shafieezadeh-Abadeh, P. Mohajerin Esfahani, and D. Kuhn. Distributionally robust logistic regression. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1576 1584, 2015. [18] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018. [19] M. Talagrand. Upper and Lower Bounds for Stochastic Processes: Modern Methods and Classical Problems. Springer, 2014. [20] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. [21] C. Villani. Topics in optimal transportation. American Mathematics Society, Providence, RI,