# outlierrobust_optimal_transport__3630b0ed.pdf Outlier-Robust Optimal Transport Debarghya Mukherjee 1 2 Aritra Guha 3 Justin Solomon 4 2 Yuekai Sun 1 Mikhail Yurochkin 5 2 Optimal transport (OT) measures distances between distributions in a way that depends on the geometry of the sample space. In light of recent advances in computational OT, OT distances are widely used as loss functions in machine learning. Despite their prevalence and advantages, OT loss functions can be extremely sensitive to outliers. In fact, a single adversarially-picked outlier can increase the standard W2-distance arbitrarily. To address this issue, we propose an outlierrobust formulation of OT. Our formulation is convex but challenging to scale at a first glance. Our main contribution is deriving an equivalent formulation based on cost truncation that is easy to incorporate into modern algorithms for computational OT. We demonstrate the benefits of our formulation in mean estimation problems under the Huber contamination model in simulations and outlier detection tasks on real data. 1. Introduction Optimal transport (OT) is a fundamental problem in applied mathematics. In its original form (Monge, 1781), the problem seeks the minimum-cost way to transport mass from a probability distribution µ on X to another distribution ν on X. In its original form, Monge s problem proved hard to study, and Kantorovich (1942) relaxed Monge s formulation of the optimal transport problem to OT(µ, ν) min Π F(µ,ν) E(X1,X2) Π c(X1, X2) , (1.1) where F(µ, ν) is the set of couplings between µ and ν (probability distributions on X X whose marginals are µ and ν) and c is a cost function. In this paper, we assume c(x, y) 0 and c(x, x) = 0. Compared to other common measure of distance between probability distributions 1Department of Statistics, University of Michigan 2MIT-IBM Watson AI Lab 3Department of Statistical Science, Duke University 4MIT CSAIL 5IBM Research. Correspondence to: Debarghya Mukherjee . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). (e.g.d-divergences), optimal transport uniquely depends on the geometry of the sample space (through the cost function). Recent advancements in optimization for optimal transport (Cuturi, 2013; Solomon et al., 2015; Genevay et al., 2016; Seguy et al., 2018) enable its broad adaptation in machine learning applications where geometry of the data is important; see (Peyré & Cuturi, 2018) for a survey. Optimal transport has found applications in natural language processing (Kusner et al., 2015; Huang et al., 2016; Alvarez Melis & Jaakkola, 2018; Yurochkin et al., 2019), generative modeling (Arjovsky et al., 2017), clustering (Ho et al., 2017), domain adaptation (Courty et al., 2014; 2017), large-scale Bayesian modeling (Srivastava et al., 2018), anomaly detection (Tong et al., 2020), and many other domains. Many applications use OT as a loss in an optimization problem of the form: θ arg minθ Θ OT(µn, νθ), (1.2) where {νθ}θ Θ is a collection of parametric models and µn is the empirical distribution of the samples. Such estimators are called minimum Kantorovich estimators (MKE) (Bassetti et al., 2006). They are popular alternatives to likelihood-based estimators, especially in generative modeling. For example, when OT( , ) is the Wasserstein-1 distance and νθ is a generator parameterized by a neural network with weights θ, equation 1.2 corresponds to the Wasserstein GAN (Arjovsky et al., 2017). One drawback of optimal transport is its sensitivity to outliers. Because all the mass in µ must be transported to ν, a small fraction of outliers can have an outsized impact. For statistics and machine learning applications in which the data is corrupted or noisy, this is a major issue. For example, the poor performance of Wasserstein GANs in the presence of outliers was noted in the recent works on outlier-robust generative learning with f-divergence GANs (Chao et al., 2018; Wu et al., 2020). The problem of outlierrobustness in MKE has not been studied except in two recent works proposing changes to the OT formulation that are challenging to handle computationally (Staerman et al., 2020; Balaji et al., 2020). Our goal is to derive an outlierrobust OT formulation compatible with existing efficient Outlier-Robust Optimal Transport computational OT methods (Peyré & Cuturi, 2018). In this paper, we propose a modification of OT to address its sensitivity to outliers. Our formulation can be used as a loss in equation 1.2, so that it is robust to a small fraction of outliers in the data. For simplicity, we consider the ϵcontamination model (Huber & Ronchetti, 2009). Let νθ0 be a member of a parametric family {νθ : θ Θ} and let µ = (1 ϵ)νθ0 + ϵ ν, where µ is the data-generating distribution, ϵ > 0 is the fraction of outliers, and ν is the distribution of the outliers. Although the fraction of outliers is capped at ϵ, the value of the outliers can be arbitrary, so their effect on the optimal transport problem can be arbitrarily large. We modify the problem so that it is more robust to such outliers, targeting the downstream application of learning θ0 from (samples from) µ in the ϵ-contamination model. Our main contributions are as follows: We propose a robust OT formulation suitable for statistical estimation in the ϵ-contamination model using MKE. We show that our formulation is equivalent to the original OT problem with a clipped transport cost. This connection enables us to leverage the voluminous literature on computational optimal transport to develop efficient algorithm to perform MKE robust to outliers. Our formulation enables a new application of optimal transport: outlier detection in data. 2. Problem Formulation 2.1. Robust OT for MKE To promote outlier-robustness in MKE, we need to allow the corresponding OT problem to ignore outliers in the data distribution µ. The ϵ-contamination model imposes a cap on the fraction of outliers, so it is not hard to see that µ νθ0 TV ϵ, where TV is the total-variation norm defined as µ TV = R 1 2|µ(dx)|. This suggests we solve a TV-constrained/regularized version of equation 1.2: min θ Θ, µ OT( µ, νθ) subject to µ µ TV ϵ; µ P, (2.1) where P is the set of all probability measures. The constrained version, however, suffers from identification issues. It cannot distinguish between clean distributions within TV distance ϵ of νθ0. To see this, fix θ Θ, and note that the optimal value of equation 2.1 is zero whenever µ is within ϵ-TV distance of νθ. Thus equation 2.1 cannot distinguish between two parameter values θ1 and θ2 such that νθ1 νθ2 TV ϵ. This makes it unsuitable as a loss function for statistical estimation, because it cannot lead to a consistent estimator. As an alternative, its regularized counterpart does not suffer from this issue: min θ Θ s:µ+s P OT(µ + s, νθ) + λ s TV, (2.2) where λ > 0 is a regularization parameter. Note that, the constrained and Lagrangian formulations are equivalent, but the equivalence depends on the two distributions in the arguments of the Wasserstein distance. In other words, as we vary the distributions (keeping ϵ fixed), the equivalent λ will change. In our formulation, we are fixing λ and varying the distributions, so the solution paths are different, making the parameter identifiable. In the rest of this paper, we work with the TV-regularized formulation equation 2.2. The main idea of our formulation is to allow for modifications of µ, while penalizing their magnitude and ensuring that the modified µ is still a probability measure. Below we formulate this intuition in an optimization problem titled ROBOT (ROBust Optimal Transport): Formulation 1: ZZ c(x, y) π(x, y) dx dy + λ s TV subject to Z π(x, y) dy = µ(x) + s(x) 0 Z π(x, y) dy = ν(y) Z s(dx) = 0. where π is a density function on X X. The first and the last constraints ensure that µ+s is a valid probability measure, while λ s TV penalizes the amount of modifications in µ. We can identify exact locations of outliers in µ by inspecting µ + s, i.e. if µ(x) + s(x) = 0, then x got eliminated and is an outlier. We will use this property to propose an outlier detection method. ROBOT, unlike classical OT, guarantees that an adversarially-picked outliers cannot increase the (robust) transport distance arbitrarily. Let µ = (1 ϵ)µ+ϵµc, i.e., µ is µ contaminated with outliers from µc, and let ν be an arbitrary measure; in MKE, µ is the contaminated data and ν is the model we learn. The adversary can arbitrarily increase OT( µ, ν) by manipulating the outlier distribution µc. For ROBOT, we have the following bound: Theorem 2.1. Let µ = (1 ϵ)µ + ϵµc for some ϵ [0, 1). Then, ROBOT( µ, ν) min {OT(µ, ν) + λϵ µ µc TV, λ µ ν TV, OT( µ, ν)} , (2.4) Outlier-Robust Optimal Transport This bound has two key takeaways: since TV norm of distributions is bounded by 1, the adversary can not increase ROBOT( µ, ν) arbitrarily; in the absence of outliers, ROBOT is bounded by classical OT. See Appendix B for the proof. Related work. We note a connection between equation 2.3 and unbalanced OT (UOT) (Chizat., 2017; Chizat et al., 2018). UOT is typically formulated by replacing the TV norm with KL(µ + s|µ) and adding an analogous term for ν. Chizat et al. (2018) studied entropy regularized UOT with various divergences penalizing marginal violations. Optimization problems similar to equation 2.3 have also been considered outside of ML (Piccoli & Rossi, 2014; Liero et al., 2018). Balaji et al. (2020) use UOT with χ2-divergence penalty on marginal violations to achieve outlier-robustness in generative modeling. Another relevant variation of OT is partial OT (Figalli, 2010; Caffarelli & Mc Cann, 2010). It may also be considered for outlierrobustness but has a drawback of forcing mass destruction rather than adjusting marginals to ignore outliers when they are present. Staerman et al. (2020) take a different path: they replace the expectation in the Wasserstein-1 dual with a median-of-means to promote robustness. It is unclear what is the corresponding primal problem, making their formulation hard to interpret as an optimal transport problem. A major challenge with the aforementioned methods, including our Formulation 1, is the large scale implementation of the optimization problem. Chizat et al. (2018) propose a Sinkhorn-like algorithm for entropy regularized UOT, but it is not amenable to stochastic optimization. Balaji et al. (2020) propose a stochastic optimization algorithm based on the UOT dual, but it requires two additional neural networks (total of four including dual potentials) to parameterize modified marginal distributions (i.e., µ+s and analogous one for ν). Optimizing with a median-of-means in the objective function (Staerman et al., 2020) is also challenging. The key contribution of our work is a formulation equivalent to equation 2.3, which is easily compatible with the large body of classical OT optimization techniques (Cuturi, 2013; Solomon et al., 2015; Genevay et al., 2016; Seguy et al., 2018). More efficient equivalent formulation. At a first glance, there are two issues with equation 2.3: it appears asymmetric, and it is unclear if it can be optimized efficiently. Below we present an equivalent formulation that is free of these issues: Formulation 2: min Π F+(Rd Rd) E(X,Y ) Π [Cλ(X, Y )] subject to X µ, Y ν . (2.5) where Cλ is the truncated cost function defined as Cλ(x, y) = min {c(x, y), 2λ}. Looking at equation 2.5, it is not apparent that it adds robustness to MKE, but it is symmetric, easy to combine with entropic regularization by simply truncating the cost, and benefits from stochastic optimization algorithms (Genevay et al., 2016; Seguy et al., 2018). This formulation also has a distant relation to the idea of loss truncation for achieving robustness (Shen & Sanghavi, 2019). Pele & Werman (2009) consider the Earth Mover Distance (discrete OT) with truncated cost to achieve computational improvements; they also mentioned its potential to promote robustness against outlier noise but did not explore this direction. In Section 3, we establish an equivalence between the two ROBOT formulations, equation 2.3 and equation 2.5. This equivalence allows us to obtain an efficient algorithm based on equation 2.5 for robust MKE. We also provide a simple procedure for computing the optimal s in equation 2.3 from the solution of equation 2.5, enabling a new OT application: outlier detection. We verify the effectiveness of robust MKE and outlier detection in our experiments in Section 4. Before presenting the equivalence proof, however, we formulate the discrete analogs of the two ROBOT formulations for their practical value. 2.2. Discrete ROBOT formulations In practice, we typically encounter samples from distributions, rather then the distributions themselves. Sampling is also built into stochastic optimization. In this subsection, we present the discrete versions of the ROBOT formulations. The key detail is that, in equation 2.3, µ, ν and s are all supported on Rd, while in the discrete case the empirical measures µn n 1 and νm m 1 are supported on a set of points ( r is the unit probability simplex in Rr). As a result, to formulate a discrete version of equation 2.3, we need to augment µn and νm with each others supports. To be precise, let supp(µn) = {X1, . . . , Xn} and supp(νm) = {Y1, . . . , Ym}. Define C = {Z1, Z2, . . . , Zm+n} = {X1, . . . , Xn, Y1, . . . , Ym}. Then discrete analog of equation 2.3 is Formulation 1 (discrete): min Π,s Caug, Π + λ [ s1 1 + t1 1] subject to Π1m+n = µn + s1 t1 , Π 1m+n = 0 νm Π 0, 1 m+ns = 0, (2.6) where Caug R(m+n) (m+n) is the augmented cost function Caug,i,j = c(Zi, Zj) (c is the ground cost, e.g., Outlier-Robust Optimal Transport squared Euclidean distance), s = (s1, t1) and 1r is the vector all ones in Rr. The TV norm got replaced with its discrete analog, the L1 norm. Similarly to its continuous counterpart, the optimization problem is harder than typical OT due to additional constraint optimization variable s and increased cost matrix size. The discrete analog of equation 2.5 is straightforward: Formulation 2 (discrete): min Π Rn m Cλ, Π subject to Π1n = µn, Π 1m = νm, Π 0, (2.7) where Cλ,i,j = min{c(Xi, Yj), 2λ}. As in the continuous case, it is easy to adapt modern (regularized) OT solvers without any computational overhead, and formulations of equation 2.6 and equation 2.7 are equivalent. It is also possible to recover s of equation 2.6 from the solution of equation 2.7 to perform outlier detection. Two-sided formulation. So far we have assumed that one of the input distributions does not have outliers, which is the setting of MKE, where the clean distribution corresponds to the model we learn. In some applications, both distributions may be corrupted. To address this case, we provide an equivalent two-sided formulation, analogous to UOT with TV norm: Formulation 3 (two-sided): min Π,s1,s2 Caug, Π + λ [ s1 1 + t1 1 + s2 1 + t2 1] subject to Π1m+n = µn + s1 t1 Π 1m+n = s2 νm + t2 Π 0, 1 m+ns1 = 0, 1 m+ns2 = 0 (2.8) where s1 = (s 1 , t 1 ) and s2 = (s 2 , t 2 ) . 3. Equivalence of the ROBOT formulations In this section, we present our main theorem, which demonstrates the equivalence between two formulations of the robust optimal transport: Theorem 3.1. For any two measures µ and ν, ROBOT(µ, ν) has same value for both the formulations, i.e., Formulation 1 is equivalent to Formulation 2 for the discrete case. Additionally, if the (non-truncated) cost function c( , ) is a metric, then the equivalence of the two formulations also holds for the continuous case. Moreover, we can recover optimal coupling of one formulation from the other. Below we sketch the proof of this theorem and highlight some important techniques used in the proof. We focus on the discrete case as it is more intuitive and has concrete practical implications in our experiments. A complete proof can be found in Appendix A. Please also see Appendix A.2 for the proof of equivalence between Formulations 1, 2 and 3 in the discrete case. 3.1. Proof sketch In the remainder of this section, we consider the discrete case, i.e., equation 2.6 for Formulation 1 (F1) and equation 2.7 for Formulation 2 (F2). Suppose Π 2 is an optimal solution of F2. Then we construct a feasible solution Π 1, s 1 = (s 1, t 1) of F1 based on Π 2 with the same value of the objective function as F2 and claim that (Π 1, s 1) is an optimal solution. We prove the claim by contradiction: if (Π 1, s 1) is not optimal, then there exists another pair ( Π1, s1) which is optimal for F1 with strictly less objective value. We then construct another feasible solution Π 2,new of Formulation 2 which has the same objective value as of ( Π1, s1) for F1. This implies Π 2,new has strictly less objective value for F2 than Π 2, which is a contradiction. The two main steps of this proof are (1) constructing a feasible solution of F1 starting from a feasible solution of F2 and (2) showing that the solution constructed is indeed optimal for F1. Hence step (1) gives a recipe to construct an optimal solution of F1 starting from an optimal solution of F2. We elaborate the first point in the next subsection, which has practical implications for outlier detection. The other point is more technical; interested readers may go through the proof in Appendix A.1. Algorithm 1 Generating optimal solution of F1 from F2 1: Start with Π 2 Rn m, an optimal solution of Formulation 2. 2: Create an augmented matrix Π Rm+n m+n with all 0. Divide Π into four blocks: Π11 |{z} n n Π12 |{z} n m Π21 |{z} m n Π22 |{z} m m 3: Set Π12 Π 2 and collect all the indices I = {(i, j) : Ci,j > 2λ}. 4: Set Π12(i, j) 0 for (i, j) I. 5: Set Π22(j, j) Pn i=1 Π 2(i, j)1(i,j) I for all 1 j m and set Π 1 Π. 6: Set s 1(i) = Pm j=1 Π 2(i, j)1(i,j) I for all 1 i n. 7: Set t 1(j) = Π22(j, j) for all 1 j m. 8: return Π 1, s 1, t 1. Outlier-Robust Optimal Transport 3.2. Going from Formulation 2 to Formulation 1 Let Π 2 (respectively Π 1) be an optimal solution of F2 (respectively F1). Recall that Π 1 has dimension (m + n) (m + n). From the column sum constraint in F1, we need to take the first n columns of Π 1 to be exactly 0, whereas the last m columns must sum up to νm. For any matrix A, we denote by A[(a : b) (c : d)] the submatrix consisting of rows from a to b and columns from c to d. Our main idea is to put a modified version of Π 2 in Π 1[(1 : n) (n + 1 : m + n)] and make Π 1[(n + 1 : m + n) (n + 1 : m + n)] diagonal. First we describe how to modify Π 2. Observe that, if for some (i, j) Ci,j > 2λ, we expect Xi supp(µn) to be an outlier resulting in high transportation cost, which is why we truncate the cost in F2. Therefore, to get an optimal solution of F1, we make the corresponding value of optimal plan 0 and dump the mass into the corresponding slack variable t 1 in the diagonal of the bottom right submatrix. This changes the row sum, which is taken care of by s 1. But, as we are not moving this mass outside the corresponding column, the column sum of Π 1[(1 : (m+n)) : ((n+1) : (m+n))] remains same as column sum of Π 2, which is νn. We summarize this procedure in Algorithm 1. Figure 1: Constructing optimal solution of Formulation 1 from optimal solution of Formulation 2. Example. In Figure 1, we provide an example to visualize the construction. On the left, we have Π 2, an optimal solution of Formulation 2. The blue triangles denote the positions where the corresponding cost value is 2λ, and light-green squares denote the positions where the corresponding value of the cost matrix is > 2λ. To construct an optimal solution Π 1 of Formulation 1 from this Π 2, we first create an augmented matrix of size 6 6. We keep all the entries of the left 6 3 sub-matrix as 0 (in this picture blank elements indicate 0). On the right submatrix, we put Π 2 into the top-right block, but remove the masses from light-green squares, i.e. where cost value is > 2λ, and put it in the diagonal entries of the bottom right block as shown in Figure 1. This mass contributes to the slack variables s1 and t1, and this augmented matrix along with s1, t1 give us an optimal solution of Formulation 1. 3.3. Outlier detection with ROBOT Our construction algorithm has practical consequences for outlier detection. Suppose we have two datasets, a clean dataset νm (i.e., has no outliers) and an outliercontaminated dataset µn. We can detect the outliers in µn without directly solving costly Formulation 1 by following Algorithm 2. In this algorithm, λ is a regularization parameter that can be chosen via cross-validation or heuristically (see Section 4.2 for an example). In Section 4.2, we use this algorithm to perform outlier detection on image data. Outlier detection with entropic regularization. Algorithm 1 allows us to recover solution of Formulation 1, which ultimately is used for outlier detection in Algorithm 2, by solving the simpler truncated cost Formulation 2 problem in equation 2.7. Similarly to the regular OT, it can be solved exactly with a linear program solver Pele & Werman (2009) propose a faster exact solution based on min-cost-flow solvers benefiting from the cost truncation or approximately using entropic regularization techniques, e.g. Sinkhorn algorithm (Cuturi, 2013). In the following lemma, we show that Algorithm 1 recovers a meaningful approximate solution of Formulation 1 from an approximate solution of Formulation 2 obtained with entropyregularized OT solvers. Lemma 3.2. Let Π 2,α be a solution of the entropy regularized version of equation 2.7: arg min Π Rn m Cλ, Π + αH(Π) subject to Π1n = µn, Π 1m = νm, Π 0, (3.1) and let Π 1,α, s 1,α be the corresponding approximate solution of Formulation 1 recovered from Π 2,α by Algorithm 1. Then Π 1,α Π 1 F + s 1,α s 1 2 0 as α 0, where (Π 1, s 1) is the exact solution of Formulation 1 in equation 2.6. When the solution of equation 2.7 is non-unique, then the solution of equation 3.1 converges to the solution of equation 2.7 with maximum entropy (see Proposition 4.1 of Peyré & Cuturi (2018)) and consequently recovers the corresponding maximum entropy solution (Π 1, s 1) of equation 2.6 by Algorithm 1 in the limit. Please find the proof in Appendix C. 4. Empirical studies To evaluate effectiveness of ROBOT, we consider the task of robust mean estimation under the Huber contamination model. The data is generated from (1 ϵ)N(η0, Id) + Outlier-Robust Optimal Transport Algorithm 2 Outlier detection in contaminated data 1: Start with µn (contaminted data) and νm (clean data). 2: Solve Formulation 2 and obtain Π 2 using a suitable value of λ. 3: Use Algorithm 1 to obtain Π 1, s 1, t 1 from Π 2. 4: Find I, the set of all the indices where µn + s 1 = 0. 5: Return I as the indices of outliers in µn. ϵN(η1, Id) and the goal is to estimate η0. Prior work has advocated for using f-divergence GANs (Chao et al., 2018; Wu et al., 2020) for this problem and pointed out inefficiencies of Wasserstein GAN in the presence of outliers. We show that our robust OT formulation allows to estimate the uncontaminated mean η0 comparably or better than a variety of f-divergence GANs. We also use this simulated setup to study sensitivity to the cost truncation hyperparameter λ. In our second experiment, we present a new application of optimal transport enabled by ROBOT. Suppose we have collected a curated dataset νm (i.e., we know that it has no outliers) such data collection is expensive, and we want to benefit from it to automate subsequent data collection. Let µn be a second dataset collected in the wild, i.e., it may or may not have outliers. We demonstrate how ROBOT can be used to identify outliers in µn using the curated dataset νm. 4.1. Robust mean estimation Following Wu et al. (2020), we consider a simple generator of the form gθ(x) = x + θ, x N(0, Id), d is the data dimension. The basic idea of robust mean estimation with GANs is to minimize various distributional divergences between samples from gθ and observed data simulated from (1 ϵ)N(η0, Id) + ϵN(η1, Id). The goal is to estimate η0 with θ. To efficiently implement ROBOT GAN, we use a standard min-max optimization approach: solve the inner max (ROBOT) and use gradient descent for the outer min parameter. To solve ROBOT, it is straightforward to adopt any of the prior stochastic regularized OT solvers: the only modification is the truncation of the cost entries as in equation 2.7. We use the stochastic algorithm for semi-discrete regularized OT (Genevay et al., 2016, Algorithm 2). We summarize ROBOT GAN in Algorithm 3. Line 5 - Line 11 perform the inner optimization where we solve the entropy regularized OT dual with truncated cost and Line 12 - Line 14 perform gradient update of θ. For the f-divergence GANs (Nowozin et al., 2016), we use the code of Wu et al. (2020) for GANs with Jensen Shannon (JS) loss, squared Hellinger (SH) loss, and Re- verse Kullback-Leibler (RKL) loss. For the exact expressions of these divergences, see Table 1 of Wu et al. (2020). We report estimation error measured by the Euclidean distance between true uncontaminated mean η0 and estimated mean θ for various contamination distributions in Table 1. ROBOT GAN performs well across all considered contamination distributions. As the difference between true mean η0 and contamination mean η1 increases, the estimation error of all methods tends to increase. However, when it becomes easier to distinguish outliers from clean samples, i.e., η1 = 2 15, performance of ROBOT noticeably improves. We present an analogous study with data simulated from a mixture of Cauchy distributions in Appendix F. Algorithm 3 ROBOT GAN 1: Input: robustness regularizion λ, entropic regularization α, data distribution µn n 1, supp(µn) = X = [X1, . . . , Xn], steps sizes τ and γ 2: Initialize: Initialize θ = θinit, set number of iterations M and L, i = 0, v = v = 0. 3: for j = 1, . . . , M do 4: Generate z N(0, Id) and set z = z + θ. 5: Set the cost vector c Rn as c(k) = min{c(Xk, z), 2λ} for k = 1, . . . , n. 6: for i = 1, . . . , L do 7: Set h v c α and do the normalized exponential transformation u eh 1,eh . 8: Calculate the gradient v µn u. 9: Update v v + γ v and v (1/(j + i)) v + (j + i 1/(j + i))v. 10: end for 11: Do the same transformation of v as in Step 7, i.e. set h v c α and set Π eh 12: Set Π(k) = 0 for k such that c(Xk, z) > 2λ for k = 1, . . . , n. 13: Calculate gradient with respect to θ as θ = 2 z P 14: Update θ θ τ θ. 15: end for 16: Ouput: θ We also compared to the Sinkhorn-based UOT algorithm (Chizat et al., 2018) available in the Python Optimal Transport (POT) library (Flamary & Courty, 2017); to obtain a UOT GAN, we modified steps 5-12 of Algorithm 3 for computing Π. Unsurprisingly, both ROBOT and UOT perform similarly: recall equivalence to Formulation 3, which is similar to UOT with TV norm. The key insight of our work is the equivalence to classical OT with truncated cost, that greatly simplifies optimization and allows to use existing stochastic OT algorithms. In this experiment, the sample size n = 1000 is sufficiently small for the Sinkhornbased UOT POT implementation to be effective, but it Outlier-Robust Optimal Transport Table 1: Robust mean estimation with GANs using different distribution divergences. True mean is η0 = 05; sample size n = 1000; contamination proportion ϵ = 0.2. We report results over 30 experiment restarts. Contamination JS Loss SH Loss RKL Loss ROBOT UOT N(0.1 15, I5) 0.09 0.03 0.11 0.03 0.115 0.03 0.1 0.03 0.1 0.04 N(0.5 15, I5) 0.23 0.04 0.24 0.05 0.24 0.05 0.117 0.03 0.2 0.04 N(1 15, I5) 0.43 0.05 0.43 0.06 0.43 0.06 0.261 0.06 0.25 0.05 N(2 15, I5) 0.67 0.07 0.67 0.08 0.67 0.08 0.106 0.03 0.1 0.03 3 2 1 0 1 2 3 log10 Estimation Error ² = 0 ² = 0.05 ² = 0.1 ² = 0.15 ² = 0.2 ² = 0.25 (a) Varying proportion of contamination 3 2 1 0 1 2 3 log10λ Estimation Error η1 = 0 η1 = -1 η1 = -2 η1 = -3 η1 = -4 (b) Varying outlier distribution mean Figure 2: Empirical study of the cost truncation hyperparameter λ sensitivity. breaks in the experiment we present in Section 4.2. We also tried the code of Balaji et al. (2020) based on CVXPY (Diamond & Boyd, 2016), but it is too slow even for the n = 1000 sample size. In Subsection 4.3 we present a comparison to Balaji et al. (2020) on a smaller sample size. Hyperparameter sensitivity study. In the previous experiment, we set λ = 0.5. Now we demonstrate empirically that there is a broad range of λ values performing well. In Figure 2(a), we study sensitivity of λ under various contamination proportions ϵ holding η0 = 15 and η1 = 5 15 fixed. Horizontal lines correspond to λ = , i.e., vanilla OT. The key observations are: there is a wide range of λ efficient at all contamination proportions (note the log10 x-axis scale), and ROBOT is always at least as good as vanilla OT (even when there is no contamination ϵ = 0). In Figure 2(b), we present a similar study varying the mean of the contamination distribution and holding ϵ = 0.2 fixed. We see that as the contamination distribution gets closer to the true distribution, it becomes harder to pick a good λ, but the performance is always at least as good as the vanilla OT (horizontal lines). 4.2. Outlier detection for data collection Our robust OT formulation equation 2.6 enables outlier identification. Let νm be a clean dataset and µn potentially contaminated with outliers. Recall that ROBOT allows modification of one of the input distributions to eliminate potential outliers. We can identify outliers in µn as follows: if µn(i) + s 1(i) = 0, then Xi, the ith point in µn, is an outlier. Instead of directly solving equation 2.6, which may be inefficient, we use our equivalence results and solve an easier optimization problem equation 2.7, followed by recovering s to find outliers via Algorithm 2. When using entropy-regularized approximate solutions to detect outliers with Algorithm 2, in step 4, µn +s 1 will not be exactly 0 for the outliers, so a small threshold should be used instead. We modify step 4 to Find I, the set of all the indices where µn + s 1 < 1/n2 when using entropy Outlier-Robust Optimal Transport Table 2: Outlier detection on MNIST. Methods Accuracy One class SVM 0.496 0.003 Local outlier factor 0.791 0.001 Isolation forest 0.636 0.010 Elliptical envelope 0.739 0.002 Orthonormal Certificates 0.819 0.008 ROBOT 0.859 0.002 ROBOT-Sinkhorn 0.897 0.004 regularization. To test our outlier-identification methodology we follow the experimental setup of Tagasovska & Lopez-Paz (2019). Specifically, let νm be a clean dataset consisting of 10k MNIST digits from 0 to 4 and µn be a dataset collected in the wild consisting of (different) 8k MNIST digits from 0 to 4 and 2k outlier MNIST images of digits from 5 to 9. We compute ROBOT(µn, νm) to identify outlier digit images in µn. For each point in µn we obtain a prediction, outlier or clean, which allows us to evaluate accuracy. Tagasovska & Lopez-Paz (2019) use last-layer features of a neural network pre-trained on the clean data νm it is straightforward to combine ROBOT and other baselines we consider with any feature extractor, but in this experiment we simply use the raw data. We compare to the Orthonormal Certificates (OC) method of Tagasovska & Lopez-Paz (2019) and to a variety of standard outlier detection algorithms available in Scikit-learn (Pedregosa et al., 2011): one class SVM (Schölkopf et al., 1999), local outlier factor (Breunig et al., 2000), isolation forest (Liu et al., 2008) and elliptical envelope (Rousseeuw & Driessen, 1999). All baselines except one class SVM and local outlier factor use clean data for training as does our method. Results of 30 experiment repetitions are summarized in Table 2. ROBOT, i.e. Algorithm 2 where equation 2.7 is solved exactly with a linear program solver, and ROBOTSinkhorn, i.e. Algorithm 2 where equation 2.7 is solved approximately using Sinkhorn (Cuturi, 2013), produce the best results. In this experiment ROBOT-Sinkhorn outperforms ROBOT, but it is not necessarily to be expected in general. We conclude that our method is effective in assisting data collection once an initial set of clean data has been acquired and is compatible with entropy-regularized OT solvers ensuring scalability. Elliptical envelope assumes that clean data is Gaussian, while one class SVM, isolation forest and local outlier factor correspondingly attempt to fit SVM, random forest and k-nearest neighbors classifiers to distinguish clean and outlier samples. All these baselines work best when the clean Inliers detected as outliers z }| { Outliers detected as inliers z }| { digit count digit count 0 10 5 191 1 1 6 94 2 53 7 175 3 32 8 181 4 22 9 309 Figure 3: Insights into ROBOT-Sinkhorn performance: the top left figure is a collection of random inlier digits miss-classified as outliers; the top right picture represents random outlier digits miss-classified as inliers and the table illustrates frequency distribution of miss-classified images. The majority of the errors are on digit 9, possibly due to its similarity to 4. data is unimodal, which is not the case for the MNIST 0 to 4 digits considered in our experiment. The orthonormal certificates method is rooted in PCA and assumes that clean and outlier data live in different subspaces. This assumption is reasonable for the MNIST data, but it might not hold broadly in practice. We believe that empirical success of our method is due to its optimal transport nature. OT is a geometry-sensitive metric on distributions that can distinguish multi-modal distributions and distributions supported on the same subspace. We provide additional insights into the ROBOT performance in Figure 3. A theoretical investigation of ROBOT outlier-detection guarantees is an interesting future work direction. Hyperparameter selection. To select the cost truncation hyperparameter λ, we propose the following heuristic: since we know that νm is clean, we can subsample two datasets from it, compute vanilla OT to obtain transportation plan Π and set λ to be half the maximum distance between matched elements, i.e. 2λ = maxi,j{Cij : Πij > 0}, where C is the cost matrix for the two subsampled datasets. This procedure is essentially estimating maximum distance between matched clean samples. To avoid subsampling noise we use 99th percentile instead of the maximum. Our experiments also revealed that increasing λ increases the set of outliers progressively, i.e. if a sample is detected as outlier for some value of λ, then it will be detected as outlier for all higher values of λ. A rigorous theoretical analysis for this observation is a potential future Outlier-Robust Optimal Transport Table 4: Robust mean estimation for n = 200. Method Estimation error Run-time ROBOT-Sinkhorn 0.196 0.1 35s 0.6s Balaji et al. (2020) 0.212 0.074 7745s 1670s Table 5: Outlier detection for n = 1000. Method Estimation error Run-time ROBOT-Sinkhorn 0.86 0.015 6 2s Balaji et al. (2020) 0.8 0.0005 3343 960s The Orthonormal Certificates method (Tagasovska & Lopez-Paz, 2019) also requires setting a threshold. It computes the null-space of the clean train data and uses the norm of the projection into that space as a score to distinguish outliers. Similarly to ROBOT, and as the authors do in their code, we use 99th percentile of those scores computed on the clean train data as the threshold. For other baselines, we use the default hyperparameters. 4.3. Comparison with Balaji et al. (2020) We conduct additional experiments comparing to the recent robust optimal transport method of Balaji et al. (2020). Their method relies on CVXPY and does not scale to the sample sizes considered in our previous experiments. We compare on smaller data sizes. Robust mean estimation. We set n = 200 samples with contamination distribution mean equal to 2 and true mean equal to 0 (same configuration as in the last row in Table 1). The runtime and the estimation error is reported in Table 4. Outlier detection experiment. In this experiment, we consider the size of the entire dataset (inliers + outliers) to be n = 1000 (with 800 inliers and 200 outliers). The results (accruacy and run-time) are provided in Table 5. The method of Balaji et al. (2020) is significantly slower as expected. Comparing the performance, we think that ROBOT performs better because it is based on TV norm, while the method of Balaji et al. (2020) uses a chi-squared constraints on the marginal perturbations. A TV constraint on the marginal perturbations is more closely related to the ϵ-contamination model for outlier detection, suggesting that a TV-based constraint/regularizer could be a better choice for the outlier detection applications. We also note that in the outlier detection experiment, using chi-square divergence results in a non-sparse solution and requires tuning a threshold parameter to perform outlier detection, in addition to the chi-square distance hyperparameter. We tuned those parameters, but were not able to achieve significant performance improvements for the method of Balaji et al. (2020). 5. Summary and discussion We propose and study ROBOT, a robust formulation of optimal transport. Although the problem is seemingly asymmetric and challenging to optimize, there is an equivalent formulation based on cost truncation that is symmetric and compatible with modern stochastic optimization methods for OT. ROBOT closely resembles unbalanced optimal transport (UOT). In our formulation, we added a TV regularizer to the vanilla optimal transport problem. This is motivated by the ϵ-contamination model. In UOT, the TV regularizer is typically replaced with a KL divergence. Other choices of the regularizer may lead to new properties and applications. Studying equivalent, simpler formulations of UOT with different divergences may be a fruitful future work direction. From the practical perspective, in our experiments we observed no degradation of ROBOT GAN in comparison to OT GAN, even when there were no outliers. It is possible that replacing OT with ROBOT may be beneficial for various machine learning applications of OT. Data encountered in practice may not be explicitly contaminated with outliers, but it often has errors and other deficiencies, suggesting that a no-harm robustness is desirable. Acknowledgements This paper is based upon work supported by the National Science Foundation (NSF) under grants no. 1830247 and 1916271. J. Solomon acknowledges the generous support of Army Research Office grants W911NF1710068 and W911NF2010168, of Air Force Office of Scientific Research award FA9550-19-1-031, of National Science Foundation grant IIS-1838071, from the CSAIL Systems that Learn program, from the MIT IBM Watson AI Laboratory, from the Toyota CSAIL Joint Research Center, from a gift from Adobe Systems, and from the Skoltech MIT Next Generation Program. We also thank anonymous reviewers, whose comments were extremely helpful for further improvement of our paper. Outlier-Robust Optimal Transport David Alvarez-Melis and Tommi S Jaakkola. Gromov Wasserstein alignment of word embedding spaces. ar Xiv:1809.00013, 2018. Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. ar Xiv:1701.07875 [cs, stat], January 2017. Yogesh Balaji, Rama Chellappa, and Soheil Feizi. Robust optimal transport with applications in generative modeling and domain adaptation. Advances in Neural Information Processing Systems, 33, 2020. Federico Bassetti, Antonella Bodini, and Eugenio Regazzini. On minimum Kantorovich distance estimators. Statistics & Probability Letters, 76(12):1298 1302, 2006. Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pp. 93 104, 2000. Luis A Caffarelli and Robert J Mc Cann. Free boundaries in optimal transport and Monge-Ampere obstacle problems. Annals of Mathematics, pp. 673 730, 2010. Gao Chao, Yao Yuan, and Zhu Weizhi. Robust estimation via generative adversarial networks. In International Conference on Learning Representations, 2018. Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. Scaling algorithms for unbalanced optimal transport problems. Mathematics of Computation, 87(314):2563 2609, 2018. Lenaïc Chizat. Unbalanced optimal transport: Models, numerical methods, applications. Numerical Analysis [math.NA]. Université Paris sciences et lettres, 2017. Nicolas Courty, Rémi Flamary, and Devis Tuia. Domain adaptation with regularized optimal transport. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 274 289. Springer, 2014. Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems, pp. 3730 3739, 2017. Marco Cuturi. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 26, pp. 2292 2300. Curran Associates, Inc., 2013. Steven Diamond and Stephen Boyd. CVXPY: A Pythonembedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1 5, 2016. Alessio Figalli. The optimal partial transport problem. Archive for Rational Mechanics and Analysis, 195(2): 533 560, 2010. Rémi Flamary and Nicolas Courty. POT Python optimal transport library, 2017. URL https://pythonot. github.io/. Aude Genevay, Marco Cuturi, Gabriel Peyré, and Francis Bach. Stochastic optimization for large-scale optimal transport. In Advances in Neural Information Processing Systems, pp. 3440 3448, 2016. Nhat Ho, Xuan Long Nguyen, Mikhail Yurochkin, Hung Hai Bui, Viet Huynh, and Dinh Phung. Multilevel clustering via Wasserstein means. In International Conference on Machine Learning, pp. 1501 1509, 2017. Gao Huang, Chuan Guo, Matt J Kusner, Yu Sun, Fei Sha, and Kilian Q Weinberger. Supervised word mover s distance. In Advances in Neural Information Processing Systems, pp. 4862 4870, 2016. Peter J. Huber and Elvezio Ronchetti. Robust Statistics. Wiley Series in Probability and Statistics. Wiley, Hoboken, N.J, 2nd ed edition, 2009. ISBN 978-0-470-129906. Leonid Vitalievich Kantorovich. On the translocation of masses. In Dokl. Akad. Nauk. USSR (NS), volume 37, pp. 199 201, 1942. Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From Word Embeddings To Document Distances. In International Conference on Machine Learning, pp. 957 966, June 2015. Matthias Liero, Alexander Mielke, and Giuseppe Savaré. Optimal entropy-transport problems and a new Hellinger Kantorovich distance between positive measures. Inventiones Mathematicae, 211(3):969 1117, 2018. Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In 2008 eighth ieee international conference on data mining, pp. 413 422. IEEE, 2008. Gaspard Monge. Mémoire sur la théorie des déblais et des remblais. De l Imprimerie Royale, 1781. Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f GAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271 279, 2016. Outlier-Robust Optimal Transport F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. Ofir Pele and Michael Werman. Fast and robust earth mover s distances. In 2009 IEEE 12th International Conference on Computer Vision, pp. 460 467. IEEE, 2009. Gabriel Peyré and Marco Cuturi. Computational Optimal Transport. ar Xiv:1803.00567 [stat], March 2018. Benedetto Piccoli and Francesco Rossi. Generalized Wasserstein distance and its application to transport equations with source. Archive for Rational Mechanics and Analysis, 211(1):335 358, 2014. Peter J Rousseeuw and Katrien Van Driessen. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3):212 223, 1999. Bernhard Schölkopf, Robert C Williamson, Alexander J Smola, John Shawe-Taylor, John C Platt, et al. Support vector method for novelty detection. In NIPS, volume 12, pp. 582 588. Citeseer, 1999. Vivien Seguy, Bharath Bhushan Damodaran, Rémi Flamary, Nicolas Courty, Antoine Rolet, and Mathieu Blondel. Large-Scale Optimal Transport and Mapping Estimation. ar Xiv:1711.02283 [stat], February 2018. Yanyao Shen and Sujay Sanghavi. Learning with bad training data via iterative trimmed loss minimization. In International Conference on Machine Learning, pp. 5739 5748. PMLR, 2019. Justin Solomon, Fernando De Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (TOG), 34(4):1 11, 2015. Sanvesh Srivastava, Cheng Li, and David B. Dunson. Scalable Bayes via Barycenter in Wasserstein Space. ar Xiv:1508.05880 [stat], June 2018. Guillaume Staerman, Pierre Laforgue, Pavlo Mozharovskyi, and Florence d Alché Buc. When OT meets MOM: Robust estimation of Wasserstein distance. ar Xiv:2006.10325, 2020. Natasa Tagasovska and David Lopez-Paz. Single-model uncertainties for deep learning. Advances in Neural Information Processing Systems, 32:6417 6428, 2019. Alexander Tong, Guy Wolf, and Smita Krishnaswamyt. Fixing bias in reconstruction-based anomaly detection with lipschitz discriminators. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1 6. IEEE, 2020. Kaiwen Wu, Gavin Weiguang Ding, Ruitong Huang, and Yaoliang Yu. On Minimax Optimality of GANs for Robust Mean Estimation. In International Conference on Artificial Intelligence and Statistics, pp. 4541 4551, June 2020. Mikhail Yurochkin, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, and Justin Solomon. Hierarchical Optimal Transport for Document Representation. ar Xiv:1906.10827 [cs, stat], June 2019.