# sparsistency_for_inverse_optimal_transport__109e1cc5.pdf Published as a conference paper at ICLR 2024 SPARSISTENCY FOR INVERSE OPTIMAL TRANSPORT Francisco Andrade & Gabriel Peyré ENS Paris {francisco.andrade,gabriel.peyre}@ens.fr Clarice Poon University of Warwick clarice.poon@warwick.ac.uk Optimal Transport is a useful metric to compare probability distributions and to compute a pairing given a ground cost. Its entropic regularization variant (e OT) is crucial to have fast algorithms and reflect fuzzy/noisy matchings. This work focuses on Inverse Optimal Transport (i OT), the problem of inferring the ground cost from samples drawn from a coupling that solves an e OT problem. It is a relevant problem that can be used to infer unobserved/missing links, and to obtain meaningful information about the structure of the ground cost yielding the pairing. On one side, i OT benefits from convexity, but on the other side, being ill-posed, it requires regularization to handle the sampling noise. This work presents an in-depth theoretical study of the ℓ1 regularization to model for instance Euclidean costs with sparse interactions between features. Specifically, we derive a sufficient condition for the robust recovery of the sparsity of the ground cost that can be seen as a far reaching generalization of the Lasso s celebrated Irrepresentability Condition . To provide additional insight into this condition, we work out in detail the Gaussian case. We show that as the entropic penalty varies, the i OT problem interpolates between a graphical Lasso and a classical Lasso, thereby establishing a connection between i OT and graph estimation, an important problem in ML. 1 INTRODUCTION Optimal transport has emerged as a key theoretical and numerical ingredient in machine learning for performing learning over probability distributions. It enables the comparison of probability distributions in a geometrically faithful manner by lifting a ground cost (or metric in a loose sense) between pairs of points to a distance between probability distributions, metrizing the convergence in law. However, the success of this OT approach to ML is inherently tied to the hypothesis that the ground cost is adapted to the problem under study. This necessitates the exploration of ground metric learning. However, it is exceptionally challenging due to its a priori highly non-convex nature when framed as an optimization problem, thereby inheriting complications in its mathematical analysis. As we illustrate in this theoretical article, these problems become tractable numerically and theoretically if one assumes access to samples from the OT coupling (i.e., having access to some partial matching driven by the ground cost). Admittedly, this is a restrictive setup, but it arises in practice (refer to subsequent sections for illustrative applications) and can also be construed as a step in a more sophisticated learning pipeline. The purpose of this paper is to propose some theoretical understanding of the possibility of stably learning a ground cost from partial matching observations. 1.1 PREVIOUS WORKS Entropic Optimal Transport. OT has been instrumental in defining and studying various procedures at the core of many ML pipelines, such as bag-of-features matching Rubner et al. (2000), distances in NLP Kusner et al. (2015), generative modeling Arjovsky et al. (2017), flow evolution for sampling De Bortoli et al. (2021), and even single-cell trajectory inference Schiebinger et al. (2019). We refer to the monographs Santambrogio (2015) for detailed accounts on the theory of OT, and Peyré et al. (2019) for its computational aspects. Of primary importance to our work, entropic regularization of OT is the workhorse of many ML applications. It enables a fast and highly parallelizable estimation of the OT coupling using the Sinkhorn algorithm Sinkhorn (1964). Published as a conference paper at ICLR 2024 More importantly, it defines a smooth distance that incorporates the understanding that matching procedures should be modeled as a noisy process (i.e., should not be assumed to be 1:1). These advantages were first introduced in ML by the seminal paper of Cuturi Cuturi (2013), and this approach finds its roots in Schrödinger s work in statistical physics Léonard (2012). The role of noise in matching (with applications in economics) and its relation to entropic OT were advanced in a series of papers by Galichon and collaborators Galichon & Salanié (2010); Dupuy & Galichon (2014); Galichon & Salanié (2022); see the book Galichon (2018). These works are key inspirations for the present paper, which aims at providing more theoretical understanding in the case of inverse OT (as detailed next). Metric Learning. The estimation of some metrics from pairwise interactions (either positive or negative) falls into the classical field of metric learning in ML, and we refer to the monograph Bellet et al. (2013) for more details. In contrast to the inverse OT (i OT) problem considered in this paper, classical metric learning is more straightforward, as no global matching between sets of points is involved. This allows the metric to be directly optimized, while the i OT problem necessitates some form of bilevel optimization. Similarly to our approach, since the state space is typically continuous, it is necessary to restrict the class of distances to render the problem tractable. The common option, which we also adopt in our paper to exemplify our findings, is to consider the class of Mahalanobis distances. These distances generalize the Euclidean distance and are equivalent to computing a vectorial embedding of the data points. See, for instance, Xing et al. (2002); Weinberger et al. (2006); Davis & Dhillon (2008). OT Ground Metric Learning. The problem of estimating the ground cost driving OT in a supervised manner was first addressed by Cuturi & Avis (2014). Unlike methods that have access to pairs of samples, the ground metric learning problem requires pairs of probability distributions and then evolves into a classical metric learning problem, but within the OT space. The class of ground metrics can be constrained, for example, by utilizing Mahalanobis Wang & Guibas (2012); Xu et al. (2018); Kerdoncuff et al. (2021) or geodesic distances Heitz et al. (2021), to devise more efficient learning schemes. The study Zen et al. (2014) conducts ground metric learning and matrix factorization simultaneously, finding applications in NLP Huang et al. (2016). It is noteworthy that ground metric learning can also be linked to generative models through adversarial training Genevay et al. (2018) and to robust learning Paty & Cuturi (2020) by maximizing the cost to render the OT distance as discriminative as possible. Inverse Optimal Transport. The inverse optimal transport problem (i OT) can be viewed as a specific instance of ground metric learning, where one aims to infer the ground cost from partial observations of the (typically entropically regularized) optimal transport coupling. This problem was first formulated and examined by Dupuy and Galichon Dupuy & Galichon (2014) over a discrete space (also see Galichon & Salanié (2022) for a more detailed analysis), making the fundamental remark that the maximum likelihood estimator amounts to solving a convex problem. The mathematical properties of the i OT problem for discrete space (i.e., direct computation of the cost between all pairs of points) are explored in depth in Chiu et al. (2022), studying uniqueness (up to trivial ambiguities) and stability to pointwise noise. Note that our theoretical study differs fundamentally as we focus on continuous state spaces. This continuous setup assumes access only to a set of couples, corresponding to matches (or links) presumed to be drawn from an OT coupling. In this scenario, the i OT is typically an ill-posed problem, and Dupuy et al. (2019); Carlier et al. (2023) propose regularizing the maximum likelihood estimator with either a lowrank (using a nuclear norm penalty) or a sparse prior (using an ℓ1 Lasso-type penalty). In our work, we concretely focus on the sparse case, but our theoretical treatment of the i OT could be extended to general structured convex regularization, along the lines of Vaiter et al. (2015). While not the focus of our paper, it is noteworthy that these works also propose efficient large-scale, non-smooth proximal solvers to optimize the penalized maximum likelihood functional, and we refer to Ma et al. (2020) for an efficient solver without inner loop calls to Sinkhorn s algorithm. This approach was further refined in Stuart & Wolfram (2020), deriving it from a Bayesian interpretation, enabling the use of MCMC methods to sample the posterior instead of optimizing a pointwise estimate (as we consider here). They also propose parameterizing cost functions as geodesic distances on graphs (while we consider only linear models to maintain convexity). An important application of i OT to ML, explored by Li et al. (2019), is to perform link prediction by solving new OT problems once the cost has been estimated from the observed couplings. Another category Published as a conference paper at ICLR 2024 of ML application of i OT is representation learning (learning embeddings of data into, e.g., an Euclidean space) from pairwise interactions, as demonstrated by Shi et al. (2023), which recasts contrastive learning as a specific instance of i OT. Inverse problems and model selection. The i OT problem is formally a bilevel problem, as the observation model necessitates solving an OT problem as an inner-level program Colson et al. (2005) we refer to Eisenberger et al. (2022) for a recent numerical treatment of bilevel programming with entropic OT. The i OT can thus be conceptualized as an inverse optimization problem Zhang & Liu (1996); Ahuja & Orlin (2001), but with a particularly favorable structure, allowing it to be recast as a convex optimization. This provides the foundation for a rigorous mathematical analysis of performance, as we propose in this paper. The essence of our contributions is a theoretical examination of the recoverability of the OT cost from noisy observations, and particularly, the robustness to noise of the sparse support of the cost (for instance, viewed as a symmetric matrix for Mahalanobis norms). There exists a rich tradition of similar studies in the fields of inverse problem regularization and model selection in statistics. The most prominent examples are the sparsistency theory of the Lasso Tibshirani (1996) (least square regularized by ℓ1), which culminated in the theory of compressed sensing Candès et al. (2006). These theoretical results are predicated on a so-called irrepresentability condition Zhao & Yu (2006), which ensures the stability of the support. While our analysis is grounded in similar concepts (in particular, we identify the corresponding irrepresentability condition for the i OT), the i OT inverse problem is fundamentally distinct due to the differing observation model (it corresponds to a sampling process rather than the observation of a vector) and the estimation process necessitates solving a linear program of potentially infinite dimension (in the limit of a large number of samples). This mandates a novel proof strategy, which forms the core of our mathematical analysis. 1.2 CONTRIBUTIONS This paper proposes the first mathematical analysis of the performance of regularized i OT estimation, focusing on the special case of sparse ℓ1 regularization (the ℓ1-i OT method). We begin by deriving the customary irrepresentability condition of the i OT problem, rigorously proving that it is well-defined. This condition interweaves the properties of the Hessian of the maximum likelihood functional with the sparse support of the sought-after cost. The main contribution of this paper is Theorem 5, which leverages this abstract irrepresentability condition to ensure sparsistency of the ℓ1-i OT method. This relates to the robust estimation of the cost and its support in some linear model, assuming the number n of samples is large enough. Specifically, we demonstrate a sample complexity of n 1/2. Our subsequent sets of contributions are centered on the case of matching between samples of Gaussian distributions. Herein, we illustrate in Lemma 7 how to compute the irrepresentability condition in closed form. This facilitates the examination of how the parameters of the problem, particularly regularization strength and the covariance of the distributions, influence the success and stability of i OT. We further explore the limiting cases of small and large entropic regularization, revealing in Proposition 8 and Proposition 9 that i OT interpolates between the graphical lasso (to estimate the graph structure of the precision matrix) and a classical lasso. This sheds light on the connection between i OT and graph estimation procedures. Simple synthetic numerical explorations in Section 5.2 further provide intuition about how ε and the geometry of the graph associated with a sparse cost impact sparsistency. As a minor numerical contribution, we present in Appendix F a large-scale ℓ1-i OT solver, implemented in JAX and distributed as open-source software. 2 INVERSE OPTIMAL TRANSPORT The forward problem Given probability distributions α P (X ), β P (Y ) and cost function c : X Y R, the entropic optimal transport problem seeks to compute a coupling density Sink(c,ε) argmax π U (α,β) c, π ε 2KL(π|α β) where c, π Z c(x, y)dπ(x, y), (1) where KL(π|ξ) R log(dπ/dξ)dπ R dπ and U (α,β) is the space of all probability measures π P (X Y ) with marginals α,β. Published as a conference paper at ICLR 2024 The inverse problem The inverse optimal transport problem seeks to recover the cost function c given an approximation ˆπn of the probability coupling ˆπ Sink(c,ε). A typical setting is an empirical probability coupling ˆπn = 1 n Pn i=1 δ(xi ,yi ) where (xi, yi)n i=1 iid ˆπ. See Section 2.2. The loss function The i OT problem has been proposed and studied in a series of papers, see Section 1.2. The approach is typically to consider some linear parameterization 1 of the cost c A(x, y) by some parameter A Rs. The key observation of Dupuy et al. (2019) is that the negative log-likelihood of ˆπ at parameter value A is given by L (A, ˆπ) c A, ˆπ +W ˆπ(A) where W ˆπ(A) sup π U ( ˆα, ˆβ) c A, π ε 2KL(π| ˆα ˆβ), and ˆα, ˆβ are the marginals of ˆπ. For ease of notation, unless stated otherwise, we write W = W ˆπ. So, computing the maximum likelihood estimator A for the cost corresponds to minimizing the convex loss function A 7 L (A, ˆπ), which, by regarding W ˆπ as a convex conjugate, can be seen as an instance of a Fenchel-Young loss, a family of losses proposed in Blondel et al. (2020) . We write the parameterization as Φ : A Rs 7 c A = s X j=1 A j Cj , where Cj C (X Y ). A relevant example are quadratic loss functions, so that for X Rd1, Y Rd2, given A Rd1 d2, c A(x, y) = x Ay. In this case, s = d1d2 and for k = (i, j) [d1] [d2], Ck(x, y) = xi y j . ℓ1-i OT To handle the presence of noisy data (typically coming from the sampling process), various regularization approaches have been proposed. In this work, we focus on the use of ℓ1-regularization Carlier et al. (2023) to recover sparse parametrizations: argmin A F(A), where F(A) λ A 1 +L (A, ˆπ). (i OT ℓ1( ˆπ)) Kantorovich formulation Note that W (A) is defined via a concave optimization problem and by Fenchel duality, one can show that (i OT ℓ1( ˆπ)) has the following equivalent Kantorovich formulation Carlier et al. (2023): argmin A,f ,g K (A, f ,g), where K (A, f ,g) J (A, f ,g)+λ A 1 , and (K ) J (A, f ,g) Z f (x)+ g(y)+ΦA(x, y) d ˆπ(x, y)+ε Z exp µ2(f (x)+ g(y)+ΦA(x, y)) dα(x)dβ(y). Based on this formulation, various algorithms have been proposed, including alternating minimization with proximal updates Carlier et al. (2023). Section F details a new large scale solver that we use for the numerical simulations. 2.1 INVARIANCES AND ASSUMPTIONS Assumption 1. We first assume that X and Y are compact. Note that J has the translation invariance property that for any constant function u, J (f +u,g u, A) = J (f ,g, A), so to remove this invariance, throughout, we restrict the optimization of (K ) to the set S ½ (A, f ,g) Rs L2(α) L2(β) ; Z g(y)dβ(y) = 0 ¾ . (2) Next, we make some assumptions on the cost to remove invariances in the i OT problem. Assumption 2 (Assumption on the cost). (i) E(x,y) α β[C(x, y)C(x, y) ] Id is invertible. C(x, y) É 1 for α almost every x and β-almost every y. 1In this work, we restrict to linear parameterizations, although the same loss function can also be applied to learn costs with nonlinear parameterization, e.g. via neural networks Ma et al. (2020). Published as a conference paper at ICLR 2024 (iii) for all k, R Ck(x, y)dα(x) = 0 for β-a.e. y and R Ck(x, y)dβ(y) = 0 for α-a.e. x. Under these assumptions, it can be shown that i OT has a unique solution (see remark after Proposition 2). Assumption 2 (i) is to ensure that c A = ΦA is uniquely determined by A. Assumption 2 (ii) is without loss of generality, since we assume that α,β are compactly supported, so this holds up to a rescaling of the space. Assumption 2 (iii) is to handle the invariances pointed out in Carlier et al. (2023) and Ma et al. (2020): J (A, f ,g) = J (A , f ,g ) c A +(f g) = c A +(f g ). (3) As observed in Carlier et al. (2023), any cost can be adjusted to fit this assumption: one can define Ck(x, y) = Ck(x, y) uk(x) vk(y) where uk(x) = R Ck(x, y)dβ(y) and vk(y) = R Ck(x, y)dα(x) R Ck(x, y)dα(x)dβ(y). Letting ΦA = P k Ak Ck, we have f P k Akuk g P k Akvk + ΦA = (f g) + ΦA. So optimization with the parametrization Φ is equivalent to optimization with Φ. NB: For the quadratic cost ck(x, y) = xi y j for k = (i, j), condition (iii) corresponds to recentering the data points, and taking x 7 x R xdα(x) and y 7 y R ydβ(y). Condition (ii) holds if x y É 1 for α-a.e. x and β-a.e. y. Condition (i) corresponds to invertibility of Eα[xx ] Eβ[yy ]. 2.2 THE FINITE SAMPLE PROBLEM In practice, we do not observe ˆπ but n data-points (xi, yi) iid ˆπ for i = 1,...,n, where ˆπ = Sink(c b A,ε). To recover b A, we plug into i OT ℓ1 the empirical measure ˆπn = 1 n Pn i=1 δxi ,yi and consider the estimator An argmin A λ A 1 +L (A, ˆπn), (i OT ℓ1( ˆπn)) As in section 2.1, to account for the invariance (3), when solving (i OT ℓ1( ˆπn)), we again centre the cost parameterization such that for all i, Pn i=1 Ck(xi, y j ) = 0 and for all j, Pn j=1 Ck(xi, y j ) = 0. Note also that i OT ℓ1( ˆπn) can be formulated entirely in finite dimensions. See Appendix B for details. 3 THE CERTIFICATE FOR SPARSISTENCY In this section, we consider the problem (i OT ℓ1( ˆπ)) with full data ˆπ and present a sufficient condition for support recovery, that we term non-degeneracy of the certificate. For simplicity of notation, throughout this section denote W W ˆπ. Under non-degeneracy of the certificate we obtain support recovery as stated in Theorem 3, a known result whose proof can be found e.g. in Lee et al. (2015). This condition can be seen as a generalization of the celebrated Lasso s Irrepresentability condition (see e.g. Hastie et al. (2015)) Lasso corresponds to having a quadratic loss instead of L (A, ˆπ), thus 2 AW (A) in the definition below reduces to a matrix. In what follows, we denote u I (ui)i I and UI,J (Ui,j )i I,j J the restriction operators. Definition 1. The certificate with respect to A and support I = {i : Ai = 0} is z A 2W (A)(:,I)( 2W (A)(I,I)) 1 sign(A)I. (C) We say that it is non-degenerate if (z A)I c < 1. The next proposition, whose proof can be found in Appendix C.1, shows that the function W (A) is twice differentiable, thus ensuring that C is well defined. Proposition 2. A 7 W (A) is twice differentiable, strictly convex, with gradient and Hessian AW (A) = Φ πA, 2 AW (A) = Φ πA where πA is the unique solution to (1) with cost c A = ΦA. We remark that strict convexity of W implies that any solution of (i OT ℓ1( ˆπ)) must be unique, and by Γ-convergence, solutions Aλ to (i OT ℓ1( ˆπ)) converge to ˆA as λ 0. The next theorem, a well-known result (see (Lee et al., 2015, Theorem 3.4) for a proof), shows that support recovery can be characterized via z b A. Published as a conference paper at ICLR 2024 Theorem 3. Let ˆπ = Sink(c b A,ε). If z b A is non-degenerate, then for all λ sufficiently small, the solution Aλ to (i OT ℓ1( ˆπ)) is sparsistent with Aλ b A = O(λ) and Supp Aλ = Supp( b A). 3.1 INTUITION BEHIND THE CERTIFICATE (C) Implication of the non-degeneracy condition The non-degeneracy condition is widely studied for the Lasso problem Hastie et al. (2015). Just as for the Lasso, the closer (z b A)i is to 1, the more unstable is this coefficient, and if (z b A)i > 1 then one cannot expect to recover this coefficients when there is noise. Note also that in the Lasso, the pre-certificate formula in (C) roughly the correlation between coefficient inside and outside the support I. In our case, the loss is more complex and this correlation is measured according to the hessian of W , which integrates the curvature of the loss. This curvature formula is however involved, so to gain intuition, we perform a detailed analysis in the ε 0 and ε > for the Gaussian setting where it becomes much simpler (see Section 5.1). Link to optimality conditions The certificate z b A can be seen as the limit optimality condition for the optimization problem (i OT ℓ1( ˆπ)) as λ 0: by the first order optimality condition to (i OT ℓ1( ˆπ)), Aλ is a solution if and only if zλ 1 λ L(Aλ) Aλ 1 , where the subdifferential for the ℓ1 norm has the explicit form A 1 = z ; z É 1, i Supp(A),zi = sign(Ai) ª . It follows zλ can be seen as a certificate for the support of Aλ since Supp(Aλ) i ; zλ i = 1 ª . To study the support behavior of Aλ for small λ, it is therefore interesting to consider the limit of zλ as λ 0. Its limit is precisely the subdifferential element with the minimal norm and coincides with (C) under the nondegeneracy condition: Proposition 4. Let zλ 1 λ L(Aλ) where Aλ solves (i OT ℓ1( ˆπ)). Then, lim λ 0zλ = zmin b A argmin z n z, 2W ( b A 1 z F ; z b A 1 o (MNC) Moreover, if z b A is non-degenerate, then zmin b A = z b A. 4 SAMPLE COMPLEXITY BOUNDS Our main contribution shows that (C) is a certificate for sparsistency under sampling noise: Theorem 5. Let ˆπ = Sink(c b A,ε). Suppose that the certificate z b A is non-degenerate. Let δ > 0. Then, for all sufficiently small regularization parameters λ and sufficiently many number of samples n, λ 1 and max ³ exp(C b A 1 /ε) q log(1/δ)λ 1, q for some constant C > 0, with probability at least 1 δ, the minimizer An to (Pn) is sparsistent with b A with Supp(An) = Supp( b A) and An b A 2 λ+ q exp(C b A 1 /ε)log(1/δ)n 1. Main idea behind Theorem 5 We know from Theorem 3 that there is some λ0 > 0 such that for all λ É λ0, the solution to (K ) has the same support as b A. To show that the finite sample problem also recovers the support of b A when n is sufficiently large, we fix λ (0,λ0] and consider the setting where the observations are iid samples from the coupling measure ˆπ. We will derive convergence bounds for the primal and dual solutions as the number of samples n increases. Let (A , f ,g ) minimise (K ). Denote F = (f (xi))i [n], G = (g (y j ))j [n] and n2 (p (xi, y j ))i,j [n], where p (x, y) = exp µ2 ΦA (x, y)+ f (x)+ g (y) . Let An minimize i OT ℓ1( ˆπn). Then, there exists a probability matrix Pn Rn n + and vectors Fn,Gn Rn such that Pn = 1 ε (Φn An +Fn Gn) . In fact, (An,Fn,Gn) minimize the finite dimensional dual problem (Kn) given in the appendix. Now consider the certificates λΦ p α β ˆπ and zn = 1 λΦ n Pn ˆPn . Published as a conference paper at ICLR 2024 Note that z and zn both depend on λ; the superscript was dropped since λ is fixed. Moreover, z is precisely zλ from Propostion 4. By exploiting strong convexity properties of Jn, one can show the following sample complexity bound on the convergence of zn to z (the proof can be found in the appendix): Proposition 6. Let n max log(1/δ)λ 2,log(2s) for some δ > 0. For some constant C > 0, with probability at least 1 δ, z zn exp(C b A 1 /ε)log(1/δ)λ 1n 1 An A 2 2 + 1 i (Fn F )2 i + 1 j (Gn G )2 j ε2 exp(C b A 1 /ε)log(1/δ)n 1. From Proposition 4, for all λ É λ0 for some λ0 sufficiently small, the magnitude of z outside the support of ˆA is less than one. Moreover, the convergence result in Proposition 6 above implies that, for n sufficiently large, the magnitude of zn outside the support of ˆA is also less than one. Hence, since the set {i ; (zn)i = 1} determines the support of An, we have sparsistency. 5 GAUSSIAN DISTRIBUTIONS To get further insight about the sparsistency property ot i OT, we consider the special case where the source and target distributions are Gaussians, and the cost parametrization c A(x, y) = x Ay. To this end, we first derive closed form expressions for the Hessian 2 AL (A) = 2 AW (A). Given α = N (mα,Σα) and β = N (mβ,Σβ), it is known (see Bojilov & Galichon (2016)) that the coupling density is also a Gaussian of the form π = N µ mα mβ , µΣα Σ Σ Σβ for some Σ Rd1 d2. In this case, W can be written as an optimization problem over the cross-covariance Σ Bojilov & Galichon (2016). W (A) = sup Σ Rd1 d2 A, Σ + ε 2 logdet Σβ Σ Σ 1 α Σ , (4) with the optimal solution being precisely the cross-covariance of the optimal coupling π. In Bojilov & Galichon (2016), the authors provide an explicit formula for the minimizer Σ to (4), and, consequently, for W (A): Σ = ΣαA A ΣαA 1 2εA , where µ Σβ + ε2 4 A Σ 1 α A , 1 By differentiating the first order condition for W , that is A = ε 1(Σβ Σ Σ 1 α Σ)Σ Σα, Galichon derives the expression for the Hessian in terms of Σ in (5): 2W (A) = AΣ = ε ΣαΣ 1, ΣβΣ 1 +T 1 A 1, A 1 , (6) where T is such that Tvec(A) = vec(A ). This formula does not hold in when A is rectangular or rank deficient, since A is not differentiable. In the following, we derive, via the implicit function theorem, a general formula for AΣ that agrees with that of Galichon in the square invertible case. Lemma 7. Denoting Σ as in (5), one has 2W (A) = ε ³ ε2(Σβ Σ ΣαΣ) 1 (Σα ΣΣ 1 β Σ ) 1 +(A A)T 1 (7) This formula given in Lemma 7 provides an explicit expression for the certificate (C). 5.1 LIMIT CASES FOR LARGE AND SMALL ε This section explores the behaviour of the certificate in the large/small ε limits: Proposition 8 reveals that the large ε limit coincides with the classical Lasso while Proposition 9 reveals that the small epsilon limit (for symmetric A 0 and Σα = Σβ = Id) coincides with the Graphical Lasso. In the following results, we denote the functional in (i OT ℓ1( ˆπ)) with parameters λ and ε by Fε,λ(A). Proposition 8 (ε ). Let b A be invertible and let ˆπ = Sink(c b A,ε) be the observed coupling between α = N (mα,Σα) and β = N (mβ,Σβ). Then, lim ε zε = (Σβ Σα)(:,I) (Σβ Σα)(I,I) 1 sign( b A)I. (8) Published as a conference paper at ICLR 2024 Moreover, for λ0 > 0, given any sequence (εj )j and A j argmin A Fε,λ0/εj (A) with limj εj = , any cluster point of (A j )j is in argmin A Rd d λ0 A 1 + 1 2 (Σ1/2 β Σ1/2 α )(A b A) 2 F (9) Interpretation As ε , the KL term forces ˆπ to be close to the independent coupling α β and in the limit, i OT is simply a Lasso problem and the limit in (8) is precisely the Lasso certificate Hastie et al. (2015). Here, the cross covariance of ˆπ satisfies εΣ = A +O(ε) ((53) in the appendix), so for large ε, sparsity in A indicates the independence in the coupling between α and β. Proposition 9 (ε 0). Let b A be symmetric positive-definite and let ˆπ = Sink(c b A,ε) be the observed coupling between α = N (mα,Id) and β = N (mβ,Id). Then, lim ε 0zε = ( b A 1 b A 1)(:,I) ( b A 1 b A 1)(I,I) 1 sign( b A)I. (10) Let λ0 > 0. Then, optimizing over symmetric positive semi-definite matrices, given any sequence (εj )j and A j argmin A 0 Fε,λ0εj (A) with limj εj = 0, any cluster point of (A j )j is in argmin A 0 λ0 A 1 1 2 logdet(A)+ 1 2 A, b A 1 . (11) Interpretation In contrast, ε 0, the KL term disappears and the coupling ˆπ becomes dependent. Naturally, the limit problem is the graphical lasso, typically used to infer conditional independence in graphs (but where covariates can be highly dependent). Note also that the limit (10) is precisely the graphical Lasso certificate Hastie et al. (2015). Here, (Remark 25 in the appendix) one show that for (x, y) π, the conditional covariance of x conditional on y (and also vice versa) is εA 1 +O(ε2). Sparsity in A can therefore be viewed as information on conditional independence. 5.2 NUMERICAL ILLUSTRATIONS In order to gain some insight into the impact of ε and the covariance structure on the efficiency of i OT, we present numerical computations of certificates here. We fix the covariances of the input measures as Σα = Σβ = Idn, similar results are obtained with different covariance as long as they are not rank-deficient. We consider that the support of the sought-after cost matrix A = δIdn +diag(G1n) G Rn n is defined as a shifted Laplacian matrix of some graph adjacency matrix G, for a graph of size n = 80 (similar conclusions hold for larger graphs). We set the shift δ to be 10% of the largest eigenvalue of the Laplacian, ensuring that C is symmetric and definite. This setup corresponds to graphs defining positive interactions at vertices and negative interactions along edges. For small ε, adopting the graphical lasso interpretation (as exposed in Section 5.1) and interpreting C as a precision matrix, this setup corresponds (for instance, for a planar graph) to imposing a spatially smoothly varying covariance C 1. Figure 1 illustrates how the value of the certificates zi,j evolves depending on the indexes (i, j) for three types of graphs (circular, planar, and Erd os Rényi with a probability of edges equal to 0.1), for several values of ε. By construction, zi,i = 1 and zi,j = 1 for (i, j) connected by the graph. For z to be non-degenerate, it is required that |zi,j | < 1, as i moves away from j on the graph. For the circular and planar graphs, the horizontal axis represents the geodesic distance dgeod(i, j), demonstrating how the certificates become well-behaved as the distance increases. The planar graph displays envelope curves showing the range of values of z for a fixed value of dgeod(i, j), while this is a single-valued curve for the circular graph due to periodicity. For the Erd os Rényi graph, to better account for the randomness of the certificates along the graph edges, we display the histogram of the distribution of |zi,j | for dgeod(i, j) = 2 (which represents the most critical set of edges, as they are the most likely to be large). All these examples show the same behavior, namely, that increasing ε improves the behavior of the certificates (which is in line with the stability analysis of Section 5.1, since (8) implies that for large ε, the certificate is trivially non-degenerate whenever Σα,Σβ are diagonal), and that pairs of vertices (i, j) connected by a small distance dgeod(i, j) are the most likely to be degenerate. This suggests that they will be inaccurately estimated by i OT for small ε. Published as a conference paper at ICLR 2024 Circular Planar Erd os Rényi Figure 1: Display of the certificate values zi,j for three types of graphs, for varying ε. Left, middle: plotted as a function of the geodesic distance dgeod(i, j) on the x-axis. Right: histogram of zi,j for (i, j) at distance dgeod(i, j) = 2. Figure 2 displays the recovery performances of ℓ1 i OT for the circular graph shown on the left of Figure 1 (similar results are obtained for the other types of graph topologies). These numerical simulations are obtained using the large-scale i OT solver, which we detail in Appendix F. The performance is represented using the number of inaccurately estimated coordinates in the estimated cost matrix C, so a score of 0 means a perfect estimation of the support (sparsistency is achieved). For ε = 0.1, sparsistency cannot be achieved, aligning with the fact that the certificate z is degenerate as depicted in the previous figure. In sharp contrast, for larger ε, sparsistency can be attained as soon as the number of samples N is sufficiently large, which also aligns with our theory of sparsistency of ℓ1 i OT since the certificate is guaranteed to be non-degenerate in the large ε setting. ε = 0.1 ε = 1 ε = 10 wrongly estim. pos. Figure 2: Recovery performance (number of wrongly estimated position) of ℓ1 i OT as a function of λ for three different values of ε. Figure 3 in the appendix displays the certificate values in the case of a non-symmetric planar graph. The graph is obtained by deleting all edges on a planar graph with i < j. We plot the certificate values as a function of geodesic distances dgeod(i, j) of the symmetrized graph. The middle plot shows the certificate values on i Ê j (where the actual edges are constrained to be and nondegeneracy requires values smaller than 1 in absolute value for dgeod(i, j) Ê 2). The right plot shows the certificate values on i É j where there are no edges and for nondegeneracy, one expects values smaller than 1 in absolute value for dgeod(i, j) Ê 1. Observe that here, the certificate is degenerate for small values of dgeod(i, j) when ε = 0, meaning that the problem is unstable at the ghost symmetric edges. As ε , the certificate becomes non-degenerate. Published as a conference paper at ICLR 2024 In this paper, we have proposed the first theoretical analysis of the recovery performance of ℓ1-i OT. Much of this analysis can be extended to more general convex regularizers, such as the nuclear norm to promote low-rank Euclidean costs, for instance. Our analysis and numerical exploration support the conclusion that i OT becomes ill-posed and fails to maintain sparsity for overly small ε. When approached from the perspective of graph estimations, unstable indices occur at smaller geodesic distances, highlighting the geometric regularity of i OT along the graph geometry. ACKNOWLEDGEMENTS The work of G. Peyré was supported by the European Research Council (ERC project NORIA) and the French government under management of Agence Nationale de la Recherche as part of the Investissements d avenir program, reference ANR19-P3IA-0001 (PRAIRIE 3IA Institute). Ravindra K Ahuja and James B Orlin. Inverse optimization. Operations research, 49(5):771 783, 2001. Jason Altschuler, Jonathan Niles-Weed, and Philippe Rigollet. Near-linear time approximation algorithms for optimal transport via sinkhorn iteration. Advances in neural information processing systems, 30, 2017. Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214 223. PMLR, 2017. Aurélien Bellet, Amaury Habrard, and Marc Sebban. A survey on metric learning for feature vectors and structured data. ar Xiv preprint ar Xiv:1306.6709, 2013. Mathieu Blondel, André FT Martins, and Vlad Niculae. Learning with fenchel-young losses. The Journal of Machine Learning Research, 21(1):1314 1382, 2020. Raicho Bojilov and Alfred Galichon. Matching in closed-form: equilibrium, identification, and comparative statics. Economic Theory, 61:587 609, 2016. Jonathan Borwein and Adrian Lewis. Convex Analysis. Springer, 2006. Emmanuel J Candès, Justin Romberg, and Terence Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory, 52(2):489 509, 2006. Guillaume Carlier, Arnaud Dupuy, Alfred Galichon, and Yifei Sun. Sista: learning optimal transport costs under sparsity constraints. Communications on Pure and Applied Mathematics, 76(9): 1659 1677, 2023. Wei-Ting Chiu, Pei Wang, and Patrick Shafto. Discrete probabilistic inverse optimal transport. In International Conference on Machine Learning, pp. 3925 3946. PMLR, 2022. Benoît Colson, Patrice Marcotte, and Gilles Savard. Bilevel programming: A survey. 4or, 3:87 107, 2005. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Adv. in Neural Information Processing Systems, pp. 2292 2300, 2013. Marco Cuturi and David Avis. Ground metric learning. The Journal of Machine Learning Research, 15(1):533 564, 2014. Marco Cuturi and Gabriel Peyré. A smoothed dual approach for variational wasserstein problems. SIAM Journal on Imaging Sciences, 9(1):320 343, 2016. Published as a conference paper at ICLR 2024 Jason V Davis and Inderjit S Dhillon. Structured metric learning for high dimensional problems. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 195 203, 2008. Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695 17709, 2021. Klaus Deimling. Nonlinear functional analysis. Courier Corporation, 2010. Arnaud Dupuy and Alfred Galichon. Personality traits and the marriage market. Journal of Political Economy, 122(6):1271 1319, 2014. Arnaud Dupuy, Alfred Galichon, and Yifei Sun. Estimating matching affinity matrices under low-rank constraints. Information and Inference: A Journal of the IMA, 8(4):677 689, 2019. Marvin Eisenberger, Aysim Toker, Laura Leal-Taixé, Florian Bernard, and Daniel Cremers. A unified framework for implicit sinkhorn differentiation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 509 518, 2022. Alfred Galichon. Optimal transport methods in economics. Princeton University Press, 2018. Alfred Galichon and Bernard Salanié. Matching with trade-offs: Revealed preferences over competing characteristics. Preprint hal-00473173, 2010. Alfred Galichon and Bernard Salanié. Cupid s invisible hand: Social surplus and identification in matching models. The Review of Economic Studies, 89(5):2600 2629, 2022. Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics, pp. 1608 1617. PMLR, 2018. Aude Genevay, Lénaic Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Sample complexity of Sinkhorn divergences. In The 22nd international conference on artificial intelligence and statistics, pp. 1574 1583. PMLR, 2019. Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC press, 2015. Matthieu Heitz, Nicolas Bonneel, David Coeurjolly, Marco Cuturi, and Gabriel Peyré. Ground metric learning on graphs. Journal of Mathematical Imaging and Vision, 63:89 107, 2021. Jean-Baptiste Hiriart-Urruty and Claude Lemaréchal. Conjugacy in convex analysis. Convex Analysis and Minimization Algorithms II: Advanced Theory and Bundle Methods, pp. 35 90, 1993. Gao Huang, Chuan Guo, Matt J Kusner, Yu Sun, Fei Sha, and Kilian Q Weinberger. Supervised word mover s distance. Advances in neural information processing systems, 29, 2016. Tanguy Kerdoncuff, Rémi Emonet, and Marc Sebban. Metric Learning in Optimal Transport for Domain Adaptation. In International Joint Conference on Artificial Intelligence, Kyoto, Japan, January 2021. URL https://hal-ujm.archives-ouvertes.fr/ujm-02611800. Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Q Weinberger. From word embeddings to document distances. In Proc. of the 32nd Intern. Conf. on Machine Learning, pp. 957 966, 2015. Jason D Lee, Yuekai Sun, and Jonathan E Taylor. On model selection consistency of regularized m-estimators. Electronic Journal of Statistics, 9:608 642, 2015. Christian Léonard. From the Schrödinger problem to the Monge Kantorovich problem. Journal of Functional Analysis, 262(4):1879 1920, 2012. Ruilin Li, Xiaojing Ye, Haomin Zhou, and Hongyuan Zha. Learning to match via inverse optimal transport. Journal of machine learning research, 20, 2019. Published as a conference paper at ICLR 2024 Shaojun Ma, Haodong Sun, Xiaojing Ye, Hongyuan Zha, and Haomin Zhou. Learning cost functions for optimal transport. ar Xiv preprint ar Xiv:2002.09650, 2020. Gonzalo Mena and Jonathan Niles-Weed. Statistical bounds for entropic optimal transport: sample complexity and the central limit theorem. Advances in neural information processing systems, 32, 2019. Marcel Nutz. Introduction to entropic optimal transport. Lecture notes, Columbia University, 2021. François-Pierre Paty and Marco Cuturi. Regularized optimal transport is ground cost adversarial. In International Conference on Machine Learning, pp. 7532 7542. PMLR, 2020. Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. Clarice Poon and Gabriel Peyré. Smooth bilevel programming for sparse regularization. Advances in Neural Information Processing Systems, 34:1543 1555, 2021. Philippe Rigollet and Austin J Stromme. On the sample complexity of entropic optimal transport. ar Xiv preprint ar Xiv:2206.13472, 2022. Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. The earth mover s distance as a metric for image retrieval. International Journal of Computer Vision, 40(2):99 121, November 2000. Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015. Geoffrey Schiebinger, Jian Shu, Marcin Tabaka, Brian Cleary, Vidya Subramanian, Aryeh Solomon, Joshua Gould, Siyan Liu, Stacie Lin, Peter Berube, et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell, 176(4):928 943, 2019. Liangliang Shi, Gu Zhang, Haoyu Zhen, Jintao Fan, and Junchi Yan. Understanding and generalizing contrastive learning from the inverse optimal transport perspective. 2023. Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Statist., 35:876 879, 1964. Andrew M Stuart and Marie-Therese Wolfram. Inverse optimal transport. SIAM Journal on Applied Mathematics, 80(1):599 619, 2020. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267 288, 1996. Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1-2):1 230, 2015. Samuel Vaiter, Mohammad Golbabaee, Jalal Fadili, and Gabriel Peyré. Model selection with low complexity priors. Information and Inference: A Journal of the IMA, 4(3):230 287, 2015. Fan Wang and Leonidas J Guibas. Supervised earth mover s distance learning and its computer vision applications. In European Conference on Computer Vision, pp. 442 455. Springer, 2012. Kilian Q Weinberger, John Blitzer, and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. In Advances in neural information processing systems, pp. 1473 1480, 2006. Eric Xing, Michael Jordan, Stuart J Russell, and Andrew Ng. Distance metric learning with application to clustering with side-information. Advances in neural information processing systems, 15, 2002. Jie Xu, Lei Luo, Cheng Deng, and Heng Huang. Multi-level metric learning via smoothed wasserstein distance. In IJCAI, pp. 2919 2925, 2018. Published as a conference paper at ICLR 2024 Gloria Zen, Elisa Ricci, and Nicu Sebe. Simultaneous ground metric learning and matrix factorization with earth mover s distance. In 2014 22nd International Conference on Pattern Recognition, pp. 3690 3695. IEEE, 2014. Jianzhong Zhang and Zhenhong Liu. Calculating some inverse linear programming problems. Journal of Computational and Applied Mathematics, 72(2):261 273, 1996. Peng Zhao and Bin Yu. On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541 2563, 2006. Figure 3: Display of certificate values for a non-symmetric planar graph, for varying ε with edges only for i > j. Middle/Right: plots of the certificate values as a function of the geodesic distance dgeod(i, j) of the symmetrized graph. The middle plot show the values when restricted to i Ê j. The right plot shows the values restricted to i É j (where there are no edges). A INTERPRETATIONS OF THE LOSS FUNCTION As mentioned in Section 2, the loss L (A, ˆπ) can be recovered via maximum likelihood (ML) estimation (Dupuy et al., 2019, Proposition 1) and be regarded as an instance of a family of losses called Fenchel-Young losses. For the sake of completeness, this section explains this in more detail. A.1 MAXIMUM LIKELIHOOD INTERPRETATION The map A Sink(c A,ε), where Sink( , ) is defined in Section 2, can be seen as parameterizing a set of measures which are absolutely continuous with respect to α β. By the standard duality result (see Nutz (2021)): 1) The density of Sink(c A,ε) is given by d Sink(c A,ε)/d(α β) = exp(c A + f A + g A), where f A and g A solve the dual problem, i.e., supf ,g R f dα+ R g dβ R exp(c A + f +g)d(α β)+1; 2) The values of the primal and dual problems agree and are equal to R f A dα R g A dβ. Combining these two facts we obtain that log ³d Sink(c A,ε) = c A f A g A = c A + sup π U (α,β) c A,π ε 2KL(π|α β) = c A +W ˆπ(A), where W ˆπ is defined in as in Section 2. Finally, taking expectation with respect to ˆπ yields E(x,y) ˆπ h log ³d Sink(c A,ε) i = E(x,y) ˆπ c A +E(x,y) ˆπ W ˆπ(A) = c A, ˆπ +W ˆπ(A) = L (A, ˆπ), thus establishing the connection with ML estimation. Bilevel interpretation Note also that argmin A L (A, ˆπ) = argmin A c A, ˆπ πA ε 2KL(πA| ˆα ˆβ) = argmin A c A, ˆπ πA ε 2KL(πA| ˆα ˆβ)+ ε 2KL( ˆπ| ˆα ˆβ) Published as a conference paper at ICLR 2024 where πA = argminπ U ( ˆα, ˆβ) c A, π KL(π| ˆα ˆβ). Since ˆπ,πA have the same marginals, c A, ˆπ πA = c A + f A + g A, ˆπ πA = ε 2 log(πA/( ˆα ˆβ)), ˆπ ˆπA . We therefore have argmin A L (A, ˆπ) = argmin A KL( ˆπ|πA), where πA = argmin π U ( ˆα, ˆβ) c A, π KL(π| ˆα ˆβ). A.2 FENCHEL-YOUNG LOSS INTERPRETATION It is argued in Blondel et al. (2020) that, associated to a prediction rule of the form ˆy(θ) = argmax µ dom(Ω) θ,µ Ω(µ), there is a natural loss function that the authors term Fenchel-Young loss and that is given by LΩ(θ; y) Ω (θ)+Ω(y) θ, y . With this in mind, let Ω(π) = KL(π|α β)+ιU (α,β)(π) where ιU (α,β)( ) is the indicator function of U (α,β), and note that L (A, ˆπ) = LΩ(c A; ˆπ) Ω( ˆπ). Since, for the purposes of minimization with respect to A, the term Ω( ˆπ) is irrelevant, we see that finding a minimizer of (i OT ℓ1( ˆπ)) amounts to finding a minimizer of the l1 regularized Fenchel-Young loss associated with KL( |α β)+ιU (α,β)( ). B THE FINITE SAMPLE PROBLEM As mention in Section 2.2, we do not observe the full coupling ˆπ = Sink(c b A,ε), but only the the empirical measure ˆπn = 1 n Pn i=1 δxi ,yi with (xi, yi) iid ˆπ. We show here that the problem i OT ℓ1( ˆπn) can be formulated entirely in finite dimensions as follows. Let i,j Pi,j (log(Pi,j /Qi,j ) 1). Note that ˆπn has marginals ˆan = 1 n Pn i=1 δxi and ˆbn = 1 n Pn i=1 δyi . We can interpret ˆπn as the matrix ˆPn = 1 n Idn n and the noisy primal problem can be equivalently written as min A Rs sup P Rn n + λ A 1 + Φn A, P ˆPn ε 2 H(P) s.t. P1 = 1 n 1 and P 1 = 1 where we write H(P) H(P| 1 n2 1 1) and Φn A = P k Ak Ck, where Ck = (Ck(xi, y j ))i,j [n] Rn n. Note that the finite-dimensional problem has the same invariances as (i OT ℓ1( ˆπ)), so, we will take Ck to be centred so that for all i, P i(Ck)i,j = 0 and for all j, P j (Ck)i,j = 0. The finite sample Kantorovich formulation is inf A,F,G Sn Kn(A,F,G) where Kn(A,F,G) Jn(A,F,G)+λ A 1 and (Kn) Jn(A,F,G) X Fi G j +(Φn A)i,j ˆPn i,j + ε 2n2 X ε(Fi +G j +(Φn A)i,j and we restrict the optimization of (Kn) over Sn (A,F,G) Rs Rn Rn ; P j G j = 0 ª . C PROOFS FOR SECTION 3 C.1 PROOF OF PROPOSITION 2 (DIFFERENTIABILITY OF W ) For the strict convexity of W (A) see Lemma 3 in Dupuy & Galichon (2014). The gradient formula can also be found in Dupuy & Galichon (2014) and trivially follows from the envelope theorem Published as a conference paper at ICLR 2024 because the optimization problem W (A) has a unique solution. We will only give a proof of the Hessian formula. The formula for the Hessian follows from the formula for the gradient provided we show that the density πA is continuously differentiable with respect to A. The fact that we can swap the order of the operator Φ and partial differentiation follows from the conditions for differentiability under the integral sign which holds since the measures are compactly supported. Without loss of generality, let ε = 1. Then, the optimizer in W (A) is of the form Santambrogio (2015): πA(x, y) = exp ³ u A(x)+ v A(y)+c A(x, y) , where u A( ) and v A( ) satisfy u A(x) = log Z Y exp ³ v A(y)+c A(x, y) dβ(y), α-a.s. v A(y) = log Z X exp ³ u A(x)+c A(x, y) dα(x), β-a.s. It is known (see e.g. Nutz (2021)) that the functions u A( ) and v A( ) inherit the modulus of continuity of the cost (in this case the map (x, y) c A(x, y))) and, hence, since we are assuming the measures to be compactly supported, it follows that u A L2(α) and v A L2(β). Moreover, if (u A,v A) solve these two equations then, for any constant c, the pair (u A + c,v A c) is also a solution and, hence, to eliminate this ambiguity we consider solutions in H = L2(α) L2 0(β), where L2 0(β) = g L2(β) ; R g dβ(y) = 0 ª . To show that W is twice differentiable, since W (A) = Φ πA, it is sufficient to show that A 7 u A and A 7 v A are differentiable. To this end, we will apply the Implicit Function Theorem (in Banach spaces) to the map F : H Rs L2(α) L2(β) Y exp ³ v(y)+c A(x, y) dβ(y) X exp ³ u(x)+c A(x, y) dα(x) since we have F(u A,v A, A) = 0. The partial derivative of F at (u A,v A, A) , denoted by u,v F(u A,v A, A), is the linear map defined by ³ u,v F(u A,v A, A) (f ,g) = ³ f + Z Y p A( , y)g(y)dβ(y),g + Z X q A(x, )f (x)dα(x) , (12) p A(x, y) = exp ³ v A(y)+c A(x, y) Y exp ³ v A(y)+c A(x, y) dβ(y) and q A(x, y) = exp ³ u A(x)+c A(x, y) X exp ³ u A(y)+c A(x, y) dβ(x) . Note that, since F(u A,v A, A) = 0, p A(x, y) = q A(x, y) = πA(x, y). Moreover, since πA has marginals α and β, it follows that D (f ,g), ³ u,v F(u A,v A, A) (f ,g) E X f (x)2 dα(x)+ Z X Y 2f (x)g(y)πA(x, y)d(α β)(x, y)+ Z Y g(y)2 dβ(y) ³ f (x)+ g(y) 2 πA(x, y)d(α β)(x, y) This shows that u,v F(u A,v A, A) is invertible the last line of 13 is zero if and only if f g 0 and since g L2 0(β) it follows that g = 0 and f = 0. To conclude the proof we need to show that u,v F(u A,v A, A) 1 is a bounded operator (see e.g. Deimling (2010) for the statement of IFT in Banach spaces) and to show this it is enough to show that, for some constant C, ³ u,v F(u A,v A, A) (f ,g) Ê C (f ,g) . Published as a conference paper at ICLR 2024 This follows from 13 and the fact that there exists a constant C such that πA(x, y) Ê C for all x and y (see e.g. Nutz (2021) for the existence of C). In fact, from g L2 0(β), we obtain R (f (x) + g(y)))2 d(α β)(x, y) = (f ,g) 2 and, hence, 13 implies that D (f ,g), ³ u,v F(u A,v A, A) (f ,g) E Ê C (f ,g) 2 and from Cauchy-Schwarz applied to the left-hand-side we obtain ³ u,v F(u A,v A, A) (f ,g) Ê C (f ,g) , thus concluding the proof. C.2 CONNECTION OF THE CERTIFICATE (C) WITH OPTIMIZATION PROBLEM (i OT ℓ1( ˆπ)) As mentioned in Section 3.1, the vector zλ provides insight into support recovery since Supp(Aλ) i ; zλ i = 1 ª . In this section, we provide the proof to Proposition 4, which shows that as λ converges to 0, zλ converges to the solution of a quadratic optimization problem (MNC) that we term the minimal norm certificate. Note that the connection with (C) can now be established by noting that if the minimal norm certificate is non-degenerate then the inequality constraints (which correspond to the complement of the support of b A) in (MNC) can be dropped since they are inactive; in this case (MNC) reduces to a quadratic optimization problem with only equality constraints and whose solution can be seen to be (C) which will thus be non-degenerate as well. The converse is clear. So, under nondegeneracy, (C) can be seen as the limit optimality vector and determines the support of Aλ when λ is small. To prove Proposition 4, we first show that the vector zλ coincides with the solution to a dual problem of (i OT ℓ1( ˆπ)). Proposition 10. Let W be the convex conjugate of W . Problem (i OT ℓ1( ˆπ)) admits a dual given by argmin z W (ˆΣxy λz) subject to z É 1, (14) where ˆΣxy = Φ ˆπ . Moreover, a pair of primal-dual solutions (Aλ,zλ) is related by λ AL (Aλ, ˆπ) and zλ Aλ 1. (15) Proof. Observe that we can write L (A, ˆπ) as L (A, ˆπ) = W (A) Z X Y (ΦA)(x, y)d ˆπ = W (A) Φ ˆπ, A F = W (A) A, ˆΣxy . The Fenchel Duality Theorem (see e.g. Borwein & Lewis (2006)) yields a dual of (i OT ℓ1( ˆπ)) given by argmin w W (ˆΣxy w)+(λ 1) (w). To conclude the proof just note that the Fenchel conjugate of λ 1 is the indicator of the set {v : v É λ} and make a change of variable z 1/λw to obtain 14. The relationship between any primal-dual pair in Fenchel Duality can also be found in Borwein & Lewis (2006). Proof of Proposition 4. We begin by noting that W is of class C 2 in a neighborhood of Σxy and that 2W (ˆΣxy) = ³ 2W ( b A) 1 . (16) To see this, note that Proposition 2 together with the assumption on ˆπ implies that ˆΣxy = W ( b A). Published as a conference paper at ICLR 2024 Moreover, since W (A) is twice continuously differentiable and strictly convex (see Proposition 2), it follows (see e.g. Corollary 4.2.9 in Hiriart-Urruty & Lemaréchal (1993)) that W ( ) is C 2 and strictly convex in a neighborhood of ˆΣxy and that (16) holds. Now observe that, since W (ˆΣxy) = W W ( b A) = b A, we can rewrite b A 1 as b A 1 = argmin z z, W (ˆΣxy) subject to z É 1 (17) Observe that since zambda are uniformly bounded vectors due to the constraint set in 14, there is a convergent subsequence converging to some z . We later deduce that all limit points are the same and hence, the full sequence zλ converges to z . Let λn be such that limλn 0 zλn = z , and let z0 be any element in b A 1. We have that z0, W (ˆΣxy) É zλn, W (ˆΣxy) É 1 ³ W ˆΣxy λnzλn W (ˆΣxy) ³ W ˆΣxy λnz0 W (ˆΣxy) , where the first inequality is the optimality of z0, the second inequality is the gradient inequality of convex functions and the last inequality follows from the optimality of zλ. Taking the limit as λn 0 we obtain that z0, W (ˆΣxy) = z , W (ˆΣxy) , showing that z b A 1. We now finish the proof by showing that D z , 2W (ˆΣxy)z E É D z0, 2W (ˆΣxy)z0E . (18) Since W ( ) is C 2 in a neighborhood of ˆΣxy, Taylor s theorem ensures that there exists a remainder function R(x) with limx ˆΣxy R(x) = 0 such that zλn, W (ˆΣxy) + λn zλn, 2W (ˆΣxy)zλn +R(ˆΣxy +λnzλn)λ2 n ³ W ˆΣxy λnzλn W ˆΣxy É 1 ³ W ˆΣxy λnz0 W ˆΣxy = z0, W (ˆΣxy) + λn z0, 2W (ˆΣxy)z0 +R(ˆΣxy +λnz0)λ2 n É zλn, W (ˆΣxy) + λn z0, 2W (ˆΣxy)z0 +R(ˆΣxy +λnz0)λ2 n, where we used the optimality of z0 and of zλn. We conclude that zλn, 2W (ˆΣxy)zλn +R(ˆΣxy +λnzλn)λn É 1 z0, 2W (ˆΣxy)z0) +R(ˆΣxy +λnz0)λn. Taking the limit establishes 18. Since z0 was an arbitrary element in b A 1, we obtain that the limit of zλn is z = argmin z D z, ³ 2W ( b A) 1 z E subject to z b A 1 where we used 16. Finally, observe that z was an arbitrary limit point of zλ and we showed that all limit points are the same; this is enough to conclude the result. D PROOF OF PROPOSITION 6 The proof of this statement relies on strong convexity of Jn. Similar results have been proven in the context of entropic optimal transport (e.g. Genevay et al. (2019); Mena & Niles-Weed (2019)). Our proof is similar to the approach taken in Rigollet & Stromme (2022). Published as a conference paper at ICLR 2024 Let (A , f ,g ) minimize (K ). Note that p α β with p (x, y) = exp µ2 ΦA (x, y)+ f (x)+ g (y) minimizes (i OT ℓ1( ˆπ)). Let n2 (p (xi, y j ))i,j , F = (f (xi))i and G = (g (y j ))j . Note that by optimality of A , A 1 É b A 1 this can be seen by comparing the objective (i OT ℓ1( ˆπ)) at A and b A. Moreover, due to the uniform bounds on f ,g from Lemma 12, p is uniformly bounded away from 0 by exp( C/λ) for some constant C that depends on ˆπ. Let Pn minimise (Pn), we know it is of the form ε (Φn An +Fn Gn) . for vectors An,Fn,Gn. The certificates are Φ ϕ and zn Φ nϕn where p α β ˆπ and ϕn = 1 Note that zn = Φ ϕ = 1 λ AL (A , ˆπ). The goal is to bound Φ ϕ Φ nϕn so that nondegeneracy of z would imply nondegeneracy of Φ nϕn. Note that Φ n Pn = Φ ˆπn. So, by the triangle inequality, Φ ϕ Φ nϕn É 1 Φ p α β Φ n Pn + 1 Φ ( ˆπn ˆπ) Φ n P Φ n Pn + 1 Φ p α β Φ n P + 1 Φ ( ˆπn ˆπ) The last two terms on the RHS can be controlled using Proposition 20, and are bounded by O(tn 1/2) with probability at least 1 O(exp( t2)) for t > 0. For the first term on the RHS, letting Z = P Pn, i,j=1 C(xi, y j )Zi,j ε(Φn An +Fn Gn) exp µ2 ε(Φn A +F G ) 2 Let L 2( Φn An +Fn Gn) Φn A +F G ) ). According to the Lipschitz continuity of the exponential ε(Φn An +Fn Gn) exp µ2 ε(Φn A +F G ) 2 É 4exp(L/ε) i,j ((Φn An +Fn Gn) (Φn A +F G ))2 i,j (20) É 12exp(L/ε) i,j (Φn An Φn A )2 i,j + 1 i (Fn F )2 i + 1 j (Gn G )2 j By strong convexity properties of Jn and hence Kn, it can be shown (see Prop 14) that (21) is upper bounded up to a constant by ε 1 exp(L/ε)(Kn(F ,G , A ) Kn(Fn,Gn, An)) µ n 2(Φ nΦn) 1 ( AJn(A ,F ,G )+λξ ) 2 Published as a conference paper at ICLR 2024 +n F Jn(A ,F ,G ) 2 +n GJn(A ,F ,G ) 2 where ξ = 1 λ(Φ ˆπ Φ (p α β)) A 1. By Lemma 11 and Lemma 12, L = O( b A 1) with probability at least 1 O(exp( t2)) if λ tn 1 2 . Finally, ( AJn(A ,F ,G )+λξ ) 2 = Φ n ˆP +Φ n P +λξ 2 É 2 ³ Φ n ˆP Φ ˆπ 2 + Φ n P Φ (p α β) 2 n F Jn(A ,F ,G ) 2 2 = 1 j=1 p (xi, y j ) n GJn(A ,F ,G ) 2 2 i=1 p (xi, y j ) We show in Propositions 16, 19 and 20 that these are bounded by O(t2n 1) with probability at least 1 O(exp( t2)) and from Proposition 23, assuming n log(2s), (n 2Φ nΦn) 1 1 with probability at least 1 O(exp( n)). So, for some constant C > 0, λ Φ ϕ Φ nϕn exp(C b A 1 /ε) t pn with probability at least 1 exp( t2). The second statement follows by combining our bound for (21) with the fact that (n 2Φ nΦn) 1 1. In the following subsections, we complete the proof by establishing the required strong convexity properties of Jn and bound Jn at (A ,F ,G ) using concentration inequalities. The proofs to some of these results are verbatim to the results of Rigollet & Stromme (2022) for deriving sampling complexity bounds in e OT, although they are included due to the difference in our setup. D.1 STRONG CONVEXITY PROPERTIES OF Jn In this section, we present some of the key properties of Jn. Recall that since we assume that α,β are compactly supported, up to a rescaling of the space, we assume without loss of generality that for all k, Ck(x, y) É 1. Lemma 11. Let An minimize (i OT ℓ1( ˆπn)). Assume that for some constant C1 > 0, t exp(C1/ε)n 1 2 λmin 1, b A 1 , then with probability at least 1 O(exp( t2)), An 1 É 2 b A Proof. Let A be a minimizer to (i OT ℓ1( ˆπn)). By optimality of A, λ A + L (A, ˆπn) É λ b A 1 + L ( b A, ˆπn). Writing ˆPn = 1 n Id and Φn A = Ps k=1 Ak Ck(xi, y j ) n i,j=1, λ A 1 ˆPn, Φn A + P, Φn A ε 2 H(P) É λ b A 1 +L ( b A, ˆπn) (22) for all P satisfying the marginal constraints P1 = 1 n 1 and P 1 = 1 Note that since L ( b A, ˆπ)+KL( ˆπ|α β) = 0, L ( b A, ˆπn) = L ( b A, ˆπn) L ( b A, ˆπ) KL( ˆπ|α β) (23) = Φ b A, ˆπ ˆπn +W ˆπn( b A) W ˆπ( b A) KL( ˆπ|α β). (24) The first two terms can be shown to be O(n 1 2 ): Indeed, Φ b A, ˆπ ˆπn = 1 i=1 Zi, where Zi b A, C(xi, yi) E[ b A, C(xi, yi)] is the sum of n terms with mean zero. Moreover, |Zi| É 2 b A . By Lemma 18, Φ b A, ˆπ ˆπn É 8 b A 2t pn with probability at least 1 2exp( t2). Published as a conference paper at ICLR 2024 The bound W ˆπn( b A) W ˆπ( b A) = O(exp( C/ε)tn 1 with probability 1 O(exp(t2)) is a due to the sample complexity of e OT Rigollet & Stromme (2022). Plugging this back into (22), we obtain λ A 1 É λ b A 1 +O(n 1 H(P) KL( ˆπ|α β) + ˆPn, Φn A + P, Φn A It remains to bound the terms T1 and T2. Intuitively, if ˆπ = ˆpα β has density ˆp, then we can show that T1,T2 = O(n 1 2 ) by choosing P = Qn 1 n2 ( ˆp(xi, y j ))n i,j=1. However, Qn will only approximately satisfy the marginal constraints Qn1 1 n 1 and Q n 1 1 n 1 (this approximation can be made precise using Proposition 16). So, we insert into the above inequality P = Qn with Qn being the projection of Qn onto the constraint set U ( 1 By (Altschuler et al., 2017, Lemma 7), the projection Qn satisfies Qn Qn 1 É 2 Qn1 1 n 1 1 +2 Q n 1 1 n 1 1 . (26) By Proposition 16, with probability at least 1 O(exp( t2)), j=1 ˆp(xi, y j ) 1 j=1 ˆp(xi, y j ) 1 2 = O µ t pn Similarly, Qn1 1 n 1 1 = O(n 1 2 ) with high probability. Qn Qn, Φn A = i,j=1 Ak(Ck)i,j ( Qn Qn)i,j É max k Ck A 1 Qn Qn 1 (28) ˆPn Qn, Φn A É A Φ n( ˆPn Qn) (29) By Proposition 19 and 20, with probability at least 1 O(exp( t2)), E Φ n ˆPn Φ ˆπ = O(tn 1 2 ) and E Φ n Qn Φ ˆπ = O(tn 1 So, we have T2 É Cn 1 We now consider the term T1 in (25). Since ˆp is uniformly bounded from above by exp(C/ε) and from below with constant exp( C/ε) for some C > 0, one can check from the projection procedure of Altschuler et al. (2017) that Qn is also uniformly bounded from above by 1 n2 exp(C/ε) and away from zero by 1 n2 exp( C/ε) for some C > 0. So, |T1| É H( Qn) H(Qn) + H(Qn) KL( ˆπ|α β) (30) e C/ε Qn Qn 1 + H(Qn) KL( ˆπ|α β) (31) We can use the bound (26) to see that Qn Qn 1 t/pn with probability at least 1 exp( t2). To bound T3 H(Qn) KL( ˆπ|α β), note that Moreover, j=1 log( ˆp(xi, y j )) ˆp(xi, y j ) Z log( ˆp(x, y)) ˆp(x, y)dα(x)dβ(y) n2 1 Z log( ˆp(x, y)) ˆp(x, y)dα(x)dβ(y)+ 1 i=1 E log( ˆp(xi, yi)) ˆp(xi, yi) Published as a conference paper at ICLR 2024 Z log( ˆp(x, y)) ˆp(x, y)dα(x)dβ(y)+ 1 Z log( ˆp(x, y)) ˆp(x, y)2dα(x)dβ(y) n2 ˆp 1 Z log( ˆp(x, y))d ˆπ(x, y) Therefore, by the bounded differences lemma 15 with f (z1,...,zn) = 1 j=1 log( ˆp(xi, y j )) ˆp(xi, y j ) Z log( ˆp(x, y)) ˆp(x, y)dα(x)dβ(y) where zi = (xi, yi). Then, letting z j = z j for all j = i, the bounded differences property is satisfied with f (z1,...,zn) f (z 1,...,z n) log( ˆp(xi, y j )) ˆp(xi, y j ) log( ˆp(x i, y j )) ˆp(x i, y j ) É 4( ˆp +1)2) So, with probability at least 1 2exp t2 , |T3| É 1 n2 + 2t( ˆp +1) pn . In summary, with probability 1 O(exp( t2)), ³ λ C2t exp(C1/ε)n 1 2 A 1 É λ b A 1 + C2(exp(C1/ε)t pn for some constants C1,C2 > 0. So, choosing C2t exp(C1/ε)n 1 2 É min λ/4,λ b A 1 /2 , we have A 1 É 2 b A with probability at least 1 exp( t2). Lemma 12. Let (A,F,G) minimize (Kn). Let CA = Φn A. Then, Fi [ 3 CA , CA ] and Gi [ 2 CA ,2 CA ]. Moreover, exp(4 CA ε 1) Ê exp 2 ε Fi +G j +(Φn A)i,j Ê exp( 6 CA ε 1). Note that CA É A 1. If (A, f ,g) minimize (K ). Let c A = ΦA. Then, for all x, y f (x) [ 3 c A , c A ] and g(y) [ 2 c A ,2 c A ]. Moreover, exp(4 c A ε 1) Ê exp 2 ε f +g +(ΦA)i,j Ê exp( 6 c A ε 1). Proof. This proof is nearly identical to (Rigollet & Stromme, 2022, Prop. 10): Let A,F,G minimize (Kn). By the marginal constraints for Pn 1 ε(F G +Φn A) i,j given in (Pn), ε(ΦA(xi, y j )+Fi +G j ) ε( CA +Fi) X 1 n exp 2G j /ε Ê exp µ2 ε( CA +Fi) (32) Published as a conference paper at ICLR 2024 where we use Jensen s inequality and the assumption that P j G j = 0 for the second. So, Fi É CA for all i [n]. Using the other marginal constraint for Pn along with this bound on Fi implies that ε(ΦA(xi, y j )+Fi +G j ) É exp µ2 2 CA +G j ) So, Gn,j Ê 2 CA . To prove the reverse bounds, we now show that P i Fi is lower bounded: Note that since H(P) Ê 1, by duality between (Pn) and (Kn), 2 Ê Φn A, P ε j G j + ε 2n2 X ε(Fi +G j +(Φn A)i j By assumption, P j G j = 0 and P i,j exp 2 ε(Fi +G j +(Φn A)i j = n2. So, i Fi Ê CA . By repeating the argument in (32), we see that 1 Ê exp(2/ε( CA +G j ))exp Ê exp(2/ε( 2 CA +G j )). So, G j É 2 CA and Fi Ê 3 CA . The proof for (K ) is nearly identical and hence omitted. Similarly to (Rigollet & Stromme, 2022, Lemma 11), we derive the following strong convexity bound for Jn: Lemma 13. The functional Jn is strongly convex with Jn(A ,F ,G ) Ê Jn(A,F,G)+ Jn(A,F,G), (A ,F ,G ) (A,F,G) + exp( L/ε) n2 Φn(A A ) 2 2 + 1 F F 2 2 + 1 for some L = O( A 1 A 1). Proof. To establish the strong convexity inequality, let h(t) J ((1 t)A + t A ,(1 t)f + t f ,(1 t)g + tg ). It suffices to find δ > 0 such that for all t [0,1], h (t) Ê δ µ 1 n2 Φn(A A ) 2 2 + 1 F F 2 2 + 1 Let Zt ((1 t)F + t F ) ((1 t)G + t G )+((1 t)ΦA + tΦA ). Note that h (t) = 2 εn2 X (F i Fi +G i Gi +(ΦA )i,j (ΦA)i,j )2 (35) Since c A c A É L, by Lemma 12, Zt c A c A . So, h (t) Ê 2 εn2 exp( L/ε) X i,j (F i Fi +G i Gi +(ΦA )i,j (ΦA)i,j )2. By expanding out the brackets and using P i Gi = 0 and since Ck are centred ( P i(Ck)i,j = 0 and P j (Ck)i,j = 0), (34) holds with δ = 2 ε exp( L/ε). Based on this strong convexity result, we have the following bound. Published as a conference paper at ICLR 2024 Proposition 14. Let (An,Fn,Gn) minimise Fn, then for all (A,F,G) S such that n 2 exp(F G + Φn A) satisfy the marginal constraints of (Pn), Kn(A,F,G) Kn(An,Fn,Gn) Ê exp( L/ε) n2 Φn(A An) 2 2 + 1 n F Fn 2 2 + 1 Kn(A,F,G) Kn(An,Fn,Gn) Éεexp(L/ε) µ n 2(Φ nΦn) 1 ( AJn(A,F,G)+λξ) 2 +n F Jn(A,F,G) 2 2 +n GJn(A,F,G) 2 2 where L = O( A 1 An 1). Proof. By strong convexity of Jn, we can show that for any (A, f ,g),(A , f ,g ) S and any ξ A 1, Kn(A ,F ,G ) Ê Kn(A,F,G)+ Jn(A,F,G), (A ,F ,G ) (A,F,G) +λ ξ, A A n2 Φn(A A ) 2 2 + 1 F F 2 2 + 1 where δ = 2exp( L/ε) ε with L = O( A 1 A 1). The first statement follows by letting (A,F,G) = (An,Fn,Gn) in the above inequality. To prove the second statement, let M Jn(A,F,G), (A ,F ,G ) (A,F,G) +λ ξ, A A n2 Φn(A A ) 2 2 + 1 F F 2 2 + 1 By minimising over (A ,F ,G ), note that ³ n2 ( AJn(A,F,G)+λξ) 2 (Φ nΦn) 1 +n F Jn(A,F,G) 2 2 +n GJn(A,F,G) 2 2 ³ n 2(Φ nΦn) 1 ( AJn(A,F,G)+λξ) 2 +n F Jn(A,F,G) 2 2 +n GJn(A,F,G) 2 2 Finally, note that Kn(A,F,G) Kn(A ,F ,G ) Ê Kn(A,F,G) Kn(An,Fn,Gn) by optimality of An,Fn,Gn. D.2 CONCENTRATION BOUNDS Lemma 15 (Mc Diarmid s inequality). Let f : X n R satisfy the bounded differences property: f (x1,...,xi 1,xi,xi+1,...,xn) f (x1,...,xi 1,x i,xi+1,...,xn) É c. Then, given X1,..., Xn random variables with Xi X , for any t > 0, P( f (X1,...,Xn) E[f (X1,...,Xn)] Ê t) É 2exp( 2t2/(nc2)). Given random vectors X and Y , denote Cov(X ,Y ) = E Y E[Y ], X E[X ] and Var(X ) = Cov(X , X ). Proposition 16. Let π have marginals α and β and let p = dπ d(α β). Let (xi, yi) ˆπ where ˆπ has marginals α,β. Assume that p É b. Then, i=1 p(xi, y j ) i=1 p(xi, y j ) !2 É (t +b +1)2 with probability at least 1 exp( t2/(4b2)). Published as a conference paper at ICLR 2024 Remark 17. The bounds also translate to an ℓ1 norm on the marginal errors, since by Cauchy Schwarz, i=1 p(xi, y j ) i=1 p(xi, y j ) Moreover, by Jensen s inequality E 1 n Pn j=1 1 1 n Pn i=1 p(xi, y j ) É r EPn j=1 1 n 1 1 n Pn i=1 p(xi, y j ) 2. i=1 p(xi, y j ) i j,k=1 E (1 p(xi, y j ))(1 p(xk, y j )) . For each j [n], we have the following cases for u j E (1 p(xi, y j ))(1 p(xk, y j )) : 1. i = k = j, then u j = E(x,y) ˆπ(p (x, y) 1)2. There is 1 such term. 2. i = j and k = j, then u j = E(x,y) ˆπ,z α (1 p(x, y))(1 p(z, y)) = 0. There are n 1 such terms. 3. i = j and k = j, then u j = E(z,y) ˆπ,x α (1 p(x, y))(1 p(z, y)) = 0. There are n 1 such terms. 4. i = k and i = j, then u j = Ex α,y β(1 p(x, y))2 and there are (n 1) such terms. 5. i, j,k all distinct. Then, u j = 0 and there are (n 1)(n 2) such terms. i j,k=1 E (1 p(xi, y j ))(1 p(xk, y j )) = 1 n2 E(x,y) ˆπ(p(x, y) 1)2 + n 1 n2 Ex α,y β(1 p(x, y))2 Using 1 p(x, y) É b +1 gives the first inequality. Note also that letting V = 1 n Pn i=1 (1 p(xi, y j )) j , V 2 = Pn j=1 1 n Pn i=1(1 p(xi, y j )) 2. We will apply Lemma 15 to f (z1,...,zn) = V . Let V = 1 ³ (1 p(x i, y j )) j where xi, y j = x j , y j for i, j Ê 2. Then, for all vectors u of norm 1, u, V V = n X j=1 u j 1 n i=1 (p(x i, y j ) p(xi, y j )) i=1 (p(x i, y 1) p(xi, y1))+ n X j=2 u j 1 n (p(x 1, y j ) p(x1, y j )) j=1 u j É 2b pn . So, by the reverse triangle inequality, V V É 2b pn . It follows that n 1/2 V É n 1/2t +n 1/2E V É n 1/2t + q n 1E V 2 É n 1/2(t +b +1). with probability at least 1 exp( t2/(4b2)) as required. Published as a conference paper at ICLR 2024 D.2.1 BOUNDS FOR THE COST PARAMETRIZATION In the following two propositions (Prop 20 and 19), we assume that Φn is defined with Ck = (Ck(xi, y j ))i,j and discuss in the remark afterward how to account for the fact that our cost Ck in (Pn) are centered. Note that the following classical result is a direct consequence of Lemma 15 Lemma 18. Suppose Z1,..., Zm are independent mean zero random variables taking values in Hilbert space H . Suppose there is some C > 0 such that for all k, Zk É C. Then, for all t > 0, with probability at least 1 2exp( t), 1 m Proposition 19. Let C = maxx,y C(x, y) . Let (xi, yi)n i=1 be iid drawn from ˆπ. Then, Φ n ˆP Φ ˆπ É with probability at least 1 2exp( t2). Proof. Direct consequence of Lemma 18. Proposition 20. Let t > 0. Let (xi, yi)n i=1 be i.i.d. drawn from ˆπ, which has marginals α,β. Let n2 (p(xi, y j ))n i,j=1 where π has marginals α,β and p = dπ d(α β). Then, E Φ n P Φ (pα β) 2 = O(n 1) and Φ n P Φ pα β 1+ t pn with probability at least 1 2exp( 2t2/(64 p 2 )) Proof. Let h(x, y) p(x, y)C(x, y), then h(x j , y j )+ n X h(xi, y j ) E[Φ n P Φ pα β] = 1 n A p ˆπ pα β It follows that E Φ n P Φ pα β 2 = 1 n2 Φ p ˆπ pα β 2 +Var(Φ n P) Var(Φ n P) = 1 i,j,k,ℓ=1 Cov(h(xi, yk),h(xℓ, yk)) Note that Cov(h(xi, yk),h(xℓ, yk)) = 0 whenever i, j,k,ℓare distinct and there are n(n 1)(n 2)(n 3) = n4 6n3 +12n2 6n such terms, i.e. there are 6n3 12n2 +6n nonzero terms in the sum. It follows that Var(Φ n P) = O(n 1) if C and p are uniformly bounded. To conclude, we apply Lemma 15 with f (z1,...,zn) = Φ n P Φ pα β and zi = (xi, yi). Let X (z1,...,zn) = Φ n P Φ pα β. Then, Note that for an arbitrary vector u, u, X (z1,z2,...,zn) X (z 1,z2,...,zn) = X k=1 uk(h(xi, y j ) h(x i, y j )) k=1 uk(h(xi, y1) h(x i, y1)) k=1 uk(h(x1, y j ) h(x 1, y j )) Published as a conference paper at ICLR 2024 É 8n 1 u sup x,y So, f (z1,z2,...,zn) f (z 1,z2,...,zn) É 8n 1 p and Φ n P Φ pα β 1+ t pn with probability at least 1 2exp( 2t2/(64 p 2 )) Remark 21 (Adjusting for the centred cost parametrization). In the above two propositions, we compare Φ n P = P i,j Ck(xi, y j )Pi,j with Φ π = R Ck(x, y)dπ(x, y) for P being a discretized version of π. The delicate issue is that in (Pn), Ck is a centralized version of ˆCk (Ck(xi, y j ))i,j . In particular, i=1 ( ˆCk)i,j 1 i=1 ( ˆCk)i,j + 1 i=1 ( ˆCk)i,j . Note that if P1 = 1 n 1 and P 1 = 1 i,j (Ck)i,j Pi,j = X i,j ( ˆCk)i,j Pi,j 1 k=1 ( ˆCk)k,j This last term on the RHS is negligible because Φ (α β) = 0 by assumption of Ck being centred: Note that 1 n2 k=1 ( ˆCk)k,j = ˆΦ n( 1 Applying the above proposition, 1 n2 k=1 ( ˆCk)k,j Φ (α β) m +log(n) pn with probability at least 1 exp( m). So, up to constants, the above two propositions can be applied even for our centralized cost C. D.2.2 INVERTIBILITY BOUND Recall our assumption that M E(x,y) α β[C(x, y)C(x, y) ] is invertible. In Proposition 23, we bound the deviation of Φ nΦn from M in the spectral norm, and hence establish that it is invertible with high probability. We will make use of the following matrix Bernstein inequality. Theorem 22 (Matrix Bernstein). Tropp et al. (2015) Let Z1,..., Zn Rd d be independent symmetric mean-zero random matrices such that Zi É L for all i [n]. Then, for all t Ê 0, 2σlog(2d)+ 1 where σ2 = Pn i=1 E[Z 2 i ] . Proposition 23. Assume that log(2s)+1 É n. Let t > 0. Then, 1 n2 Φ nΦn M with probability at least 1 O(exp( t2)). Proof. Recall that Φn A = P k Ak Ck, where Ck is the centred version of the matrix ˆCk = (Ck(xi, y j ))i,j . That is, (Ck)i,j = ( ˆCk)i,j 1 ℓ ( ˆCk)ℓ,j 1 m ( ˆCk)i,m + 1 m ( ˆCk)ℓ,m. (37) Published as a conference paper at ICLR 2024 For simplicity, we first do the proof for (Φn A)i,j = P k Ak Ck(xi, y j ), and explain at the end how to modify the proof to account for Φn using the centered Ck. 1 n2 Φ nΦn M = 1 j=1 C(xi, y j )C(xi, y j ) Z C(xi, y)C(xi, y) dβ(y) Z C(xi, y)C(xi, y) dβ(y) Z C(x, y)C(x, y) dα(x)dβ(y) (39) To bound the two terms in (39), let Zi R C(xi, y)C(xi, y) dβ(y) R C(x, y)C(x, y) dα(x)dβ(y) and observe that these are i.i.d. matrices with zero mean. By matrix Bernstein with the bounds Zi É 2 and Z 2 i É 4, pn + 2log(2s) assuming that log(2s) É n. For the two terms in (38), j=1 C(xi, y j )C(xi, y j ) Z C(xi, y)C(xi, y) dβ(y) i=1 C(xi, yi)C(xi, yi) +E C(xi, y j )C(xi, y j ) Z C(xi, y)C(xi, y) dβ(y) µ C(xi, y j )C(xi, y j ) Z C(xi, y)C(xi, y) dβ(y) For each i = 1,...n, let Yj = C(xi, y j )C(xi, y j ) R C(xi, y)C(xi, y) dβ(y) and observe that conditional on xi, Yj ª j =i are iid matrices with zero mean. The matrix Bernstein inequality applied to 1 n 1 P j [n]\{i} Yj implies that j [n]\{i} Yj n 1 + 2log(2s) assuming that log(2s) É n 1. Finally, we apply Lemma 15 to f (z1,...,zn) = 1 n2 Φ nΦn M . Let Φ nu = P k uk C(x i, y i) with x i = xi and y i = yi for i Ê 2. For each vector u of unit norm, 1 n2 (Φ nΦn (Φ n) Φ n)u, u = 1 C(xi, y j ) u 2 C(x i, y j ) u 2 (43) C(xi, y1) u 2 C(x i, y 1) u 2 + 1 C(x1, y j ) u 2 C(x 1, y j ) u 2 É 4n 1. (44) So, 1 n2 Φ nΦn M É 8 with probability at least 1 exp( 2t2/16). Published as a conference paper at ICLR 2024 To conclude this proof, we now discuss how to modify the above proof in the case of the centered cost Ck, that is Φn A = P k Ak Ck where Ck is as defined in (37). Note that in this case, 1 n2 (Φ nΦn)k,ℓ= 1 i,j=1 Ck(xi, y j )Cℓ(xi, y j )+ 1 p,q=1 Cℓ(xp, yq) i,j=1 Ck(xi, y j ) p=1 Cℓ(xp, y j ) i=1 Ck(xi, y j ) q=1 Cℓ(xi, yq) j=1 Ck(xi, y j ) We already know from the previous arguments that i,j=1 Ck(xi, y j )Cℓ(xi, y j ) M For the last term in (45), let Λ = (i, j,p,q) ; i [n], j [n]\{i}, p [n]\ i, j ª ,q [n]\ i, j,p ªª . Note that Λ has n(n 1)(n 2)(n 3) terms, and Λc = [n] [n] [n] [n] has O(n3) terms. Therefore, we can write p,q=1 C(xp, yq) i,j=1 C(xi, y j ) ! (i,j,p,q) Λ C(xp, yq)C(xi, y j ) +E (i,j,p,q) Λc C(xp, yq)C(xi, y j ) (i,j,p,q) Λ C(xp, yq)C(xi, y j ) +O(n 1) where the second inequality is because there are O(n3) terms in Λc and we used the bound that C(x, y) = 1. Moreover, (i,j,p,q) Λ C(xp, yq)C(xi, y j ) É n 3 q {i,j,p} C(xp, yq)C(xi, y j ) . Note that conditional on i, j,p, q {i,j,p} C(xp, yq)C(xi, y j ) ] = Z C(xp, y)dβ(y)C(xi, y j ) = 0 by assumption that C(x, y)dβ(y) = 0. So, we can apply Matrix Bernstein as before to show that (i,j,p,q) Λ C(xp, yq)C(xi, y j ) ] q A similar argument can be applied to handle the two terms in (46), and so, E 1 n2 Φ nΦn M q The high probability bound can now be derived using Lemma 15 as before. E PROOFS FOR THE GAUSSIAN SETTING Simplified problem To ease the computations, we will compute the Hessian of the following function (corresponding to the special case where Σα = Id and Σβ = Id): W (A) sup Σ A, Σ + ε 2 logdet(Id Σ Σ). (47) To retrieve the Hessian of the original function note that, since logdet(Σβ Σ Σ 1 α Σ) = logdet(Σβ)+ logdet(Id Σ 1 2 β Σ Σ 1 α ΣΣ 1 2 β ), a change of variable Σ Σ 1 2 β , shows that W (A) = W (Σ 1 2 β) and, hence, 2W (A) = (Σ 1 2α) 2 W (Σ 1 2α). (48) Published as a conference paper at ICLR 2024 By the envelope theorem, W (A) = Σ, where Σ is the maximizer of (47), thus reducing the computation of 2 W (A) to differentiating the optimality condition of Σ, i.e., A = ε 1Σ(Id Σ Σ) 1. (49) Recall also that such a Σ has an explicit formula given in (5). 2 W (A) = ε ε2(Id Σ Σ) 1 (Id ΣΣ ) 1 +(A A)T 1 (50) and Σ is as in (49) and T is the linear map defined by Tvec(X ) = vec(X ). Proof of Lemma 24. Define G(Σ, A) Σ(Id Σ Σ) 1 ε 1A. Note that a maximizer Σ of (47) satisfies G(Σ, A) = 0. Moreover, ΣG is invertible at such (Σ, A) because this is the Hessian of logdet(Id Σ Σ), a strictly concave function. By the implicit function theorem, there exists a function f : A 7 Σ such that G(f (A), A) = 0 and f = ( ΣG) 1 AG = ε 1( ΣG) 1. It remains to compute ΣG at (f (A), A): At an arbitrary point (Σ, A) we have ΣG = (Id Σ Σ) 1 Id+(Id Σ) Σ(Id Σ Σ) 1 = (Id Σ Σ) 1 Id+(Id Σ) (Id Σ Σ) 1 (Id Σ Σ) 1 (Σ Id)T+(Id Σ ) and at (f (A), A), i.e, with Σ(Id Σ Σ) 1 = ε 1A, we can further simplify to ΣG = (Id Σ Σ) 1 (Id+Σ(Id Σ Σ) 1Σ ))+ε 2(A A)T By the Woodbury matrix formula, (Id+Σ(Id Σ Σ) 1Σ ) = (Id ΣΣ ) 1 = Id+ AΣ . So, ΣG = (Id Σ Σ) 1 (Id ΣΣ ) 1 +ε 2(A A)T, thus concluding the proof. We remark that from the connection between W (A) and W (A), i.e., 48, we obtain Lemma 7 as a corollary. E.1 LIMIT CASES SVD representation of the covariance To derive the limiting expressions, we make an observation on the singular value decomposition of Σ: Let the singular value decomposition of A be A = UDV , where D is the diagonal matrix with positive entries di. Note that = (Id+ ε2 4 (A A) ) 1 2 = V (Id+ ε2 4 D 2) 1 2 V . Moreover, and A A commute, so Σ = A (( 2A A) ) 1 2 ε V =U DV , (51) where D is the diagonal matrix with diagonal entries 2 1 di . (52) Published as a conference paper at ICLR 2024 E.1.1 LINK WITH LASSO: LIMIT AS ε 1+ 4d2 i ε2 ε ε d3 i ε3 +O(ε 5) 0, ε . (53) It follows that limε Σ = 0 and hence, ε 2 W (A) Id. So, the certificate converges to (Σβ Σα)(:,I) (Σβ Σα)(I,I) 1 sign( b A)I. (54) Proof of Proposition 8. The i OT problem approaches a Lasso problem as ε . Recall that in the Gaussian setting, the i OT problem is of the form argmin A F(A) = λ A 1 + A, Σε,A Σε, b A + ε 2 logdet Σβ Σ ε,AΣ 1 α Σε,A (55) where Σε,A satisfies Σβ Σ ε,AΣ 1 α Σε,A = εΣ 1 α Σε,A A 1. So, we can write argmin A F(A) λ A 1 + A, Σε,A Σε, b A + ε 2 logdet εΣ 1 α Σε,A A 1 (56) 1 2 β and Σε,A = Σ 1 2 α Σε,AΣ 1 From (52), if X has SVD decomposition W =U diag(di)V , then Σε,A =U DV , where D = diag( di) ε d3 i ε3 +O(ε 5). logdet(εΣ 1 α Σε,A A 1) = logdet µ εΣ 1 2 α Σε,AX 1Σ = logdet U diag 1 d2 i /ε2 +O(ε 4) U = logdet diag 1 d2 i /ε2 +O(ε 4) = 1 ε2 X 2 F +O(ε 4). Also, ε Σε,A Σε, b A, A = ε Σε,A Σε, b A, X = X X0, X +O(ε 2). So, assuming that λ = λ0/ε, εF(A) = λ0 A 1 + X X0, X X 2 F +O(ε 2) (57) = λ0 A 1 + (Σ 1 2α)(A b A) 2 +O(ε 2) (58) The final statement on the convergence of minimizers follows by Gamma-convergence. E.1.2 LINK WITH GRAPHICAL LASSO In the special case where the covariances are the identity (Σα = Id and Σβ = Id) and A is symmetric positive definite we have that Σ is also positive definite and Galichon s formula (5) holds (since A is invertible) and hence the Hessian reduces to µ1 ε 2 W (A) 1 = (A A) Σ 1 Σ 1 +T . (59) Moreover, if A admits an eigenvalue decomposition A = UDU , then Σ admits an eigenvalue decomposition Σ =U DU with entries of D given by (52). Note that it follows that limε 0 Σ = Id and, hence, limε 0 Σ 1 Σ 1 +T = Id + T, a singular matrix with the kernel being the set of Published as a conference paper at ICLR 2024 asymmetric matrices. So, the limit does not necessarily exist as ε 0. However, in the special case where A is symmetric positive definite, one can show that the certificates remain well defined as ε 0: Let S+ be the set of matrices ψ such that ψ is symmetric and S be the set of matrices ψ such that ψ is anti-symmetric. Then, ( 2 W (A)) 1(S+) S+ and ( 2 W (A)) 1(S ) S . Moreover, µ1 ε 2 W (A) 1 S+ = (A A) Σ 1 Σ 1 +Id , ε 2 W (A) 1 S = (A A) Σ 1 Σ 1 Id . Since the symmetry of A implies the symmetry of sign(A), we can replace the Hessian given in (59) by (A A)(Σ 1 Σ 1 +Id) and, hence, since limε 0 Σ = Id, the limit as ε 0 is lim ε 0zε = (A 1 A 1)(:,I) (A 1 A 1)(I,I) 1 sign(A)I. (60) This coincides precisely with the certificate of the graphical lasso: argmin Θ 0 S, Θ logdet(Θ)+λ Θ 1 . Proof of Proposition 9. The i OT problem with identity covariances restricted to the set of positive semi-definite matrices has the form argmin A 0 Fε,λ(A), where Fε,λ(A) λ A 1 + A, Σε,A ˆΣ + ε 2 logdet(Id Σ ε,AΣε,A), (61) where I Σ ε,AΣε,A = εΣε,A A 1. From the singular value decomposition of Σε,A,i.e., (51), we see that if A is symmetric positive definite, then so is Σε,A. Plugging the optimality condition, i.e., I Σ ε,AΣε,A = εΣε,A A 1, into (61), we obtain argmin A 0 λ A 1 + A, Σε,A ˆΣ + ε 2 logdet(εA 1Σε,A) =argmin A 0 λ A 1 ε 2 logdet(A/ε)+ε A, (Σε,A ˆΣ)/ε + ε 2 logdet(Σε,A) Let λ = ελ0 for some λ0 > 0. Then, removing the constant ε 2 logdet(εId) term and factoring out ε, the problem is equivalent to argmin A 0 λ0 A 1 1 2 logdet(A)+ A, ε 1(Σε,A ˆΣ) + 1 2 logdet(Σε,A) Assume that ˆΣ = Σε, b A. From the expression for the singular values of Σε,A in (52), note that Σε,A = Id ε 2 A 1 +O(ε2). So, limε 0(Σε,A Σε, b A)/ε = 1 2 A 1 b A 1 . The objective converges pointwise to 2 logdet(A)+ 1 2 A, b A 1 A 1 , and the statement is then a direct consequence of Gamma-convergence. Remark 25. Note that from (52), we have Σ = Id ε 2 A 1 +O(ε2). So, in this case, the covariance of ˆπ is µ Id Id ε 2 A 1 +O(ε2) Id ε 2 A 1 +O(ε2) Id For (X ,Y ) π, The Schur complement of this is the covariance of X conditional on Y , which is εA 1 + O(ε2). So, up the ε, one can see A 1 as the precision" matrix of the covariance of X conditional on Y . Published as a conference paper at ICLR 2024 F LARGE SCALE ℓ1-IOT SOLVER Recall that the i OT optimization problem, recast over the dual potentials for empirical measures, reads inf A,F,G 1 n i Fi +Gi +(Φn A)i,i + ε 2n2 X ε(Fi +G j +(Φn A)i,j To obtain a better-conditioned optimization problem, in line with Cuturi & Peyré (2016), we instead consider the semi-dual problem, which is derived by leveraging the closed-form expression for the optimal G, given F. inf A,F 1 n i Fi +(Φn A)i,i + ε ε(Fi +(Φn A)i,j Following Poon & Peyré (2021), which proposes a state-of-the-art Lasso solver, the last step is to use the following Hadamard product over-parameterization of the ℓ1 norm A 1 = min U V U 2 2 2 + V 2 2 2 . where the Hadamard product is U V (Ui Vi)i, to obtain the final optimization problem inf A,U,V 1 n i Fi +(Φn(U V ))i,i + ε ε(Fi +(Φn(U V ))i,j 2 U 2 2 + λ This is a smooth optimization problem, for which we employ a quasi-Newton solver (L-BFGS). Although it is non-convex, as demonstrated in Poon & Peyré (2021), the non-convexity is benign, ensuring the solver always converges to a global minimizer, (F ,U ,V ), of the functional. From this, one can reconstruct the cost parameter, A U V .