# order_constraints_in_optimal_transport__6f8c9b4c.pdf Order Constraints in Optimal Transport Fabian Lim 1 Laura Wynter 1 Shiau Hong Lim 1 Optimal transport is a framework for comparing measures whereby a cost is incurred for transporting one measure to another. Recent works have aimed to improve optimal transport plans through the introduction of various forms of structure. We introduce novel order constraints into the optimal transport formulation to allow for the incorporation of structure. We define an efficient method for obtaining explainable solutions to the new formulation that scales far better than standard approaches. The theoretical properties of the method are provided. We demonstrate experimentally that order constraints improve explainability using the e-SNLI (Stanford Natural Language Inference) dataset that includes human-annotated rationales as well as on several image color transfer examples. 1. Introduction Optimal transport (OT) is a framework for comparing measures whereby a cost is incurred for transporting one measure to another. Optimal transport enjoys both significant theoretical interest and widespread applicability (Villani, 2008; Peyr e & Cuturi, 2019). Recent work (Swanson et al., 2020; Alvarez-Melis et al., 2018) aims to improve the interpretability of optimal transport plans through the introduction of structure. Structure can take many forms: text documents where context is given (Alvarez-Melis et al., 2020), images with certain desired features (Shi et al., 2020; Liu et al., 2021), fairness properties (Laclau et al., 2021), or even patterns in RNA (Forrow et al., 2019). Swanson et al. (2020) postulate that structure can be discovered by sparsifying optimal transport plans. Motivated by advances in sparse model learning, order statistics and isotonic regression (Paty et al., 2020), we introduce novel order constraints (OC) into the optimal trans- 1IBM Research, Singapore. Correspondence to: Fabian Lim <flim@sg.ibm.com>. Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). port formulation so as to allow for more complex structure to be readily added to optimal transport plans. We show theoretically that, for convex cost functions with efficiently computable gradients, the resulting order-constrained optimal transport problem can be solved using a form of ADMM (Boyd et al., 2011) and can be δ-approximated efficiently. We further derive computationally efficient lower bounds for our formulation. Specifically, when the structure is not provided in advance, we show how it can be estimated. The bounds allow the use of an explainable method for identifying the most important constraints, rendered tractable through branch-and-bound. Each of these order constraints leads to an optimal transport plan computed sequentially using the algorithm we introduce. The end result is a diverse set of the most important optimal transport plans from which a user can select the plan that is preferred. The order constraints allow context to be taken into account, explicitly, when known in advance, or implicitly, when estimated through the proposed procedure. Consider OT for document retrieval (e.g., see Kusner et al. (2015); Swanson et al. (2020)) where a user enters a text query. The multiple transport plans provided by OT with order constraints offer a diverse set of the best responses, each accounting for different possible contexts. The resulting method is weakly-supervised with only two tunable thresholds (or unsupervised if the parameters are set to default values), and is amenable to applications of OT where labels are not available to train a supervised model. An example is provided in Fig. 2 which aims to learn the similarity between two phrases. The top box depicts the transport plan learnt by OT; the darker the line, the stronger the transport measure. The match provided by OT does not reveal the contradiction in the two phrases. The second to fourth transport plans are more human-interpretable. The second plan matches red clothing to the black pants. The third matches piece to pair , a relation not included in the standard OT plan. The last plan matches clothing to pants .1 The example makes use of OT with a single order constraint to better illustrate the idea of order constraints. OT with multiple order constraints can incorporate 1For readability of Fig. 2, very low scores were suppressed, making it appear that they do not sum to one though they do. Order Constraints in Optimal Transport SOURCE TARGET OT OT with Order Constraints Figure 1: Color transfer using OT with Order Constraints OT with Order Constraints Learnt w/o order constraint Learnt with order constraint Figure 2: Text matching using OT with Order Constraints all of the pairs in a single plan. Fig. 1 illustrates using OT to transfer the color palette of a target image (Van Gogh s Caf e Terrace at Night ) to the source image of an alley. The second image from the left, obtained by standard OT, transfers yellow to bright areas and blue to dark. Our proposed order constraints allow adding context, which in this case gives desired effects such as transferring the Van Gogh night sky (second image from right) or the awning brilliant yellow (middle image) to the sky in the source image. The contributions of this work are as follows: (1) we provide a novel formulation, Optimal Transport with Order Constraints, that allows explicitly incorporating structure to OT plans; (2) we provide an efficient method for solving the proposed OT with OC formulation related to an ADMM along with the convergence theory; (3) for cases where the order constraints are not known in advance, we propose an explainable method to estimate them using branch & bound and define bounds for use in the method, and (4) we demonstrate the benefits of using the proposed method on NLP and image color transfer tasks. Next, we review related work. In the following section, we present the standard optimal transport formulation followed by the proposed OT with order constraints (OT with OC) formulation and a demonstration of the existence of a solution, which corresponds to contribution (1) above. Then, an algorithm leveraging the form of the polytope is defined to solve the OT with OC problem in a manner far more efficient than a standard convex optimization solver, as noted in contribution (2) above. In order to further improve explainability, when the order constraints are not known a priori, we present a branch & bound procedure to search the space for a set of diverse OT with OC plans, along with the bounds that reduce the polynomial search space, as noted in contribution (3) above. Finally, experiments are provided on the e-SNLI dataset and several image color transfer examples to demonstrate how order constraints allow for incorporating structure and improving explainability, as noted in contribution (4) above. Related Work We consider point-point linear costs as in e.g., Cuturi (2013); Courty et al. (2017); Kusner et al. (2015); Wu et al. (2018); Yurochkin et al. (2019); Solomon (2018); Altschuler et al. (2019; 2017); Schmitzer (2019). Recently, a number of works have sought to add structure to optimal transport so as to obtain more intuitive and explainable transport plans. Alvarez-Melis et al. (2019) modeled invariances in OT, a form of regularization, to learn the most meaningful transport plan. Sparsity has been explored in OT to implicitly improve explainablity. Swanson et al. (2020) used dummy variates to include fewer transport coefficients. Forrow et al. (2019) regularize using the OT rank to better deal with high dimensional data. Blondel et al. (2018) had a similar goal of learning sparse and potentially more intepretable transport plans through regularization and a dual formulation. We, on the other hand, propose an explicit approach to interpretability, rather than the implicit method of sparsifying solutions through regularization. (Su & Hua, 2017) add constraints to require two sequences to lie close to each other, in terms of indices; to respect sequence order, they penalize the constraints and then use the classic Sinkhorn algorithm to ob- Order Constraints in Optimal Transport tain an approximate solution. Another way to add structure into OT is through dependency modeling; multi-marginal optimal transport takes this approach, wherein multiple measures are transported to each other. The formulation was shown to be generally intractable (Altschuler & Boix Adsera, 2020a), however, and strong assumptions have to be made to solve it in polynomial time (Altschuler & Boix Adsera, 2020b; Pass, 2015; Di Marino et al., 2015). Standard OT can be solved in cubic time by primal-dual methods (Peyr e & Cuturi, 2019) and the proposed OT with order constraints can be solved using generalized solvers with similar time complexity. The difficulty comes from the fact that there are quadratically many order constraints. Recent work (Cuturi, 2013; Scetbon et al., 2021; Altschuler et al., 2017; Goldfeld & Greenewald, 2020; Zhang et al., 2021; Lin et al., 2019; Guo et al., 2020; Jambulapati et al., 2019; Guminov et al., 2021) has favored solving OT via iterative methods which offer lower complexity at the expense of obtaining an approximate solution. In particular, (Scetbon et al., 2021) considered low-rank approximations of OT plans, and proposed a mirror descent with inner Dykstra iterations. Recent iterative methods for standard OT (Guminov et al., 2021; Jambulapati et al., 2019) ensure feasibility of the solution at each iteration through the use of a rounding algorithm proposed by (Altschuler et al., 2017). Iterative methods are essential for solving submodular OT (Alvarez-Melis et al., 2018) that would otherwise incur an exponential number of explicit constraints. 2. Order Constraints for Optimal Transport Notation Let R (resp. R+) denote the set of real (resp. non-negative) numbers. Let m, n denote the number of rows and columns in the optimal transport problem. Vectors and matrices are in Rn and Rm n. Let [n] := {1, 2 . . . n} and [mn] := [m] [n]. Let i, j, k, ℓ, p, q denote vector/matrix indices, 2[n] denote the power set of [n], and 1n and In denote the all-ones vector and identity matrix, resp., of size n. We denote the index sequence i1, i2, , ik as i[k]. IS( ) is the indicator function over set S. The trace operator for square matrices is denoted by tr ( ) and rank (Xij) gives the descending positions over the mn matrix coefficients Xij, or a subset thereof.2 For a closed set C, the Euclidean projector for a matrix X Rm n is Proj C (X) = arg min Y C Y X 2 F where F is the matrix Frobenius norm. For c R let (c)+ = c if c 0 and (c)+ = 0 otherwise, and for X Rm n let (X)+ equal X after setting all negative coefficients to zero. Standard OT Formulation In optimal transport, one optimizes a linear transport cost over a simplex-like polytope, 2This is not to be confused with the standard definition of a matrix s rank. U(a, b). A (balanced) optimal transport problem is given by two probability vectors a and b that each sum to 1. Each being of length m, n, resp., they define a polytope: U(a, b) = {Π Rm n + : Π1n = a, ΠT 1m = b}, and a linear cost D Rm n, where D 0 and bounded. Then the optimal transport cost f : Rm n R+ is the optimal cost of the following optimization: min Π U(a,b) f(Π) := tr DT Π , (1) and the optimal value of Π is the transport plan. We propose to extend (1) by including order constraints to represent structure. Proposed Formulation: OT with Order Constraints The use of order constraints is analogous to the constraints used in isotonic, or monotonic, regression, (see De Leeuw et al., 2009). Specifically, if a particular variate Πij is semantically or symbolically important, then its rank (Πij) should reflect that dominance. We thus enforce k order constraints by specifying a sequence of k variates in [mn], ij[k] := i1j1, i2j2, , ikjk of importance, where the set of variates V := [mn] \ {iℓjℓ: ℓ [k]}: Πikjk Πi1j1 Πpq for pq V. (2) Enforcing order constraints (2) means each variate iℓjℓis fixed to maintain the (k ℓ+1)-th top position Π(k ℓ+1) in the ordering Π(1) Π(2) Π(mn) whilst learning the transport plan Π; in particular ikjk is fixed at the topmost position. Therefore, for order constraints on k variates of importance and point-to-point costs D Rm n, we extend the OT problem as follows: infΠ U(a,b) f(Π) := tr DT Π , (3) s.t Πikjk Πi1j1, Πi1j1 Πpq for pq V, (4) for V in (2), and we assume non-negative costs D 0 throughout. Let Oij[k] Rm n + = {X Rm n + : rank (Xiℓjℓ) = k ℓ+ 1, ℓ [k]}. (5) Thus (3) (4) can be compactly expressed as: inf Π U(a,b) Oij[k] f(Π). (6) Problem (6) is convex with linear costs and constraints. However, Oij[k] adds a significant number of constraints to the m+n in the standard OT formulation (1). The inf in (4) and (6) are replaced by min when feasible; we address such sufficient conditions in the following. Order Constraints in Optimal Transport Feasibility of the Proposed OT with OC Problem U(a, b) is feasible since ab T U(a, b), (see Cuturi, 2013). In certain instances it may be considered ambiguous to match a single position to multiple positions, consider variates ij[k] where i[k] = i1, i2, , ik (and j[k]) do not repeat row (or column) indices; let Ii[k](p) = 1 if p = iℓfor at most one choice of ℓ, or 0 otherwise, and similarly Ij[k](q). We next derive conditions for the feasibility of (6) under mild assumptions. Proposition 2.1. Suppose i[k] = i1, i2, , ik (and j[k]) do not repeat row (or column) indices. Further suppose Π Rm n, where Πiℓjℓ= cℓfor ℓ [k], and Πpq for pq V , resp. satisfy: 0 Πiℓjℓ = cℓ min(aiℓ, bjℓ), (7) 0 Πpq = (ap Ii[k](p)cp) (bq Ij[k](q)cq) α(c1, , cℓ) (8) where α = α(c1, , cℓ) = 1 P ℓ [k] cℓ 0. Then, Π is feasible w.r.t (6) if: apbq α(c1, , ck) ck c1 for all pq V. (9) Proof. Since i[k] and j[k] do not repeat row/column indices, then (7) and (8) imply that Π satisfies Π1n = a and ΠT 1m = b; and since they also imply Π 0, thus Π U(a, b). Furthermore, we also have Π Oij[k], because (8) and (9) imply for all pq V , that Πpq apbq/α ck. Corollary 2.2. Suppose i[k] (and j[k]) do not repeat row (or column) indices. If ai = 1/m and bj = 1/n, then Π is feasible w.r.t (6) if min(m, n) (1 k/ max(m, n)) 1. Proof. For all ℓ [k], pick cℓ= min(1/m, 1/n) = c for some c 0; this choice for c satisfies (7) and (8). Then (9) holds if apbq/c 1 kc = α holds, or equivalently min(m, n) (1 k/ max(m, n)) 1 holds. Cor. 2.2 states for uniform constraints ai = 1/m and bj = 1/n, feasibility typically holds when k max(m, n), since an upper bound for (1 k/ max(m, n)) 1 is 1 + O (k/ max(m, n)). Indeed, such is the case of interest where a sparse k number of OCs are enforced, see (2). 3. Solving OT with Order Constraints In this section we provide a δ-approximate method for (6) that runs in O ( D /δ mn log mn) time using the alternating direction method of multipliers (ADMM) (Boyd et al., 2011; Beck, 2017). In what follows assume feasiblity of (6), and the strategy is to consider the (non-empty) polytope U(a, b) Oij[k] as the intersection of two closed convex sets: C1(a, b) = {X Rm n : X1n = a, XT 1n = b}, and C2 = Oij[k], used to equivalently rewrite (6) as: min X C1(a,b) f(X) + IC2(Z), (10) s.t. X, Z Rm n, X Z = 0, (11) where the indicator function IC(X) = 0 if X C, and IC(X) = otherwise. For X C1(a, b), Z C2, the equality implies X = Z = Π for feasible Π U(a, b). The first-order, iterative ADMM procedure is summarized in Algorithm 1; ρ > 0 is a penalty term, and iterates Xt+1, Zt+1 solve, respectively: min X C1(a,b) tr DT X + ρ 2 X Zt + Mt 2 F min Z Rm n IC2(Z) + ρ 2 Xt+1 Z + Mt 2 F where Mt is the (scaled) dual iterate, see (Boyd et al., 2011). Simplifying the first min expression in (12) shows that Xt+1 is nothing but the Euclidean projection onto C1(a, b) of a point Zt Mt ρ 1D; similarly Zt+1 is the Euclidean projection onto C2 of Xt+1 + Mt. Algorithm 1 Iterative procedure for OT under OC Oij[k] with linear costs f(X) = tr DT X . Require: Costs D, penalty ρ, initial M0, Z0. 1: for round t 1 until stopping do 2: Update Xt+1 = Proj C1(a,b) Zt Mt ρ 1D 3: Update Zt+1 = Proj C2 (Xt+1 + Mt). 4: Update (scaled) dual variable Mt+1 = Mt+Xt Zt. 5: Return Xt We next derive the projections, Proj C1(a,b) ( ) and Proj C2 ( ), required for Alg. 1. Prop. 3.1 gives the Euclidean projector for C1(a, b). Proposition 3.1 (Projection C1(a, b)). For C1(a, b) consider the Euclidean projector Proj C1(a,b) (X). Let Pk be the projection of a vector onto its mean, i.e., Pk = 1 k1k1T k . Define matrices M := Im m m+n Pm and N := In n m+n Pn, and matrices m copies (rows) z }| { [Ma, , Ma]T + 1 n copies (cols) z }| { [Nb, , Nb], Y2 := M(XPn) + (Pm X)M. (13) Then the projection ˆX = Proj C1(a,b) (X) satisfies ˆX = Y1 + (X Y2). The proof of Prop. 3.1 can be found in the Appendix. Wang et al. (2010) proved a similar property for the simplified setting where a = b = 1 and X symmetric, which does not hold in the optimal transport setting. Order Constraints in Optimal Transport Algorithm 2 e PAVA for C2 = Oij[k] for k [mn] Require: X Rm n. Indices ij[k]. 1: ℓ:= 1, B := 1, le[1] := ri[1] := 1, val[1] := T(0). 2: for ℓ k do 3: B := B + 1, ℓ:= ℓ+ 1, le[B] := ri[B] := ℓ, val[B] := Xiℓjℓ. 4: for B 2 and val[B] val[B 1] do 5: Let q = ri[B]. 6: if B = 2 then 7: Solve and store η 0 satisfying T( η) = 2q+ η/(q 1). Set val[B 1] := T( η). 8: else 9: Set val[B 1] := pq for p = ri[B 1]. 10: Set ri[B 1] := ri[B]. Decrement B := B 1. 11: Return B, η, le, ri, and val. Next we address computing Proj C2 (X) for C2 = Oij[k] for values of k of interest. For the special case where k = mn, Grotzinger & Witzgall (1984) defined the socalled Pool Adjacent Violators Algorithm (PAVA) that solves Proj C2 (X) for C2 = Oij[mn]. We cannot directly make use of PAVA for OT with Order Constraints because we require only a small number, k mn, of order constraints, and not the full set, since the OT itself should learn the ordering of the other coefficients. However, the idea behind PAVA is of interest. To this end we extend PAVA, and call this extension e PAVA. e PAVA projects into C2 = Oij[k] for general k, and is presented in Alg. 2. e PAVA makes use of two quantities defined below; the proof of correctness is given in the Appendix. We define a block B to be a partition of indices. Let le[B] denote the left and ri[B] the right boundary of B; val[B] is the value of its projection. For any η 0, let r(η) denote the rank (Xi1j1 η) with respect to decreasing order X(1) X(2) X(mn k+1) over all variates in V {i1j1}, for V in (2). We require averages pq := Pq ℓ=p Xiℓjℓ/(q p + 1) and a threshold function T( ) := (τ(t(η), η))+, where τ(s, η) for any 0 s < r(η) and integer t(η) 0 are defined3 as: τ(s, η) = 1 s + 1 t(η) + 1 = arg min s [r] : τ(s, η) > X(s) or t(η) + 1 = r(η) whenever the set in (15) is empty. Proposition 3.2 (Projection C2). For C2 = Oij[k] for any iℓjℓ [mn] where ℓ [k], consider the Euclidean projector Proj C2 (X) for any X Rm n. Let T(η) := (τ(t(η), η))+ for τ, t in (14) and (15) and V in (2). Then 3The sum in (14) equals 0 when s = 0. for any X Rm n, e PAVA will successfully terminate with some B, η, le, ri, and val. Furthermore, the projection ˆX = Proj C2 (X) satisfies i) for pq V we have ˆXpq = T( η) = val[1] if rank (Xpq) t( η) or ˆXpq = (Xpq)+ otherwise, and ii) for ℓ [k] we have ˆXiℓjℓ= val[B ] iff le[B ] ℓ ri[B ] for some B [B]. Convergence Analysis Standard ADMM convergence results hold as follows. As t , the primal residue Xt Zt F 0 and dual residue ρ Zt 1 Zt F 0, which can be used to show asymptotic convergence of f(Xt) to the optimum (see Boyd, 2010, p. 17). The iteration complexity of Alg. 1 can be obtained from the following4 non-asymptotic result (see Beck, 2017, Thm 15.4). Consider the following. As (10)-(11) are both linear, if C1(a, b) C2 = , then an optimal Y = arg max Y Rm n d(Y ) exists (Boyd & Vandenberghe, 2004) from solving the dual d(Y ) := min X C1(a,b),Z C2 tr XT (D + Y ) tr Y T Z (16) Recall Proposition 2.1 is a sufficient condition for C1(a, b) C2 = . Theorem 3.3 (Convergence of Algorithm 1, Beck (2017)). Let Xt, Zt, Mt be the sequences of iterates from Algorithm 1 with penalty ρ > 0, and let Xt = (t + 1) 1 Pt+1 ℓ=1 Xℓand similarly Zt. Assume C1(a, b) C2 = , and let f and Π denote the optimum value and solution respectively of (6), and Y = arg max Y Rm n d(Y ) for d(Y ) in (16). Then we have the convergence bounds f( Xt) f ρc1 2(t + 1), Xt Zt F ρc1 c2(t + 1) where c1 Z0 Π 2 F + ( M0 F + c2/ρ)2 and c2 2 Y F . Proposition 3.4 (Iteration complexity). Algorithm 1, given linear costs D and penalty parameter ρ 0, initialized with Z0 = M0 = 0, achieves error f( Xt) f δ, in O ( D /δ) iterations. Proof. We seek to estimate ρc1 that bounds the error f( Xt) f in Thm. 3.3. First, consider the constant c1. Setting Z0 = M0 = 0 gives Z0 Π 2 F 1 and M0 F = 0, then Thm. 3.3 requires c1 1 + (c2/ρ)2; hence if we set the penalty ρ = c2 then c1 2. Next, it remains to estimate ρ, or equivalently, c2. Thm. 3.3 requires c2 to be at least the largest possible norm that an optimizer of the dual (16) can take. To this end if, writing Y (D ) for the optimizer of (16) given costs D , it suffices to consider 4Use Thm 15.4 of Beck (2017) with G, Q set to zero, and checking Assump 15.2 holds for f(X) and C1(a, b), C2. Order Constraints in Optimal Transport a constant c satisfying c Y (D ) F for any normalized 5 cost D with D = 1, and then picking c2 such that c2 2 D c. Hence we arrive at an estimate for ρc1 to be in O ( D ). Thus we conclude the error is at most δ under the claimed number of iterations. Prop. 3.4 holds for the choice6 ρ = c2, but however c2 is unknown in practice; the common adage is to simply set ρ = 1, (see Boyd et al., 2011). Invoking both Propositions 3.4 and 3.5 below, we estimate the total operation complexity as O ( D /δ mn log mn). Proposition 3.5. One round of Algorithm 1 runs in O (mn log(mn)) time. Proof. Line 2 costs O (mn) for C1(a, b) and Line 4 costs O (k + (mn k) log(mn k)) for C2. The update in Line 2 requires only O (mn). Note that for Proj C1(a,b) ( ) the expressions (13) require only matrix-vector multiplications with O (mn) complexity, as follows. For Y1 this is clear from (13). For Y2, consider the left term M(XPn); then the other term (Pm X)M similarly follows. Putting M = Im m m+n Pm and Pk = 1 k1k1T k for k = m, n, we get that M(XPn) can be expanded (ignoring scaling factors) into two terms X1n1T m and 1T m(1m X1n)1T n. Hence the claim holds given that 1m, 1n are vectors. For C2, observe that computing T(η) = τ(t(η), η) multiple times in Alg. 2 (Line 7), requires a one-time sort of mn k +1 terms and Lines 5-10 incur an additional O (k) following arguments in Grotzinger & Witzgall (1984). Care should to be taken in comparing our complexity, O ( D /δ mn log mn), to that of OT methods that use rounding approximations Altschuler et al. (2017) to ensure feasibility of each iterate. Such methods, which use KLdivergence, are O n2/δ D for m = n, (see Guminov et al., 2021; Jambulapati et al., 2019). For order constraints like other structured OT approaches (e.g. Scetbon et al., 2021), a rounding method for U(a, b) Oij[k] is not available. However, we have the bound on Xt Zt F in Thm. 3.3 to quantify the feasibility of the iterates Xt, Zt from C1(a, b) C2. 4. Explainability via Branch-and-Bound We have thus far assumed that the structure captured by the optimal transport plan via order constraints has been provided externally. In some settings, the structure is not provided and one must estimate the most important variates to 5The dual d in (16) is scale-invariant, in the sense that if Y optimizes (16) for costs D, then for any α > 0, we have αY optimizes the same for costs αD. 6For the choice ρ = c2, it follows that the error Xt Zt F in Thm. 3.3 also drops linearly with t. define the corresponding constraints. With the goal of generating plans from which a one can select, we propose an explainable and efficient approach using branch & bound to compute a diverse set of optimal transport plans. The approach successively computes a bound on the best score that can be obtained with the variates currently fixed; if it cannot improve the best known score, the branch is cut and the next branch, i.e. the next set of fixed variates, is explored. Consider the k order constraints (2) enforced by the variate ij[k], and further consider introducing an additional variate ij [mn] such that Πikjk Πi1j1 Πij Πpq for pq V , see (2). This manner of introducing variates one-at-a-time, can be likened to variable selection and formalized by a tree structure. On the tree T , each node ij[k] has children i j [k+1] = ij, i1j1, i2j2, , ikjk that share the same top-k constraints; the root node corresponds to unconstrained OT, and has children at level k = 1 corresponding to single order constraint variates ij[1] = ij. We would like the procedure to always select all ancestors of a given node before selecting the node itself. Hence, the root node should be selected first. The branch selection aimed at increasing diversity of the plans is guided by two threshold parameters τ1, τ2 that lie between 0 and 1. For any given node ij[k], the parameters τ1, τ2 limit its children to those that only deliver reasonably likely transport plans, as follows. For example, suppose that Π U(a, b) is the solution of (6) obtained from the variate ij[k]. To determine the children of ij[k], first compute saturation levels normalized between 0 and 1 given as φij = Πij/ min(ai, bj), to derive φs ij, φr ij, φc ij = φij, max ℓ [n]: ℓ =i φiℓ, max ℓ [m]: ℓ =j φℓj Φij := min(φr ij, φc ij). (17) Low saturation values in (17) imply uncertainty in the assignments Π. The thresholds τ1, τ2 are used to determine the set I(Π) that filters those variates ij with low self saturation (τ1) and low neighborhood saturation (τ2): I(Π) := {ij [mn] : φs ij τ1, Φij τ2}. (18) The children of ij[k] are obtained as i j [k+1] = ij, ij[k] for all ij I(Π). The proposed explainable approach adds diversity by learning the top plans. Let us define k1, k2, k3 as follows: The number of nodes constructed while learning a subtree is at most k1, the number of top plans to retain is k2, and k3, as noted above, limits the tree depth. The branch-and-bound search proceeds on the reduced tree given by T (k3, τ1, τ2), of depth k3, containing nodes (18) whose saturation lie beneath the thresholds τ1, τ2. We thus wish to learn the topk2 plans given by subtree, b T (k1, k2, k3, τ1, τ2), and iden- Order Constraints in Optimal Transport Algorithm 3 Learning subtree b T (k1, k2, k3, τ1, τ2) of T (k3, τ1, τ2) and top-k2 candidate plans for linear costs f(Π) = tr DT Π . Require: Costs D, thresholds 0 τ1, τ2 1. Search upper limit k1, number of top candidates k2, and search depth k3 min(m, n). 1: Compute ˆΠ1 using (1). Init b T (k1, k2, k3, τ1, τ2). 2: Use ˆΠ1, τ1, τ2 in (17) and (18) to obtain I and Φij. Init. S = {(ij, Φij) : ij I}. count=0. 3: for count < k1 do 4: Pop ij[k] having smallest Φ in S, for some k constraints. Compute L from right-hand side of (23). 5: if ˆΠk2 is not yet obtained or L > f(ˆΠk2) then 6: Solve Algorithm 1 with order constraint Oij[k] for new candidate ˆΠ. Set count += 1. 7: Update top-k2 candidates ˆΠ1, ˆΠ2, , ˆΠk2 and b T (k1, k2, k3, τ1, τ2) using new candidate ˆΠ. 8: if k equals k3 then 9: Go to line 4. 10: if ˆΠk2 not yet obtained or f(ˆΠ) < f(ˆΠk2) then 11: Use ˆΠ, τ1, τ2 in (17) and (18) and obtain new variates ij I(ˆΠ) and {Φij}ij I(ˆΠ). 12: for variate ij in I(ˆΠ) do 13: if i / i[k] and j / j[k] then 14: Push (ij, i1j1, , ikjk, Φij) onto stack S. 15: Return top k2 candidates ˆΠ1, ˆΠ2, , ˆΠk2 and b T (k1, k2, k3, τ1, τ2). tifying the nodes of T (k3, τ1, τ2) that are in 1-to-1 correspondence with those top-k2 plans. The tree T (k3, τ1, τ2) is constructed dynamically, and avoids redundantly computing plans Π corresponding to variates ij[k] if the k2best candidate has cost lower than either: i) a known lower bound L f(Π), or ii) the cost of ij[k] parent s plan. Finally, in the construction of T (k3, τ1, τ2) we conly consider variates where i[k] and j[k] do not repeat indices. Alg. 3 summarizes the branch-and-bound variable selection procedure. Upon termination, a diverse set of at most k1 plans are computed and the k2 plans identified by b T (k1, k2, k3, τ1, τ2) are the top-k2 amongst these k1 plans, as the most uncertain variates are handled first. It remains to define the lower bound L f(Π) (Line 5); Kusner et al. (2015) proposed a bound for (1), which we extend to the order constrained case (6) as follows: minΠ U(a,b) Oij[k] f(Π) minα x β Gij[k] x + gij[k](x, D) for some suitable α, β, where V in (2), and where Gij[k] = 3rd best 5th best 4th best Learnt w/o order constraint Learnt with order constraint found top-candidate Candidate Search Tree Figure 3: Top k2 = 5 candidates for OT with multiple order constraints for the example in Fig. 2 P ℓ [k] Diℓjℓ, and for coefficients ϕ Rm n, gij[k](x, ϕ) := min Π U(a,b) Oij[k]: Πiℓjℓ=x, ℓ [k] pq V Πpqϕpq. We decouple the row and column constraints in (20) and solve PACKING {ϕi}i [n], u, α : i [n] ϕixi s.t. X i [n] xi = α, 0 xi u. (20) Here u and α in (20) represent the per item xi capacity and total budget, respectively. Define µ u, {ϕ}i [n], α = PACKING {ϕ}i [n], u, α ν u, {ϕ}i [n 1], α = PACKING {ϕ}i [n 1], u, α u . Proposition 4.1. Let α1 = maxm i=1 ai n , β1 = maxm i=1 ai,, and α2 = maxn j=1 bj m, β2 = maxn j=1 bj. Then for any ij[k] set enforcing (6) where i[k] = i1, i2, , ik (and j[k]) do not repeat row (or column) indices, the quantity gij[k](x, ϕ) appearing in (19) is lower-bounded by gij[k](x, ϕ) L1ijk (x, ϕ, a) , α1 x β1 L2ijk (x, ϕ, b) , α2 x β2 (22) where L1ijk (x, ϕ, a) and L2ijk (x, ϕ, b) equal, resp. ℓ [k] ν x, {ϕiℓq}q [n]\{jℓ}, aiℓ + X p/ i[k] µ x, {ϕpq}q [n], ap Order Constraints in Optimal Transport OT with Order Constraints SOURCE TARGET Alg.3 Prior Constraint Figure 4: Color transfer candidates obtained using OT with OC (Alg. 3) Table 1: Task and annotation scores on e-SNLI. Best F1@n is the best achievable annotation F1 score over the set of k2 = n best plans, measured by setting coefficients 0 or 1 after determining values under which there is no impact on the the task F1 (75.1 .3). Standard OT achieves 64.5 .3 annotation F1. Algorithm Best F1@n n= 2 n= 5 n= 10 Algorithm 3 (ours) 68.1 .2 71.2 .3 73.7 .2 Greedy version 67.9 .3 68.2 .3 68.2 .3 ℓ [k] ν x, {ϕpjℓ}p [m]\{iℓ}, bjℓ + X q / j[k] µ x, {ϕpq}p [m], bq The proof is in the Appendix. Set ϕ = D in (22). Now, from (19) and Prop. 4.1 we obtain: min Π U(a,b) Oij[k] f(Π) max minα1 x β1 Gij[k] x + L1ijk (x, D, a) minα2 x β2 Gij[k] x + L2ijk (x, D, b) Thus (23) provides the lower bound in Line 4 of Alg. 3. 5. Experimental Results Sentence Relationship Classification: We use an annotated dataset from the enhanced Stanford Natural Language Inference (e-SNLI) (Camburu et al., 2018; Swanson et al., 2020) that includes English sentence pairs classified as entailment , contradiction or neutral . Annotations denote which words were used by humans to determine the class. We use Alg. 3 to compute k2 candidate plans which are measured against the annotations after being set to a binary variable: values above a threshold are set to 1, and below to 0. The threshold is determined as the one that maintains the reported task F1 score when the plan weights that lie below the threshold are set to zero. We determine the (τ1, τ2) that constrain T (k3, τ1, τ2) to be (τ1, τ2) = (.5, .5). Annotations are provided separately on the source and target in each pair; each candidate plan is marginalized using max across columns/rows to arrive at a vector of coefficients, and used to compute an annotation F1 score against the annotations (after application of the threshold). The top k2 = n plans are scored using the Best F1@n metric, which reports the score of the best plan in the learnt subtree b T (k1, k2, k3, τ1, τ2). More details can be found in the Appendix. Table 1 shows the results in terms of the classification task and annotation accuracy along with confidence intervals. Using even only a single order constraint, i.e. k3 = 1, the tree search in Alg. 3 achieves significantly improved explainability using a modest k1 = 20 candidates, giving higher annotation scores over the set of plans. As a baseline against which to compare, we implement a greedy algorithm by restricting the set I(Π) of variates in (18) to contain a single variate, i.e., modifying Lines 2 and 11 of Alg. 3. The greedy algorithm retains the variate ij I(Π) with the lowest saturation (17), thus corresponding to a single tree path. For n 5, the greedy algorithm results in a candidate set lacking diversity, as evidenced by the stagnating Best F1@n scores as compared to Alg. 3. Fig. 3 shows the search tree T (k3, τ1, τ2) (blue), the learnt Order Constraints in Optimal Transport problem size problem size problem size problem size 1 OC 2 OC 4 OC 10 OC scipy Legend Figure 5: Compute time of Alg. 1 compared with scipy.optimize, for various number of constraints. subtree b T (k1, k2, k3, τ1, τ2) (red) and the corresponding candidate plans for the example of Fig. 2 for the top k2 = 5 candidates and depth increased to k3 = 3. See Supp. Mat. for a walk-through of Alg. 3 using Fig. 3 and a study of the effectiveness of the lower bound using a swarmplot of # of nodes skipped in the search tree, T (k3, τ1, τ2) as tree depth k3 increases. The numbers in Fig. 3 indicate the ranking in the top-k2 order upon termination. The diversity provided by Alg. 3 allows the 3rd best solution (in red) to be included in the set of plans, and it is that plan that best explains the contradiction in the two sentences. Color Transfer We use source images from the SUN dataset (Xiao et al., 2010; Yu et al., 2016), and target images from Wiki Art (Tan et al., 2016). The images are segmented and the average RGB colors in each segment are used to obtain distributions a, b and costs D in the OT formulations (1) and (3). The thresholds that constrain T (k3, τ1, τ2) are set to (τ1, τ2) = (.5, 1.) here. Further details on the problem setup can be found in the Appendix. Results are shown in Fig. 4. The first OT with OC plan includes a prior (human-crafted) constraint while the latter two provide the auto-generated constraints from Alg. 3. The OT with OC solutions offer a range of images, all of which improve upon the standard OT solution in diverse ways. In the first row, standard OT transfers the blue sky to the ground and the cylindrical roof, while the prior constraint maps the green grass from the painting to the ground and intensifies the sky with the blue. The auto-generated constraints from Alg. 3, especially the k3 = 2 case, do the same though slightly less. In the second row, standard OT awkwardly transferred the red to the grass path. The OT with OC solution with a prior constraint as well as the auto-generated constraints of Alg. 3, k3 = 2 use the red to create an evening sky effect. In the third row, standard OT results in a very dark cliff with minimal visible features; the OT with OC solutions lead to a well-defined cliff and more effectively use the blue from the painting sky to deepen the sky in the source image. In addition to producing better resulting images, Alg. 3 allows for human judgement to be used to make the final selection from a set of plans. Computational Efficiency Fig. 5 shows how the computation time (in secs) of Alg. 1 scales with problem size. We generate 100 random problems for m, n 100 with 1, 2, 4 and 10 order constraints. The iterations are set to terminate at 1e4 rounds or a max projection error of 1e-4, and these settings achieve an average functional approximation of 0.51% error (within .19). We use penalty ρ = 1.0 after testing a range of ρ s and observing little difference, as it is well-known in the ADMM literature (Boyd, 2010). We compare our method against scipy.optimize. Fig. 5 shows 95% confidence intervals from 10 independent repetitions. Alg. 1 scales much better than scipy.optimize for large problems by orders of magnitude as the problem size mn increases. Note that Fig. 5 compares python-based algorithms; we also evaluated the C++-based cvxpy, and found that our algorithm performs almost indistinguishably to cvxpy for a single order constraint, which is remarkable given the overhead of python as compared to C++. 6. Conclusions Our proposed optimal transport with order constraints allows complex structure to be incorporated in optimal transport plans and provides a set of explainable solutions from which a human can select. Future extensions of interest are as follows. Other OT solvers may be extended to incorporate order constraints. Such solvers may offer different sets of plans satisfying varying objectives arising from other loss functions and various forms of regularization. See (Flamary et al., 2021). Nonlinear costs such as the submodular formulation of (Alvarez-Melis et al., 2018) can be useful in some settings and may be extended to include order constraints. It would also be of interest to explore other applications that can benefit from OT and specifically OT with OC. Optimal Transport with Order Constraints can be found in the AI Explainability 360 toolbox, which is part of the IBM Research Trusted AI library (Arya et al., 2019) at https://github.com/Trusted-AI/AIX360. Order Constraints in Optimal Transport Altschuler, J., Weed, J., and Rigollet, P. Near-linear time approximation algorithms for optimal transport via sinkhorn iteration. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pp. 1961 1971, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964. Altschuler, J., Bach, F., Rudi, A., and Niles-Weed, J. Massively scalable sinkhorn distances via the nystr om method. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Altschuler, J. M. and Boix-Adsera, E. Hardness results for Multimarginal Optimal Transport problems. ar Xiv:2012.05398 [cs, math], December 2020a. ar Xiv: 2012.05398. Altschuler, J. M. and Boix-Adsera, E. Polynomial-time algorithms for multimarginal optimal transport problems with structure. 2020b. URL http://arxiv.org/ abs/2008.03006. Alvarez-Melis, D., Jaakkola, T., and Jegelka, S. Structured optimal transport. In Storkey, A. and Perez Cruz, F. (eds.), Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pp. 1771 1780. PMLR, 09 11 Apr 2018. URL http://proceedings.mlr.press/v84/ alvarez-melis18a.html. Alvarez-Melis, D., Jegelka, S., and Jaakkola, T. S. Towards optimal transport with global invariances. In Chaudhuri, K. and Sugiyama, M. (eds.), Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pp. 1870 1879. PMLR, 16 18 Apr 2019. URL http://proceedings.mlr. press/v89/alvarez-melis19a.html. Alvarez-Melis, D., Mroueh, Y., and Jaakkola, T. Unsupervised hierarchy matching with optimal transport over hyperbolic spaces. In International Conference on Artificial Intelligence and Statistics, pp. 1606 1617. PMLR, 2020. Arya, V., Bellamy, R. K. E., Chen, P.-Y., Dhurandhar, A., Hind, M., Hoffman, S. C., Houde, S., Liao, Q. V., Luss, R., Mojsilovi c, A., Mourad, S., Pedemonte, P., Raghavendra, R., Richards, J., Sattigeri, P., Shanmugam, K., Singh, M., Varshney, K. R., Wei, D., and Zhang, Y. One explanation does not fit all: A toolkit and taxonomy of ai explainability techniques, 2019. URL https://arxiv.org/abs/1909.03012. Beck, A. First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadelphia, PA, October 2017. ISBN 978-161197-498-0 978-1-61197-499-7. doi: 10.1137/1. 9781611974997. URL http://epubs.siam.org/ doi/book/10.1137/1.9781611974997. Blondel, M., Seguy, V., and Rolet, A. Smooth and sparse optimal transport. In Storkey, A. and Perez Cruz, F. (eds.), Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pp. 880 889. PMLR, 09 11 Apr 2018. URL http://proceedings.mlr.press/v84/ blondel18a.html. Boyd, S. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning, 3(1):1 122, 2010. ISSN 19358237, 1935-8245. doi: 10.1561/2200000016. URL http://www.nowpublishers.com/article/ Details/MAL-016. Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1 122, jan 2011. ISSN 19358237. doi: 10.1561/2200000016. URL https:// doi.org/10.1561/2200000016. Boyd, S. P. and Vandenberghe, L. Convex optimization. Cambridge University Press, Cambridge, UK ; New York, 2004. ISBN 978-0-521-83378-3. Camburu, O.-M., Rockt aschel, T., Lukasiewicz, T., and Blunsom, P. e-SNLI: Natural language inference with natural language explanations. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. Courty, N., Flamary, R., Tuia, D., and Rakotomamonjy, A. Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9):1853 1865, 2017. doi: 10.1109/TPAMI. 2016.2615921. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 26, pp. 2292 2300. Curran Associates, Inc., 2013. De Leeuw, J., Kurt, H., and Mair, P. Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) Order Constraints in Optimal Transport and Active Set Methods. Journal of Statistical Software, 32, October 2009. doi: 10.18637/jss.v032.i05. Di Marino, S., Gerolin, A., and Nenna, L. Optimal transportation theory with repulsive costs. 2015. URL http: //arxiv.org/abs/1506.04565. Felzenszwalb, P. F. and Huttenlocher, D. P. Efficient Graph Based Image Segmentation. International Journal of Computer Vision, 59(2):167 181, September 2004. ISSN 0920-5691. doi: 10.1023/B:VISI.0000022288. 19776.77. URL http://link.springer.com/ 10.1023/B:VISI.0000022288.19776.77. Flamary, R., Courty, N., Gramfort, A., Alaya, M. Z., Boisbunon, A., Chambon, S., Chapel, L., Corenflos, A., Fatras, K., Fournier, N., Gautheron, L., Gayraud, N. T., Janati, H., Rakotomamonjy, A., Redko, I., Rolet, A., Schutz, A., Seguy, V., Sutherland, D. J., Tavenard, R., Tong, A., and Vayer, T. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021. URL http://jmlr.org/papers/v22/ 20-451.html. Forrow, A., H utter, J.-C., Nitzan, M., Rigollet, P., Schiebinger, G., and Weed, J. Statistical optimal transport via factored couplings. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2454 2465. PMLR, 2019. Goldfeld, Z. and Greenewald, K. Gaussian-smoothed optimal transport: Metric structure and statistical efficiency. In International Conference on Artificial Intelligence and Statistics, pp. 3327 3337. PMLR, 2020. Grotzinger, S. J. and Witzgall, C. Projections onto order simplexes. Applied Mathematics and Optimization, 12 (1):247 270, October 1984. ISSN 1432-0606. doi: 10. 1007/BF01449044. Guminov, S., Dvurechensky, P., Tupitsa, N., and Gasnikov, A. On a Combination of Alternating Minimization and Nesterov s Momentum. In Proceedings of the 38th International Conference on Machine Learning, pp. 3886 3898. PMLR, July 2021. URL https://proceedings.mlr.press/v139/ guminov21a.html. ISSN: 2640-3498. Guo, W., Ho, N., and Jordan, M. Fast algorithms for computational optimal transport and wasserstein barycenter. In International Conference on Artificial Intelligence and Statistics, pp. 2088 2097. PMLR, 2020. Jambulapati, A., Sidford, A., and Tian, K. A Direct o(1/epsilon) Iteration Parallel Algorithm for Optimal Transport. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Kusner, M., Sun, Y., Kolkin, N., and Weinberger, K. From word embeddings to document distances. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 957 966, Lille, France, 07 09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/ kusnerb15.html. Laclau, C., Redko, I., Choudhary, M., and Largeron, C. All of the fairness for edge prediction with optimal transport. In International Conference on Artificial Intelligence and Statistics, pp. 1774 1782. PMLR, 2021. Lin, T., Ho, N., and Jordan, M. On efficient optimal transport: An analysis of greedy and accelerated mirror descent algorithms. In International Conference on Machine Learning, pp. 3982 3991. PMLR, 2019. Liu, B., Rao, Y., Lu, J., Zhou, J., and Hsieh, C.- J. Multi-proxy wasserstein classifier for image classification. Proceedings of the AAAI Conference on Artificial Intelligence, 35(10):8618 8626, May 2021. URL https://ojs.aaai.org/index. php/AAAI/article/view/17045. Pass, B. Multi-marginal optimal transport: Theory and applications. ESAIM: M2AN, 49(6):1771 1790, 2015. doi: 10.1051/m2an/2015020. URL https://doi.org/ 10.1051/m2an/2015020. Paty, F.-P., d Aspremont, A., and Cuturi, M. Regularity as regularization: Smooth and strongly convex brenier potentials in optimal transport. In International Conference on Artificial Intelligence and Statistics, pp. 1222 1232. PMLR, 2020. Peyr e, G. and Cuturi, M. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. ISSN 1935-8237. doi: 10.1561/2200000073. URL http://dx.doi.org/10.1561/2200000073. Scetbon, M., Cuturi, M., and Peyr e, G. Low-Rank Sinkhorn Factorization. In Proceedings of the 38th International Conference on Machine Learning, pp. 9344 9354. PMLR, July 2021. URL https://proceedings.mlr.press/v139/ scetbon21a.html. ISSN: 2640-3498. Schmitzer, B. Stabilized sparse scaling algorithms for entropy regularized transport problems. SIAM Journal on Scientific Computing, 41(3):A1443 A1481, 2019. doi: 10.1137/16M1106018. Shi, Y., Yu, X., Liu, L., Zhang, T., and Li, H. Optimal feature transport for cross-view image geo-localization. Order Constraints in Optimal Transport Proceedings of the AAAI Conference on Artificial Intelligence, 34:11990 11997, Apr. 2020. doi: 10.1609/ aaai.v34i07.6875. URL https://ojs.aaai.org/ index.php/AAAI/article/view/6875. Solomon, J. Optimal transport on discrete domains. 2018. URL http://arxiv.org/abs/1801.07745. Su, B. and Hua, G. Order-preserving wasserstein distance for sequence matching. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2906 2914, 2017. doi: 10.1109/CVPR.2017.310. Swanson, K., Yu, L., and Lei, T. Rationalizing text matching: Learning sparse alignments via optimal transport. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5609 5626. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.496. URL https://www.aclweb.org/anthology/ 2020.acl-main.496. Tan, W. R., Chan, C. S., Aguirre, H. E., and Tanaka, K. Ceci n est pas une pipe: A deep convolutional network for fine-art paintings classification. In 2016 IEEE International Conference on Image Processing (ICIP), pp. 3703 3707, Phoenix, AZ, USA, September 2016. IEEE. ISBN 978-1-4673-9961-6. doi: 10.1109/ICIP. 2016.7533051. URL http://ieeexplore.ieee. org/document/7533051/. Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. Wang, F., Li, P., and Konig, A. C. Learning a bi-stochastic data similarity matrix. In 2010 IEEE International Conference on Data Mining, pp. 551 560, 2010. doi: 10.1109/ICDM.2010.141. Wu, L., Yen, I. E.-H., Xu, K., Xu, F., Balakrishnan, A., Chen, P.-Y., Ravikumar, P., and Witbrock, M. J. Word mover s embedding: From Word2Vec to document embedding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4524 4534, Brussels, Belgium, October November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1482. URL https: //www.aclweb.org/anthology/D18-1482. Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., and Torralba, A. SUN database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485 3492, June 2010. doi: 10.1109/CVPR.2010. 5539970. ISSN: 1063-6919. Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., and Xiao, J. LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop. ar Xiv:1506.03365 [cs], June 2016. URL http://arxiv.org/abs/1506.03365. ar Xiv: 1506.03365. Yurochkin, M., Claici, S., Chien, E., Mirzazadeh, F., and Solomon, J. M. Hierarchical optimal transport for document representation. In Wallach, H., Larochelle, H., Beygelzimer, A., Alch e-Buc, F. d., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 1601 1611. Curran Associates, Inc., 2019. Zhang, Y., Cheng, X., and Reeves, G. Convergence of gaussian-smoothed optimal transport distance with subgamma distributions and dependent samples. In International Conference on Artificial Intelligence and Statistics, pp. 2422 2430. PMLR, 2021. Order Constraints in Optimal Transport The Appendix comprises two sections. Section A includes further details on the Sentence Relationship Classification and Color Transfer experiments and on the Computational Efficiency of our proposed OT with OC method. Section B of the Appendix includes the proofs for the results in the main paper. A. Details on Experimental Setup and Results A.1. Sentence Relationship Classification For the enhanced Stanford Natural Language Inference (e-SNLI) dataset we follow Swanson et al. (2020) to evaluate explainablity of an OT scheme, whereby annotation-based scores are measured after applying a threshold to determine the activations. The threshold is taken as the one that balances task and explainabilty performance; it should be a sufficiently high so that we do not get spurious activations, and sufficiently low so that enough coefficients are preserved to not compromise the task scores. e-SNLI provides annotation labels to determine the explainability performance, quantified by the annotation F1. The highest annotation F1 score over n = k2 candidates of the Algorithm 3 determines the Best F1@n scores. The Greedy version used as a baseline is scored the same way. We used sizes of (100K, 10K, 5K) for train, validation, and test, respectively. Picking the threshold for the classification task: This is a three class entailment , neutral and contradiction classification task to predict logical relationships between sentence pairs. The task is performed by a shallow neural network that incorporates an OT attention module between the input and output layers layers7. The attention layer uses OT to match source and target tokens, and matched tokens are concatenated and output via softmax. The shallow network is trained on the un-thresholded plan. Over the validation and tests sets, all coefficients that lie below T/(mn) are dropped when measuring the task F1. Using the validation set, we pick the threshold in the region between 0.01 and 2 where the task F1 starts to plateau due to the dropped coefficients; we use an activation threshold of 2. Comparing activations with annotation labels: In e-SNLI, annotations that are provided separately on the source and target sentence. Swanson et al. (2020) proposed to marginalize the transport plan Π U(a, b) forming two vectors for comparison, by applying max to each row and column; we found that this method overlooks the coupling in the transport plan Π, and we propose instead taking max only over row/column positions that have annotations labels. On the other hand we cannot then utilize neutral examples for measuring annotation F1 (since these examples are missing row annotations), but we do not consider this as a drawback of our proposal because Camburu et al. (2018) indicated that the annotations for the neutral examples are curated in a somewhat inconsistent manner from the other example classes, and we thus omit them from the annotation scores. Hyperparameters: Alg. 3 takes a standard OT baseline Π U(a, b) to generate the variate set I(Π), see (18). We use a regularized OT (the hyper-parameter-less mirror descent version of standard OT (see Scetbon et al., 2021) with 20 iterations) to obtain Π. The standard OT solution lies on a polytope and has at most m + n non-zeros out of the mn locations (Peyr e & Cuturi, 2019). The regularized version produces more non-zeros that give more information for seeking unsaturated locations in (18). For the standard OT attention network we do the same with a lower 5 iterations to speed up training. We obtain (τ1, τ2) that constrain T (k3, τ1, τ2), as follows. First compute transport plans Π U(a, b) by solving (1), using the regularized version see above, and computing φs ij and Φij using (18). The polytope U(a, b) constrains the points (φs ij, Φij) to lie in a lower triangular region bounded by the horizontal and vertical axes and a line that runs through (1, 0) and (0, 1). The plan coefficients that are uncertain will lie in a box region bounded by the axes τ1 and τ2. Using SNLI annotations as labels that indicate importance, we chose (τ1, τ2) = (.5, .5). We only consider variates ij in (18) if both i, j do not correspond to stop-words. The transport plan is marginalized using max across columns/rows to arrive at a vector of coefficients, and we compute an annotation F1 score against the annotations. The top k2 = n plans are scored using the Best F1@n metric, which reports the score of the best plan in the learnt subtree b T (k1, k2, k3, τ1, τ2). Walkthrough of Alg. 3 using Fig. 3: At Line 1, the root node (labeled R) and its candidate plan ˆΠ1 are computed, and b T (k1, k2, k3, τ1, τ2) is initialized. At Line 2 the single order constraint nodes are constructed and pushed to stack S along 7https://github.com/asappresearch/rationale-alignment Order Constraints in Optimal Transport Number of OC Number of Skipped Nodes Figure 6: Effectiveness of lower bound (23) in terms of nodes skipped in T (k3, τ1, τ2) for various depths k3 = 1, 2, 3. with saturations Φij using (17). At Line 4 a variate ij corresponding to a single order constraint (e.g., piece - pair ) is popped off, and at Line 6 a new candidate ˆΠ2 is computed. At Line 7 this candidate is identified by adding ij to b T (k1, k2, k3, τ1, τ2) which now has two nodes. Then at Lines 11-14, new depth-2 nodes ij[2] = i j ij are computed from ˆΠ2 and pushed to S to be considered next in T (k3, τ1, τ2) (e.g., piece - pair followed by clothing - pants ). Lines 3-14 iterate to update b T (k1, k2, k3, τ1, τ2) and the candidate set. Once b T (k1, k2, k3, τ1, τ2) grows to k2 nodes, the lower bound test at Lines 5 and 11 prune redundant nodes of T (k3, τ1, τ2). The iterations terminate when count reaches k1. Hardware and compute times: Classifier training was performed on a multi-core Ubuntu virtual machine and on a n Vidia Tesla P100-PCIE-16GB GPUs. For the shallow neural nets GPU memory consumption was found < 1GB. Training times for 10 epochs (around 5 million sample runs) were roughly 21 hours. Algorithm 3 took about 16 hours to complete over the test set. A.2. Color Transfer For the source image, we use Felzenszwalb & Huttenlocher (2004) to preserve pixel locality (see Alvarez-Melis et al., 2018), and for the target we perform color RGB segmentation via k-means, in the range of 5-10 color clusters. We illustrate the results of Alg. 3 on three examples from learnt subtrees b T (20, 5, 1, τ1, τ2) and b T (40, 10, 2, τ1, τ2), where in this set of experiments we chose (τ1, τ2) = (0.5, 1.0). Here we have no labels to tune (τ1, τ2) as we did for e-SNLI. The neighbour saturation metric Φij (see (17)) had less effect in this application. Therefore, it sufficed here to use non-regularized OT to compute the base-plan Π U(a, b). We use the 5 largest source image segments and the 2 largest target image segments in terms of Dij; that is, we only consider source image segments large enough to be visually significant, and target image segments with contrasting colors. The Wiki Art dataset can be found at https://github.com/cs-chan/Art GAN/ tree/master/Wiki Art%20Dataset. A.3. Computational Efficiencies Computation Issues Each computation run of Alg. 1 is measured on a single Intel x86 64 bit Xeon 2MHZ with 12GB memory per core. Also, since the python library scipy.optimize depends on numpy8 and can run numpy with multiple threads, for fair comparison with our (single-threaded) implementation of Alg. 1, we disabled all parallelisms. Effectiveness of Lower Bound (23) in Alg. 3 Figure 6 demonstrates the effectives of the lower bound (23), by showing the number of search tree nodes in T (k3, τ1, τ2) that were skipped while performing the e-SNLI experiment for various depths k3 = 1, 2, 3 and a large candidate size k1 = 40. The bounds show to be effective in skipping more nodes as the size of T (k3, τ1, τ2) increases with the depth k3. 8https://numpy.org/ Order Constraints in Optimal Transport First we provide the proofs of the propositions concerning the projections, Proposition 3.2 and Proposition 3.1. Secondly, we cover the bounds, giving the proof of correctness of Proposition 4.1. B.1. Projections The projection Proj C2 C2 (X) = arg min Y C2 Y X 2 F involves mn k constraints Ypq Yi1j1 for pq V where V in (2), followed by mn k + 1 constraints 0 Ypq for pq V {i1j1}, and k 1 constraints Yiℓjℓ Yiℓ+1jℓ+1 for ℓ [k 1]. Write λpq, δpq and ηℓfor their respective Lagrangian variables and the Karush-Kuhn-Tucker (KKT) conditions: Xpq λpq + δpq for all pq V Xi1j1 η1 + P pq V λpq + δi1j1 for pq = i1j1 Xiℓjℓ+ ηℓ 1 ηℓ for all pq = iℓjℓwhere 2 ℓ k Ypq Yi1j1, for all pq V, and Ypq 0 for all pq V {i1j1} (25) λpq 0, for all pq V, and δpq 0 for all pq V {i1j1} (26) λpq(Ypq Yi1j1) = 0, for all pq V, and δpq Ypq = 0 for all pq V {i1j1} (27) Yiℓjℓ Yiℓ+1jℓ+1, for all ℓ [k 1] (28) ηℓ 0, for all ℓ [k 1] (29) ηℓ(Yiℓ+1jℓ+1 Yiℓjℓ) = 0, for all ℓ [k 1] (30) Lemma B.1. Given any index ij = i1j1 and positive η 0, any Xpq R for all pq V {ij}, let r = r(η) = rank (Xij η) over the mn k + 1 variates. Then for both τ( ) = τ( , η) and 0 t = t(η) < r (see both (14) and (15) in main text) we have τ(t, η) X(t+1), and if t > 0 also τ(t, η) X(t) X(t 1) X(1). (31) Proof of Lemma B.1. Assume η = 0 and drop η from τ and t, since the case η > 0 is equivalent to the case X pq = Xpq and X ij = Xij η. Then, the statement (31) holds if (i) X(t) τ(t) and (ii) τ(t) X(t+1). Fact (i) is relevant only when t > 0, and holds by definition (see (15) in main text) because t + 1 is minimal in [r] for which τ(t + 1) > X(t+1) holds. Fact (ii) holds in either of the possible two cases: Case I: [t + 1 = r]. In this case, note τ(t) is in fact the unweighted average of X(t+1) = Xij and values Xpq that have higher rank (Xpq) > r. Therefore we must conclude τ(t) X(t+1). Case II: [t + 1 < r]. In this case, note that by definition of t we must have τ(t + 1) > X(t+1) is satisfied. Together with (t + 1) τ(t + 1) = t τ(t) + X(t+1) we have τ(t) > τ(t + 1). Therefore we conclude τ(t) > τ(t + 1) > X(t+1). Lemma B.2. Given any index ij = i1j1 and positive η 0, any Xpq R for all pq V {ij}, for where at least one Xpq in V {ij} is negative. Let 1 s t denote the least-ranked element that is negative in the ranking X(1) X(s 1) 0 > X(s 1) X(t) for t = t(η) satisfying (15) in main text. Then for τ( ) = τ( , η) in (14), if τ(t) < 0, we necessarily have: Xij < 0, and ℓ=1 X(ℓ) < Xij, (32) Proof of Lemma B.2. Assume η = 0 and drop η from τ and t, since the case η > 0 is equivalent to the case X pq = Xpq and X ij = Xij η. We get that τ(s) = (s + 1) 1 (Ps 1 ℓ=1 X(ℓ) + X(s). + Xij) X(s), where the inequality follows from the minimality of t in (15). Re-arranging, we get Ps 1 ℓ=1 X(ℓ) + Xij X(s). On the other hand, we had assumed X(s) < 0, so therefore we must conclude (32). Lemma B.3. Given index ij = i1j1 and η 0, coefficients Xpq R for all pq V {ij}, let r = r(η) = rank (Xij η) over the mn k + 1 coefficients. Let τ( ) = τ( , η) and 0 t = t(η) respectively satisfy (14) and (15) in the main text. For all pq V {ij}, set Ypq as: i) if rank (Xpq) t or pq = ij then set Ypq = (τ(t, η))+, and ii) if otherwise then set Order Constraints in Optimal Transport Ypq = (Xpq)+. Then the mn k and mn k + 1 coefficients λpq and δpq as determined by the first two lines of (24), given this choice for Ypq and Xpq, satisfy (25) (27) together with the chosen Ypq and Xpq. Proof of Lemma B.3. Fix any ij = i1j1. Assume η = 0, because the general case η 0 is equivalent to the case X pq = Xpq and X ij = Xij η, hence we simply write τ( ) = τ( , η) and t = t(η). We show that (33)-(34) below (for all pq V ) Ypq = (τ(t))+ if rank (Xpq) t (Xpq)+ otherwise (33) Yij = (τ(t))+ , (34) together with (31) and (32) in Lemmatta B.1 and B.2 respectively, establish λpq and δpq that relate Ypq and Xpq by (the first two equations of) (24), and satisfy the three KKT conditions (25), (26) and (27) for iij1 = ij, proving the lemma. KKT condition (25). We first show Ypq Yij for all pq V . Consider rank (Xpq) > t. Then wherever Xpq 0 we have 0 Ypq (33) = Xpq X(t+1) (31) τ(t) (34) = Yij , and whenever Xpq < 0 we have Ypq (33) = (Xpq)+ = 0 (τ(t))+ = Yij. For all other pq V where rank (Xpq) t we see that Ypq (33) = (τ(t))+ (34) = Yij . Thus the first set of inequalities in (25) follow. The second set Ypq 0 for all V {ij} follow by definitions (33) and (34). We next show (26) and (27) seperately for pq V ; for pq V we choose λpq, δpq satisfying Xpq + δpq = λpq + Ypq as (if rank (Xpq) > t) λpq = 0, and δpq = (Xpq)+ Xpq, (35) (if rank (Xpq) t) λpq = Xpq (τ(t))+ , δpq = 0, if τ(t) 0 or Xpq 0, λpq = 0, δpq = Xpq, if τ(t) < 0 and Xpq < 0 (36) Verify (35) satisfies (first equation of) (24) by putting Ypq = (Xpq)+ as in (33), and similarly verify (36) satisfies the same by putting Ypq = (τ(t))+ whilst considering both cases τ(t) 0 and τ(t) < 0. Following that (26) and (27) for the remaining index ij will be shown after. KKT condition (26) for pq V . Consider rank (Xpq) > t. From (35), this is trivally satisfied for λpq, and holds for δpq because (Xpq)+ Xpq 0. For rank (Xpq) t we see from (36) the following. We see λpq = Xpq (τ(t))+ 0 holds in the event where either τ(t) 0 or Xpq 0 holds, since τ(t) (31) Xpq whenever rank (Xpq) t. Also in the event where both τ(t) < 0 and Xpq < 0 hold, we have that δpq = Xpq 0 because then we have Xpq to be negative. KKT condition (27) for pq V . The first set of equalities in (27) satisfy as follows. Whenever rank (Xpq) t, we have Ypq = Yij, see above discussion on KKT condition (25). For rank (Xpq) > t, we had chosen λpq = 0, see (35). The second set of equalities in (27) satisfy as follows: for case rank (Xpq) > t this is seen given choice (33) putting Ypq = (Xpq)+ and choice (35) for δpq, and for rank (Xpq) t this is seen given (33) putting Ypq = (τ(t))+ and choice (36) for δpq. It remains to show KKT conditions δij 0 in (26) and δij Yij = 0 in (27). Derive the following: pq V λpq (35) = X pq V : rank(Xpq) t (Xpq + δpq (τ(t))+) = X pq V : rank(Xpq) t (Xpq + δpq) t (τ(t))+ . Then put Yij = (τ(t))+ from (34) in δij = Yij Xij P pq V λij, and put in the derivation above to get δij = (t + 1) (τ(t))+ (Xij + P pq V : rank(Xpq) t(Xpq + δpq)), and simplify to: (t + 1)τ(t) (Xij + P pq V : rank(Xpq) t Xpq) (14) = (t + 1)τ(t) (t + 1)τ(t) = 0 if τ(t) 0 pq V : rank(Xpq) t(Xpq + δpq) (36) = Xij P pq V : Xpq 0 Xpq (32) > 0, if τ(t) < 0 showing both (26) and (27) hold, where we had used (32) from Lemma B.2 for the final inequality in the τ(t) < 0 case. Lemma B.4. For τ( , η) and t(η) in (14) and (15) in the main text, consider the function T : R+ 7 R defined as T(η) := (τ(t(η), η))+ for any η 0. Then T is piecewise-linear, monotonic non-increasing and convex in η 0. Order Constraints in Optimal Transport Proof of Lemma B.4. τ(s, η) is decreasing linear function of η for fixed s, therefore T(η) = (τ(t(η), η))+ = max(τ(t(η), η), 0) is a convex, piecewise linear function with inflection points whenever t(η) changes value, with the final inflection point occurring when τ(t(η), η) meets the horizontal axis. Lemma B.5 ((Grotzinger & Witzgall, 1984)). For p, q, r, s [k] satisfying 1 p q r s k where r = q + 1, assume pq sq for in (37). Let ℓsatisfy p ℓ s. Then for all ℓ q we have ψpℓ (ℓ p + 1) ps 0, and for all ℓ> q we have we have ψrℓ (ℓ r + 1) rs 0. Restatement of Proposition 3.2 For C2 = Oij[k] for any iℓjℓ [mn] where ℓ [k], consider the Euclidean projector Proj C2 (X) for any X Rm n. Let T(η) := (τ(t(η), η))+ for τ, t in (14) and (15) in the main text, and let V in (2). Then for any X Rm n, e PAVA will successfully terminate with some B, η, le,ri, and val. Furthermore, the projection ˆX = Proj C2 (x) satisfies i) for pq V we have ˆXpq = T( η) = val[1] if rank ˆXpq t( η) or ˆXpq = Xpq otherwise, and ii) for ℓ [k] we have ˆXiℓjℓ= val[B ] iff le[B ] ℓ ri[B ] for some B [B]. Proof. Proving Proposition 3.2 involves exhibiting a Y Rm n satisfying (24) (30). Given variates Xiℓjℓwhere iℓjℓ/ V for all ℓ [k], define: ℓ=p Xiℓjℓ, pq = ψpq q p + 1, for 1 p q k. (37) The block of operations between Lines 4 and 10 of Alg. 2 is termed coalescing (Grotzinger & Witzgall, 1984). As coalescing occurs during iterates, in the case whenever B = 2, notice the parameter denoted η in Alg. 2 (Line 7) that updates. We prove that if this parameter is taken to be η1 in the KKT conditions (24) and (29) (30), that Alg. 2 terminates and when it does with ˆX, that ˆX = Y satifies the KKT. The proof is recursive in η1 and we initialize η1 = 0. We use ηℓ and η ℓto denote the current, and next iterate, respectively. We first invoke (24) to fix the relationship between solution and Lagrangian multipliers. For a given η1 value, put η1 = η in Lemma B.3 to get mn k + 1 coefficients Ypq (and the Lagrangians λpq, δpq) for all pq V {i1j1} satisfying the first two lines of (24) and (25) (27) of the KKT conditions. Likewise for η1, we use (24) to derive the k 1 Lagrangians that accompany Yiℓjℓas follows. For any B [B], e PAVA in Alg. 2 sets Yiℓjℓ= Yiℓ+1jℓ+1 for values of ℓthat satisfy le[B ] ℓ< ℓ+ 1 ri[B ]. On the other hand if ℓis on the boundary ℓ= ri[B ] and ℓ+ 1 = le[B + 1] then e PAVA results in Yiℓjℓ = Yiℓ+1jℓ+1. Therefore using ψpq and pq from (37) and ν = P pq V λpq + δi1j1, we express Yiℓjℓand ηℓ for all ℓ [k]: ηℓ = ψ1ℓ ℓ T(η1) + ν, Yiℓjℓ= T(η1), if 1 ℓ ri[1], (38) ηℓ = ψpℓ (ℓ p + 1) pq, Yiℓjℓ= pq, if p = le[B ] ℓ q = ri[B ] (39) where (39) is satisfied for all 1 < B B, and simply let ηk = 0 since this does not contradict (37). Thus it remains to show, as the iterates of η1 update the values of Ypq, λpq, δpq, ηpq by Lemma B.3 and (38) (39), the remaining KKT conditions (28) (30) are satisfied upon termination of e PAVA. To this end we must prove the existence and required properties of the zero (i.e., η, taken here to be η1) in Line 7. Specifically, let q = ri[1] be the boundary of block 1. Suppose either a) η1 = 0 and q = 1, or b) ψ2q (q 1) T(η1) + η1 = 0 is satisfied for some η1 > 0 and q > 1. Let r = q + 1 and consider ψ2s from (37) for some s r 2. We show below that in either case a) or b), that if rs T(η1) holds, we must have that i) there exists a zero η 1 satisfying ψ2s (s 1) T(η 1) + η 1 = 0, and ii) T(η 1) T(η1) and η 1 η1. The proof of the above properties of the zero follow. In the case η1 = 0 and q = 1, then i) follows by equivalently showing if T(η 1) η 1/(s 1) 2s has a zero (divide by (s 1) 1 and use (37)). Indeed, the zero exists at some η 1 0 since the non-increasing T(η 1), see Lemma B.4, and an increasing linear function η 1/(s 1)+ 2s, meets, as the latter starts at a point lower than the former as given by the assumption 2s T(0). Furthermore, ii) holds since we showed η 1 0 = η1 and by monotonicity of T( ). In the other case put 2s = (1 β) 2q + β rs for 0 β = (s q)/(s 1) 1, and use the condition b) above to derive T(η1) 2s η1/(s 1) = β (T(η1) rs) 0 where the inequality follows since we assumed rs T(η1). Then by monotonicity of T( ), see Lemma B.4, the zero exists and occurs at some point η 1 η1 with similar arguments as before showing i) and ii). Order Constraints in Optimal Transport KKT condition (29) for block 1, case (38): We now show that for block 1, coalescing recursively maintains nonnegativity (29). Suppose 1, q and r, s are boundaries of blocks 1 and 2, where ri[1] = q = r 1. Let η ℓfor ℓ [s] equal (38) after coalescing, rewritten into η ℓ= ψ2ℓ (ℓ 1)T(η 1) + η 1, and similarly ηℓfor ℓ [q] denotes (38) before coalescing. By recursion assumption either η1 = 0 or ψ2q (q 1) T(η1) + η1 = 0 is satisfied for some η1 > 0. e PAVA coalesces blocks 1 and 2 if val[1] val[2], or equivalently, if rs T(η1) for value η1 0 , and therefore by recursion assumption, we conclude a new zero η 1 0 of Line 7 with properties i) and ii) above exists. Then for ℓ q we have η ℓ ηℓ= (ℓ 1)(T(η1) T(η 1)) + η 1 η1, and by the inequality η 1 η1 0, we conclude η ℓ ηℓ 0. Next for ℓ> q, we express the following for some α = 1 (ℓ 1)/(s 1) 0: η ℓ ηℓ (a) = η 1 + ψ2q (ℓ 1) T(η 1) + (ℓ q) rs (b) = α (η 1 + ψ2q (q 1) rs) (c) α(q 1) (T(η1) rs) 0, where (a) follows from η ℓin (38) and ηℓin (39), and (b) follows from expressing T(η 1) = [ψ2q +(s q) rs+η 1]/(s 1) and collecting terms into α, and (c) follows as η 1 + ψ2q η1 + ψ2q = (q 1) T(η1), and finally the last inequality follows because we only coalesce when rs T(η1). KKT condition (29) for block > 1, case (39): We show the similar non-negativity property for other blocks. Suppose p, q and r, s are boundaries of two blocks B and B + 1, where r = q + 1 and B > 1. e PAVA coalesces block B and B + 1 only if val[B 1] val[B ], or equivalently, pq rs. Consider any p ℓ s. Eqn. (39) implies that the Lagrangians before and after coalescing are ηℓ= ψpℓ (ℓ p + 1) ps and η ℓ= ψrℓ (ℓ r + 1) rs, respectively. Supposing ηℓ 0 and by these expressions for η ℓ, ηℓ, we thus invoke Lemma B.5 to show η ℓ 0 for all ℓ q hold after coalescing; the other case ℓ> q also holds similarly by Lemma B.5. KKT condition (30): We show the coalescing update Lines 4-10, maintains the boundary s = ri[B ] for any B [B] property ηs = 0; this proves complementary slackness (30) because (38) (39) shows Yiℓjℓto differ only across boundaries. Let ηq and η s denote boundary Lagrangians for the previous and current coalescing update, respectively; the boundaries p, q and r, s for r = q + 1 are coalesced. For (39) for B > 1, we have for η s = ψps (s p + 1) ps at the boundary ℓ= s, and ηs = 0 by definition (37). For (38) for B = 1, rewrite (38) to get the form η s = ψ2s (s 1) T(η 1) + η 1 resembling the equation with the zero shown above. If the coalescing update is executed for the first time, then η1 = 0 and q = 1. Otherwise η1 > 0 and q > 1 and by recursion assumption ηq = ψ2q (q 1) T(η1) + η1 = 0. Therefore in either cases, together with the condition rs T(η1) that holds for a coalescing step to occur, we conclude by zero property i) that η s = 0. KKT condition (28): Each coalescing step of e PAVA attempts to restore a non-increasing property of Yiℓjℓfor all ℓ [k]. This is only possible if the new values Yiℓjℓthat result from coalescing does not increase beyond the values before coalescing. This is obvious for the case of B > 2 by definition (37). e PAVA eventually terminates satisfying (28), since there are finite number of coalescsing steps, and in the worst case arrives at the terminating state B = 1 and ri[1] = k is arrived at, which satisfies (28). We now turn to the Euclidean projection Proj C1(a,b) (X) of a matrix X Rm n onto the rowand column-sum constraint set C1(a, b) for measures a, b. Let denote the matrix Kronecker (i.e., tensor) product. Observe that the row-sums of any X Rm n can also be obtained via Kronecker equivalence X1n = (Im 1T n)x, where x Rmn is a length-mn vector formed from X Rm n by setting x(k 1)n+ℓ= Xkℓ. Also be the same equivalence we obtain column-sums of X Rm n XT 1m = (1m In)x from the same equivalent x Rmn; in numpy terminology9 one writes x = ravel (X). We thus conclude that Euclidean projection ˆX = Proj C1(a,b) (X) for any X Rm n, can be equivalently obtained by solving the following optimization over Rmn: ˆx = arg min y Rmn 1 2 y x 2 2 , s.t. (Im 1T n)y = a, (1T m In)y = b (40) and then setting ˆX = unravel (ˆx), where unravel ( ) is the operational inverse of numpy function ravel ( ). Converting the problem to vectors (40) results in KKT conditions with (m + n) (Lagrangian) variables α Rm and 9https://numpy.org/doc/stable/reference/generated/numpy.ravel.html Order Constraints in Optimal Transport β Rn in a system of linear equations: where here A is an (m + n) mn matrix with first m rows equal Im 1T n and last n rows equal 1T m In. That is to compute ˆX = Proj C1(a,b) (X), one may simply solve (41) for y, given x = ravel (X) , a and b. However we have to pay attention that A in (41) is not full-rank. To compute the exact rank of A, we can make use of Remark 3.1 from (Peyr e & Cuturi, 2019); specifically, it is m + n 1. This implies that the left-inverse of A is not equal to (AT A) 1AT , and instead requires the pseudo-inverse of the matrix10 A R(m+n) mn in (41): A = h 1 n Im 1n 1 m+n 1 m 1m In 1 m+n i . Recall Pk is the projection Pk = 1 k1k1T k , and matrices M := Im m m+n Pm and N := In n m+n Pn. Then (42) will prove Prop. 3.1 , restated here in the Kronecker equivalent form (40) corresponding to numpy function x = ravel (X). Lemma B.6. Assume a Rm and b Rn have same means, i.e., a T 1n = b T 1n. Then for any x Rm n, the solution of (41) is given by the the following expression for y Rmn for some α Rm and β Rn, where + Imn AT AT x (42) where A is the pseudo-inverse of A R(m+n) mn in (41). Proof of Lemma B.6. Suppose a b lies in span (A), then our strategy is to show y Rmn in (42) satisfies both top (i.e., first m rows) and bottom (i.e., last n rows) of (41), in two steps. From the Moore-Penrose property AA A = A of the pseudo-inverse, we conclude that A has a left-inverse property over span (A). Thus, we can construct some y = + ν with an unrestricted choice for ν null (A), and y will satisfy the bottom of (41). Next we want to show our construction for y also satisfies the top of (41). To do this, we show there exists some α Rm, β Rn that satisfies = x ν A a b , for some specific ν null (A). Indeed for any x Rmn, there exists some ν null (A) such that x ν equals the projection of x onto span AT , i.e., there exists ν such that x ν = AT (AT ) x ; to check write ν = Imn AT (AT ) x and observe that ν is then the projection of any x Rmn onto (span AT ) , and observe that (span AT ) = null (A), agreeing with our assumption ν null (A). Thus, we have shown that y = A a b ν = Imn AT (AT ) x satisfies the optimality conditions (41), and this form of y is exactly that of (42). To conclude we need to prove the starting assumption a b span (A). Because we assumed 1T a = 1T b we have 1m 1n to be to a b . On the other hand 1m 1n is in fact in (span (A)) . Therefore a b Restatement of Proposition 3.1 For any x Rmn, the Euclidean projection ˆx solving (40) satisfies ˆx = y1 + (x y2), where y1 = 1 n[M 1n]a + 1 m[1n N]b and y2 = [M Pn]x + [Pm N]x for Pm, Pn, M and N defined as before. Proof of Proposition 3.1. Put expression for A into the expression for y in (42), set ˆx = y, and apply basic manipulations. 10This can be verified to satisfy the conditions in p. 257, Golub, Gene H. and Van Loan, Charles F., Matrix Computations, Johns Hopkins University Press, Third Edition, 1996. Order Constraints in Optimal Transport B.2. Bounds The goal here is to extend the result in Kusner et al. (2015) and obtain a lower bound (see (19) in main text) on the optimal cost (6) with order constraints. This requires a lower bound (see (23) in main text) on the function gij[k](x, ϕ) that appears in (19) in the main text. Prop. 4.1 derives this lower bound by decoupling the expression into (m or n) independent minimizations that can be computed in parallel; Kusner et al. (2015) takes the same approach for the simpler problem (1), but here a global constraint (due to the order constraints) still has to be dealt with (see equation below (19) in main text). Here the independent minimizations are related to PACKING {ϕk}k [n], u, α , see (20) in the main text, where u and α represent the per item xk capacity and total budget, respectively. We first present Lemmatas B.7 B.8 that show the quantities L1ijk (x, D, a) and L2ijk (x, D, b) (in the lower bound (23), main text) are convex and piecewise-linear. That is the lower bound in Proposition 4.1 can be efficiently evaluated by bisection; it only requires an upfront sorting complexity of O (n log n) when parallelized. Lemma B.7. The solution to PACKING {ϕ}k [n], u, α for 0 α/n < u α 1 is given by PACKING {ϕ}k [n], u, α = u ϕ(k) ϕ(ℓ+1) # + αϕ(ℓ+1) (43) where ℓ= α/u < n, and ϕ(k) is the k-th coefficient in increasing order ϕ(1) ϕ(2) ϕ(n). For fixed α, it is convex piecewise-linear and monotone non-increasing in u, with inflection points at u = α/i for i = n 1, n 2, , 1. Proof. The solution (of (20), main text) is by greedy prioritization of lower costs in the ordering ϕ(k), hence: PACKING {φk}k [n], u, α = k=1 ϕ(k)u + (α ℓ u)ϕ(ℓ+1) (44) where ℓ= α/u < n, and clearly α/n u α is required to have a at least one feasible solution. Gathering coefficients of u we arrive at (43) from (44), and from (43) it is clearly piecewise-linear with inflection points at values of u that cause ℓ= α/u to change value; there are n 1 of them at values u = α/i for i = n 1, n 2, , 1 making a total of n intervals (α/(i + 1), α/i]. Pick some i < n, fix some 0 α 1, and consider u in the interval (α/(i + 1), α/i]. For our choice of i we conclude the sequence of inequalities i + 1 = j α α/(i+1) k > α u = ℓ j α α/i k = i and so we conclude ℓin (44) will equal ℓ= i for any u (α/(i + 1), α/i]. The function (43) is piecewise-linear because evaluating the left-limit α/(i + 1) u of interval (α/(i + 1), α/i] (when ℓ= i), equals the value evaluated at the rightmost point of the neighboring interval (α/(i + 2), α/(i + 1)] (when ℓ= i + 1): PACKING {φk}k [n], α i + 1, α = lim α i+1 u PACKING {φk}k [n], u, α = α i + 1 From (44) it the gradient is non-positive everywhere due to the non-positivity of the summands ϕ(k) ϕ(i+1) 0 since k ℓ, implying (44) is monotone non-increasing in u everywhere. A monotone non-increasing function is convex if its gradient is non-decreasing. Indeed, if we subtract the gradient of the i-th interval (where ℓ= i), to the from that of the (i + 1)-th interval to the right (where ℓ= i + 1): "i 1 X ϕ(k) ϕ(i) # ϕ(k) ϕ(i+1) # = i ϕ(i+1) ϕ(i) 0. We have thus proved that (43) is piecewise-linear, monotone non-increasing and convex. Lemma B.8. The solution to PACKING {ϕk}k [n 1], u, α u for 0 α/n x α 1 is: PACKING {ϕk}k [n 1], u, α u = u ϕ(k) ϕ(ℓ) # + αϕ(ℓ) (45) where ℓ= α/u and ϕ(k) is the k-th coefficient in increasing order ϕ(1) ϕ(2) ϕ(n). For a for a fixed α, it is convex piecewise-linear and non-increasing in u with the inflection points u = α/i for i = n 1, n 2, , 1. Order Constraints in Optimal Transport Proof. The proof is identical to that of Lemma B.7; we outline the key differences. The expression (45) is derived similarly as (43) by the same greedy strategy PACKING {ϕk}k [n 1], u, α u = k=1 ϕ(k)u + (α ℓu)ϕ(ℓ). To show the monotonic non-increasing behavior and convexity proceed in the similar manner. The n 1 inflection points for i = n 1, n 2, 1 are the same, thus so are the piecewise intervals; evaluating at the left-limit α/(i + 1) u of the interval (α/(i + 1), α/i] equals α i+1 Pi k=1 ϕ(k), as does the rightmost point u = α/(i + 1) of the interval (α/(i + 2), α/(i + 1)]. The gradient is indeed non-decreasing, again as seen by subtracting the gradient in the i-th (α/(i + 1), α/i] from that of (i 1)-th interval which we obtain i ϕ(i) ϕ(i 1) 0 for any choice of i < n. Complexity of computing lower bounds in Proposition 4.1: The lower bound (see (23), main text) suggests to evaluate L1ijk(x, ϕ, a) and L2ijk(x, ϕ, b) at multiple values of α x β, or equivalently, evaluate µ and ν (see (22), main text) over the same. But µ and ν are defined using PACKING {ϕ}i [n], x, ak and PACKING {ϕ}i [n 1], x, ai x , and by Lemmata B.7 B.8 they are piecewise-linear x with gradients and coefficients determined upfront by an O (n log(n)) sort on the coefficients ϕk. In other words, the cost of O (n log(n)) is only one-time, and once paid, the function (43) can be evaluated for multiple points of x in O (1) time. Finally, each µ, ν summand in the expressions for L1ijk(x, ϕ, a) and L2ijk(x, ϕ, b) can be computed independently in parallel. Restatement of Proposition 4.1 Let α1 = maxm i=1 ai n , β1 = maxm i=1 ai,, and α2 = maxn j=1 bj m, β2 = maxn j=1 bj. Then for any ij[k] set that defines the order constrained OT (see (6) in main text) where i[k] = i1, i2, , ik (and j[k]) do not repeat row (or column) indices, the minimum gij[k](x, ϕ) (see (19) in main text) is lower-bounded by gij[k](x, ϕ) L1ijk (x, ϕ, a) for α1 x β1 L2ijk (x, ϕ, b) for α2 x β2 where L1ijk (x, ϕ, a) and L2ijk (x, ϕ, b) resp. equal P ℓ [k] ν x, {ϕiℓq}q [n]\{jℓ}, aiℓ + P p/ i[k] µ x, {ϕpq}q [n], ap ℓ [k] ν x, {ϕpjℓ}p [m]\{iℓ}, bjℓ + P q / j[k] µ x, {ϕpq}p [m], bq . Proof of Proposition 4.1. For brevity we only prove the top bound that exists for x in the range α1 x β1 (notated as L1ijk (x, ϕ, a) in (22) of main text); the other L2ijk (x, ϕ, b) will follow similarly and is omitted. The bound (see (22), main text) is obtained by relaxing the constraint set (a la (20), main text): Π Rm n : Π U(a, b), Π Oij[k], Πi1j1 = = Πikjk = x Π Rm n + : X q [n] Πpq = aq, Πi1j1 = x, , Πikjk = x, max q [n] Πpq x The p-th set on the RHS only involves coefficients Πpq found in the k-th row; after relaxation and taking into account the linear form of gij[k](x, ϕ), and the fact that for each ℓ-th row iℓ, the jℓ-th column does not repeat in j[k], we obtain m independent minimizations of the following form: q [n] ϕpqΠpq s.t. P q [n] Πpq = ap, and 0 Πpq x, for p / i[k] min P q [n]\{jℓ} ϕiℓqΠiℓq s.t. P q [n]\{jℓ} Πiℓq = aiℓ x, and 0 Πiℓq x, for ℓ [k] where the sum of the m optimal costs of the m minimizations, lower bound the quantity gij[k](x, ϕ). For any p / i[k], the p-th minimization has optimal cost µ x, {ϕpq}q [n], ap = PACKING {ϕpq}q [n], x, ap , and similarly for ℓ [k], the iℓ-th row has optimal cost ν x, {ϕpq}q [n]\{jℓ}, aiℓ = PACKING {ϕpq}q [n]\{jℓ}, x, aiℓ x , see (20) of main text. Lemmata B.7 B.8 applied to the PACKING problems above, obtains that the minimization for the p-th rows is only feasible for x ap/n, and has the same optimal value for ap x 1; therefore the global constraint across all minimizations is obtained as α1 = maxp [m] ap/n and β1 = maxp [n] ap.