# unbalanced_cooptimal_transport__6fb10df2.pdf Unbalanced CO-optimal Transport Quang Huy Tran1,2, Hicham Janati3, Nicolas Courty1, Rémi Flamary2, Ievgen Redko4, Pinar Demetci5, 6, Ritambhara Singh5, 6 1Université Bretagne Sud, IRISA 2CMAP, Ecole Polytechnique, IP Paris 3LTCI, Télécom Paris, IP Paris 4Univ. Lyon, UJM-Saint-Etienne, CNRS, UMR 5516 5Center for Computational Molecular Biology, Brown University 6Department of Computer Science, Brown University {quang-huy.tran, nicolas.courty}@univ-ubs.fr, hicham.janati@telecom-paris.fr, remi.flamary@polytechnique.edu, ievgen.redko@univ-st-etienne.fr, pinardemetci@gmail.com, ritambhara@brown.edu Optimal transport (OT) compares probability distributions by computing a meaningful alignment between their samples. COoptimal transport (COOT) takes this comparison further by inferring an alignment between features as well. While this approach leads to better alignments and generalizes both OT and Gromov-Wasserstein distances, we provide a theoretical result showing that it is sensitive to outliers that are omnipresent in real-world data. This prompts us to propose unbalanced COOT for which we provably show its robustness to noise in the compared datasets. To the best of our knowledge, this is the first such result for OT methods in incomparable spaces. With this result in hand, we provide empirical evidence of this robustness for the challenging tasks of heterogeneous domain adaptation with and without varying proportions of classes and simultaneous alignment of samples and features across single-cell measurements. Introduction The last decade has witnessed many successful applications of optimal transport (OT) (Monge 1781; Kantorovich 1942) in machine learning, namely in domain adaptation (Courty et al. 2016), generative adversarial networks (Arjovsky, Chintala, and Bottou 2017), classification (Frogner et al. 2015), dictionary learning (Rolet, Cuturi, and Peyré 2016), semisupervised learning (Solomon et al. 2014). When the supports of the probability measures lie in the same ground metric space, it is natural to use the distance defined by the metric to induce the cost, which leads to the famous Wasserstein distance (Villani 2003). When they do not, one can rely on the idea of Gromov-Hausdorff distance (Gromov 1981) and its equivalent reformulations (Gromov 1999; Kalton and Ostrovskii 1999; Burago, Burago, and Ivanov 2001), and adapt them to the setting of metric measure spaces (Gromov 1999). This results in, for example, the Gromov-Wasserstein (GW) distance (Mémoli 2007, 2011; Sturm 2012), which has been widely used in many applications, namely in shape matching (Mémoli 2011), comparing kernel matrices (Peyré, Cuturi, and Solomon 2016), graphs (Vayer et al. 2019; Xu Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. et al. 2019; Xu, Luo, and Carin 2019), computational biology (Demetci et al. 2022), heterogeneous domain adaptation (Yan et al. 2018), correspondence alignment (Solomon et al. 2016), machine translation (Alvarez-Melis and Jaakkola 2018). By construction, the GW distance can only provide the sample alignment that best preserves the intrinsic geometry of the distributions and, as such, compares square pairwise relationship matrices. The CO-Optimal transport (COOT) (Redko et al. 2020; Chowdhury et al. 2021) goes beyond these limits by simultaneously learning two independent (feature and sample) correspondences, and thus provides greater flexibility over the GW distance in terms of usage and interpretability. First, it allows us to measure similarity between arbitrary-size matrices. An interesting use case is, for instance, on tabular data, which are usually expressed as a matrix whose rows represent samples and columns represent features. For the GW distance, the similarity or distance matrix (or any square matrix derived from the data) must be calculated in advance and the effect of the individual variables is lost during this computation. On the other hand, COOT can bypass this step as it can use either the tabular data directly or the similarity matrices as inputs. Second, COOT provides both sample and feature correspondences. These feature correspondences are also interpretable and allow to recover relations between the features of two different datasets even when they do not lie in the same space. Similar to classical OT, COOT enforces hard constraints on the marginal distributions both between samples and features. These constraints lead to two main limitations: (1) imbalanced datasets where samples or features are re-weighted cannot be accurately compared; (2) mass transportation must be exhaustive: outliers, if any, must be matched regardless of the cost they induce. To circumvent these limitations, we propose to relax the mass preservation constraints in the COOT distance and study a broadly applicable and general OT framework that includes several well-studied cases presented in Table 1. Related work. To relax the OT marginal constraints, a straightforward solution is to control the difference between the marginal distributions of the transportation plan and the The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23) data by some discrepancy measure, e.g., Kullback-Leibler divergence. In classical OT, this gives rise to the unbalanced OT (UOT), which was first proposed by (Benamou 2003). The theoretical and numerical aspects of this extension have been studied extensively (Liero, Mielke, and Savaré 2018; Chizat et al. 2018b,a; Pham et al. 2020) and are gaining increasing attention in the machine learning community, with widerange applications, namely in domain adaptation (Fatras et al. 2021), generative adversarial networks (Balaji, Chellappa, and Feizi 2020; Yang and Uhler 2019), dynamic tracking (Lee, Bertrand, and Rozell 2020), crowd counting (Ma et al. 2021), neuroscience (Janati et al. 2019; Bazeille et al. 2019) or modeling cell developmental trajectories (Schiebinger et al. 2019). Unbalanced OT and its variants are usually sought for their known robustness to outliers (Mukherjee et al. 2021; Balaji, Chellappa, and Feizi 2020; Fatras et al. 2021). This appealing property goes beyond classical OT. For instance, to compare signed and non-negative measures in incomparable spaces, unbalanced OT (Liero, Mielke, and Savaré 2018) can be blended with the Lp-transportation distance (Sturm 2006), which leads to the Sturm-Entropic-Transport distance (Ponti and Mondino 2020), or with the GW distance, which gives rise to the unbalanced GW (UGW) distance (Séjourné, Vialard, and Peyré 2021). Also motivated by the unbalanced OT, (Zhang et al. 2021) proposed a relaxation of the bidirectional Gromov-Monge distance called unbalanced bidirectional Gromov-Monge divergence. Contributions. In this work, we introduce an unbalanced extension of COOT called Unbalanced CO-Optimal transport (UCOOT). UCOOT defined for both discrete and continuous data is a general framework that encompasses all the OT variants displayed in Table 1. Our main contribution is to show that UCOOT is provably robust to both samples and features outliers, while its balanced counterpart can be made arbitrarily large with strong enough perturbations. To the best of our knowledge, this is the first time such a general robustness result is established for OT across different spaces. Our theoretical findings are showcased in unsupervised heterogeneous domain adaptation and single-cell multi-omic data alignment, demonstrating a very competitive performance. Notations. For any integer n 1, we write [|1, n|] := {1, ..., n}. Given a Polish space X, we denote M+(X) the set of nonnegative and finite Borel measures over X. For any µ M+(X), we denote its mass by m(µ) := µ(X). Unless specified otherwise, we always consider fully supported measures, i.e. supp(µ) = X, for any measure µ M+(X). The product measure of two measures µ and ν is defined as: d(µ ν)(x, y) := dµ(x)dν(y). Given π M+(X Y), we denote (π#1, π#2) its marginal distributions i.e. dπ#1 = R Y dπ and dπ#2 = R X dπ. For µ, ν M+(X), the Kullback-Leibler divergence is defined by KL(µ|ν) def = R dµ dν dν R dµ + R dν if µ ν and set to + otherwise. Finally, the indicator divergence ι=(µ|ν) is equal to 0 if µ = ν and + otherwise. Unbalanced CO-optimal Transport (UCOOT) The ultimate goal behind the CO-Optimal Transport (COOT) framework is the simultaneous alignment of samples and features to allow for comparisons across spaces of different dimensions. In this section, we discuss OT formulations including OT, UOT, GW, UGW and COOT, then introduce the proposed UCOOT and show how the aforementioned distances fall into our framework. From sample alignment to sample-feature alignment. Let (X s 1 , µs 1) and (X s 2 , µs 2) be a pair of compact measure spaces such that X s 1 and X s 2 belong to some common metric space (E, d). Classical (unbalanced) optimal transport infers one alignment (or joint distribution) πs M+(X s 1 X s 2 ) with marginals (πs #1, πs #2) close to (µs 1, µs 2) according to some appropriate divergence D such that the cost R c(x1, x2)dπ(s)(x1, x2) + D(πs #1 µs 1) + D(πs #2 µs 2) is minimal. For instance, in balanced (resp. unbalanced) OT, D corresponds to the indicator divergence (resp. KL divergence or TV). To define a generalized OT beyond one single alignment, we must first introduce a new pair of measure spaces (X f 1 , µf 1) and (X f 2 , µf 2). Intuitively, the two transport plans that must be inferred: πs across samples and πf across features, must minimize a cost of the form RR c((xs 1, xf 1), (xs 2, xf 2))dπs(xs 1, xs 2)dπf(xf 1, xf 2) where c((xs 1, xf 1), (xs 2, xf 2)) is the joint cost of aligning the sample-feature pairs (xs 1, xf 1) and (xs 2, xf 2). However, unlike OT, there is no underlying ambient metric space in which comparisons between these pairs are straightforward. Thus, we consider a simplified cost of the form: c((xs 1, xf 1), (xs 2, xf 2)) = |ξ1(xs 1, xf 1) ξ2(xs 2, xf 2)|p, for p 1 and some scalar functions ξ1, ξ2 that define the sample-feature interactions. A similar definition was adopted by (Chowdhury et al. 2021) to extend COOT to the continuous setting in the context of hypergraphs. Formally, our general formulation takes pairs of sample-feature spaces defined as follows. Definition 1 (Sample-feature space). Let (X s, µs) and (X f, µf) be compact measure spaces, where µf M+(X f) and µs M+(X s). Let ξ be a scalar integrable function in Lp(X s X f, µs µf). We call the triplet X = ((X s, µs), (X f, µf), ξ) a sample-feature space and ξ is called an interaction. Definition 2 (Generalized COOT). Given two divergences D1 and D2, we define the generalized COOT of order p between X1 = ((X s 1 , µs 1), (X f 1 , µf 1), ξ1) and X2 = ((X s 2 , µs 2), (X f 2 , µf 2), ξ2) by: inf πs M+(X s 1 X s 2 ) πf M+(X f 1 X f 2 ) m(πs)=m(πf ) ZZ |ξ1(xs 1, xf 1) ξ2(xs 2, xf 2)|pdπsdπf | {z } transport cost of sample-feature pairs k=1 λk Dk(πs #k πf #k|µs k µf k) | {z } mass destruction / creation penalty for λ1, λ2 > 0 and p 1. OT UOT GW UGW COOT UCOOT Across spaces Sample alignment Feature alignment Robustness to outliers (Fatras et al. 2021) (Fatras et al. 2021) (Prop. 2) (Thm. 1) (Prop. 2) (Thm. 1) Table 1: Properties of different OT formulations generalized by UCOOT. The proposed UCOOT is not only able to learn informative feature alignments, but also robust to outliers. As the multiplicative nature between πs and πf leads to an invariance by the scaling map α 7 (απs, 1 απf), for α > 0, we further impose the equal mass constraint m(πs) = m(πf). It is worth mentioning that the formulation 1 is not the only way to relax the marginal constraints. For example, instead of Dk(πs #k πf #k|µs k µf k), one can consider Dk(πs #k|µs k) + Dk(πf #k|µf k), or Ds(πs #1 πs #2|µs 1 µs 2), for some divergence Ds. However, amongst these choices, ours is the only one which can be recast as a variation of the unbalanced OT problem. This allows us to leverage the known techniques in unbalanced OT to justify the theoritical and practical properties, namely Proposition 1 and Theorem 1 below. Note that the problem above is very general and can with some additional constraints recover exact OT, UOT, GW, UGW, COOT (see Table 2). In particular, if the measures (µs 1, µs 2) and (µf 1, µf 2) are probability measures, then setting D1 = D2 = ι= leads to the COOT problem first introduced in the discrete case in (Redko et al. 2020) and recently generalized to the continuous setting in (Chowdhury et al. 2021). In this work, we relax the hard constraints and consider a more flexible formulation with the KL divergence: Definition 3 (UCOOT). We define Unbalanced COOT (UCOOT) as in (1) with D1 = D2 = KL. We write UCOOTλ(X1, X2) to indicate the UCOOT between two sample-feature spaces X1 and X2, for a given pair of hyperparameters λ = (λ1, λ2). While various properties of the divergences Dk have been extensively studied in the context of unbalanced OT by several authors (Chizat 2017; Frogner et al. 2015), the concept of sample-feature interaction requires more clarification. Let us consider some simple examples. In the discrete case, we consider n observations of d features represented by matrix A Rn,d. In this case, the space X s (resp. X f) is not explicitly known but can be characterized by the finite set [|1, n|] (resp. [|1, d|]), up to an isomorphism. Assuming that all samples (resp. features) are equally important, the discrete empirical measures can be given by uniform weights µs = 1n n (resp. µf = 1d d ). The most natural sample-feature interaction ξ is simply the index function ξ(i, j) = Aij. In the continuous case, we assume that data stream from a continuous random variable a µs P(Rd) for which an interaction function can be ξ(a, j) = aj. Proposition 1. For any D1, D2 {ι=, KL}, Problem 2 (in Equation 1) admits a minimizer. Remark 1. The existence of minimizer shown in Proposition 1 can be extended to a larger family of Csiszár divergences (Csiszár 1963). A general proof is given in the Appendix. UCOOT and perfect alignment. Suppose that X1 and X2 are two finite sample-feature spaces such that (X s 1 , X s 2 ) and (X f 1 , X f 2 ) have the same cardinality and are equipped with the uniform measures µs 1 = µs 2, µf 1 = µf 2. Then UCOOTλ(X1, X2) = 0 if and only if there exist perfect alignments between rows (samples) and between columns (features) of the interaction matrices ξ1 and ξ2. Robustness of COOT and UCOOT When discussing the concept of robustness, outliers are often considered as samples not following the underlying distribution of the data. In our general context of sample-feature alignments, we consider a pair (xs, xf) X s X f to be an outlier if the magnitude of |ξ(xs, xf)| is abnormally larger than other interactions between X s and X f. As a result, such outliers lead to abnormally large transportation costs |ξ1 ξ2|. To study the robustness of COOT and UCOOT, we consider an outlier scenario where the marginal data distributions are contaminated by some additive noise distribution. Assumption 1. Consider two sample-feature spaces Xk = ((X s k, µs k), (X f k , µf k), ξk), for k = 1, 2. Let εs (resp. εf) be a probability measure with compact support Os (resp. Of). For a {s, f}, define the noisy distribution eµa = αaµa + (1 αa)εa, where αa [0, 1]. We assume that ξ1 is defined on (X s 1 Os) (X f 1 Of) and that ξ1, ξ2 are continuous on their supports. We denote the contaminated sample-feature space by f X1 = ((X s 1 Os, eµs 1), (X f 1 Of, eµf 1), ξ1). Finally, we define some useful minimal and maximal costs: 0 def = min xs 1 Os,xf 1 Of xs 2 X s 2 ,xf 2 X f 2 |ξ1(xs 1, xf 1) ξ2(xs 2, xf 2)|p def = max xs 1 X s 1 Os,xf 1 X f 1 Of xs 2 X s 2 ,xf 2 X f 2 |ξ1(xs 1, xf 1) ξ2(xs 2, xf 2)|p Here, 0 accounts for the minimal deviation of the cost between the outliers and target support, while is the maximal deviation between the contaminated source and the target. Requirements OT GW COOT semi-d. COOT UCOOT Shape of inputs d1 = d2 n1 = d1, n2 = d2 Coupling constraint πf = Id1 = Id2 πf = πs Scalar function ξ(i, j) = Aij ξ(i, j) = dist(Ai., Aj.) ξ(i, j) = Aij ξ(a, j) = aj ξ(i, j) = Aij Divergence ι= ι= ι= ι= KL Table 2: Conditions under which different OT formulations fall within the generalized framework of Definition 2. semi-d refers to semi-discrete setting, where µs is a continuous probability and µd = 1d/d. Here, Id is the identity matrix in Rd. The exact marginal constraints of COOT enforce conservation of mass. Thus, outliers must be transported no matter how large their transportation costs are. This intuition is captured by the following result. Proposition 2. (COOT is sensitive to outliers) Consider f X1, X2 as defined in Assumption 1. Then: COOT( f X1, X2) (1 αs)(1 αf) 0. (2) Whenever the outlier proportion (1 αs)(1 αf) is positive, COOT increases with the distance between the supports of the outliers and those of the clean data. Thus, the right hand side of (2) can be made arbitrarily large by taking outliers far from the supports of the clean data. We can now state our main theoretical contribution. Relaxing the marginal constraints leads to a loss that saturates as outliers get further from the data: Theorem 1. (UCOOT is robust to outliers) Consider two sample-feature spaces f X1, X2 as defined in Assumption 1. Let δ def = 2(λ1 + λ2)(1 αsαf) and K = M + 1 M UCOOT(X1, X2) + δ, where M = m(πs) = m(πf) is the transported mass between clean data. Then: UCOOT(f X1,X2) αsαf UCOOT(X1, X2) + δM 1 exp (1 + M) + K The proof of Theorem 1 is provided in the Appendix and inspired from (Fatras et al. 2021), but in a much more general setting: (1) it covers both sample and feature outliers and (2) considers a noise distribution instead of a Dirac. Note that the inequality (2) indicates that outliers can make COOT arbitrary large, while UCOOT is upper bounded and discards the mass of outliers with high transportation cost. This is well illustrated in Figure 1, where we simulate outliers by adding a perturbation to a row of the interaction matrix. More precisely, we first generate a matrix A R20,15 by Aij = cos( i 20π) + cos( j 15π). Then, we replace its last row by τ115, for τ 0. Figure 1 depicts COOT and UCOOT between A and its modified version as a function of τ. The higher the value of τ, the more likely that the last row contains the interaction of outliers. Consequently, as τ increases, so does COOT but at a much higher pace, whereas UCOOT remains stable. It should be noted that, with minimal adaptation, theorem 1 also holds for the unbalanced GW (UGW) distance. This provides a theoretical explanation of the empirical observation in (Séjourné, Vialard, and Peyré 2021) that unlike GW, the UGW distance is also robust to outliers. 0 1 2 3 4 5 0.0 Row pertubation Figure 1: Sensitivity of COOT and UCOOT under the presence of outliers. Numerical Aspects Solving COOT-type problems, in general, is not trivial. As highlighted in (Redko et al. 2020), the balanced case corresponds to a convex relaxation of the bilinear assignment problem, which seeks the pair of permutations minimizing the transport cost. Here we argue that relaxing the marginal constraints makes the problem easier in two different aspects: (1) the obtained problem is easier to solve through a sequence of GPU friendly iterations; (2) regularization leads to lower alignment costs and thus better local minima. In this section, we first describe how to compute UCOOT in practice. Optimization strategy. We consider two tabular datasets A Rn1,d1 and B Rn2,d2. Let uk be the uniform histogram over sample-feature pairs: uk def = 1 nkdk 1nk 1dk, for k = 1, 2. For the sake of simplicity, we assume uniform weights over both samples and features. Computing UCOOT can be done using block-coordinate descent (BCD) both with and without entropy regularization. More precisely, given a hyperparameter ε 0, discrete UCOOT can be written as: m(πs)=m(πf ) i,j,k,l (Aik Bjl)2πs ijπf kl + λ1 KL(πs1n1 πf1d1|u1) + λ2 KL(πs 1n2 πf 1d2|u2) + εKL(πs πf|µs 1 µs 2 µf 1 µf 2). The only difference between ε = 0 and ε > 0 lies in the inner-loop algorithm used to update one of transport plans (πs, πf) while the other one remains fixed. Note that both cases allow for implementations of a scaling multiplicative algorithm that can be parallelized on GPUs. For ε > 0, updating each transport plan boils down to an entropic UOT prob- Algorithm 1: BCD algorithm to solve UCOOT Input: A Rn1,d1, B Rn2,d2, λ1, λ2, ε Initialize πs and πf repeat Update πs using Sinkhorn or NNPR Rescale πs = q m(πf ) m(πs) πs Update πs using Sinkhorn or NNPR Rescale πf = q m(πs) m(πf )πf until convergence lem, which can be solved efficiently using the unbalanced variant of Sinkhorn s algorithm (Chizat et al. 2018a). The main benefit of entropy regularization is to reduce the number of variables from (n1 n2)+(d1 d2) to n1 +n2 +d1 +d2. Moreover, by taking ε sufficiently small, we can recover solutions close to those in the non-entropic case. For ε = 0, the non-regularized UOT problem can be recast as a non-negative penalized regression (NNPR) (Chapel et al. 2021). This problem can be solved using a majorization-minimization algorithm which leads to a multiplicative update on the transport plan. For the sake of reproducibility, we provide the details on the optimization scheme of both algorithms in the Appendix. Experiments Illustration and Interpretation of UCOOT on MNIST Images We illustrate the robustness of UCOOT and its ability to learn meaningful feature alignments under the presence of both sample and feature outliers in the MNIST dataset. We introduce the feature outliers by applying a handcrafted transformation ϕσ that performs a zero-padding (shift), a 45 rotation, a resize to (28, 34) and adds Gaussian noise N(0, σ2) entries to the first 10 columns of the image. Figure 2 (a) shows some examples of original and transformed images. We randomly sample 100 images per class (1000 total) from X = MNIST and Y = ϕσ(MNIST). Regarding the sample outliers, we add 50 random images with uniform entries in [0, 1] to the target data Y . We then compute the optimal COOT and UCOOT alignments shown in Figure 2 (b) and (c). The flexibility of UCOOT with respect to mass transportation allows it to completely disregard: (1) noisy and uninformative pixels (features), which are all given the same weight as depicted by (b); (2) all the sample outliers of which none are transported as shown by the last blank column of the alignment (c). Moreover, notice how the color-coded input image is transformed according to the transformation ϕσ despite the fact that no spatial information is provided in the OT problem. On the other hand, a very small perturbation (σ = 0.01) is enough for the sample alignment given by COOT to lose its block-diagonal dominant structure (class information is lost), while the UCOOT alignment remains unscathed. One may wonder whether the performance of UCOOT would still hold for different values of σ. Figure 3 answers this question positively. For σ > 0, we compute the average accuracy (defined by the percentage of mass within the block-diagonal struc- ture) over 20 different runs. The performance of COOT not only degrades with noisier outliers but is also unstable. By contrast, the accuracy of UCOOT remains almost constant regardless of the level of noise. Heterogeneous Domain Adaptation (HDA) We now investigate the application of discrete UCOOT in unsupervised Heterogeneous Domain Adaptation (HDA). It is a particularly difficult problem where one wants to predict classes on unlabeled data using labeled data lying in a different space. OT methods across spaces have recently shown good performance on such tasks, in particular using GW distance (Yan et al. 2018) and COOT (Redko et al. 2020). Datasets and experimental setup. We consider the Caltech-Office dataset (Saenko et al. 2010) containing three domains: Amazon (A) (1123 images), Caltech-256 (C) (958 images) and Webcam (W) (295 images) with 10 overlapping classes amongst them. The image in each domain is representated by the output of the second last layer in the Google Net (Szegedy et al. 2015) and Caffe Net (Jia et al. 2014) neural network architectures, which results in 4096 and 1024-dimensional vectors, respectively (thus ds = 4096, dt = 1024). We compare 4 OT-based methods: GW, COOT, UGW, and UCOOT. The hyper-parameters for each method are validated on a unique pair of datasets (W W), then fixed for all other pairs in order to provide truly unsupervised HDA generalization. We follow the same experimental setup as in (Redko et al. 2020). For each pair of domains, we randomly choose 20 samples per class (thus ns = nt = 200) and perform adaptation from Caffe Net to Google Net features, then calculate the accuracy of the generated predictions on the target domain using OT label propagation (Redko et al. 2019). This technique uses the OT plan to estimate the amount of mass transported from each class (since the sources are labeled) to a given target sample. The predicted class corresponds to the one which contains the most mass. We repeat this process 10 times and calculate the average and standard deviation of the performance. In both source and target domains, we assign uniform sample and feature distributions. HDA Results and robustness to target shift. The means and standard deviations of the accuracy on target data are reported in Table 3 for all the methods and all pairs of datasets. We observe that, thanks to its robustness, UCOOT outperforms COOT on 7 out of 9 dataset pairs, with higher average accuracy but also slightly larger variance. This is because of the difficulty of the unsupervised HDA problem and the instability present in all methods. In particular, GW-based approaches perform very poorly. This may be due to the fact that the pre-trained models contain meaningful but a very high-dimensional vectorial representation of the image. Thus, using the Euclidean distance matrices as inputs not only causes information loss but also is less relevant (see for example, (Aggarwal, Hinneburg, and Keim 2001), or Theorem 3.1.1 and Remark 3.1.2 in (Vershynin 2018)). We also illustrate the robustness of UCOOT to a change in class proportions, also known as target shift. More precisely, we simulate a change in proportion only in the source domain Color coded Sent through each learned feature alignment X = MNIST [0, 1]28,28 Y = 'σ(MNIST) [0, 1]28,34 sample outliers 1RLV\ RXWOLHUV Figure 2: Example illustrating the feature alignment πf learned by UCOOT and its robustness to outliers. (a) Visualization of 4 random samples from both datasets. The added Gaussian noise only affects the first 10 columns of the images and is different across images. (b) The barycentric mapping (see Appendix for details) defined by UCOOT learns the transformation defined by ϕσ while disregarding non-informative features. (c) Alignments across samples from X and Y . We contaminated the target Y with 50 sample outliers (images with uniform entries in [0, 1]). A very small amount of noise is sufficient to derail COOT. Unlike COOT, UCOOT does not transport any outlier sample. Accuracy is computed as the percentage of mass within the block-diagonal structure. 0.0 0.05 0.1 1.0 2.0 Noise level σ Sample alignment accuracy (%) Figure 3: Robustness of UCOOT vs. COOT on MNIST example. The accuracy is averaged on 20 trials. by selecting 20ρ samples per class for 4 amongst 10 classes with ρ decreasing from ρ = 1 to ρ = 0.2. In this configuration, the classes in the source domain are imbalanced and the unlabeled HDA problem becomes more difficult. We report the performance of all the methods as a function of the Total Variation (TV) between the class marginal distributions on one pair of datasets in Figure 4. We can see that UCOOT is quite robust to change in class proportions, while COOT experiences a sharp decrease in accuracy when the class distributions become more imbalanced. 0.00 0.05 0.12 0.20 0.31 Total variation Accuracy (%) UCOOT COOT GW UGW Figure 4: Robustness to class proportion change for increasing TV on the class marginals. Single-Cell Multi-Omics Alignment Finally, we present a real-world application of UCOOT for the alignment of single-cell measurements. Recent advances in single-cell sequencing technologies allow biologists to measure a variety of cellular features at the single-cell resolution, such as expression levels of genes and epigenomic changes in the genome (Buenrostro et al. 2015; Chen, Lake, and Zhang 2019), or the abundance of surface proteins (Stoeckius et al. 2017). These multiple measurements produce single-cell multi-omics datasets. These datasets measuring different biological phenomena at the single-cell resolution allow scientists to study how the cellular processes Caffe Net Google Net Domains GW UGW COOT UCOOT C C 16.25 ( 7.54) 10.85 ( 2.13) 36.40 ( 12.94) 44.05 ( 19.33) C A 12.95 ( 7.74) 11.60 ( 4.86) 28.30 ( 11.78) 31.90 ( 7.43) C W 18.95 ( 9.43) 14.15 ( 3.98) 19.55 ( 14.51) 28.55 ( 6.60) A C 16.40 ( 8.99) 10.25 ( 5.66) 41.80 ( 14.81) 39.15 ( 17.98) A A 14.75 ( 15.20) 20.20 ( 6.45) 57.90 ( 16.84) 42.45 ( 15.47) A W 14.55 ( 8.83) 20.65 ( 4.13) 42.10 ( 7.80) 48.55 ( 13.06) W C 20.65 ( 11.90) 14.20 ( 5.13) 8.60 ( 6.56) 69.80 ( 14.91) W A 17.00 ( 9.75) 7.10 ( 2.45) 16.65 ( 10.01) 30.55 ( 10.09) W W 19.30 ( 11.87) 24.40 ( 3.28) 75.30 ( 3.26) 51.50 ( 20.51) Average 16.76 ( 10.14) 14.82 ( 4.23) 36.29 ( 10.95) 42.94 ( 13.93) Table 3: Unsupervised HDA from Caffe Net to Google Net. are regulated, leading to finer cell variations during development and diseases. However, it is hard to obtain multiple types of measurements from the same individual cells due to experimental limitations. Therefore, many single-cell multiomics datasets have disparate measurements from different sets of cells. As a result, computational methods are required to align the cells and the features of the different measurements to learn the relationships between them that help with data analysis and integration. Multiple tools (Cao, Hong, and Wan 2021; Hao et al. 2021; Liu et al. 2019), including GW (Cao, Hong, and Wan 2021; Demetci et al. 2022) and UGW (Demetci et al. 2021) based methods, have shown good performance for cell-to-cell alignments. However, aligning both samples and features is a more challenging and critical task that GW and UGW-based methods cannot address Here we provide an application of UCOOT to simultaneously align the samples and features in a single-cell multi-omics dataset. For demonstration, we choose a dataset generated by the CITE-seq experiment (Stoeckius et al. 2017), which simultaneously measures gene expression and antibody (or surface protein) abundance in single cells. From this dataset, we use 1000 human peripheral blood cells, which have ten antibodies and 17,014 genes profiled. We selected this specific dataset as we know the ground-truth correspondences on both the samples (i.e., cells) and the features (i.e., genes and their encoded antibodies), thus allowing us to quantify and compare the alignment performance of UCOOT and COOT. As done previously (Cao, Hong, and Wan 2021; Demetci et al. 2022; Liu et al. 2019), we quantify the cell alignment performance by calculating the fraction of cells closer than the true match (FOSCTTM) of each cell in the dataset and averaging it across all cells. This metric quantifies alignment error, so lower values are more desirable. The feature alignments are measured by calculating the accuracy of correct matching. The results are presented after hyperparameter tuning both methods with similar grid size per hyperparameter (see Experimental Details in Appendix). Balanced Scenario. First, we select and align the same number of samples and features across the two datasets. For this, we subset the gene expression domain with the ten genes that match to the ten antibodies they express. Original data contains the same number of cells across domains since both domains are simultaneously measured in the same singlecells. We observe that both UCOOT and COOT can correctly align features (Figure 5 (a)) and the cells (Appendix Figure S1(a)) across the two measurements. However, UCOOT gives better performance, as demonstrated by a lower FOCSTTM score (0.0062 vs 0.0127) for cells. Both COOT and UCOOT recover the diagonal for matching features (100% accuracy), but UCOOT recovers the exact matching, likely due to its robustness to noise, whereas COOT assigns weights to other features as well. Unbalanced Scenarios. Next, we perform alignment with an unequal number of features. This setting is more likely to occur for real-world single-cell datasets as different features are measured. In the first simple scenario, we align the ten antibodies with only a subset (five) of their matching genes. As visualized in Figure 5 (b), COOT struggles to find the correct feature alignments (60% accuracy), which would lie in the diagonal of the highlighted square (dashed lines). However, the relaxation of the mass conservation constraint in UCOOT allows it to shrink the mass of antibodies that lack matches in the gene expression domain, leading do higher accuracy (100% accuracy). Next, we align the ten antibodies with the 50 most variable genes in the dataset, including their matching genes. This alignment task is the most realistic scenario, as single-cell multi-omics data consists of high-dimensional datasets with a different number of features for different measurements. Therefore, biologists focus their analyses on the reduced set of most variable features (e.g. genes). It is also the most computationally challenging case among all our experiments on this dataset. Hence, we provide sample-level supervision to both methods by giving a cost penalization matrix based on the correct sample alignments to the sample alignment computation. We see in Figure 5(c) that in comparison to COOT (50% accuracy), UCOOT recovers more of the correct feature alignments (70% accuracy), and yields fewer redundant alignments (for more detail, see Experimental Details in Appendix). Note that UCOOT avoids incorrect mappings by locally shrinking the mass of the features or samples that lack correspondences. This avoids subsequent incorrect downstream analysis of the integration results. This (a) Balanced scenario: aligning matching features Matching Genes CD19 CD11c CD14 CD16 CD57 CD45RA CD2 CD8 CD4 CD3 CD19 ITGAX CD14 FCGR3A B3GAT1 PTPRC CD2 CD8A CD4 CD3E Matching Genes CD19 CD11c CD14 CD16 CD57 CD45RA CD2 CD8 CD4 CD3 CD19 ITGAX CD14 FCGR3A B3GAT1 PTPRC CD2 CD8A CD4 CD3E CD19 CD11c CD14 CD16 CD57 CD45RA CD2 CD8 CD4 CD3 PTPRC CD2 CD8A CD4 CD3E Most Variable Genes Most Variable Genes (b) Unbalanced scenario: aligning a subset of the matching features UCOOT COOT (c) Unbalanced scenario: aligning antibodies with the top 50 most variable genes (including matching features) Antibodies Antibodies Matching Genes Matching Genes PTPRC CD2 CD8A CD4 CD3E CD19 CD11c CD14 CD16 CD57 CD45RA CD2 CD8 CD4 CD3 Figure 5: Feature alignments on the single-cell multi-omics dataset of COOT and UCOOT between antibodies (surface proteins) and their matching genes (that encode them). (a) The features are sorted such that the correct alignment would yield a diagonal matrix. (b) Only five of the correct gene matches are kept (the last five genes from (a) are excluded). (c) Alignments between the ten antibodies and the top 50 most variable genes, including the matching genes. For (b) and (c), the diagonal within the dashed square highlights the correct matches. Overall, UCOOT gives better feature alignments. property can also help users to discover rare cell types by observing the extent of mass relaxation per cell or prune uninformative features in the single-cell datasets. Lastly, we also consider the case of unequal number of samples across the two measurements. This case is common in real world single-cell multi-omics datasets that are not simultaneously measured. Demetci et al. (Demetci et al. 2021) have shown that single-cell alignment methods that do not account for this mismatch yield poor alignment results. Therefore, we downsample the number of cells in one of the domains by 25% and perform alignment with the full set of cells in the other domain (Appendix Figure S1(b & c). We compute the FOSCTTM score for all cells that have a true match in the dataset and report the average values. UCOOT continues to yield a low FOSCTTM score (0.0081 compared to 0.0062 in the balanced scenario), while COOT shows a larger drop in performance (0.1342 compared to 0.0127 in the balanced scenario). Discussion and Conclusion In this work, we present an extension of COOT called unbalanced COOT, where the hard constraint on the marginal distributions is replaced by a soft control via the KL divergence. The resulting problem not only benefits from the flexibility of COOT but also enjoys the provable robustness property under the presence of outliers, which is not the case for COOT. The experimental results confirm our findings, yielding a very competitive performance in the unsupervised HDA task, as well as meaningful feature couplings for the single-cell multiomics alignment. Also, while UCOOT introduces additional hyper-parameters, domain knowledge can help narrow down the range of feasible values, thus reducing the time and computational cost of the tuning process. Further investigation should be carried out to fully understand and assess the observed efficiency of UCOOT in real-world applications, and also explore the possibilities of UCOOT in more diverse applicative settings, including its use as a loss in deep learning architectures. Lastly, from a theoretical perspective, statistical properties such as sample complexity or stability analysis are needed to better understand the intricate relations between the two sample and feature couplings. Acknowledgements The authors thank to Tuan Binh Nguyen for the fruitful discussion and invaluable suggestions. This work is funded by the projects OTTOPIA ANR-20-CHIA-0030, 3IA Côte d Azur Investments ANR-19-P3IA-0002 of the French National Research Agency (ANR), the 3rd Programme d Investissements d Avenir ANR-18-EUR-0006-02, the Chair "Challenging Technology for Responsible Energy" led by l X Ecole Polytechnique and the Fondation de l Ecole Polytechnique, sponsored by TOTAL, and the Chair "Business Analytics for Future Banking" sponsored by NATIXIS. This research is produced within the framework of Energy4Climate Interdisciplinary Center (E4C) of IP Paris and Ecole des Ponts Paris Tech. Pinar Demetci s and Ritambhara Singh s contribution was funded by R35 HG011939. Aggarwal, C. C.; Hinneburg, A.; and Keim, D. A. 2001. On the Surprising Behavior of Distance Metrics in High Dimensional Space. In Database Theory ICDT 2001, 420 434. Springer Berlin Heidelberg. Alvarez-Melis, D.; and Jaakkola, T. S. 2018. Gromov Wasserstein Alignment of Word Embedding Spaces. Empirical Methods in Natural Language Processing (EMNLP), 1881 1890. Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein Generative Adversarial Networks. International Conference on Machine Learning, 70: 214 223. Balaji, Y.; Chellappa, R.; and Feizi, S. 2020. Robust Optimal Transport with Applications in Generative Modeling and Domain Adaptation. Advances in Neural Information Processing Systems. Bazeille, T.; Richard, H.; Janati, H.; and Thirion, B. 2019. Local optimal transport for functional brain template estimation. In International Conference on Information Processing in Medical Imaging, 237 248. Springer. Benamou, J.-D. 2003. Numerical resolution of an unbalanced mass transport problem. ESAIM: Mathematical Modelling and Numerical Analysis, 37: 851 868. Buenrostro, J. D.; Wu, B.; Chang, H. Y.; and Greenleaf, W. J. 2015. ATAC-seq: A Method for Assaying Chromatin Accessibility Genome-Wide. Current Protocols in Molecular Biology, 109(1): 21.29.1 21.29.9. Burago, D.; Burago, Y.; and Ivanov, S. 2001. A Course in Metric Geometry, volume 33 of Graduate Studies in Mathematics. American Mathematical Society. Cao, K.; Hong, Y.; and Wan, L. 2021. Manifold alignment for heterogeneous single-cell multi-omics data integration using Pamona. Bioinformatics. Btab594. Chapel, L.; Flamary, R.; Wu, H.; Févotte, C.; and Gasso, G. 2021. Unbalanced Optimal Transport through Non-negative Penalized Linear Regression. In Neural Information Processing Systems (Neur IPS). Chen, S.; Lake, B. B.; and Zhang, K. 2019. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nature Biotechnology, 37(12): 1452 1457. Chizat, L. 2017. Unbalanced Optimal Transport: Models, Numerical Methods, Applications. Ph.D. thesis, PSL Research University. Chizat, L.; Peyré, G.; Schmitzer, B.; and Vialard, F.-X. 2018a. Scaling algorithms for unbalanced optimal transport problems. Mathematics of Computation, 87: 2563 2609. Chizat, L.; Peyré, G.; Schmitzer, B.; and Vialard, F.-X. 2018b. Unbalanced optimal transport: Dynamic and Kantorovich formulations. Journal of Functional Analysis, 274(11): 3090 3123. Chowdhury, S.; Needham, T.; Semrad, E.; Wang, B.; and Zhou, Y. 2021. Hypergraph Co-Optimal Transport: Metric and Categorical Properties. ar Xiv preprint ar Xiv:1810.09646. Courty, N.; Flamary, R.; Tuia, D.; and Rakotomamonjy, A. 2016. Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39: 1853 1865. Csiszár, I. 1963. Eine informationstheoretische Ungleichung und ihre anwendung auf den Beweis der ergodizität von Markoffschen Ketten. Magyar Tud. Akad. Mat. Kutató Int. Közl, 8: 85 108. Demetci, P.; Santorella, R.; Sandstede, B.; Noble, W. S.; and Singh, R. 2022. SCOT: Single-Cell Multi-Omics Alignment with Optimal Transport. Journal of computational biology, 29(1): 3 18. Demetci, P.; Santorella, R.; Sandstede, B.; and Singh, R. 2021. Unsupervised integration of single-cell multi-omics datasets with disparities in cell-type representation. bio Rxiv. Fatras, K.; Séjourné, T.; Courty, N.; and Flamary, R. 2021. Unbalanced minibatch Optimal Transport; applications to Domain Adaptation. International Conference on Machine Learning. Frogner, C.; Zhang, C.; Mobahi, H.; Araya, M.; and Poggio, T. A. 2015. Learning with a Wasserstein loss. Advances in Neural Information Processing Systems, 2053 2061. Gromov, M. 1981. Groups of polynomial growth and expanding maps (with an appendix by Jacques Tits). Publications Mathématiques de l IHÉS, 53: 53 78. Gromov, M. 1999. Metric Structures for Riemannian and Non-Riemannian Spaces, volume 152 of Progress in Mathematics. Birkhäuser, Boston, US. Hao, Y.; Hao, S.; Andersen-Nissen, E.; III, W. M. M.; Zheng, S.; Butler, A.; Lee, M. J.; Wilk, A. J.; Darby, C.; Zagar, M.; Hoffman, P.; Stoeckius, M.; Papalexi, E.; Mimitou, E. P.; Jain, J.; Srivastava, A.; Stuart, T.; Fleming, L. B.; Yeung, B.; Rogers, A. J.; Mc Elrath, J. M.; Blish, C. A.; Gottardo, R.; Smibert, P.; and Satija, R. 2021. Integrated analysis of multimodal single-cell data. Cell. Janati, H.; Bazeille, T.; Thirion, B.; Cuturi, M.; and Gramfort, A. 2019. Group level MEG/EEG source imaging via optimal transport: minimum Wasserstein estimates. In International Conference on Information Processing in Medical Imaging, 743 754. Springer. Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; and Darrell, T. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. Proceedings of the 22nd ACM International Conference on Multimedia, 675 678. Kalton, N. J.; and Ostrovskii, M. I. 1999. Distances between Banach spaces. Forum Mathematicum, 11(1): 17 48. Kantorovich, L. 1942. On the transfer of masses (in Russian). Doklady Akademii Nauk, 37: 227 229. Lee, J.; Bertrand, N. P.; and Rozell, C. J. 2020. Parallel Unbalanced Optimal Transport Regularization for Large Scale Imaging Problems. ar Xiv preprint ar Xiv:1909.00149. Liero, M.; Mielke, A.; and Savaré, G. 2018. Optimal entropytransport problems and a new Hellinger Kantorovich distance between positive measures. Inventiones Mathematicae, 211: 969 1117. Liu, J.; Huang, Y.; Singh, R.; Vert, J.-P.; and Noble, W. S. 2019. Jointly embedding multiple single-cell omics measurements. Bio Rxiv, 644310. Ma, Z.; Wei, X.; Hong, X.; Lin, H.; Qiu, Y.; and Gong, Y. 2021. Learning to Count via Unbalanced Optimal Transport. Proceedings of the AAAI Conference on Artificial Intelligence, 35(3): 2319 2327. Monge, G. 1781. Mémoire sur la théorie des déblais et des remblais. Histoire de l Académie Royale des Sciences, 666 704. Mukherjee, D.; Guha, A.; Solomon, J. M.; Sun, Y.; and Yurochkin, M. 2021. Outlier-Robust Optimal Transport. In Meila, M.; and Zhang, T., eds., Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 7850 7860. PMLR. Mémoli, F. 2007. On the use of Gromov-Hausdorff Distances for Shape Comparison. In Eurographics Symposium on Point Based Graphics. The Eurographics Association. Mémoli, F. 2011. Gromov-Wasserstein distances and the metric approach to object matching. Foundations of Computational Mathematics, 1 71. Peyré, G.; Cuturi, M.; and Solomon, J. 2016. Gromov Wasserstein Averaging of Kernel and Distance Matrices. International Conference on Machine Learning, 48. Pham, K.; Le, K.; Ho, N.; Pham, T.; and Bui, H. 2020. On Unbalanced Optimal Transport: An Analysis of Sinkhorn Algorithm. Proceedings of the 37 th International Conference on Machine Learning, 119. Ponti, N. D.; and Mondino, A. 2020. Entropy-Transport distances between unbalanced metric measure spaces. ar Xiv preprint ar Xiv:2009.10636. Redko, I.; Courty, N.; Flamary, R.; and Tuia, D. 2019. Optimal Transport for Multi-source Domain Adaptation under Target Shift. Proceedings of Machine Learning Research, 89. Redko, I.; Vayer, T.; Flamary, R.; and Courty, N. 2020. COOptimal Transport. Advances in Neural Information Processing Systems, 33. Rolet, A.; Cuturi, M.; and Peyré, G. 2016. Fast Dictionary Learning with a Smoothed Wasserstein Loss. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 51: 630 638. Saenko, K.; Kulis, B.; Fritz, M.; and Darrell, T. 2010. Adapting visual category models to new domains. Proceedings of the 11th European Conference on Computer Vision, 213 226. Schiebinger, G.; Shu, J.; Tabaka, M.; Cleary, B.; Subramanian, V.; Solomon, A.; Gould, J.; Liu, S.; Lin, S.; Berube, P.; Lee, L.; Chen, J.; Brumbaugh, J.; Rigollet, P.; Hochedlinger, K.; Jaenisch, R.; Regev, A.; and Lander, E. S. 2019. Optimal Transport Analysis of Single-Cell Gene Expression Identifies Developmental Trajectories in Reprogramming. Cell, 176(4): 928 943. Solomon, J.; Peyré, G.; Kim, V. G.; and Sra, S. 2016. Entropic Metric Alignment for Correspondence Problems. ACM Transactions on Graphics, 35(4). Solomon, J.; Rustamov, R.; Guibas, L.; and Butscher, A. 2014. Wasserstein Propagation for Semi-Supervised Learning. Proceedings of the 31st International Conference on Machine Learning, 32: 306 314. Stoeckius, M.; Hafemeister, C.; Stephenson, W.; Houck Loomis, B.; Chattopadhyay, P. K.; Swerdlow, H.; Satija, R.; and Smibert, P. 2017. Simultaneous epitope and transcriptome measurement in single cells. Nature Methods, 14(9): 865 868. Sturm, K.-T. 2006. On the geometry of metric measure spaces. Acta Mathematica, 196: 65 131. Sturm, K.-T. 2012. The space of spaces: curvature bounds and gradient flows on the space of metric measure spaces. ar Xiv preprint ar Xiv:1208.0434. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going Deeper with Convolutions. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1 9. Séjourné, T.; Vialard, F.-X.; and Peyré, G. 2021. The Unbalanced Gromov Wasserstein Distance: Conic Formulation and Relaxation. Advances in Neural Information Processing Systems, 34. Vayer, T.; Chapel, L.; Flamary, R.; Tavenard, R.; and Courty, N. 2019. Optimal Transport for structured data with application on graphs. International Conference on Machine Learning, 97. Vershynin, R. 2018. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press. Villani, C. 2003. Topics in optimal transportation, volume 58 of Graduate Studies in Mathematics. American Mathematical Society. Xu, H.; Luo, D.; and Carin, L. 2019. Scalable Gromov Wasserstein Learning for Graph Partitioning and Matching. Advances in Neural Information Processing Systems, 32. Xu, H.; Luo, D.; Zha, H.; and Duke, L. C. 2019. Gromov Wasserstein Learning for Graph Matching and Node Embedding. Proceedings of the 36th International Conference on Machine Learning, 6932 6941. Yan, Y.; Li, W.; Wu, H.; Min, H.; Tan, M.; and Wu, Q. 2018. Semi-Supervised Optimal Transport for Heterogeneous Domain Adaptation. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, 2969 2975. Yang, K. D.; and Uhler, C. 2019. Scalable Unbalanced Optimal Transport using Generative Adversarial Networks. 7th International Conference on Learning Representations. Zhang, Z.; Mroueh, Y.; Goldfeld, Z.; and Sriperumbudur, B. K. 2021. Cycle Consistent Probability Divergences Across Different Spaces. ar Xiv preprint ar Xiv:2111.11328.