# unsupervised_manifold_alignment_with_joint_multidimensional_scaling__d1fc26d4.pdf Published as a conference paper at ICLR 2023 UNSUPERVISED MANIFOLD ALIGNMENT WITH JOINT MULTIDIMENSIONAL SCALING Dexiong Chen, Bowen Fan, Carlos Oliver & Karsten Borgwardt Department of Biosystems Science and Engineering, ETH Z urich, Switzerland SIB, Swiss Institute of Bioinformatics, Switzerland {firstname.lastname}@bsse.ethz.ch We introduce Joint Multidimensional Scaling, a novel approach for unsupervised manifold alignment, which maps datasets from two different domains, without any known correspondences between data instances across the datasets, to a common low-dimensional Euclidean space. Our approach integrates Multidimensional Scaling (MDS) and Wasserstein Procrustes analysis into a joint optimization problem to simultaneously generate isometric embeddings of data and learn correspondences between instances from two different datasets, while only requiring intra-dataset pairwise dissimilarities as input. This unique characteristic makes our approach applicable to datasets without access to the input features, such as solving the inexact graph matching problem. We propose an alternating optimization scheme to solve the problem that can fully benefit from the optimization techniques for MDS and Wasserstein Procrustes. We demonstrate the effectiveness of our approach in several applications, including joint visualization of two datasets, unsupervised heterogeneous domain adaptation, graph matching, and protein structure alignment. The implementation of our work is available at https://github.com/Borgwardt Lab/Joint MDS. 1 INTRODUCTION Many problems in machine learning require joint visual exploration and manipulation of multiple datasets from different (heterogeneous) domains, which is generally a preferable first step prior to any further data analysis. These different data domains may consist of measurements for the same samples obtained with different methods or technologies, such as single-cell multi-omics data in bioinformatics (Demetci et al., 2022; Liu et al., 2019; Cao & Gao, 2022). Alternatively, the data could be comprised of different datasets of similar objects, such as word spaces of different languages in natural language modeling (Alvarez-Melis et al., 2019; Grave et al., 2019), or graphs representing related objects such as disease-procedure recommendation in biomedicine (Xu et al., 2019b). There are two main challenges in joint exploration of multiple datasets. First, the data from the heterogeneous domains may be high-dimensional or may not possess input features but rather only dissimilarities between them. Second, the correspondences between data instances across different domains may not be known a priori. We propose in this work to tackle both issues simultaneously while making few assumptions on the data modality. To address the first challenge, for several decades many dimensionality reduction methods have been proposed to provide lower-dimensional embeddings of data. Among them, multidimensional scaling (MDS) (Borg & Groenen, 2005) and its extensions (Tenenbaum et al., 2000; Chen & Buja, 2009) are widely used ones. They generate low-dimensional data embeddings while preserving the local and global information about its manifold structure. One of the key characteristics of MDS is the fact that it only requires pairwise (dis)similarities as input rather than specific data features, which makes it applicable to problems whose data does not have access to the input features, such as graph node embedding learning (Gansner et al., 2004). However, when it comes to dealing with multiple datasets from different domains at the same time, the subspaces learned by MDS for different datasets are not naturally aligned, making it not directly applicable for a joint exploration. Dexiong Chen and Bowen Fan contributed equally. Published as a conference paper at ICLR 2023 One well-known method for aligning data instances from different spaces is Procrustes analysis. When used together with dimensionality reduction, it results in a manifold alignment method (Wang & Mahadevan, 2008; Kohli et al., 2021; Lin et al., 2021). However, these approaches require prior knowledge about the correspondences between data instances across the domains, which limits their applicability in many real-world applications where this information is hard or expensive to obtain. Unsupervised manifold alignment approaches (Wang & Mahadevan, 2009; Cui et al., 2014) have been proposed to overcome this limitation by aligning the underlying manifold structures of two datasets with unknown correspondences while projecting data onto a common low-dimensional space. In this work, we propose to combine MDS with the idea of unsupervised manifold alignment to simultaneously embed data instances from two domains without known correspondences to a common low-dimensional space, while only requiring intra-dataset dissimilarities. We formulate the problem as a joint optimization problem, where we integrate the stress functions for each dataset that measure the distance deviations and adopt the idea of Wasserstein Procrustes analysis (Alvarez-Melis et al., 2019) to align the embedded data instances from two datasets in a fully unsupervised manner. We propose to solve the resulting optimization problem through an alternating optimization strategy, resulting in an algorithm that can benefit from the optimization techniques for solving each individual sub-problem. Our approach, named Joint MDS, allows recovering the correspondences between instances across domains while also producing aligned low-dimensional embeddings for data from both domains, which is the main advantage compared to Gromov-Wasserstein (GW) optimal transport (M emoli, 2011; Yan et al., 2018) for only correspondence finding. We show the effectiveness of joint MDS in several machine learning applications, including joint visualization of two datasets, unsupervised heterogeneous domain adaptation, graph matching, and protein structure alignment. 2 RELATED WORK We present here the work most related to ours, namely MDS, unsupervised manifold alignment and optimal transport (OT) for correspondence finding. Multidimensional scaling and extensions MDS is one of the most commonly used dimensionality reduction methods that only require pairwise (dis)similarities between data instances as input. Classical MDS (Torgerson, 1965) was introduced under the assumption that the dissimilarity is an Euclidean distance, which has an analytic solution via SVD. As an extension of classic MDS, metric MDS consists in learning low-dimensional embeddings that preserve any dissimilarity by minimizing a stress function. Several extensions of MDS have also been proposed for various practical reasons, such as non-metric MDS (Agarwal et al., 2007), Isomap (Tenenbaum et al., 2000), local MDS (Chen & Buja, 2009) and so on. MDS has also been used for graph drawing (Gansner et al., 2004) by producing node embeddings using shortest path distances on the graph. Our approach can be seen as an important extension of MDS to work with multiple datasets. Unsupervised manifold alignment While (semi-)supervised manifold alignment methods (Ham et al., 2005; Wang & Mahadevan, 2008; Shon et al., 2005) require at least partial information about the correspondence across domains, unsupervised manifold alignment learns the correspondence directly from the underlying structures of the data. One of the earliest works for unsupervised manifold alignment was presented in (Wang & Mahadevan, 2009), where a similarity metric based on the permutation of the local geometry was used to find cross-domain corresponding instances followed by a non-linear dimensionality reduction. A similar approach was also adopted in (Tuia et al., 2014) with a graph-based similarity metric. A more generalized framework, named GUMA (Cui et al., 2014), was proposed as an optimization problem with three complex terms to project data instances via a linear transformation and match them simultaneously. As an extension of (Cui et al., 2014), Union Com (Cao et al., 2020) introduced geodesic distance matching instead of the kernel matrices, to deal with multi-model data in bioinformatics in particular. Additionally, generative adversarial networks and the maximum mean discrepancy have also been used to find correspondences jointly with dimensionality reduction for unsupervised manifold alignment (Amodio & Krishnaswamy, 2018; Liu et al., 2019). Our approach differs from these previous approaches as it makes few assumptions on the data modality but only requires intra-dataset dissimilarities as input. Optimal transport for correspondence finding OT (Peyr e et al., 2019) is a powerful and flexible approach to compare two distributions, and has a strong theoretical foundation. It can find a soft-correspondence mapping between two sets of samples without any supervision. With the Published as a conference paper at ICLR 2023 Figure 1: Joint Multidimensional Scaling maps data from two domains X and X to a common low-dimensional space Rd while preserving the pairwise intra-domain dissimilarities. recent advances in computational algorithms for OT (Cuturi, 2013), it has become increasingly popular in various machine learning fields. However, one major drawback of OT is that it was originally designed to compare two distributions in the same metric space: the cost function cannot be properly defined between samples in different spaces. This limitation makes it unsuitable for tasks involving data from different domains. As an important extension, the Gromov-Wasserstein (M emoli, 2011) distance has been proposed to overcome this limitation, by comparing two pairwise intradomain (dis)similarity matrices directly. This modification makes the GW distance invariant to many geometric transformations on the data, such as rotation or translation. GW has thus been successfully applied to several tasks involving datasets from heterogeneous domains, such as heterogeneous domain adaptation (HDA) (Yan et al., 2018; Titouan et al., 2020), graph matching and partitioning (Xu et al., 2019b; Vayer et al., 2019), and integration of single-cell multi-omics data (Demetci et al., 2022). As an alternative of GW, OT methods with global invariances have also been recently proposed for unsupervised word translation tasks in NLP (Alvarez-Melis et al., 2019; Grave et al., 2019), and for more general purpose (Jin et al., 2021). In addition to discovering correspondences, our approach can additionally produce low-dimensional embeddings of data from both domains, which could be useful for visualization and further analysis. 3 BACKGROUND In this section, we revisit the metric MDS problem and Wasserstein Procrustes, two main building blocks of our method. Notation We denote by D Rn n the pairwise dissimilarities of a dataset of n elements (in an unknown space X) and by d : Rd Rd R+ a distance function which is typically the Euclidean distance. For two sets of data points Z Rn d and Z Rn d of respective n and n elements in Rd, we denote by d(Z, Z ) the pairwise distance matrix in Rn n such that its (i, j)-th entry is equal to d(Zi, Z j). We denote by F the Frobenius norm of matrix and , F the associated inner-product. We denote by n the probability simplex with n bins n := {a Rn + | Pn j=1 aj = 1}. Metric multidimensional scaling Given a dissimilarity matrix D Rn n with (i, j)-th entry denoted as dij, representing a set of n samples in an unknown space X, MDS consists of finding n coordinates Z := (z1, . . . , zn) Rn d that best preserves the pairwise dissimilarities by solving the following weighted stress minimization problem: min Z Rn d stress(Z, D, W) := i,j=1 wij(dij d(zi, zj))2, (1) for a given weight matrix W := (wij)i,j=1,...,n Rn n + . Despite being a non-convex problem, it can be efficiently solved by an iterative majorization algorithm, named SMACOF (scaling by majorizing a complicated function) (Borg & Groenen, 2005). This algorithm could be used individually to map two datasets to two low-dimensional Euclidean spaces. However, both the alignment of the two subspaces and the correspondences between data instances are not known a priori. It is thus meaningless to work jointly with the embeddings of both datasets directly. Wasserstein Procrustes OT is a widely used approach for finding correspondences between feature vectors from the same space in an unsupervised fashion. Its strong theoretical foundations and fast algorithms make it a natural candidate to align the distributions of the embedded data obtained with MDS. Specifically, let a n and b n be the weights of the discrete measures P i aiδzi and P j bjδz j with respective location Z := (z1, . . . , zn) Rn d and Z := (z 1, . . . , z n ) Rn d. Published as a conference paper at ICLR 2023 For a cost function c : Rd Rd R+, Kantorovich s formulation of the OT problem between the two discrete measures is defined as min P Π(a,b) P, C F , (2) where C Rn n represents the pairwise costs whose entries are Cij := c(zi, z j) and Π(a, b) is the polytope of the admissible couplings between a and b: Π(a, b) := {P Rn n + | P1 = a, P 1 = b}. (3) In practice, a and b are uniform measures as we consider the mass to be evenly distributed among the data points. In particular, if c is the squared Euclidean distance such that c(z, z ) = z z 2 2, the solution of Eq. (2) defines a squared distance on the set of discrete measures, called the squared Wasserstein distance, which we denote by W 2 2 (Z, Z ) below. Naive application of OT to find correspondences between the embedded data obtained with two MDSs for two datasets can easily fail since the two spaces of Z and Z are not coherent, or in other words, there is no meaningful notion of distance between them. In order to jointly align the two spaces and find the correspondences, Alvarez-Melis et al. (2019) propose to jointly optimize a linear transformation f from a pre-defined invariance class F: min f F W 2 2 (f(Z), Z ) = min P Π(a,b) min f F P, d2(f(Z), Z ) F . (4) They provide solutions of this problem for invariances defined by linear operators with a bounded norm F := {O Rd d | O p kp}, where p is the Schatten ℓp-norm. In particular, when p = and kp = 1, we have F = Od, the orthogonal group of matrices in Rd d, and one recovers the Wasserstein Procrustes problem (Grave et al., 2019). 4 JOINT MULTIDIMENSIONAL SCALING PROBLEM We introduce our new joint MDS problem which can jointly embed two datasets in a common space. 4.1 CORRESPONDENCE DISCOVERY WITH JOINT MULTIDIMENSIONAL SCALING The Joint MDS problem consists of jointly mapping data points from two different domains into a common Euclidean space. Following the discussion in Section 3, the two spaces obtained with two individual MDSs for each dataset are not necessarily aligned and the correspondences between data instances are unknown. To address both issues, we leverage Wasserstein Procrustes to algin the embedded data from two domains. Specifically, we combine the stress minimization and the Wasserstein Procrustes to define the overall objective function for our Joint MDS problem. Given two matrices D Rn n and D Rn n representing the pairwise dissimilarities of two collections, we formulate the Joint MDS as the following optimization problem: min Z Rn d,Z Rn d P Π(a,b),O Od stress(Z, D, W) + stress(Z , D , W ) + 2λ P, d2(ZO, Z ) F , (5) where W and W are weights for the pairwise dissimilarities generally equal to 1/n2 and 1/n 2, and a and b refer to the sample weights as defined above. This objective function exhibits a simple interpretation: the two stress function terms measure the distance deviation for each domain while the last term measures the correspondences between instances from the two domains. It is worth noting that the stress functions are rotation-invariant, i.e. if Z and Z are solutions respectively minimizing the two stress functions, their rotations are also solutions. Thus, using the Wasserstein Procrustes with global geometric invariances (in Od) instead of the original OT could find better correspondences. An illustration of Joint MDS is shown in Figure 1. 4.2 CHOICE OF THE DISSIMILARITIES In general, any distance could be used to compute the dissimilarities if they are not provided directly with the dataset. For instance, the Euclidean distance is a natural choice. As an extension of the Published as a conference paper at ICLR 2023 Euclidean distance in Riemannian geometry, the geodesic distance (Tenenbaum et al., 2000) often captures better the local geometric structure of the data distribution. However, the geodesic distance is generally hard to compute exactly as it requires full knowledge about the data manifold. Fortunately, it can be efficiently estimated approximately by constructing a k-nearest neighbor graph and computing the shortest path distance on the graph. Similar to Isomap (Tenenbaum et al., 2000), a variant of MDS computed on the geodesic distance, we also observe better performance for joint MDS when using geodesic distances, especially in unsupervised heterogeneous domain adaptation tasks. 4.3 OPTIMIZATION The above optimization problem is non-convex, which is hard to solve in general. Here, we propose an alternating optimization scheme by solving two subproblems that have already been studied previously. Specifically, we show that when P and O are fixed, the Joint MDS problem (5) becomes a simple weighted MDS problem. On the other hand, when fixing Z and Z , the problem (5) is reduced to a Wasserstein Procrustes problem. As a consequence, we can easily leverage optimization techniques developed for each of the sub-problems respectively. Weighted MDS problem We first observe that the distance is invariant to the orthogonal transformations from Od. Thus, for fixed P and O, the problem amounts to minimizing the following stress function (with a change of variable Z = ZO): min Z Rn d,Z Rn d i,j=1 wij(dij d(zi, zj))2 + i ,j =1 w i j (d i j d(z i , z j ))2 + 2λ P, d2(Z, Z ) , where the last term can be rewritten as 2λ P, d2(Z, Z ) = j =1 λPij (0 d(zi, z j ))2 + j=1 λP i j(0 d(z i , zj))2. This is equivalent to solving the MDS problem with a stress function stress( Z, D, W) where , D := D 0 0 D , W := W λP λP W This problem can be solved with the SMACOF algorithm, which simply relies on the majorization theory without making any assumption on whether D has a distance structure. The majorization principle essentially consists in minimizing a more manageable surrogate function g(X, Z) rather than directly minimizing the original complicated function h(X) := stress(X, D, W). This surrogate function g(X, Z) is required to (i) be a majorizing function of h(X), i.e. h(X) g(X, Z) for any X Rn d, and (ii) touch the surface at the so-called supporting point Z, i.e. h(Z) = g(Z, Z). Now, let the minimum of g(X, Z) over X be attained at X . The above assumptions imply the following chain of inequalities: h(Z) = g(Z, Z) g(X , Z) h(X ), called the sandwich inequality. By substituting Z with X and iterating this process, we obtain a sequence (Z0, . . . , Zt) that yields a non-increasing sequence of function values. (Borg & Groenen, 2005) showed that there is indeed such a majorizing function g for the stress function, which is detailed in Section A of the Appendix. Through the lens of majorization, (De Leeuw, 1988) have shown that the above sequence converges to a local minimum of the stress minimization problem with a linear convergence rate. Local minima are more likely to occur in low-dimensional solutions while being less likely to happen in high dimensions (De Leeuw & Mair, 2009; Groenen & Heiser, 1996). The SMACOF algorithm based on stress majorization is summarized in Section A of the Appendix. Wasserstein Procrustes problem For fixed embeddings Z and Z , the first two terms are independent of P and O and therefore this step consists in solving a Wasserstein Procrustes problem: min O Od, P Π(a,b) P, d2(ZO, Z ) F εH(P), (8) where we introduced an entropic regularization term which makes the objective strongly convex over P and thus easier to solve. The resulting problem can be solved with an alternating optimization Published as a conference paper at ICLR 2023 over P and O as studied in (Alvarez-Melis et al., 2019). Specifically, when fixing O, the above problem is a classic discrete OT problem with entropic regularization that can be efficiently solved with, e.g. , the Sinkhorn s algorithm. More details about Sinkhorn s algorithm can be found in the Appendix. When fixing P, the problem is equivalent to a classic orthogonal Procrustes problem: max O Od O, Z PZ F , whose solution is simply given by the singular value decomposition of Z PZ . Solving the Wasserstein Procrustes problem offers correspondences between two sets of embeddings of the same dimensions even if their spaces are not aligned, which is particularly suitable here. By alternating between these two steps, we obtain our final algorithm for solving Joint MDS. Overall algorithm Z and Z could be simply initialized with two individual SMACOF algorithms with D and D . However, for some hard matching problems, this might not offer a good initialization of the Wasserstein Procrustes problem. In this case, GW could be used as a convex relaxation of the Wasserstein Procrustes problem (Alvarez-Melis et al., 2019; Grave et al., 2019), which provides a better initialization for the coupling matrix P. In addition, similar to the observations in (Alvarez Melis et al., 2019), we also empirically find that annealing the entropic regularization parameter ε in Wasserstein Procrustes leads to better convergence. When using the solution of GW as initialization, we also find that annealing λ is useful. The full algorithm is summarized in Algorithm 1. Algorithm 1 Joint Multidimensional Scaling 1: Input: distances D Rn n and D Rn n , weights W Rn n and W Rn n , matching penalty λ, entropic regularization ε, max iterations T. 2: Output: low-dimensional embeddings Z, Z Rn d, optimal coupling P Rn n . 3: Set Z = MDS(D, W) and Z = MDS(D , W ) with SMACOF using random initialization. 4: Set D in Eq (7) with D and D . 5: for t = 1, . . . , T do 6: Update P and O by solving Wasserstein Procrustes in Eq. (8) between Z and Z using ε. 7: Update Z = ZO and Z in Eq (7). 8: Update W in Eq (7) using P, λ W and W . 9: Update Z, Z = MDS( D, W) using SMACOF with Z as initialization. 10: end for Complexity analysis Let us denote by N = n + n the total number of data points of both domains. The complexity of a SMACOF iteration is the complexity of a Guttman Transform, which equals to O(N 2) (De Leeuw, 1988). On the other hand, the complexity of one iteration of the alternating optimization method for solving the Wasserstein Procrustes is O(d3 + N 2) (Alvarez-Melis et al., 2019) where the first term corresponds to the complexity of SVD and the second term corresponds to the Sinkhorn s algorithm with a fixed number of iterations. While a linear convergence of the SMACOF algorithm was shown in (De Leeuw, 1988), we empirically observe that a fixed number of iterations is sufficient for the convergence of SMACOF as well as for solving Wasserstein Procrustes at each iteration of the Joint MDS algorithm 1. Therefore, for a total number of iterations T, the complexity of the full algorithm is O(T(d3 +2N 2)). Despite the quadratic complexity, our algorithm could benefit from the acceleration of GPUs as most of the operations in SMACOF and Sinkhorn s algorithm are matrix multiplications. 5 APPLICATIONS 5.1 JOINT VISUAL EXPLORATION OF TWO DATASETS We show that our Joint MDS can be used to jointly visualize two related datasets by simultaneously mapping two datasets to a common 2D space while preserving the original manifold structures. Experimental setup Here, we consider 3 (pairs of) synthetic datasets and 3 (pairs of) real-world datasets. The synthetic datasets respectively consist of a bifurcation, a Swiss roll and a circular frustum from (Liu et al., 2019). Each synthetic dataset has 300 samples that have been linearly projected onto 1000 and 2000 dimensional feature spaces respectively and have been injected white noise (Liu et al., 2019). The 3 real-world datasets consist of 2 sets of single-cell multi-omics data (SNAREseq and sc GEM) (Demetci et al., 2022) and a subsampled digit dataset (MNIST-USPS (Deng, 2012; Hull, 1994)) . We compute the pairwise geodesic distances of each dataset as described in Published as a conference paper at ICLR 2023 Joint MDS Component 1 Joint MDS Component 2 Joint MDS Component 1 Joint MDS Component 2 Joint MDS Component 1 Joint MDS Component 2 Joint MDS Component 1 Joint MDS Component 2 Joint MDS Component 1 Joint MDS Component 2 1.0 0.5 0.0 0.5 1.0 Joint MDS Component 1 Joint MDS Component 2 1.0 0.5 0.0 0.5 1.0 1.5 Joint MDS Component 1 Figure 2: Joint visual exploration of two datasets. Top row: 3 synthetic datasets (bifurcation, Swiss roll, and circular frustum). Bottom row: real-world datasets (SNAREseq, sc GEM and MNIST-USPS). Different colors represent datasets and different markers represent different classes except for MNISTUSPS, in which different colors represent different classes of digits from 0 to 9. Section 4.2. We fix the number of components d to 2 to visualize datasets in R2. We fix the matching penalization parameter λ to 0.1 for all the datasets. More details about the datasets and parameters and additional results can be found in Section B.2 of the Appendix. Results We show in Figure 2 the results for all the considered datasets. Joint MDS successfully embedded each pair of datasets to a common 2D space while preserving the original manifold structures such as proximity of similar classes. In contrast, some other recently proposed methods such as (Demetci et al., 2022) only map one dataset onto the space of the other dataset through a barycentric projection from GW, which could lose information about the manifold structure of the dataset before projection. As we also have access to the ground truth of the 1-to-1 correspondences for the synthetic and single-cell datasets, we can quantitatively assess the alignment learned by Joint MDS, by using the fraction of samples closer than the true match (FOSCTTM) metric (Liu et al., 2019). For a given sample from a dataset, we compute the fraction of samples from the other dataset that are closer to it than its true match and we average the fraction values for all the samples in both datasets. A lower average FOSCTTM generally indicates a better alignment and FOSCTTM would be zero for a perfect alignment as all the samples are closest to their true matches. We show in Table 1 the results of FOSCTTM compared to several baseline methods that are specifically proposed for aligning single-cell data. Joint MDS can also be used to align and visualize two sets of point clouds with a specific application for human pose alignment. We provide more details and results in Section B.2.5 of the Appendix due to limited space. Table 1: Average FOSCTTM comparison (lower is better). Method Bifurcation Swiss roll Circular frustum SNAREseq sc GEM MMD-MA (Liu et al., 2019) 12.44 3.27 1.25 15.00 20.14 Union Com (Cao et al., 2020) 8.30 1.57 15.20 26.50 20.96 SCOT (Demetci et al., 2022) 8.66 2.16 0.88 15.00 19.80 Joint MDS (d=2) 11.15 0.96 0.90 17.18 20.42 Joint MDS (d=16) 7.56 0.58 0.87 15.59 18.54 5.2 UNSUPERVISED HETEROGENEOUS DOMAIN ADAPTATION We use the same datasets as in Section 5.1. To solve the HDA problem, we first solve the joint MDS problem between the two pairwise geodesic distance matrices, which provides embeddings for instances from both domains in the same space. Then, we train a k-NN classifier (k is fixed to 5) on the source domain and estimate the labels on the target domain without any further adaptation. For the Published as a conference paper at ICLR 2023 more complex MNIST-USPS dataset, we use a linear SVM classifier with the regularization parameter set to 1 instead, which results in better prediction accuracy. We follow the same parameter selection as in (Demetci et al., 2022) and compare our method to the state-of-the-art unsupervised HDA methods including SCOT (Demetci et al., 2022) and EGW (Yan et al., 2018), which are both based on entropic regularized Gromov-Wasserstein. The classification accuracies are shown in Table 2. While all the methods work well on the easier datasets, our Joint MDS outperforms GW-based baselines on some harder tasks such as MNIST-USPS. Table 2: Classification accuracy for unsupervised domain adaptation. Method Bifurcation Swiss roll Circular frustum SNAREseq sc GEM MNIST-USPS SCOT (Demetci et al., 2022) 93.7 97.7 95.7 98.2 57.6 26.7 EGW (Yan et al., 2018) 95.7 99.3 94.7 93.8 62.7 43.1 Joint MDS (d=2) 96.0 99.3 94.0 85.5 64.4 15.0 Joint MDS (d=16) 96.7 99.3 94.7 94.7 72.9 60.2 5.3 GRAPH MATCHING Data We apply the proposed method on two real-world datasets for the graph matching task which were used in (Xu et al., 2019b;a). The first dataset was collected from MIMIC-III (Johnson et al., 2016), which is one of the most commonly used critical care databases in healthcare related studies. This dataset contains more than 50,000 hospital admissions for patients and each is represented as a sequence of ICD (International Classification of Diseases) codes for different diseases and procedures. The interactions in the graph for disease (resp. procedure) are constructed with the diseases (resp. procedures) appearing in the same admission. The final two constructed graphs involve 11,086 admissions, one with 56 nodes of diseases and the other with 25 nodes of procedures. The procedure recommendation could be formulated as the problem of matching these two graphs. The second dataset contains the protein-protein interaction (PPI) networks of yeast. The original PPI network of yeast has 1,004 proteins together with 4,920 high-confidence interactions. The graph to match is its noisy version, which was built by adding q% lower-confidence interactions with q {5, 15, 25}. Experimental setup To solve the graph matching problem, we use the shortest path distances on the graph as the input of our Joint MDS similarly to graph drawing methods (Gansner et al., 2004; Kamada et al., 1989), and use the coupling matrix P as the matching prediction. We compare our method with some recent graph matching methods, including MAGNA++ (Vijayan et al., 2015), Hub Align (Hashemifar & Xu, 2014), Ker GM (Zhang et al., 2019) and GWL (Xu et al., 2019b). In particular, GWL is a powerful graph matching algorithm based on the Gromov-Wasserstein distance that outperforms many classic methods (Xu et al., 2019a; Liu et al., 2021). Note that Joint MDS differs from GWL in three aspects: i) Joint MDS relies on SMACOF which has convergence guarantees (De Leeuw, 1988), while GWL relies on SGD that does not necessarily converge for non-convex problems; ii) it generates isometric embeddings, which could be more beneficial for tasks require distance preservation such as protein structure alignment; iii) it uses OT instead of GW in the low-dimensional embedding space to learn the correspondences. For baseline methods, we use the authors implementations with recommended hyperparameters. For our method, we follow the same hyperparameter selection procedure and report the average accuracy over 5 different runs. We do not perform MAGNA++ and Hub Align on MIMIC-III matching task since they can only produce one-to-one node pairs. For the diseases-procedure graph matching task, we use 75% of admissions for training and 25% for testing. We use training data to respectively build disease graph and procedure graph, and use the test data to build the correspondences between disease and procedure as the true recommendation. For each node in the disease graph, we calculate top 3 and top 5 recommended procedure nodes based on the coupling matrix P. We consider it as a correct match if the true closest procedure node appears in the predicted top-k recommendation list. For the PPI network matching task, we calculated the node correctness (NC) as the evaluation metrics: Pn i=1 Pn j=1 Pij if Tij = 1, where P and T respectively corresponds to the predicted matching probability matrix and the true matching matrix. Results In Table 3, the results of the two graph matching tasks are presented. Our method achieves comparable performance to GWL, while both methods outperform other classic baselines. More results and experimental details are available in Section B.4 of the Appendix. Published as a conference paper at ICLR 2023 Table 3: Graph matching performances on PPI networks and MIMIC-III disease-procedure graphs. Method PPI 5% PPI 15% PPI 25% MIMIC top 3 MIMIC top 5 MAGNA++ 50.00 35.16 12.85 Hub Align 46.06 32.47 27.39 Ker GM 66.14 39.04 32.17 22.67 47.86 GWL 84.31 74.35 67.42 27.98 42.14 Joint MDS 86.44 0.33 72.31 0.62 55.3 0.78 30.24 1.66 46.28 1.51 Method RMSD-D GUMA-2D 28.89 GUMA 26.04 Joint MDS-2D 18.04 0.10 Joint MDS 13.11 0.02 0 15 30 45 60 75 RMSD-D (p-value=4.12e-08) 0 15 30 45 60 75 Joint MDS-2D RMSD-D (p-value=5.36e-08) Figure 3: Protein structure alignment results. Left: average RMSD-D over 58 pairs of protein models. Right: RMSD-D comparison of GUMA (Cui et al., 2014) and Joint MDS, with the difference significance measured by a Wilcoxon signed-rank test. 5.4 PROTEIN STRUCTURE ALIGNMENT Protein 3D structure alignment is an important sub-task in protein structure reconstruction (Wang & Mahadevan, 2009; Badaczewska-Dawid et al., 2020) and protein model quality assessment (Pereira et al., 2021). This task consists of structurally aligning two moderately accurate protein models estimated by different methods, either experimental or computational, and eventually integrating them to generate a more accurate and high-resolution 3D model of protein folding. Here, we consider a more challenging setting where the correspondences across protein models are not known. In contrast to previous work (Wang & Mahadevan, 2009; Cui et al., 2014) where only few examples are illustrated, we use a larger and more realistic dataset, and provide a quantitative way to evaluate the unsupervised manifold alignment methods for this task. Specifically, we consider here the first domain of all the proteins from CASP14 (Pereira et al., 2021) from T1024 to T1113 and use the protein models predicted by the two top-placed methods in the leaderboard, namely Alpha Fold2 and Baker. This results in a dataset of 58 pairs of protein models, with an average number of residues equal to 198. More details about the dataset can be found in Section B.5 of the Appendix. We compare our method to a state-of-the-art unsupervised manifold alignment method namely GUMA (Cui et al., 2014). The parameters are fixed across the dataset and the performance is measured as the average of the RMSD-D obtained by 3 different runs, defined as the root-mean-square deviation (RMSD-D) of residue positions and distances, given by 1 n2 D d(Z,Z) 2 F + q 1 n2 D d(Z ,Z ) 2 F + 1 n Pn i,j=1 Pij zi z j 2, where P denotes the true permutation matrix with 0 and 1 as entries. We respectively compute embeddings in the orignal 3D space and in the 2D space, and show the results in Figure 3. We also measure the significance of the difference with the p-value of a Wilcoxon signed-rank test. While GUMA hardly works in this challenging task with few number of RMSD-D values being small, our Joint MDS manages to produce good alignment in a large number of proteins. 6 CONCLUSION We introduced Joint MDS, a flexible and powerful analysis tool for joint exploration of two datasets from different domains without known correspondences between instances across the datasets. While it only requires pairwise intra-dataset dissimilarities as input, we showcased its applicability and effectiveness in several applications. We see the main limitation of our approach in its quadratic complexity of number of samples, which makes it hard to scale to very large datasets. Investigation of faster stochastic and online optimization methods for solving either MDS or Wasserstein Procrustes (Rajawat & Kumar, 2017; Pai et al., 2019; Shamai et al., 2015; Boyarski et al., 2017) would be interesting research directions. Published as a conference paper at ICLR 2023 ACKNOWLEDGEMENT This project has received funding from the European Union s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 813533 (K.B.). The authors would also like to thank Leslie O Bray for her insightful feedback on the manuscript, which greatly improved it. Sameer Agarwal, Josh Wills, Lawrence Cayton, Gert Lanckriet, David Kriegman, and Serge Belongie. Generalized non-metric multidimensional scaling. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2007. David Alvarez-Melis, Stefanie Jegelka, and Tommi S Jaakkola. Towards optimal transport with global invariances. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2019. Matthew Amodio and Smita Krishnaswamy. Magan: Aligning biological manifolds. In International Conference on Machine Learning (ICML), 2018. Aleksandra E Badaczewska-Dawid, Andrzej Kolinski, and Sebastian Kmiecik. Computational reconstruction of atomistic protein structures from coarse-grained models. Computational and structural biotechnology journal, 18:162 176, 2020. Federica Bogo, Javier Romero, Matthew Loper, and Michael J. Black. FAUST: Dataset and evaluation for 3D mesh registration. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Piscataway, NJ, USA, 2014. IEEE. Ingwer Borg and Patrick JF Groenen. Modern multidimensional scaling: Theory and applications. Springer Science & Business Media, 2005. Amit Boyarski, Alex M Bronstein, and Michael M Bronstein. Subspace least squares multidimensional scaling. In International Conference on Scale Space and Variational Methods in Computer Vision, pp. 681 693. Springer, 2017. Kai Cao, Xiangqi Bai, Yiguang Hong, and Lin Wan. Unsupervised topological alignment for single-cell multi-omics integration. Bioinformatics, 36(Supplement 1):i48 i56, 2020. Zhi-Jie Cao and Ge Gao. Multi-omics single-cell data integration and regulatory inference with graph-linked embedding. Nature Biotechnology, pp. 1 9, 2022. Lisha Chen and Andreas Buja. Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis. Journal of the American Statistical Association, 104(485): 209 219, 2009. Zhen Cui, Hong Chang, Shiguang Shan, and Xilin Chen. Generalized unsupervised manifold alignment. Advances in Neural Information Processing Systems (Neur IPS), 2014. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems (Neur IPS), 2013. Jan De Leeuw. Convergence of the majorization method for multidimensional scaling. Journal of classification, 5(2):163 180, 1988. Jan De Leeuw and Patrick Mair. Multidimensional scaling using majorization: Smacof in r. Journal of statistical software, 31:1 30, 2009. Pinar Demetci, Rebecca Santorella, Bj orn Sandstede, William Stafford Noble, and Ritambhara Singh. Scot: Single-cell multi-omics alignment with optimal transport. Journal of Computational Biology, 29(1):3 18, 2022. Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141 142, 2012. Published as a conference paper at ICLR 2023 Emden R Gansner, Yehuda Koren, and Stephen North. Graph drawing by stress majorization. In International Symposium on Graph Drawing, pp. 239 250. Springer, 2004. Edouard Grave, Armand Joulin, and Quentin Berthet. Unsupervised alignment of embeddings with wasserstein procrustes. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2019. Patrick JF Groenen and Willem J Heiser. The tunneling method for global optimization in multidimensional scaling. Psychometrika, 61(3):529 550, 1996. Jihun Ham, Daniel Lee, and Lawrence Saul. Semisupervised alignment of manifolds. In International Workshop on Artificial Intelligence and Statistics, 2005. Somaye Hashemifar and Jinbo Xu. Hubalign: an accurate and efficient method for global alignment of protein protein interaction networks. Bioinformatics, 30(17):i438 i444, 2014. Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550 554, 1994. Kun Jin, Chaoyue Liu, and Cathy Xia. Two-sided wasserstein procrustes analysis. In International Joint Conference on Artificial Intelligence (IJCAI), 2021. Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1 9, 2016. Tomihisa Kamada, Satoru Kawai, et al. An algorithm for drawing general undirected graphs. Information processing letters, 31(1):7 15, 1989. Dhruv Kohli, Alexander Cloninger, and Gal Mishne. Ldle: Low distortion local eigenmaps. J. Mach. Learn. Res., 22:282 1, 2021. Ya-Wei Eileen Lin, Yuval Kluger, and Ronen Talmon. Hyperbolic procrustes analysis using riemannian geometry. In Advances in Neural Information Processing Systems (Neur IPS), 2021. Jie Liu, Yuanhao Huang, Ritambhara Singh, Jean-Philippe Vert, and William Stafford Noble. Jointly embedding multiple single-cell omics measurements. In Workshop on Algorithms in Bioinformatics, 2019. Linfeng Liu, Michael C Hughes, Soha Hassoun, and Liping Liu. Stochastic iterative graph matching. In International Conference on Machine Learning (ICML), 2021. Facundo M emoli. Gromov wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11(4):417 487, 2011. Gautam Pai, Ronen Talmon, Alex Bronstein, and Ron Kimmel. Dimal: Deep isometric manifold learning using sparse geodesic sampling. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 2019. Joana Pereira, Adam J Simpkin, Marcus D Hartmann, Daniel J Rigden, Ronan M Keegan, and Andrei N Lupas. High-accuracy protein structure prediction in casp14. Proteins: Structure, Function, and Bioinformatics, 89(12):1687 1699, 2021. Gabriel Peyr e, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. Ketan Rajawat and Sandeep Kumar. Stochastic multidimensional scaling. IEEE Transactions on Signal and Information Processing over Networks, 3(2):360 375, 2017. Gil Shamai, Yonathan Aflalo, Michael Zibulevsky, and Ron Kimmel. Classical scaling revisited. In Proceedings of the International Conference on Computer Vision (ICCV), 2015. Aaron Shon, Keith Grochow, Aaron Hertzmann, and Rajesh P Rao. Learning shared latent structure for image synthesis and robotic imitation. In Advances in Neural Information Processing Systems (Neur IPS), 2005. Published as a conference paper at ICLR 2023 Justin Solomon, Gabriel Peyr e, Vladimir G Kim, and Suvrit Sra. Entropic metric alignment for correspondence problems. ACM Transactions on Graphics (To G), 35(4):1 13, 2016. Joshua B Tenenbaum, Vin de Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319 2323, 2000. Vayer Titouan, Ievgen Redko, R emi Flamary, and Nicolas Courty. Co-optimal transport. Advances in Neural Information Processing Systems, 33:17559 17570, 2020. Warren S Torgerson. Multidimensional scaling of similarity. Psychometrika, 30(4):379 393, 1965. Devis Tuia, Michele Volpi, and Gustau Camps-Valls. Unsupervised alignment of image manifolds with centrality measures. In 2014 22nd International Conference on Pattern Recognition, pp. 912 917. IEEE, 2014. Titouan Vayer, R emi Flamary, Romain Tavenard, Laetitia Chapel, and Nicolas Courty. Sliced gromov-wasserstein. ar Xiv preprint ar Xiv:1905.10124, 2019. Vipin Vijayan, Vikram Saraph, and Tijana Milenkovi c. Magna++: maximizing accuracy in global network alignment via both node and edge conservation. Bioinformatics, 31(14):2409 2411, 2015. Chang Wang and Sridhar Mahadevan. Manifold alignment using procrustes analysis. In International Conference on Machine Learning (ICML), 2008. Chang Wang and Sridhar Mahadevan. Manifold alignment without correspondence. In International Joint Conference on Artificial Intelligence (IJCAI), 2009. Hongteng Xu, Dixin Luo, and Lawrence Carin. Scalable gromov-wasserstein learning for graph partitioning and matching. Advances in Neural Information Processing Systems (Neur IPS), 2019a. Hongteng Xu, Dixin Luo, Hongyuan Zha, and Lawrence Carin Duke. Gromov-wasserstein learning for graph matching and node embedding. In International Conference on Machine Learning (ICML), 2019b. Yuguang Yan, Wen Li, Hanrui Wu, Huaqing Min, Mingkui Tan, and Qingyao Wu. Semi-supervised optimal transport for heterogeneous domain adaptation. In International Joint Conference on Artificial Intelligence (IJCAI), 2018. Zhen Zhang, Yijian Xiang, Lingfei Wu, Bing Xue, and Arye Nehorai. Kergm: Kernelized graph matching. In Advances in Neural Information Processing Systems (Neur IPS), 2019. Published as a conference paper at ICLR 2023 This appendix provides additional background and details about methods and experiments. It is organized as follows: Section A provides further background on optimization and Section B presents experimental details and additional results. A FURTHER BACKGROUND ON OPTIMIZATION A.1 MULTIDIMENSIONAL SCALING AND SMACOF This section provides further background on the metric MDS problem and details about the SMACOF algorithm. The metric MDS problem consists in solving the minimization problem of the stress function defined in Eq. (1), or equivalently min Z Rn d stress(Z, D, W) := i 0, 0 otherwise. Proof. Through simple calculus, we have X i