# hyperbolic_procrustes_analysis_using_riemannian_geometry__0dade4e3.pdf

Hyperbolic Procrustes Analysis Using

Riemannian Geometry

Ya-Wei Eileen Lin Yuval Kluger Ronen Talmon

Viterbi Faculty of Electrical and Computer Engineering, Technion

Program in Applied Mathematics, Yale University Interdepartmental Program in Computational Biology and Bioinformatics, Yale University

Department of Pathology, Yale University {lin.ya-wei@campus, ronen@ee}.technion.ac.il, yuval.kluger@yale.edu

Label-free alignment between datasets collected at different times, locations, or by different instruments is a fundamental scientiﬁc task. Hyperbolic spaces have recently provided a fruitful foundation for the development of informative representations of hierarchical data. Here, we take a purely geometric approach for label-free alignment of hierarchical datasets and introduce hyperbolic Procrustes analysis (HPA). HPA consists of new implementations of the three prototypical Procrustes analysis components: translation, scaling, and rotation, based on the Riemannian geometry of the Lorentz model of hyperbolic space. We analyze the proposed components, highlighting their useful properties for alignment. The efﬁcacy of HPA, its theoretical properties, stability and computational efﬁciency are demonstrated in simulations. In addition, we showcase its performance on three batch correction tasks involving gene expression and mass cytometry data. Speciﬁcally, we demonstrate high-quality unsupervised batch effect removal from data acquired at different sites and with different technologies that outperforms recent methods for label-free alignment in hyperbolic spaces.

1 Introduction

A key scientiﬁc task in modern data analysis is the alignment of data. The need for alignment often arises since data are acquired in multiple domains, under different environmental conditions, using various acquisition equipment, and at different sites. This paper focuses on the problem of label-free alignment of data embedded in hyperbolic spaces. Recently, hyperbolic spaces have accentuated in geometric representation learning. These non-Euclidean spaces have become popular since they provide a natural embedding of hierarchical data thanks to the exponential growth of the lengths of their geodesic paths [41, 42, 30, 15, 14, 6, 32].

The problem of alignment of data embedded in hyperbolic spaces has been extensively studied, e.g., in the context of natural language processing [49], ontology matching [10], matching two data modalities [40], and improving the embedding in hyperbolic spaces [2]. A few of these studies are based on optimal transport (OT) [2, 22], a classical problem in mathematics [38] that has recently reemerged in modern data analysis, e.g., for domain adaptation [7]. Despite its increasing usage, OT for unsupervised alignment is fundamentally limited [54], since OT (as any density matching approach) cannot recover volume-preserving maps [3, 4, 36].

In this paper, we resort to Procrustes analysis (PA) [17, 18] that is based on purely geometric considerations. PA has been widely used for aligning datasets by eliminating the shift, scaling, and rotational factors. Over the years, it has been successfully applied to various applications, e.g.,

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

image registration [34], manifold alignment [52], shape matching [35], domain adaptation [47], and manifold learning [27], to name but a few. Here, we address the problem of label-free matching of hierarchical data embedded in hyperbolic spaces. We present hyperbolic Procrustes analysis (HPA), a new PA method in the Lorentz model of hyperbolic geometry. The main novelty lies in the introduction of new implementations of the three prototypical PA components based on Riemannian geometry. Speciﬁcally, translation is viewed as a Riemannian mean alignment, implemented using parallel transport (PT). Scaling is determined with respect to geodesic paths. Rotation is considered as moment alignment on a mapping of the tangent space of the manifold to a Euclidean vector space. Our analysis provides new derivations in the Riemannian geometry of the Lorentz model and speciﬁes the commuting properties of the HPA components. We show that HPA, compared to existing baselines and OT-based methods, achieves improved alignment in a purely unsupervised setting. In addition, it has a natural and stable out-of-sample extension, it supports both small and big data, and it is computationally efﬁcient.

We show application to batch correction in bioinformatics tasks. We present results on both gene expression and mass cytometry (Cy TOF) data, exemplifying the generality and broad scope of our method. In contrast to recent works [28, 50], our method does not require landmark correspondence, which is often unavailable in many datasets or hard to obtain. Speciﬁcally, we show that batch effects caused by acquisition using different technologies, at different sites, and at different times can be accurately removed, while preserving the intrinsic structure of the data.

Our main contributions are as follows. (i) We present a new implementation of PA using the Riemannian geometry of the Lorentz model for unsupervised label-free hierarchical data alignment. (ii) We provide theoretical analysis and justiﬁcation of our alignment method based on new derivations of Riemannian geometry operations in the Lorentz model. These derivations have their own merit as they could be used in other contexts. (iii) We show experimental results of accurate batch effect removal from several hierarchical bioinformatics datasets without landmark correspondence.

2 Background on hyperbolic geometry

Hyperbolic space is a non-Euclidean space with a negative constant sectional curvature and an underlying geometry that describes tree-like graphs with small distortions [46]. There exist four commonly-used models for hyperbolic spaces: Poincaré disk model, Lorentz model (hyperboloid model), Poincaré half-plane model, and Beltrami-Klein model. These four models are equivalent and there exist transformations between them. Here, we consider the Lorentz model, and speciﬁcally, the upper sheet of the hyperboloid model, because its basic Riemannian operations have simple closed-form expressions and the computation of the geodesic distances is stable [42, 30].

Formally, the upper sheet of the hyperboloid model in a d-dimensional hyperbolic space is deﬁned by Ld := {x 2 Rd+1|hx, xi L = 1, x(1) > 0}, where hx, xi L = x>Hx is the Lorentzian inner product and H 2 R(d+1) (d+1) is deﬁned by H = [ 1, 0>; 0, Id]. The Lorentzian norm of a hyperbolic vector x 2 Ld is denoted by ||x||L =

hx, xi L, with the origin µ0 = [1, 0>]> 2 Ld. Let Tx Ld be the tangent space at x 2 Ld, deﬁned by Tx Ld := {v|hx, vi L = 0}. Consider x 2 Ld and v 2 Tx Ld, the geodesic path φ : R+

0 ! Ld is deﬁned by φ(t) = cosh(||v||Lt)x + sinh(||v||Lt) v ||v||L with φ(0) = x and initial velocity φ0(0) = v, where φ0(t) := d

dtφ(t). In addition,

the associated geodesic distance is d Ld(x, φv(t)) = cosh 1( hx, φv(t)i L).

The Exponential map, projecting a point v 2 Tx Ld to the manifold Ld, is given by Expx(v) = φ(1) = cosh(||v||L)x + sinh(||v||L) v ||v||L . The Logarithmic map, projecting a point y 2 Ld to

the tangent space Tx Ld at x, is deﬁned by Logx(y) = cosh 1(λ) p

λ2 1 (y λx), where λ = hx, yi L. The PT of a vector v 2 Tx Ld along the geodesic path from x 2 Ld to y 2 Ld is deﬁned by PTx!y(v) = v + hy λx,vi L

λ+1 (x + y), where λ = hx, yi L, keeping the metric tensor unchanged. The Riemannian mean x X and the corresponding dispersion d X of a set X = {xi|xi 2 Ld}n

i=1 are deﬁned using the Fréchet mean [13, 33] by

x X := m(X) = arg min

Ld(x, xi) and d X := r(X) = min

Ld(x, xi), (1)

PAH [51] HOT-F [54] HOT-L [22] HOT-ME [22]

Other methods

Step 1: Riemannian

translation

Step 2: Riemannian

Step 3: Riemannian

wrapped rotation

Figure 1: Illustration of our alignment method (HPA). Two sets of points in hyperbolic space are given, depicted in dark and light colors. Each point is associated with one of two labels: (i) blue/cyan, and (ii) red/pink. Left: Alignment results after each step of our HPA implemented in L2: Riemannian translation, Riemannian scaling, and Riemannian wrapped rotation. Note that these Riemannian operations in hyperbolic space are different than their Euclidean counterparts. Right: Alignment results of four methods: PAH [51] and HOT-F [54] applied in L2, and HOT-L [22] and HOT-ME [22] applied in the 2D Poincaré disk. For visualization, all points in L2 are transformed to the 2D Poincaré disk. The alignment after Step 3 of HPA (circled in black) is more accurate than the alignments obtained by the other methods.

where m : X ! Ld and r : X ! R+. Note that the Fréchet mean of samples on connected and compact Riemannian manifolds of non-positive curvatures, such as hyperbolic spaces, is guaranteed to exist, and it is unique [24, 44, 1]. The Fréchet mean is commonly computed by the Karcher Flow [24, 20], which is computationally demanding. Importantly, in the considered hyperbolic space, the Fréchet mean can be efﬁciently obtained using the accurate gradient formulation [33].

Given a vector x 2 Ld and a symmetric and positive-deﬁnite (SPD) matrix 2 Rd d, the wrapped normal distribution G(x, ) provides a generative model of hyperbolic samples as follows [39, 11]. First, a vector v0 is sampled from N(0, ). Then, 0 is concatenated to the vector v0 such that v = [0, v0]> 2 Tµ0Ld. Finally, PT from the origin µ0 = [1, 0>]> to x is applied to v, and the resulting point is mapped to the manifold using the Exponential map at x. The probability density function of this model is given by log G(y|x, ) = log N(0, ) (n 1) log( sinh ||v0||2

3 Hyperbolic Procrustes analysis

Existing methods for data alignment typically seek a function that minimizes a certain cost. A large body of work attempts to match the empirical densities of two datasets, e.g., by minimizing the maximum mean discrepancy (MMD) [48, 29] or solving OT problems [45, 2, 22]. Finding an effective cost function without labels or landmarks is challenging, and minimizing such costs directly often lead to poor alignment in practice (see illustration in Fig. 1). A different well-established approach that applies indirect alignment based on geometric considerations is PA. While preparing this manuscript, another method of PA in hyperbolic spaces (PAH) was presented for matching two sets, assuming that they consist of the same number of points and that there exists a point-wise isometric map between them [51]. We remark that the analysis we present here applies to broader settings and makes no such assumptions. See Appendix E for details on classical PA as well as for comparisons to [51] and to the application of Euclidean PA in the tangent space.

We consider two sets of points H(1) = {h(1)

i=1 and H(2) = {h(2)

i=1 in Ld. Here, we aim to ﬁnd a function : Ld ! Ld, consisting of three components: translation, scaling, and rotation, that aligns H(2) with H(1) in an unsupervised label-free manner as depicted in Fig. 1. Finding such a function can be viewed as an extension of classical PA from the Euclidean space Rd+1 to the Lorentz model Ld. A natural extension to multiple sets is described in Section 3.5. We remark that the statements are written in the context of the problem at hand. In Appendix A, we restate them more generally and present their proofs.

3.1 Riemannian translation

(2) denote the Riemannian means of the sets H(1) and H(2), respectively. In this translation component, we ﬁnd a map Γh

(1) : Ld ! Ld that aligns the Riemannian means of

the sets. In the spirit of [5, 53], we propose to construct Γh

i ) as the composition of three

Riemannian operations in Ld: the Logarithmic map applied to h(2)

(2), PT h(2)

(1) along the geodesic path, and the Exponential map applied to the transported point at h

i ) := Exph

See Fig. B.1 in Appendix B for illustration. Since the geodesic path between any two points in Ld is unique [46], Γh

(1) is well-deﬁned. The rationale behind the combination of these three Riemannian operations is twofold. First, PT is a map that aligns the means of the sets, while preserving their internal structure. Second, the Logarithmic and Exponential maps compose a map whose domain and range are the Lorentz model Ld rather than the tangent space, as desired. We make these claims formal in the following results. Proposition 1. The map Γh

(1) deﬁned in Eq. (2) aligns the means of the sets, i.e., it satisﬁes

(1) = m({Γh

where m is the function deﬁned in Eq. (1).

Proposition 2. The map Γh

i ) for all h(2)

i 2 H(2) can be recast as:

(2) + γ(h(2)

where the functions β and γ are positive, deﬁned by 0 < β(h(2)

L and 0 < γ(h(2)

(1) (2 +1)h

L, respectively, and 0 < =

In addition to providing a compact closed-form expression, Prop. 2 gives the proposed translation based on Riemannian geometry an interpretation of standard mean alignment in linear vector spaces. It implies that the alignment is nothing but subtracting the mean of the source set h

(2) from each vector in H(2), and adding the mean of the target set h

(1) (with the appropriate scales). Proposition 3. The map Γh

(1) preserves distances (i.e., it is an isometry):

j ) = d Ld(Γh

for any two points h(2)

Let (t) be the unique geodesic path from h

(1) such that (0) = h

(2) and (1) = h

(1), and let 0(0) 2 Th

(2)Ld and 0(1) 2 Th

(1)Ld be the corresponding velocities, respectively.

Proposition 4. The map Γh

(1) aligns geodesic velocities, i.e., given the mapping of the geodesic

velocities to the manifold Ld 3 v0 = Exph

(2)( 0(0)) = h

(1) and Ld 3 v1 = Exph

(1)( 0(1)), we have

(1)(v0) = v1. (6)

Isometry is determined up to rotation, a fact that can be problematic for alignment. For example, any H-unitary matrix [16] can be an isometric function in Ld. When landmarks are given, they can be used to alleviate this redundancy. However, in the purely unsupervised setting we consider, other data-driven ques are required. Prop. 4 implies that the proposed translation based on PT ﬁxes some of these rotational degrees of freedom by aligning the geodesic velocities. In Section 3.3 we revisit this issue. Now, with a slight abuse of notation, let e H(2) = Γh

Proposition 5. Consider two subsets A, B H(2) and their translations e A = Γh

(1)(A), e B =

(1)(B) e H(2). Let a = m(A), b = m(B), ea = m( e A), and eb = m( e B) be the Riemannian means of the subsets. Then,

(1) Γa!b = Γea!eb Γh

In the context of the alignment problem, the importance of Prop. 5 is the following. Suppose the two sets correspond to data measured at two labs (denoted with and without a tilde), and suppose each set was acquired by two types of equipment (denoted by A and B). Prop. 5 implies that aligning data from the different labs and then aligning data acquired using the different equipment is equivalent to ﬁrst aligning the different equipment and then the different labs, i.e., any order of the two alignments generates the same result. Seemingly, this is a natural property in Euclidean spaces. However, in a Riemannian manifold, it is not a trivial result, and it holds for the transport along the geodesic path. See Appendix A for counter-examples.

3.2 Riemannian scaling

Let d(1) and d(2) denote the Riemannian dispersions of H(1) and H(2). By propositions 1 and 3, h

and d(2) are the mean and dispersion of e H(2). Here, our goal is to align the Riemannian dispersions of H(1) and e H(2). For this purpose, we propose the scaling function s

(1) : Ld ! Ld, given by

i ) = φi(s), (8)

d(1)/d(2) is the scaling factor and φi(t) is the geodesic path from h

(1) to eh(2)

i such that

(1) and φi(1) = eh(2)

i . See Fig. B.2 in Appendix B for illustration.

Proposition 6. The dispersion of the rescaled set b H(2) = s

(1)( e H(2)) is d(1).

3.3 Riemannian wrapped rotation

The purpose of this component is to align the orientation of the distributions of the two sets after translation and scaling, namely, after aligning their ﬁrst and second moments. The proposed rotation function h

(1) : Ld ! Ld consists of (i) mapping the points from the manifold Ld to the tangent space Th

(1)Ld, (ii) mapping to Rd, (iii) rotating in Rd, and (iv) mapping back to the tangent space and then to the manifold. We perform the rotation in Rd, which we term wrapped rotation, rather than a direct rotation on the manifold Ld or on the tangent space Th

(1)Ld for the following reasons. First, the frequently used rotation map in Ld [51] does not necessarily preserve the Riemannian mean, and in our context, it might reverse the mean alignment. Second, rotation applied directly to the tangent space Th

(1)Ld does not guarantee that the rotated points remain on the same tangent space. Third, applying rotation in the Euclidean vector space that is isometric to the tangent space is less efﬁcient and stable, and it obtains slightly worse empirical results (see Appendix D.4 for details). Last, applying the rotation in Rd allows us to use the standard Euclidean rotation using SVD. In Section 4, we empirically demonstrate the advantage of the proposed rotation compared to the alternatives.

Deﬁnition 1. Let the mapping function Ph

(1)Ld ! Rd deﬁned on the tangent space at h

and its inverse map be the following functions deﬁned by

v(2), . . . , v(d + 1)

> 2 Rd and P 1

(9) where s 2 Rd and h , i is the standard Euclidean inner product.

Note that removing the ﬁrst element of v is valid due to the constraint imposed on the vector elements in the tangent space by deﬁnition. Indeed, no information is lost and the mapping is invertible.

The ﬁrst step in our rotation component is to map the points in H(1) and in b H(2) to the tangent space at h

i ) for i = 1, . . . , N1, and v(2)

i ) for i = 1, . . . , N2. In the second step, we map the points by the mapping function in Deﬁnition 1 and re-center them: s(1)

i ) s(1), i = 1, . . . , N1 and s(2)

i ) s(2), i = 1, . . . , N2, where

s(k) = 1 Nk

i ) for k = 1, 2 is the mean vector of the projections. Then, the mapped and centered points (in Rd) are collected into matrices:

2 , . . . , s(k)

2 Rd Nk. (10)

In the third step, for each set k = 1, 2, we compute the rotation matrix U (k) 2 Rd d by applying SVD to the matrix S(k) = U (k) (k)(E(k))>. Since the left-singular vectors are determined up to a sign, we propose to align their signs as follows: u(2)

i sign(hu(2)

i , where u(1)

i are the i-th left-singular vector of the two sets, resulting in modiﬁed rotation matrices U (k). Finally, we apply the rotation to b H(2) by

where U = U (1)(U (2))>. Proposition 7. The wrapped rotation is bijective, and the inverse is given by

(1)) 1 = U >

3.4 Analysis

Putting all three components together, the proposed HPA that aligns H(2) with H(1) culminates in the composition of translation, scaling, and rotation:

As in most PA schemes, the order of the three components is important. Yet, the proposed components allow us a certain degree of freedom, as indicated in the following results. Proposition 8. The Riemannian translation and the Riemannian scaling commute w.r.t. the Riemannian means h

Note that s

(1) do not necessarily commute: s

Proposition 9. The Riemannian scaling and the wrapped rotation commute:

We note that the rotation does not commute with the translation, because PT only preserves the local covariant derivative on the tangent space but might cause rotation and distortion along the transportation. Therefore, the rotation is required to be the last component of our HPA.

Thus far, we did not present a model for the discrepancy between the two sets, nor we presented the proposed HPA as optimal with respect to some criterion. In the following result, we show that if the discrepancy between the sets can be expressed as a composition of translation, scaling, and rotation, then the two sets can be perfectly aligned using HPA.

Proposition 10. Let : Ld ! Ld be a map, given by = U

(1). If H(1) =

i ), i = 1, . . . , N2, (16)

where U 0 2 O(d).

Note that HPA consists of the sequence of Riemannian translation, Riemannian scaling and wrappedrotation. The domain and range of each component is the manifold Ld. Yet, the ﬁrst and last operations of each component are the Logarithmic and Exponential maps that project a point from the manifold to the tangent space, and vice versa, respectively. This allows us to propose an efﬁcient implementation of the sequence without the back and forth projections as described in Appendix C.

3.5 Extension to multiple sets

We can naturally scale up the setting to support the alignment of K > 2 sets, denoted by H(k) = {h(k)

i=1, where k 2 {1, 2, . . . , K}. Let h

(k) and d(k) be the Riemannian mean and dispersion of

Algorithm 1 Hyperbolic Procrustes analysis

Input: K sets of hyperbolic points H(1) = {h(1)

i=1, . . . , H(K) = {h(K)

i=1 Output: K aligned sets of hyperbolic points H(1) = { h(1)

i=1, . . . , H(K) = { h(K)

1: for each set H(k) do 2: compute the Riemannian mean h

(k) and dispersion d(k)

3: end for 4: compute h, the global Riemannian mean of {h

k=1 5: for each set H(k) do 6: apply the Riemannian translation eh(k)

i ) // Eq. (2)

7: apply the Riemannian scaling bh(k)

i ) with s = 1/

d(k) // Eq. (8)

8: apply the wrapped rotation h(k)

i ) with U = U (1)(U (k))> // Eq. (11) 9: end for

the k-th dataset, respectively. In addition, let h be the global Riemannian mean of {h

k=1. We propose to transport the points of the k-th set using Γh

(k)!h. Next, the Riemannian dispersion of

each set is set to 1 by applying s

h with s = 1/

d(k). Finally, the wrapped rotation is applied to all the data sets on the mapping of the tangent space Th Ld and then mapped back to the manifold Ld. The ﬁrst set is designated as the reference set, and all other rotation matrices U (k) are updated according to u(k)

i sign(hu(k)

i . The proposed HPA for multiple sets is summarized in Algorithm 1, and some implementation remarks appear in Appendix C.

4 Experimental results

We apply HPA to simulations and to three biomedical datasets1. In addition, we test HPA on MNIST [31] and USPS [23] datasets, which arguably do not have a distinct hierarchical structure. Nonetheless, we demonstrate in Appendix D that our HPA is highly effective in aligning these two datasets. All the experiments are label-free. We compare the obtained results to the following alignment methods: (i) PAH [51], which is applied only to the simulated data since it requires the existence of a one-to-one correspondence between the sets, (ii) only the Riemannian translation (RT), (iii) OT in hyperbolic space with the weighted Fréchet mean (HOT-F) extended to an unsupervised setting according to [54], (iv) OT with W-linear map (HOT-L) [22], and (v) hyperbolic mapping estimation (HOT-ME) [22]. As a baseline, we present the results obtained before the alignment (Baseline). For more details on the experimental setting, see Appendix C.

4.1 Simulations

The synthetic data in Ld is generated using the sampling scheme described in Section 2 based on [39]. Given an arbitrary point µ 2 Ld and an arbitrary SPD matrix 2 Rd d, we generate a set of N points Q(1) = {q(1)

i=1 centered at µ by Ld 3 q(1)

i = Expµ(PTµ0!µ(ev(1)

i )), where µ0 = [1, 0]> is the

origin, v(1)

i = [0, ev(1)

i ]>, and ev(1)

i N(0, ). Next, we generate three noisy and distorted versions of Q(1). The ﬁrst noisy set Q(2) = {q(2)

i=1 is generated as proposed in [51] by q(2)

i = LT iq(1)

i , where T i is a hyperbolic translation deﬁned by T i = [

i ; i, (I + i >

1 2 ] , i is sampled from N(0, σ2I), σ2 is the variance, and L is a random H-unitary matrix [16]. Another noisy set, denoted as Q(3) = {q(3)

i=1, is generated by q(3)

i = L(Expµ(PTµ0!µ(u(1)

i ))), where u(1)

i + i]>. Here, the noise is added to the tangent space at µ0. Finally, let Q(4) = {q(4)

i=1 be a distorted set, given by q(4)

i = fµ0(q(3)

i ), where fµ0(x) = cosh(||u||Lt)µ0 + sinh(||u||Lt) u ||u||L and u = Logµ0(x), for arbitrary (ﬁxed) µ0 2 Ld and t > 0.

1Our code is available at https://github.com/Ronen Talmon Lab/Hyperbolic Procrustes Analysis.

Q(1) and Q(2) Q(1) and Q(3) Q(1) and Q(4)

Figure 2: Alignment results of the simulated data.

After alignment Before alignment

Figure 3: The visualizations of HPA applied to the two breast cancer gene expression datasets.

We apply Algorithm 1 to align the three pairs of sets {Q(1), Q(2)}, {Q(1), Q(3)}, and {Q(1), Q(4)}, setting N = 100, σ = 1, and d 2 {3, 5, 10, 20, . . . , 40}. Each experiment is repeated 10 times with different values of µ, , µ0 and t. To evaluate the alignment, we use the pairwise discrepancy based on the hidden one-to-one correspondence, given by "(Q(1), Q(j)) = 1

i ), where j 2 {2, 3, 4}. The discrepancy as a function of the dimension d is shown in Fig. 2. We observe that the proposed HPA has lower discrepancy relative to the other label-free methods. Speciﬁcally, it outperforms OT-based methods that are designed to match the densities. Furthermore, the proposed HPA is stable, in contrast to HOT-L, which is highly sensitive to the noise and distortion introduced in Q(3) and Q(4). Interestingly, we remark that the discrepancies of RT and PAH are very close, empirically showing that RT alone is comparable to PAH. In addition, note that HPA is permutationinvariant and does not require one-to-one correspondence as PAH. We report the running-time in Appendix D and demonstrate that HPA is more efﬁcient than HOT-F and HOT-ME.

4.2 Batch effect removal

We consider bioinformatics datasets consisting of gene expression data and Cy TOF. Representing such data in hyperbolic spaces was shown to be informative and useful [25], implying that such data have an underlying inherent hierarchical structure. Batch effects [43] arise from experimental variations that can be attributed to the measurement device or other environmental factors. Batch correction is typically a critical precursor to any subsequent analysis and processing.

Three batch effect removal tasks are examined. The ﬁrst task involves breast cancer (BC) gene expression data. We consider two publicly available datasets: METABRIC [8] and TCGA [26], consisting of samples from ﬁve breast cancer subtypes. The batch effect stems from different proﬁling techniques: gene expression microarray and RNA sequencing. In the second task, three cohorts of lung cancer (LC) gene expression data [21] are considered, consisting of samples from three lung cancer subtypes. The data were collected using gene expression microarrays at three different sites (a likely source of batch effects): Stanford University (ST), University of Michigan (UM), and Dana-Farber Cancer Institute (D-F). The last task involves Cy TOF data [48] consisting of peripheral blood mononuclear cells (PBMCs) collected from two multiple sclerosis patients during four days: two day before treatment (BT) and two days after treatment (AT). These 8 = 2 2 2 batches were collected with or without PMA/ionomycin stimulated PBMCs. We aim to remove the batch effects between two different days from the same condition (BT/AT) and from the same patient. In each batch removal task, we ﬁrst learn an embedding of the data from all the batches into the Lorentz model Ld [42]. Then, HPA is applied to the embedded points in Ld.

Table 1: The k-NN AUC-ROC from the best k in each method.

Datasets Space S-Baseline Baseline HPA RT HOT-F HOT-L HOT-ME

BC L10 0.7716 0.0088 0.5254 0.0091 0.7410 0.0354 0.6001 0.0658 0.5838 0.0121 0.4594 0.0428 0.5600 0.0147

LC L20 0.9137 0.0532 0.5521 0.0991 0.8316 0.0904 0.5318 0.0370 0.5400 0.0396 0.5150 0.0654 0.4901 0.0516

P1 BT L20 0.9723 0.0058 0.6646 0.1556 0.9401 0.0068 0.9020 0.0086 0.8603 0.0142 0.8855 0.0057 0.8919 0.0089 P1 AT L20 0.9590 0.0150 0.7656 0.1564 0.9329 0.0011 0.8270 0.0880 0.8469 0.0873 0.8914 0.0175 0.8848 0.0306 P2 BT L20 0.9686 0.0114 0.6971 0.1335 0.9329 0.0186 0.8830 0.0142 0.8762 0.0215 0.8566 0.1720 0.8810 0.0040 P2 AT L20 0.9045 0.0070 0.5688 0.0688 0.8453 0.0798 0.7190 0.0439 0.7012 0.0912 0.7136 0.0012 0.7291 0.0113

Table 2: The mean and std of MMD computed over ten random subsets of size 300 for BC, 15 for LC, and 1000 for Cy TOF from the two datasets.

Datasets Space Baseline HPA RT HOT-F HOT-L HOT-ME

BC L10 0.2089 0.0027 0.0013 0.0004 0.0072 0.0011 0.0009 0.0001 0.0019 0.0004 0.0007 0.0002

ST&UM L20 0.1072 0.0051 0.0162 0.0048 0.0250 0.0049 0.0023 0.0002 0.0014 0.0007 0.0011 0.0006 ST&D-F L20 0.3213 0.0152 0.0122 0.0042 0.0150 0.0106 0.0019 0.0002 0.0010 0.0004 0.0015 0.0003 UM&D-M L20 0.0790 0.0071 0.0168 0.0090 0.0168 0.0032 0.0017 0.0003 0.0012 0.0006 0.0029 0.0005

P1 BT L20 0.0638 0.0024 0.0012 0.0002 0.0020 0.0002 0.0004 0.0002 0.0002 0.0001 0.0009 0.0002 P1 AT L20 0.0598 0.0014 0.0006 0.0001 0.0015 0.0001 0.0003 0.0001 0.0001 0.0001 0.0002 0.0001 P2 BT L20 0.0424 0.0021 0.0012 0.0001 0.0015 0.0001 0.0002 0.0001 0.0020 0.0008 0.0009 0.0003 P2 AT L20 0.0758 0.0053 0.0011 0.0002 0.0013 0.0002 0.0002 0.0001 0.0015 0.0008 0.0007 0.0003

Fig. 3 shows a visualization of the embedding of the two breast cancer datasets before and after HPA. For visualization, we project the points in L3 to the 3D Poincaré ball. Before the alignment, the dominant factor separating between the patients samples (points) is the batch. In contrast, after the alignment, the batch effect is substantially suppressed (visually) and the factors separating the points are dominated by the cancer subtype.

We evaluate the quality of the alignment in two aspects using objective measures: (i) k-NN classiﬁcation, with leave-one-batch-out cross-validation, is utilized for assessing the alignment of the intrinsic structure, and (ii) MMD [19] is used for assessing the distribution alignment quality. For the classiﬁcation, we view the ﬁve subtypes of BC, the three subtypes of LC, and the presence of stimulated cells in Cy TOF as the labels in the respective tasks. In addition to the results of the different alignment methods, we report the k-NN classiﬁcation based only on a single batch (S-Baseline), which indicates the adequacy of the representation in hyperbolic space to the task at hand.

Table 1 depicts the k-NN classiﬁcation obtained for the best k per method, and Table 2 shows the MMD. In each task, we set the dimension of the Lorentz model d to the dimension in which the best empirical single-task performance is obtained (S-Baseline). We note that similar results and trends are obtained for various dimensions. Additional results for various k values and an ablation study, showing that the combination of all three components yields the best classiﬁcation results, are reported in Appendix D.

Although the OT-based methods obtain the best matching between the distributions of the batches, HPA outperforms in all three tasks in terms of classiﬁcation (see Table 1). In the two gene expression tasks, where the data have multiple labels and we align multiple batches, the advantage of HPA compared to the other methods is particularly signiﬁcant. In Appendix D, we demonstrate HPA s out-of-sample capabilities on the Cy TOF data by learning the batch correction map between the different days from one patient and applying it to the data of the other patient.

4.3 Discussion

Alignment methods based on density matching, such as OT-based methods, often overlook an important aspect in purely unsupervised settings. Although sample density is the main data property that can be and need to be aligned, preserving the intrinsic structure/geometry of the sets is important, as it might be tightly related to the hidden labels. Indeed, we see in our experiments that OT-based methods provide a good density alignment (reducing the inter-set variability), as demonstrated by small MMD values (see Table 2). However, the intrinsic structure of the sets (the intra-set variability) is not preserved, as evident by the resulting poor (hidden) label matching, conveyed by the k-NN classiﬁcation performance (see Table 1). This is also illustrated in the right panel of Fig. 1. There it is visible that the three OT-based methods provide good global alignment of the sets, yet the intrinsic structure is not kept, as implied by the poor color matching.

In contrast to OT-based methods, HPA does not explicitly aim to match densities, and thus, it obtains slightly worse MMD performance compared to OT-based methods. However, HPA matches the ﬁrst two moments of the density and includes the rotation component, which was shown to be one of the fundamental limitations of OT-based methods for alignment as OT cannot recover volume-preserving maps [4, 36]. As seen in the simulation and experimental results and illustrated in the right panel of Fig. 1, we still obtain a good global alignment and simultaneously preserve the intrinsic structure, allowing for high classiﬁcation performance. We remark that in the synthetic examples, there is a (hidden) one-to-one correspondence between the sets, and therefore one-to-one discrepancy can be computed (instead or in addition to MMD). When there is such a correspondence, OT still cannot recover volume-preserving maps, while HPA can mitigate noise and distortions.

5 Conclusion

We introduced HPA for label-free alignment of data in the Lorentz model. Based on Riemannian geometry, we presented new translation and scaling operations that align the ﬁrst and second Riemannian moments as well as a wrapped rotation that aligns the orientation in the hyperboloid model. Our theoretical analysis provides further insight and highlights properties that may be useful for practitioners. We empirically showed in simulations that HPA is stable under noise and distortions. Experimental results involving purely unsupervised batch correction of multiple bioinformatics datasets with multiple labels is demonstrated. Beyond alignment and batch effect removal, our method can be viewed as a type of domain adaptation or a precursor of transfer learning that relies on purely geometric considerations, exploiting the geometric structure of data as well as the geometric properties of the space of the data. In addition, it can be utilized for multimodal data fusion and geometric registration of shapes with hierarchical structure.

Acknowledgments and Disclosure of Funding

We thank the reviewers for their important comments and suggestions, and we thank Thomas Dagès for the helpful discussion. The work of YEL and RT was supported by the European Union s Horizon 2020 research and innovation programme under grant agreement No. 802735-ERC-DIFFOP. The work of YK was supported by the National Institutes of Health R01GM131642, UM1PA05141, P50CA121974, and U01DA053628.

[1] B. Afsari. Riemannian lp center of mass: existence, uniqueness, and convexity. Proceedings of

the American Mathematical Society, 139(2):655 673, 2011.

[2] D. Alvarez-Melis, Y. Mroueh, and T. Jaakkola. Unsupervised hierarchy matching with optimal

transport over hyperbolic spaces. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1606 1617. PMLR, 2020.

[3] Y. Brenier. Décomposition polaire et réarrangement monotone des champs de vecteurs. CR

Acad. Sci. Paris Sér. I Math., 305:805 808, 1987.

[4] Y. Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Com-

munications on pure and applied mathematics, 44(4):375 417, 1991.

[5] R. Chakraborty, D. Seo, and B. C. Vemuri. An efﬁcient exact-pga algorithm for constant

curvature manifolds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3976 3984, 2016.

[6] I. Chami, Z. Ying, C. Ré, and J. Leskovec. Hyperbolic graph convolutional neural networks.

Advances in neural information processing systems, 32:4868 4879, 2019.

[7] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853 1865, 2016.

[8] C. Curtis, S. P. Shah, S.-F. Chin, G. Turashvili, O. M. Rueda, M. J. Dunning, D. Speed, A. G.

Lynch, S. Samarajiwa, Y. Yuan, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature, 486(7403):346 352, 2012.

[9] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in

neural information processing systems, 26:2292 2300, 2013.

[10] J. Euzenat and P. Shvaiko. Ontology matching: State of the art and future challenges. IEEE

Transactions on knowledge and data engineering, 25(1):158 176, 2013.

[11] L. Falorsi, P. de Haan, T. R. Davidson, and P. Forré. Reparameterizing distributions on lie

groups. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 3244 3253. PMLR, 2019.

[12] R. Flamary and N. Courty. Pot python optimal transport library. Git Hub: https://github.

com/rﬂamary/POT, page 144, 2017.

[13] M. Fréchet. Les éléments aléatoires de nature quelconque dans un espace distancié. In Annales

de l institut Henri Poincaré, volume 10, pages 215 310, 1948.

[14] O. Ganea, G. Bécigneul, and T. Hofmann. Hyperbolic entailment cones for learning hierarchical

embeddings. In International Conference on Machine Learning, pages 1646 1655. PMLR, 2018.

[15] O. Ganea, G. Becigneul, and T. Hofmann. Hyperbolic neural networks. In Advances in Neural

Information Processing Systems (NIPS), page 5350 5360, 2018.

[16] I. Gohberg, P. Lancaster, and L. Rodman. Matrices and indeﬁnite scalar products. 1983.

[17] J. C. Gower. Generalized procrustes analysis. Psychometrika, 40(1):33 51, 1975.

[18] J. C. Gower, G. B. Dijksterhuis, et al. Procrustes problems, volume 30. Oxford University Press

on Demand, 2004.

[19] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample

test. The Journal of Machine Learning Research, 13(1):723 773, 2012.

[20] D. Groisser. Newton s method, zeroes of vector ﬁelds, and the riemannian center of mass.

Advances in Applied Mathematics, 33(1):95 135, 2004.

[21] D. N. Hayes, S. Monti, G. Parmigiani, C. B. Gilks, K. Naoki, A. Bhattacharjee, M. A. Socinski,

C. Perou, and M. Meyerson. Gene expression proﬁling reveals reproducible human lung adenocarcinoma subtypes in multiple independent patient cohorts. Journal of Clinical Oncology, 24(31):5079 5090, 2006.

[22] A. Hoyos-Idrobo. Aligning hyperbolic representations: an optimal transport-based approach.

ar Xiv preprint ar Xiv:2012.01089, 2020.

[23] J. J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern

analysis and machine intelligence, 16(5):550 554, 1994.

[24] H. Karcher. Riemannian center of mass and molliﬁer smoothing. Communications on pure and

applied mathematics, 30(5):509 541, 1977.

[25] A. Klimovskaia, D. Lopez-Paz, L. Bottou, and M. Nickel. Poincaré maps for analyzing complex

hierarchies in single-cell data. Nature communications, 11(1):1 9, 2020.

[26] D. Koboldt, R. Fulton, M. Mc Lellan, H. Schmidt, J. Kalicki-Veizer, J. Mc Michael, L. Fulton,

D. Dooling, L. Ding, E. Mardis, et al. Comprehensive molecular portraits of human breast tumours. Nature, 490(7418):61 70, 2012.

[27] D. Kohli, A. Cloninger, and G. Mishne. LDLE: Low distortion local eigenmaps. In ICLR 2021

Workshop on Geometrical and Topological Representation Learning, 2021.

[28] I. Korsunsky, N. Millard, J. Fan, K. Slowikowski, F. Zhang, K. Wei, Y. Baglaenko, M. Brenner,

P.-r. Loh, and S. Raychaudhuri. Fast, sensitive and accurate integration of single-cell data with harmony. Nature methods, 16(12):1289 1296, 2019.

[29] A. Kumagai and T. Iwata. Unsupervised domain adaptation by matching distributions based

on the maximum mean discrepancy via unilateral transformations. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pages 4106 4113, 2019.

[30] M. Law, R. Liao, J. Snell, and R. Zemel. Lorentzian distance learning for hyperbolic rep-

resentations. In International Conference on Machine Learning, pages 3672 3681. PMLR, 2019.

[31] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document

recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

[32] Q. Liu, M. Nickel, and D. Kiela. Hyperbolic graph neural networks. In Advances in Neural

Information Processing Systems (NIPS), pages 8228 8239, 2019.

[33] A. Lou, I. Katsman, Q. Jiang, S. Belongie, S.-N. Lim, and C. De Sa. Differentiating through the

fréchet mean. In International Conference on Machine Learning (ICML), pages 6393 6403. PMLR, 2020.

[34] B. Luo and E. Hancock. Feature matching with procrustes alignment and graph editing. 1999.

[35] H. Maron, N. Dym, I. Kezurer, S. Kovalsky, and Y. Lipman. Point registration via efﬁcient

convex relaxation. ACM Transactions on Graphics (TOG), 35(4):1 12, 2016.

[36] R. J. Mc Cann. Polar factorization of maps on riemannian manifolds. Geometric & Functional

Analysis GAFA, 11(3):589 608, 2001.

[37] H. H. Milioli, R. Vimieiro, I. Tishchenko, C. Riveros, R. Berretta, and P. Moscato. Iteratively

reﬁning breast cancer intrinsic subtypes in the metabric dataset. Bio Data mining, 9(1):1 8, 2016.

[38] G. Monge. Mémoire sur la théorie des déblais et des remblais. Histoire de l Académie Royale

des Sciences de Paris, 1781.

[39] Y. Nagano, S. Yamaguchi, Y. Fujita, and M. Koyama. A wrapped normal distribution on

hyperbolic space for gradient-based learning. In International Conference on Machine Learning (ICML), pages 4693 4702. PMLR, 2019.

[40] I. Naim, Y. Song, Q. Liu, H. Kautz, J. Luo, and D. Gildea. Unsupervised alignment of natural

language instructions with video segments. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 28, 2014.

[41] M. Nickel and D. Kiela. Poincaré embeddings for learning hierarchical representations. In

Advances in Neural Information Processing Systems (NIPS), page 6341 6350, 2017.

[42] M. Nickel and D. Kiela. Learning continuous hierarchies in the lorentz model of hyperbolic

geometry. In International Conference on Machine Learning (ICML), pages 3779 3788. PMLR, 2018.

[43] V. Nygaard, E. A. Rødland, and E. Hovig. Methods that remove batch effects while retaining

group differences may lead to exaggerated conﬁdence in downstream analyses. Biostatistics, 17(1):29 39, 2016.

[44] X. Pennec. Barycentric subspace analysis on manifolds. The Annals of Statistics, 46(6A):2711

2746, 2018.

[45] G. Peyré and M. Cuturi. Computational optimal transport: With applications to data science.

Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019.

[46] J. G. Ratcliffe, S. Axler, and K. Ribet. Foundations of hyperbolic manifolds, volume 149.

Springer, 1994.

[47] P. L. C. Rodrigues, C. Jutten, and M. Congedo. Riemannian procrustes analysis: Trans-

fer learning for brain computer interfaces. IEEE Transactions on Biomedical Engineering, 66(8):2390 2401, 2018.

[48] U. Shaham, K. P. Stanton, J. Zhao, H. Li, K. Raddassi, R. Montgomery, and Y. Kluger. Removal

of batch effects using distribution-matching residual networks. Bioinformatics, 33(16):2539 2546, 2017.

[49] D. Spohr, L. Hollink, and P. Cimiano. A machine learning approach to multilingual and

cross-lingual ontology matching. In International Semantic Web Conference, pages 665 680. Springer, 2011.

[50] T. Stuart, A. Butler, P. Hoffman, C. Hafemeister, E. Papalexi, W. M. Mauck III, Y. Hao,

M. Stoeckius, P. Smibert, and R. Satija. Comprehensive integration of single-cell data. Cell, 177(7):1888 1902, 2019.

[51] P. Tabaghi and I. Dokmanic. On procrustes analysis in hyperbolic space. ar Xiv preprint

ar Xiv:2102.03723, 2021.

[52] C. Wang and S. Mahadevan. Manifold alignment using procrustes analysis. In In International

Conference on Machine Learning (ICML), pages 1120 1127, 2008.

[53] O. Yair, M. Ben-Chen, and R. Talmon. Parallel transport on the cone manifold of spd matrices

for domain adaptation. IEEE Transactions on Signal Processing, 67(7):1797 1811, 2019.

[54] O. Yair, F. Dietrich, R. Talmon, and I. G. Kevrekidis. Domain adaptation with optimal transport

on the manifold of spd matrices. ar Xiv e-prints, pages ar Xiv 1906, 2019.