# resync_riemannian_subgradientbased_robust_rotation_synchronization__5a3abbaf.pdf

Re Sync: Riemannian Subgradient-based Robust Rotation Synchronization

Huikang Liu School of Information Management and Engineering Shanghai University of Finance and Economics liuhuikang@shufe.edu.cn

Xiao Li School of Data Science The Chinese University of Hong Kong, Shenzhen lixiao@cuhk.edu.cn

Anthony Man-Cho So Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong manchoso@se.cuhk.edu.hk

This work presents Re Sync, a Riemannian subgradient-based algorithm for solving the robust rotation synchronization problem, which arises in various engineering applications. Re Sync solves a least-unsquared minimization formulation over the rotation group, which is nonsmooth and nonconvex, and aims at recovering the underlying rotations directly. We provide strong theoretical guarantees for Re Sync under the random corruption setting. Speciﬁcally, we ﬁrst show that the initialization procedure of Re Sync yields a proper initial point that lies in a local region around the ground-truth rotations. We next establish the weak sharpness property of the aforementioned formulation and then utilize this property to derive the local linear convergence of Re Sync to the ground-truth rotations. By combining these guarantees, we conclude that Re Sync converges linearly to the ground-truth rotations under appropriate conditions. Experiment results demonstrate the effectiveness of Re Sync.

1 Introduction

Rotation synchronization (RS) is a fundamental problem in many engineering applications. For instance, RS (also known as rotation averaging ) is an important subproblem of structure from motion (Sf M) and simultaneous localization and mapping (SLAM) in computer vision [20, 22, 17], where the goal is to compute the absolute orientations of objects from relative rotations between pairs of objects. RS has also been applied to sensor network localization [41, 12], signal recovery from phaseless observations [2], digital communications [36], and cryo-EM imaging [35, 33].

Practical measurements of relative rotations are often incomplete and corrupted, leading to the problem of robust rotation synchronization (RRS) [28, 21, 22, 39, 9, 31]. The goal of RRS is to reconstruct a set of ground-truth rotations X 1, , X i , , X n SO(d) from measurements of

Corresponding Author

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

relative rotations represented as

X i X j , (i, j) A, Oij, (i, j) E \ A, 0, (i, j) Ec, with (i, j)

A, with ratio pq, E \ A, with ratio (1 p)q, Ec, otherwise. (1)

Here, SO(d) := R Rd d : R R = I, det(R) = 1 denotes the rotation group (also known as the special orthogonal group), E represents the indices of all available observations, A denotes the indices of true observations, Ac := E \ A is the indices of outliers, Oij SO(d) is an outlying observation, and the missing observations are set to be 0 by convention; see, e.g., [23, section 2.1]. We use q (0, 1) to denote the observation ratio and p (0, 1) to denote the ratio of true observations.

Related works. Due to the vast amount of research in this ﬁeld, our overview will necessarily focus on theoretical investigations of RS. In the case where no outliers exist in the measurement model (1), i.e., p = 1, a natural formulation is to minimize a smooth least-squares function P

(j,j) E Xi X j Yij 2 F over Xi SO(d), 1 i n. Spectral relaxation and semideﬁnite relaxation (SDR) are typical approaches for addressing this problem [34, 3, 6, 5, 4, 30], where they provide strong recovery guarantees. However, these results cannot be directly applied to the corrupted model (1) due to the existence of outliers (i.e., p < 1) and the sensitivity of the least-squares solution to outlying observations.

Theoretical understanding of RRS is still rather limited. One typical setting for theoretical analysis of RRS is the random corruption model (RCM); see Section 2.2. The work [39] introduces a leastunsquared formulation and applies the SDR method to tackle it. Under the RCM and in the full observation case where q = 1, it is shown that the minimizer of the SDR reformulation exactly recovers the underlying Gram matrix (hence the ground-truth rotations) under the conditions that the true observation ratio p 0.46 for SO(2) (and p 0.49 for SO(3)) and n . In [23], the authors established the relationship between cycle-consistency and exact recovery and introduced a message-passing algorithm. Their method is tailored to ﬁnd the corruption level in the graph, rather than recovering the ground-truth rotations directly. They provided linear convergence guarantees for their algorithm once the ratios satisfy p8q2 = Ω(log n/n) under the RCM. However, it is unclear how this message-passing algorithm is related to other optimization procedures for solving the problem. Let us mention that they also provided guarantees for other compact groups and corruption settings. Following partly the framework established in [23], the work [32] presents an interesting nonconvex quadratic programming formulation of RRS. It is shown that the global minimizer of the nonconvex formulation recovers the true corruption level (still not the ground-true rotations directly) when p2q2 = Ω(log n/n) under the RCM. Unfortunately, the work does not provide a concrete algorithm that provably ﬁnds a global minimizer of the nonconvex formulation. In [29], the authors introduced and analyzed a depth descent algorithm for recovering the underlying rotation matrices. In the context of the RCM, they showed asymptotic convergence of their algorithm to the underlying rotations without providing a speciﬁc rate. The result is achieved under the conditions that the algorithm is initialized near X , q O(log n/n), and p 1 1/(d(d 1) + 2). The latter requirement translates to p 3/4 for SO(2) and p 7/8 for SO(3). It is important to note, however, that the primary focus of their research lies in the adversarial corruption setup rather than the RCM.

Main contributions. Towards tackling the RRS problem under the measurement model (1), we consider the following least-unsquared formulation, which was introduced in [39] as the initial step for applying the SDR method:

minimize X Rnd d f(X) := X

(i,j) E Xi X j Yij F

subject to Xi SO(d), 1 i n. (2)

Note that this problem is nonsmooth and nonconvex due to the unsquared Frobenius-norm loss and the rotation group constraint, respectively. We design a Riemannian Subgradient synchronization algorithm (Re Sync) for addressing problem (2); see Algorithm 1. Re Sync will ﬁrst call an initialization procedure named Spectr In (see Algorithm 2), which is a spectral relaxation method. Then, it implements an iterative Riemannian subgradient procedure. Re Sync targets at directly recovering the ground-truth rotations X SO(d)n rather than the Gram matrix or the corruption level. Under the RCM (see Section 2.2), we provide the following strong theoretical guarantees for Re Sync:

(S.1) Initialization. The ﬁrst step of Re Sync is to call Spectr In for computing the initial point X0. Theoretically, we establish that X0 can be relatively close to X depending on p and q; see Theorem 2. (S.2) Weak sharpness. We then establish a problem-intrinsic property of the formulation (2) called weak sharpness; see Theorem 3. This property characterizes the geometry of problem (2) and is of independent interest. (S.3) Convergence analysis. Finally, we derive the local linear rate of convergence for Re Sync based on the established weak sharpness property; see Theorem 4.

The main idea is that the weak sharpness property in (S.2) helps to show linear convergence of Re Sync to X in (S.3). However, this result only holds locally. Thus, we need the initialization guarantee in (S.1) to initialize our algorithm in this local region and then argue that it will not leave this region once initialized. We refer to Sections 3.1 to 3.3 for more technical challenges and our proof ideas. Combining the above theoretical results yields our overall guarantee: Re Sync converges linearly to the ground-truth rotations X when p7q2 = Ω(log n/n); see Theorem 1.

Notation. Our notation is mostly standard. We use Rnd d X = (X1; . . . ; Xn) SO(d)n to represent the Cartesian product of all the variables Xi SO(d), 1 i n. The same applies to the ground-truth rotations X = (X 1; ; X n). Let Ei = {j | (i, j) E}, Ai = {j | (i, j) A}, and Ac i = Ei \ Ai. We also deﬁne Aij = Ai Aj for simplicity. For a set S, we use |S| to denote its cardinality. For any matrix X, Y Rnd d, we deﬁne the following distance up to a global rotation:

dist (X, Y ) = X Y R F , where R = arg min R SO(d) XR Y 2 F = PSO(d)(X Y ).

Besides, we introduce the following distances up to the global rotation R deﬁned above:

dist1 (X, Y ) =

i=1 Xi Yi R F , dist (X, Y ) = max 1 i n Xi Yi R F .

2 Algorithm and Setup

2.1 Re Sync: Algorithm Development

In this subsection, we present Re Sync for tackling the nonsmooth nonconvex formulation (2); see Algorithm 1. Our algorithm has two main parts, i.e., initialization and an iterative Riemannian subgradient procedure.

Initialization. Re Sync ﬁrst calls a procedure Spectr In (see Algorithm 2) for initialization. Spectr In is a spectral relaxation-based initialization technique. Spectr In computes the ﬁrst d leading unit eigenvectors of the data matrix to form Φ Rnd d. We multiply n to those eigenvectors to ensure that its norm matches that of SO(d)n. We also construct Ψ, which reverses the sign of the last column of Φ so that the determinants of Φ and Ψ differ by a sign.

400 600 800 1000 10

Spectr In Naive Spectr In

Figure 1: The average under 100 simulations of the initial distance dist(X0, X ) computed by Algorithm 2 versus naive spectral initialization (i.e., outputting X0 = eΦ directly) with p = 0.2, q = 0.2 and d = 3.

Then, we compute the projection of Φ and Ψ onto SO(d)n. The projection is computed in a block-wise manner, namely eΦi = PSO(d)(Φi), 1 i n,

where Φi, eΦi Rd d are the i-th block of Φ and eΦ, respectively. The projection can be explicitly evaluated as

( Pi Q i , if det(Φi) > 0, b Pi Q i , otherwise, 1 i n.

Here, Pi, Qi Rd d are the left and right singular vectors of Φi (with descending order of singular values), respectively, and b Pi is obtained by reversing the sign of the last column of Pi. The initial point X0

is chosen as eΦ or eΨ, depending on which is closer to SO(d)n.

Let us mention that the computation of eΨ and Steps 5 - 9 in Spectr In can practically improve the approximation error dist(X0, X ). We demonstrate such a phenomenon in Figure 1, in which Naive Spectr In refers to outputing X0 = eΦ directly in Algorithm 2.

Riemannian subgradient update. Re Sync then implements an iterative Riemannian subgradient procedure after obtaining the initial point X0. The key is to compute the search direction (Riemannian subgradient) e Rf(Xk i ) and the retraction Retr Xk i ( ) onto SO(d) for 1 i n. Towards providing concrete formulas for the Riemannian subgradient update, let us impose the Euclidean inner product A, B = trace(A B) as the inherent Riemannian metric. Consequently, the tangent space to SO(d) at R SO(d) is given by TR := {RS : S Rd d, S + S = 0}. The Riemannian subgradient e Rf(Xi) can be computed as [40, Theorem 5.1] e Rf(Xi) = PTXi(e f(Xi)), 1 i n, (3)

where the projection can be computed as PTXi(B) = Xi X i B B Xi /2 for any B Rd d

and e f(Xi) is the Euclidean subgradient of f with respect to the i-th block variable Xi. Let us deﬁne fi,j(X) := Xi X j Yij F . The Euclidean subdifferential f(Xi) with respect to the block variable Xi is given by

f(Xi) = 2 X

j:(i,j) E fi,j(Xi), with fi,j(Xi) =

( Xi Yij Xj Xi X j Yij F , if Xi X j Yij F = 0,

V Rd d, V F 1, otherwise.

Algorithm 1 Re Sync: Riemannian Subgradient Synchronization

Require: Initialize X0 = Spectr In(Y ) (Algorithm 2), where Y Rnd nd and its (i, j)- th block is Yi,j Rd d; 1: Set iteration count k = 0; 2: while stopping criterion not met do 3: Update the step size µk; 4: Riemannian subgradient update:

Xk+1 i = Retr Xk i

µk e Rf(Xk i )

for 1 i n; 5: Update iteration count k = k + 1; 6: end while

Algorithm 2 Spectr In: Spectral Initialization

1: Input: Y Rnd nd; 2: Compute the d leading unit eigenvectors of Y : {u1, . . . , ud}; 3: Set Φ = n[u1, u2, . . . , ud] Rnd d and Ψ = n[u1, u2, . . . , ud 1, ud]; 4: Compute eΦ = PSO(d)n(Φ) and eΨ = PSO(d)n(Ψ);

5: if eΦ Φ F eΨ Ψ F then 6: X0 = eΦ; 7: else 8: X0 = eΨ; 9: end if 10: Output: Initial point X0.

Any element e f(Xi) f(Xi) is called a Euclidean subgradient. In Re Sync, one can choose an arbitrary subgradient e f(Xi) f(Xi) at Xi.

Mimicking the gradient method to update along the search direction e Rf(Xi) provides a point X+ i = Xi µe Rf(Xi) on the tangent space TXi at Xi, which may violate the manifold constraint X+ i SO(d) . One common approach in Riemannian optimization is to employ a retraction operator to address the feasibility issue. For SO(d), we can use a QR decomposition-based retraction and implement the Riemannian subgradient step as

X+ i = Retr Xi µe Rf(Xi) = Qr Xi µe Rf(Xi) , 1 i n. (4)

Here, Qr(B) returns the Q-factor in the thin QR decomposition of B, while the diagonal entries of the R-factor are restricted to be positive [7].

Finally, setting Xi = Xk i , Xj = Xk j for all j such that (i, j) E, µ = µk in (3) and (4) yields a concrete implementation of Step 4 in Re Sync and leads to SO(d) Xk+1 i = X+ i for 1 i n. This completes the description of one full iteration of Re Sync. Note that the per-iteration complexity of the Riemannian subgradient procedure is O(n2q), and Algorithm 2 has computational cost O(n3).

2.2 RCM Setup for Theoretical Analysis

We develop our theoretical analysis of Re Sync by adopting the random corruption model (RCM). The RCM was previously used in many works to analyze the performance of various synchronization algorithms; see, e.g., [39, 19, 23, 32]. Speciﬁcally, we can represent our measurement model (1) on a graph G(V, E), where V is a set of n nodes representing {X 1, , X n} and E is a set of edges

containing all the available measurements {Yi,j, (i, j) E}. We assume that the graph G follows the well-known Erd os-R enyi model G(n, q), which implies that each edge (i, j) E is observed with probability q, independently from every other edge. Each edge (i, j) E is a true observation (i.e., (i, j) A) with probability p and an outlier (i.e., (i, j) Ac) with probability 1 p. Furthermore, the outliers {Oi,j}(i,j) Ac are assumed to be independently and uniformly distributed on SO(d).

3 Main Results

In this section, we present our theoretical results for Re Sync. Our main results are summarized in the following theorem, which states that our proposed algorithm can converge at a linear rate to the underlying rotations X . Our standing assumption in this section is stated below.

All our theoretical results in this section are based on the RCM; see Section 2.2.

Theorem 1 (overall). Suppose that the ratios p and q satisfy

p7q2 = Ω log n

With probability at least 1 O(1/n), Re Sync with µk = µ0γk, where µ0 = Θ(p2/n) and γ = 1 pq

16, converges linearly to the ground-truth rotations X (up to a global rotation), i.e.,

dist Xk, X ξ0γk, dist Xk, X δ0γk, k 0.

Here, ξ0 = Θ( p

np5q) and δ0 = Θ(p2).

The basic idea of the proof is to establish the problem-intrinsic property of weak sharpness and then use it to derive a linear convergence result. However, the result only holds locally. Thus, we develop a procedure to initialize the algorithm in this local region and argue that Re Sync will not leave this region afterwards. In the remaining parts of this section, we implement the above ideas and highlight the challenges and approaches to overcoming them.

3.1 Analysis of Spectr In with Leave-One-Out Technique

Theorem 2 (initialization). Let X0 be generated by Spectr In (see Algorithm 2). Suppose that the ratios p and q satisfy

p2q = Ω log n

Then, with probability at least 1 O(1/n), we have

dist(X0, X ) = O log n

and dist (X0, X ) = O log n

The works [34] and [11] show that exact reconstruction of X is information-theoretically possible if the condition p2q = Ω(log n/n) holds for the cases d = 2 and d = 3, respectively. Though Theorem 2 does not provide exact recovery, it achieves an optimal sample complexity for reconstructing an approximate solution in the inﬁnity norm. Speciﬁcally, Theorem 2 shows that, as long as p2q C log n/n for some constant C > 0 large enough, the ℓ -distance dist (X0, X ) (i.e., max1 i n dist(Xi, X i )) can be made relatively small. However, the ℓ2-distance dist(X0, X ) is of the order Ω( n) under such a sample complexity.

The work [26] considers orthogonal and permutation group synchronization and shows that spectral relaxation-based methods achieve near-optimal performance bounds. Our result differs from that of [26] in twofold: 1) Our approach follows the standard leave-one-out analysis based on the standard Dist (up to O(d) invariance) deﬁned above Lemma 3 in the Appendix. Nonetheless, we have to transfer the results to dist due to the structure of SO(d) in Lemma 5, which is a nontrivial step due to the speciﬁc structure of SO(d). 2) Our result can handle incomplete observations (i.e., q < 1). In the case of incomplete observations, the construction in (17) in the Appendix becomes more intricate; it has the additional third column, rendering the analysis of our Lemma 2 more involved.

We prove Theorem 2 with some matrix concentration bounds and the leave-one-out technique. We provide the proof sketch below and refer to Appendix A for the full derivations.

Proof outline of Theorem 2. According to (1) and the fact E(Oij) = 0 since outliers are assumed to be independently and uniformly distributed on SO(d) in the RCM (see Appendix A), we know that E(Yij) = pq X i X j for all (i, j) [n] [n]. This motivates us to introduce the noise matrix Wij = Yij pq X i X j , i.e., Y = pq X X + W . (6)

The condition p2q = Ω(log n/n) in Theorem 2 ensures that the expectation pq X X will dominate the noise matrix W in the decomposition (6).

We ﬁrst discuss how to bound dist(X0, X ). Notice that X0 and X are the d leading eigenvectors of Y (after projection onto SO(d)n) and pq X X , respectively. We can then use the matrix perturbation theory (see Lemma 3) to bound dist(X0, X ). Towards this end, we need to estimate the operator norm W 2, which could be done by applying the standard matrix Bernstein concentration inequality [38] since the blocks {Wij} are i.i.d. white noise with bounded operator norms and variances; see Lemma 2.

We next turn to bound the initialization error in the inﬁnity norm, i.e., dist (X0, X ). Let us use (W X0)m Rd d to denote the m-th block of W X0 Rnd d for 1 m n. The main technical challenge lies in deriving a sharp bound for the term max1 m n (W X0)m F , as it involves two dependent random quantities, i.e., the noise matrix W and the initial X0 that is obtained by projecting the ﬁrst d leading eigenvectors of Y onto SO(d)n. To overcome such a statistical dependence, we utilize the leave-one-out technique. This technique was utilized in [42] to analyze the phase synchronization problem and was later applied to many other synchronization problems [1, 10, 15, 18, 26]. Let us deﬁne

Y (m) = pq X X + W (m) with W (m) kl = Wkl 1{k =m} 1{l =m}. (7)

That is, we construct W (m) Rnd nd by setting the m-th block-wise row and column of W to be 0. Then, it is easy to see that Y (m) is statistically independent of W m Rd nd, where the latter denotes the m-th block-wise row of W . Let X(m) be the d leading eigenvectors of Y (m). Consequently, X(m) is also independent of W m. Based on the above discussions, we can bound each (W X0)m F in the following way:

(W X0)m F = W m X0 F W m X(m) F + W m(X0 X(m)) F . (8)

The ﬁrst term W m X(m) F can be bounded using an appropriate concentration inequality due to the statistical independence between W m and X(m). The second term can be bounded as

W m(X0 X(m)) F Wm 2 X0 X(m) F , in which Wm 2 can be further bounded by matrix concentration inequality (see Lemma 2) and X0 X(m) F can be bounded using standard matrix perturbation theory (see Lemma 4).

3.2 Weak Sharpness and Exact Recovery

We next present a property that is intrinsic to problem (2) in the following theorem.

Theorem 3 (weak sharpness). Suppose that the ratios p and q satisfy

p2q2 = Ω log n

Then, with probability at least 1 O(1/n), for any X SO(d)n satisfying dist (X, X ) = O(p), we have f(X) f(X ) npq

8 dist1(X, X ).

Some remarks on Theorem 3 are in order. This theorem shows that problem (2) possesses the weak sharpness property [8], which is intrinsic to the problem and independent of the algorithm used to solve it. It is known that with this property, various subgradient-type methods can achieve linear convergence [14, 25]. We will establish a similar linear convergence result for Re Sync in the next subsection based on Theorem 3.

The weak sharpness property shown in Theorem 3 is of independent interest, as it could be helpful when analyzing other optimization algorithms (not just Re Sync) for solving problem (2). Currently,

only a few applications are known to produce sharp optimization problems, such as robust low-rank matrix recovery [25], robust phase retrieval [16], and robust subspace recovery [24]. Furthermore, sharp instances of manifold optimization problems are especially scarce. Hence, Theorem 3 extends the list of optimization problems that possess the weak sharpness property and contributes to the growing literature on the geometry of structured nonsmooth nonconvex optimization problems.

It is worth noting that Theorem 3 also establishes the exact recovery property of the formulation (2). Speciﬁcally, up to a global rotation, the ground-truth X is guaranteed to be the unique global minimizer of f over the region SO(d)n {X : dist (X, X ) = O(p)}. Consequently, recovering the underlying X reduces to ﬁnding the global minimizer of f over the aforementioned region. As we will show in the next subsection, Re Sync will converge linearly to the global minimizer X when initialized in this region. However, the initialization requirement is subject to the stronger condition p4q = Ω(log n/n) on the ratios p and q, which is ensured by Theorem 2.

We list our main ideas for proving Theorem 3 below. The full proof can be found in Appendix B.

Proof outline of Theorem 3. Note that the objective function f can be decomposed into two parts:

(i,j) A X i Xj X i X j F +

(i,j) Ac X i Xj Oij F . (9)

It is easy to see that g(X ) = 0 and g(X) 0. Based on the fact that the true observation is uniformly distributed in all the indices, we have E (g(X)) = pq P

1 i,j n X i Xj X i X j F npq

2 dist1(X, X ); see Appendix B.2 for the last inequality. A traditional way to lower bounding g(X) using E(g(X)) for all X SO(d) is to apply concentration inequality and an epsilon-net covering argument. Unfortunately, the sample complexity condition p2q2 = Ω(log n/n) does not lead to a high probability result in this way. Instead, our approach is to apply the concentration theory on the cardinalities of index sets rather than on X directly; see the following lemma.

Lemma 1 (concentration of cardinalities of index sets). Given any ϵ = Ω log n npq , with probability

at least 1 O(1/n), we have (1 ϵ)nq |Ei| (1 + ϵ)nq, (1 ϵ)npq |Ai| (1 + ϵ)npq,

(1 ϵ)npq2 |Ei Aj| (1 + ϵ)npq2, (1 ϵ)np2q2 |Aij| (1 + ϵ)np2q2

for any 1 i, j n. See Section 1 for the notation.

We then provide a sharp lower bound on g(X) based on Lemma 1.

Proposition 1. Under the conditions of Theorem 3, with probability at least 1 O(1/n), we have

16 dist1 (X, X ) , X SO(d)n. (10)

Next, to lower bound h(X) h(X ) = P

(i,j) Ac X i Xj Oij F X i X j Oij F we ﬁrst bound

h(X) h(X ) X

* X i X j Oij X i X j Oij F , X i Xj X i X j

where the inequality comes from the convexity of the norm function U 7 U F whenever X i X j Oij = 0. Then, using the orthogonality of each block of X , we further have

h(X) h(X ) X

* I X i Oij X j I X i Oij X j F , X i X i Xj X j I

Recall that since the outliers {Oi,j}(i,j) Ac are independently and uniformly distributed on SO(d), so are {X i Oij X j }(i,j) Ac. This observation indicates that I X i Oij X j / I X i Oij X j F

(i,j) Ac are i.i.d. random matrices. Hence, by invoking concentration results that utilize the randomness of the outliers {Oi,j}(i,j) Ac and the cardinalities (i, j) Ac, we obtain the following result.

Proposition 2. Under the conditions of Theorem 3, with probability at least 1 O(1/n), we have

h(X) h(X ) npq

16 dist1 (X, X ) (12)

for all X SO(d)n satisfying dist (X, X ) = O(p).

Combining Proposition 1 and Proposition 2 gives Theorem 3.

3.3 Convergence Analysis and Proof of Theorem 1

Let us now turn to utilize the weak sharpness property shown in Theorem 3 to establish the local linear convergence of Re Sync. As a quick corollary of Theorem 3, we have the following result. Corollary 1. Under the conditions of Theorem 3, with probability at least 1 O(1/n), for any X SO(d)n satisfying dist (X, X ) = O(p), we have D e Rf(X), X X E npq

16 dist1(X, X ), e Rf(X) Rf(X). (13)

This condition indicates that any Riemannian subgradient e Rf(X) provides a descent direction pointing towards X . However, it only holds for X SO(d)n satisfying dist (X, X ) = O(p). Our key idea for establishing local convergence is to show that the Riemannian subgradient update in Re Sync is a contraction operator in both the Euclidean and inﬁnity norm-induced distances using Corollary 1, i.e., if Xk lies in the local region, then Xk+1 also lies in the region. This idea motivates us to deﬁne two sequences of neighborhoods as follows:

N k F = {X | dist(X, X ) ξk} and N k = {X | dist (X, X ) δk} . (14)

Here, ξk = ξ0γk, δk = δ0γk, where ξ0, δ0, and γ (0, 1) will be speciﬁed later. Thus, these two sequences of sets {N k F } and {N k } will linearly shrink to the ground-truth. It remains to show that if Xk N k F N k , then Xk+1 N k+1 F N k+1 , which is summarized in the following theorem.

Theorem 4 (convergence analysis). Suppose that δ0 = O(p2) and ξ0 = O( npqδ0). Set γ = 1 pq

16 and µk = δk/n in Re Sync. If Xk N k F N k for any k 0, then with probability at least 1 O(1/n), we have Xk+1 N k+1 F N k+1 .

Proof outline of Theorem 4. The proof consisted of two parts. On the one hand, we need to show that Xk+1 N k+1 F , which can be achieved by applying Corollary 1. On the other hand, in order to show that Xk+1 N k+1 , we need a good estimate of each block of e Rf(X). See Appendix C.

Having developed the necessary tools, we are now ready to prove Theorem 1.

Proof of Theorem 1. Based on Theorem 2, we know that X0 N 0 F T N 0 if ξ0 and δ0 satisfy

ξ0 = O log n

and δ0 = O log n

According to Theorem 4, by choosing δ0 = Θ(p2) and ξ0 = Θ( p

np5q), condition (15) holds when p7q2 = Ω(log n/n). This completes the proof of Theorem 1.

4 Experiments

In this section, we conduct experiments on Re Sync for solving the RRS problem on both synthetic and real data, providing empirical support for our theoretical ﬁndings. Our experiments are conducted on a personal computer with a 2.90GHz 8-core CPU and 32GB memory. All our experiment results are averaged over 20 independent trials. Our code is available at https://github.com/Huikang2019/ Re Sync.

4.1 Synthetic Data

We consider the rotation group SO(3) in all our experiments. We generate X 1, . . . , X n by ﬁrst generating matrices of the same dimension with i.i.d. standard Gaussian entries and then projecting

0 100 200 300

(a) n = 400

0 100 200 300

(b) n = 600

0 100 200 300

(c) n = 800

0 100 200 300

(d) n = 1000

Figure 2: Convergence of Re Sync with p = q = (log n/n)1/3.

0.2 0.4 0.6 0.8 1 0

MPLS CEMP_GCW IRLS_L12 DESC Re Sync LUD

(a) q = 0.2, σ = 0.0

0.2 0.4 0.6 0.8 1 0

(b) q = 0.2, σ = 1.0

0.2 0.4 0.6 0.8 1 0

(c) p = 0.2, σ = 0.0

0.2 0.4 0.6 0.8 1 0.5

(d) p = 0.2, σ = 1.0

Figure 3: Comparison with state-of-the-art synchronization algorithms.

each of them onto SO(3). The underlying graph, outliers, and relative rotations in the measurement model (1) are generated according to the RCM as described in Section 2.2. In our experiments, we also consider the case where the true observations are contaminated by additive noise, namely, {Yi,j}(i,j) A in (1) is generated using the formula

Yi,j = PSO(3) X i X j + σGi,j for (i, j) A, (16) where Gi,j consists of i.i.d. entries following the standard Gaussian distribution and σ 0 controls the variance level of the noise.

Convergence veriﬁcation of Re Sync. We evaluate the convergence performance of Re Sync with the noise level σ = 0 in (16). We set p = q = (log n/n)1/3 in the measurement model (1), which satisﬁes p2q = log n/n. We use the initial step size µ0 = 1/npq and the decaying factor γ {0.7, 0.8, 0.85, 0.90, 0.95, 0.98} in Re Sync. We test the performance for various n selected from {400, 600, 800, 1000}. Figure 2 displays the experiment results. It can be observed that (i) Re Sync converges linearly to ground-truth rotations for a wide range of γ and (ii) a smaller γ often leads to faster convergence speed. These corroborate our theoretical ﬁndings. However, it is worth noting that excessively small γ values may result in an early stopping phenomenon (e.g., γ 0.8 when n = 400). In addition, Re Sync performs better with a larger n, as it allows for a smaller γ (e.g., γ = 0.7 when n = 1000) and hence converges to the ground-truth rotations faster.

Comparison with the state-of-the-arts. We next compare Re Sync with state-of-the-art synchronization algorithms, including IRLS L12 [9], MPLS [31], CEMP GCW [23, 32], DESC [32], and LUD [39]. We obtain the implementation of the ﬁrst four algorithms from https: //github.com/Cole Wyeth/DESC, while LUD s implementation is obtained through private communication with its authors. In our comparisons, we use their default parameter settings. For Re Sync, we set the initial step size to µ0 = 1/npq and the decaying factor to γ = 0.95, as suggested by the previous experiment. We ﬁx n = 200 and vary the true observation ratio p (or the observation ratio q) while keeping q = 0.2 (or p = 0.2) ﬁxed. We display the experiment results for σ = 0 and σ = 1 in Figures 3a and 3b, respectively, where p is selected from {0.2, 0.3, 0.4, . . . , 1}. When σ = 0, Re Sync achieves competitive performance compared to other robust synchronization algorithms. When the additive noise level is σ = 1, Re Sync outperforms other algorithms. In Figures 3c and 3d, we present the results with varying q chosen from {0.2, 0.3, 0.4, . . . , 1} for noise-free (σ = 0) and noisy (σ = 1) cases, respectively. In the noise-free case, DESC performs best when q < 0.5, while Re Sync slightly outperforms others when q 0.5. In the noisy case, it is clear that Re Sync achieves the best performance for a large range of q.

4.2 Real Data

We consider the global alignment problem of three-dimensional scans from the Lucy dataset, which is a down-sampled version of the dataset containing 368 scans with a total number of 3.5 million triangles. We refer to [39] for more details about the experiment setting. We apply three algorithms LUD [39], DESC [32] and our Re Sync on this dataset since they have the best performance on noisy synthetic data. As Figure 4 shows, Re Sync outperforms the other two methods.

Figure 4: Histogram of the unsquared residuals of LUD, DESC, and Re Sync for the Lucy dataset.

5 Conclusion and Discussions on Limitations

In this work, we introduced Re Sync, a Riemannian subgradient-based algorithm with spectral initialization for solving RRS. We established strong theoretical results for Re Sync under the RCM. In particular, we ﬁrst presented an initialization guarantee for Spectr In, which is a procedure embedded in Re Sync for initialization. Then, we established a problem-intrinsic property called weak sharpness for our nonsmooth nonconvex formulation, which is of independent interest. Based on the established weak sharpness property, we derived linear convergence of Re Sync to the underlying rotations once it is initialized in a local region. Combining these theoretical results demonstrates that Re Sync converges linearly to the ground-truth rotations under the RCM.

Limitations. Our overall guarantee in Theorem 1 requires the sample complexity of p7q2 = Ω(log n/n), which does not match the currently best known lower bound p2q = Ω(log n/n) for exact recovery [34, 11]. We showed in Theorem 2 that approximate recovery with an optimal sample complexity is possible. Moreover, we showed in Theorem 3 that exact recovery with p2q2 = Ω(log n/n) is possible if we have a global minimizer of the objective function of problem (2) within a certain local region. However, due to the nonconvexity of problem (2), it is non-trivial to obtain the said minimizer. We circumvented this difﬁculty by establishing the linear convergence of Re Sync to a desired minimizer in Theorem 4. Nevertheless, a strong requirement on initialization is needed, which translates to the weaker ﬁnal complexity result of p7q2 = Ω(log n/n).

Although our theory allows for p 0 as n , our argument relies heavily on the randomness of the outliers {Oi,j} and the absence of additive noise. In practice, adversarial outliers that arbitrarily corrupt a measurement and additive noise contamination are prevalent. It remains unknown how well Re Sync performs in such scenarios.

The above challenges are signiﬁcant areas for future research and improvements.

Acknowledgments and Disclosure of Funding

The authors thank Dr. Zengde Deng (Cainiao Network) and Dr. Shixiang Chen (University of Science and Technology of China) for providing helpful advice. They also thank the reviewers for their insightful comments, which have helped greatly to improve the quality and presentation of the manuscript.

Huikang Liu is supported in part by the National Natural Science Foundation of China (NSFC) Grant 72192832. Xiao Li is supported in part by the National Natural Science Foundation of China (NSFC) under grants 12201534 and 72150002, and in part by the Shenzhen Science and Technology Program under grants RCBS20210609103708017 and RCYX20221008093033010. Anthony Man-Cho So is supported in part by the Hong Kong Research Grants Council (RGC) General Research Fund (GRF) project CUHK 14205421.

[1] Emmanuel Abbe, Jianqing Fan, Kaizheng Wang, and Yiqiao Zhong. Entrywise eigenvector analysis of random matrices with low expected rank. Annals of Statistics, 48(3):1452 1474, 2020.

[2] Boris Alexeev, Afonso S Bandeira, Matthew Fickus, and Dustin G Mixon. Phase retrieval with polarization. SIAM Journal on Imaging Sciences, 7(1):35 66, 2014.

[3] Mica Arie-Nachimson, Shahar Z Kovalsky, Ira Kemelmacher-Shlizerman, Amit Singer, and Ronen Basri. Global motion estimation from point matches. In 2012 2nd International Conference on 3D Imaging, Modeling, Processing, Visualization & Transmission, pages 81 88. IEEE, 2012.

[4] Afonso S Bandeira. Random Laplacian matrices and convex relaxations. Foundations of Computational Mathematics, 18(2):345 379, 2018.

[5] Afonso S Bandeira, Nicolas Boumal, and Amit Singer. Tightness of the maximum likelihood semideﬁnite relaxation for angular synchronization. Mathematical Programming, 163(1 2):145 167, 2017.

[6] Afonso S Bandeira, Amit Singer, and Daniel A Spielman. A Cheeger inequality for the graph connection Laplacian. SIAM Journal on Matrix Analysis and Applications, 34(4):1611 1630, 2013.

[7] N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt, a Matlab toolbox for optimization on manifolds. Journal of Machine Learning Research, 15(1):1455 1459, 2014.

[8] James V Burke and Michael C Ferris. Weak sharp minima in mathematical programming. SIAM Journal on Control and Optimization, 31(5):1340 1359, 1993.

[9] Avishek Chatterjee and Venu Madhav Govindu. Robust relative rotation averaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):958 972, 2017.

[10] Yuxin Chen, Jianqing Fan, Cong Ma, and Kaizheng Wang. Spectral method and regularized MLE are both optimal for top-K ranking. Annals of Statistics, 47(4):2204 2235, 2019.

[11] Yuxin Chen and Andrea J Goldsmith. Information recovery from pairwise measurements. In 2014 IEEE International Symposium on Information Theory, pages 2012 2016. IEEE, 2014.

[12] Mihai Cucuringu, Yaron Lipman, and Amit Singer. Sensor network localization by eigenvector synchronization over the Euclidean group. ACM Transactions on Sensor Networks, 8(3):1 42, 2012.

[13] Chandler Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. III. SIAM Journal on Numerical Analysis, 7(1):1 46, 1970.

[14] Damek Davis, Dmitriy Drusvyatskiy, Kellie J Mac Phee, and Courtney Paquette. Subgradient methods for sharp weakly convex functions. Journal of Optimization Theory and Applications, 179(3):962 982, 2018.

[15] Shaofeng Deng, Shuyang Ling, and Thomas Strohmer. Strong consistency, graph Laplacians, and the stochastic block model. Journal of Machine Learning Research, 22(1):5210 5253, 2021.

[16] John C Duchi and Feng Ruan. Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval. Information and Inference: A Journal of the IMA, 8(3):471 529, 2019.

[17] Anders Eriksson, Carl Olsson, Fredrik Kahl, and Tat-Jun Chin. Rotation averaging with the chordal distance: Global minimizers and strong duality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1):256 268, 2019.

[18] Jianqing Fan, Weichen Wang, and Yiqiao Zhong. An l eigenvector perturbation bound and its application to robust covariance estimation. Journal of Machine Learning Research, 18(207):1 42, 2018.

[19] Tingran Gao and Zhizhen Zhao. Multi-frequency phase synchronization. In Proceedings of the 36th International Conference on Machine Learning, pages 2132 2141. PMLR, 2019.

[20] Venu Madhav Govindu. Lie-algebraic averaging for globally consistent motion estimation. In CVPR 2004, volume 1, pages I I. IEEE, 2004.

[21] Richard Hartley, Khurrum Aftab, and Jochen Trumpf. L1 rotation averaging using the Weiszfeld algorithm. In CVPR 2011, pages 3041 3048. IEEE, 2011.

[22] Richard Hartley, Jochen Trumpf, Yuchao Dai, and Hongdong Li. Rotation averaging. International Journal of Computer Vision, 103(3):267 305, 2013.

[23] Gilad Lerman and Yunpeng Shi. Robust group synchronization via cycle-edge message passing. Foundations of Computational Mathematics, 22(6):1665 1741, 2022.

[24] Xiao Li, Shixiang Chen, Zengde Deng, Qing Qu, Zhihui Zhu, and Anthony Man-Cho So. Weakly convex optimization over Stiefel manifold using Riemannian subgradient-type methods. SIAM Journal on Optimization, 31(3):1605 1634, 2021.

[25] Xiao Li, Zhihui Zhu, Anthony Man-Cho So, and Ren e Vidal. Nonconvex robust low-rank matrix recovery. SIAM Journal on Optimization, 30(1):660 686, 2020.

[26] Shuyang Ling. Near-optimal performance bounds for orthogonal and permutation group synchronization via spectral methods. Applied and Computational Harmonic Analysis, 60:20 52, 2022.

[27] Huikang Liu, Man-Chung Yue, and Anthony Man-Cho So. A uniﬁed approach to synchronization problems over subgroups of the orthogonal group. Applied and Computational Harmonic Analysis, 66:320 372, 2023.

[28] Daniel Martinec and Tomas Pajdla. Robust rotation and translation estimation in multiview reconstruction. In CVPR 2007, pages 1 8. IEEE, 2007.

[29] Tyler Maunu and Gilad Lerman. Depth descent synchronization in SO (D). International Journal of Computer Vision, 131(4):968 986, 2023.

[30] David M Rosen, Luca Carlone, Afonso S Bandeira, and John J Leonard. SE-Sync: A certiﬁably correct algorithm for synchronization over the special euclidean group. International Journal of Robotics Research, 38(2 3):95 125, 2019.

[31] Yunpeng Shi and Gilad Lerman. Message passing least squares framework and its application to rotation synchronization. In Proceedings of the 37th International Conference on Machine Learning, pages 8796 8806. PMLR, 2020.

[32] Yunpeng Shi, Cole M Wyeth, and Gilad Lerman. Robust group synchronization via quadratic programming. In Proceedings of the 39th International Conference on Machine Learning, pages 20095 20105. PMLR, 2022.

[33] Yoel Shkolnisky and Amit Singer. Viewing direction estimation in cryo-EM using synchronization. SIAM Journal on Imaging Sciences, 5(3):1088 1110, 2012.

[34] Amit Singer. Angular synchronization by eigenvectors and semideﬁnite programming. Applied and Computational Harmonic Analysis, 30(1):20 36, 2011.

[35] Amit Singer, Zhizhen Zhao, Yoel Shkolnisky, and Ronny Hadani. Viewing angle classiﬁcation of cryo-electron microscopy images using eigenvectors. SIAM Journal on Imaging Sciences, 4(2):723 759, 2011.

[36] Anthony Man-Cho So. Probabilistic analysis of the semideﬁnite relaxation detector in digital communications. In Proceedings of the 21st annual ACM-SIAM Symposium on Discrete Algorithms, pages 698 711. SIAM, 2010.

[37] Gilbert W Stewart and Ji-guang Sun. Matrix Perturbation Theory. Academic Press, 1990.

[38] Joel A Tropp. An Introduction to Matrix Concentration Inequalities. Foundations and Trends in Machine Learning, 8(1 2):1 230, 2015.

[39] Lanhui Wang and Amit Singer. Exact and stable recovery of rotations for robust synchronization. Information and Inference: A Journal of the IMA, 2(2):145 193, 2013.

[40] Wei Hong Yang, Lei-Hong Zhang, and Ruyi Song. Optimality conditions for the nonlinear programming problems on Riemannian manifolds. Paciﬁc Journal of Optimization, 10(2):415 434, 2014.

[41] Stella Yu. Angular embedding: A robust quadratic criterion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(1):158 173, 2011.

[42] Yiqiao Zhong and Nicolas Boumal. Near-optimal bounds for phase synchronization. SIAM Journal on Optimization, 28(2):989 1016, 2018.

1 Introduction 1

2 Algorithm and Setup 3

2.1 Re Sync: Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 RCM Setup for Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Main Results 5

3.1 Analysis of Spectr In with Leave-One-Out Technique . . . . . . . . . . . . . . . . 5

3.2 Weak Sharpness and Exact Recovery . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.3 Convergence Analysis and Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . 8

4 Experiments 8

4.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.2 Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5 Conclusion and Discussions on Limitations 10

A Full Proof of Theorem 2 15

A.1 Initialization Error in Euclidean Norm . . . . . . . . . . . . . . . . . . . . . . . . 16

A.2 Initialization Error in Inﬁnity Norm . . . . . . . . . . . . . . . . . . . . . . . . . 17

B Full Proof of Theorem 3 18

B.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

B.2 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

B.3 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

C Full Proof of Theorem 4 21

C.1 Proof of Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

C.2 Proof of Contraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

A Full Proof of Theorem 2

In this section, we present the full proof of Theorem 2. We will use the notation deﬁned in the proof outline of Theorem 2 in Section 3.1.

Firstly, we know that

(1 pq)X i X j , with probablity pq, Oij pq X i X j , with probablity (1 p)q, pq X i X j , otherwise. (17)

Since Oij is assumed to be uniformly distributed on SO(d) in the RCM, given any matrix Q SO(d), it is easy to see that Oij Q is also uniformly distributed on SO(d), so we have E(Oij) = E(Oij Q) = E(Oij) Q, Q SO(d).

Let Ekl SO(d), k = l denote the diagonal matrix whose k-th and l-th diagonal entries are 1 and others are 1, then we have E(Oij) = E(Oij) Ekl, which implies d E(Oij) = E(Oij) (E12 + E23 + + Ed1) = (d 4)E(Oij). Thus, we have E(Oij) = 0. Then, it is easy to see that E(Wij) = 0 and

Var(Wij) = (1 pq)2I + (1 q)p2q2I + (1 p)q(1 + p2q2)I = q(1 p2q)I. (18) Based on the above calculations, we can derive the following Lemma, which is a direct result of the matrix Bernstein inequality [38]. Similar results can also be found in [42, Lemma 9], [27, Proposition 3], and [26, Eq. (5.12)].

Lemma 2. With probability at least 1 O(1/n), the following holds for any m [n]:

nq log n , W (m) 2 = O p

nq log n , Wm 2 = O p

Proof. Note that W = P

i<j W (ij), where W (ij) Rnd nd denotes the matrix with the (i, j) and (j, i)-th block equal to Wij and Wji and others equal to 0. So {W (ij)} are i.i.d. centered and bounded random matrices. Besides, we have

E(W W ) 2 =

i<j E(W (ij)(W (ij)) )

= 2(n 1) Var(Wij) 2 = O(nq).

According to the matrix Bernstein inequality [38], we have that W 2 = O nq log n holds with probability at least 1 O(1/n2). The above argument also holds for each W (m), then taking a union bound over the choice of m [n] yields the second result. For Wm, we have W 2 = max u 2=1 W u 2 max u 2=1 Wmu 2 = Wm 2,

which gives the last result.

The following lemma follows from [13]; see also [26, Theorem A.2]. [42, Lemma 11] is a special case where d = 1. Before that, we need to introduce the distance up to a global orthogonal matrix:

Dist (X, Y ) = X Y R F , where R = arg min R O(d) XR Y 2 F = PO(d)(X Y ).

Lemma 3 (Davis-Kahan sin Θ Theorem). Suppose that A, E Cn n are Hermitian matrices and A = A + E. Let δ = λd(A) λd+1(A) > 0 be the gap between the d-th eigenvalue and d + 1-th eigenvalue of A for some 1 d n 1. Furthermore, let U, U be the d-leading eigenvectors of A and A, respectively, which are normalized such that U F = U F =

nd. Then, we have

2 EU F δ E 2 . (19)

A.1 Initialization Error in Euclidean Norm

Based on Lemma 2 and Lemma 3, we have the following bound on Dist. Note that the notations Φ, Ψ, eΦ, eΨ used in the following analysis are deﬁned in Spectr In (i.e., Algorithm 2).

Lemma 4. Let Φ and Φ(m) be the d-leading eigenvectors of Y and Y (m), respectively, which are normalized such that Φ F = Φ(m) F =

nd. Then, we have

Dist(Φ, X ) = O log n

and Dist(Φ, Φ(m)) = O(1) (20)

hold with probability at least 1 O(1/n).

Proof. Let us Choose A = Y (m) and E = W (m) = W W (m) in Lemma 3, then A = Y , U = Φ(m) and U = Φ. Since Φ(m) is independent of W (m), similar to Lemma 2, we apply the matrix Bernstein inequality [38] to obtain, with probability at least 1 O(1/n2), that

EU F = W (m)Φ(m) F = O( p

In addition, Y (m) = pq X X + W (m) implies that

δ = λd(Y (m)) λd+1(Y (m)) λd(pq X X ) W (m) 2 npq O( p

where the second inequality holds due to λd(X X ) = n and the last inequality is from

λd(X X ) = n and Lemma 2. Based on the condition that p2q = Ω log n

n , as long as

p2q C log n

n for some large enough constant C, we have

δ E 2 npq O( p

nq log n) E 2 npq O( p

nq log n) 1

where the second inequality holds because of E 2 W 2 + W (m) 2 = O( nq log n). Hence, by applying Lemma 3, we get

Dist(Φ, Φ(m))

2 EU F δ E 2 O( nq log n)

npq = O(1) (21)

where the last inequality is because of p q = Ω( p

log n/n). Similarly, by choosing A = pq X X and E = W , we can show that

Dist(Φ, X )

2 W X F npq O( nq log n)

1 2npq = O log n

Here, the last inequality holds because W 2 = O( nq log n) (see Lemma 2) and the fact that X F =

Following the same analysis as in Lemma 4, we are also able to show that Dist(Ψ, X ) =

p q , where Ψ reverses the sign of the last column of Φ so that the determinants of Φ and Ψ differ by a sign. However, Lemma 4 only provides the upper bound on Dist , i.e., the distance up to an orthogonal matrix. The following lemma translates the result to that on dist .

Lemma 5. Suppose that eΦ Φ F eΨ Ψ F , where eΦ = PSO(d)n(Φ) and eΨ = PSO(d)n(Ψ), then we have

dist(Φ, X ) = Dist(Φ, X ) = O log n

Proof. Based on the structure of O(d) and SO(d), it is easy to see that Dist(Φ, X ) = min {dist(Φ, X ), dist(Ψ, X )} .

Our remaining task is to prove Dist(Φ, X ) = dist(Φ, X ) based on the condition that eΦ Φ F eΨ Ψ F . If Dist(Φ, X ) = dist(Ψ, X ), then we have

= Dist(Φ, X ) = dist(Ψ, X ) eΨ Ψ F eΦ Φ F

where the ﬁrst inequality holds because eΨ is the projection of Ψ on SO(d)n. It is easy to see that

eΦ Φ F + eΨ Ψ F = Φ PSO(d)n(Φ) F + Φ P(O(d)\SO(d))n(Φ) F 2 n. The equality holds because the mapping that reverses the sign of the last column of a matrix is a bijection between SO(d) and O(d) \ SO(d), and the inequality holds since the minimum distance between SO(d) and O(d) \ SO(d) is 2. Under the condition that p2q = Ω log n

n , as long as

p2q C log n

n for some large enough constant C, we could have eΦ Φ F + eΨ Ψ F =

p q < 2 n, which contradict to the above inequality. Thus, we have

dist(Φ, X ) = Dist(Φ, X ) = O log n

A.2 Initialization Error in Inﬁnity Norm

Next, based on the Davis-Kahan theorem, we can bound the distance between f X0 and X in the inﬁnity norm, which is stated in the following result. Lemma 6. Let Φ be the d-leading eigenvectors of Y . Then, we have

dist (Φ, X ) = O log n

holds with probability at least 1 O(1/n).

Proof. Let Πm = argminΠ SO(d) Φ Φ(m)Π F . We can compute

(W Φ)m F = W mΦ F W mΦ(m)Πm F + W m(Φ Φ(m)Πm) F

W mΦ(m) F + Wm 2 Φ Φ(m)Πm F . (23)

The fact that Φ(m) is independent from Wm implies W mΦ(m) F = O( nq log n), so we have

(W Φ)m F = O( p

nq log n) (1 + O(1)) = O( p

nq log n). (24)

Next, let Π = argminΠ SO(d) Φ X Π F , then we have

(Y Φ)m = pq X m X Φ + (W Φ)m = npq X mΠ + pq X m X (Φ X Π ) + (W Φ)m.

Since Φ is the d-leading eigenvector of Y , we have Y Φ = ΦΣ with Σ Rd d consisting of the d-leading eigenvalues of Y . Applying the standard eigenvalue perturbation theory, e.g., [37, Theorem 4.11], we have Σ npq I 2 = O( W 2) = O( p

Based on the assumption that p2q = Ω log n

n , we can show that Σ 3

4npq I. Note that Y Φ = ΦΣ

implies Φm = (Y Φ)mΣ 1 for each m [n], then we have

Φm F 4 3npq (Y Φ)m F 4 3npq pq X m X Φm F + 4 3npq (W Φ)m F 2

where the last inequality holds because 4 3npq pq X m X Φm F = 4 3n X Φm F 4

d 3 and (24). Therefore, for each m [n], npq(Φm X mΠ ) = Φm(npq I Σ) + ΦmΣ npq X mΠ

= Φm(npq I Σ) + (Y Φ)m npq X mΠ . This further implies

npq Φm X mΠ F Σ npq I 2 Φm F + pq X m X (Φ X Π ) F + (W Φ)m F

d W 2 + npq Φ X Π F + O p

nq log n), where the last inequality holds because of Lemma 2 and Lemma 4. This completes the proof.

Finally, since X0 = PSO(d)n(Φ), according to Lemma 2 in [27], we have

dist(X0, X ) 2 dist(Φ, X ) and dist (X0, X ) 2 dist (Φ, X ), which complete the proof of Theorem 2.

B Full Proof of Theorem 3

In this section, we provide the full proof of Lemma 1, Proposition 1, and Proposition 2, which ﬁnishes the proof of Theorem 3. To simplify the notation and the theoretical derivations, we assume without loss of generality that X i = I for all 1 i n as one can separately rotate the space that each variable Xi lies in such that the corresponding ground-truth X i is rotated to identity [39, Lemma 4.1]. Consequently, we have Yij = I for (i, j) A.

B.1 Proof of Lemma 1

For each 1 i n, |Ei| = P

1 j n 1Ei(j), where 1Ei( ) denotes the indicator function w.r.t Ei. Based on our model, P 1 j n 1Ei(j) follows the binomial distribution B(n, q). According to the Bernstein inequality [38], for any constant ϵ (0, 1), we have

1 j n 1Ei(j) nq

1 2ϵ2n2q2 P

1 j n E{12 Ei(j)} + ϵnq/3

The last inequality holds because of ϵ < 1. Therefore,

Pr 1 i n {|Ei| nq| ϵnq} 2n exp 3

where the last inequality holds because we assume that ϵ 8 log n npq . Similarly, we have

Pr 1 i n {|Ai| npq| ϵnpq} 2n exp 3

Pr 1 i,j n Ei Aj| npq2 ϵnpq2 2n2 exp 3

Pr 1 i,j n Aij| np2q2 ϵnp2q2 2n2 exp 3

Hence, we complete the proof of Lemma 1 once n 4.

B.2 Proof of Proposition 1

We can ﬁrst compute X

1 i,j n Xi Xj F X

k Aij ( Xi Xk F + Xj Xk F )

1 (1 ϵ)np2q2 X

k Aij ( Xi Xk F + Xj Xk F ). (29)

Here, the ﬁrst inequality comes from the triangle inequality, while the second one follows from Lemma 1. Now, invoking Lemma 1, which tells |Ak| (1 + ϵ)npq, gives X

k Aij Xi Xk F = X

j Ak Xi Xk F

(1 + ϵ)npq X

k Ai Xi Xk F

= (1 + ϵ)npq X

(i,k) A Xi Xk F .

By symmetry, we conclude that X

k Aij ( Xi Xk F + Xj Xk F ) 2(1 + ϵ)npq X

(i,j) A Xi Xj F . (30)

Furthermore, we claim the following bound for any X SO(d)n: X

1 i,j n Xi Xj F n

2 dist1(X, X ). (31)

Combining (29), (30), (31), and the fact g(X) = P

(i,j) A Xi Xj F establishes Proposition 1.

Hence, it remains to show (31). First of all, according to the triangle inequality, we have

1 i,j n Xi Xj F X

1 i n Xi X F , (32)

where X = 1

1 j n Xj can be taken as a diagonal matrix (since the Frobenius norm is invariant up to a global rotation) with its ﬁrst (d 1) diagonal entries being positive. Finally, applying the following lemma to (32) provides (31).

Lemma 7. For any A SO(d) and B = Diag(b1, . . . , bd) satisfying b1, . . . , bd 1 [0, 1] and bd [ 1, 1], we have

Proof. It is equivalent to show that A B 2 F 1

4 A I 2 F . Since A SO(d), we simplify as

d + A, I 4B + 2 B 2 F = X

1 i d 1 + Aii(1 4bi) + 2b2 i 0. (33)

To prove (33) holds for all A SO(d), we choose A = argmin A SO(d) A, I 4B = PSO(d)(4B I). It is easy to see that A is also a diagonal matrix. For any 1 i d 1, we have

1 + Aii(1 4bi) + 2b2 i = 2(bi 1)2, Aii = 1; 2b2 i + 4bi, Aii = 1.

So it is always nonnegative since bi 0 for any 1 i d 1. For the last summation in (33), on the one hand, if Add = 1, then 1 + Add(1 4bd) + 2b2 d = 2(bd 1)2 0. On the other hand, if Add = 1, then there exist another 1 k d 1 such that Akk = 1. So we have (1 + Akk(1 4bk) + 2b2 k) + (1 + Add(1 4bd) + 2b2 d) = 2b2 k + 2b2 d + 4(bk + bd) 0. The last inequality holds because |bd| bk. We complete the proof.

B.3 Proof of Proposition 2

Based on (11) and our simpliﬁcation that X i = I, 1 i n, we have

h(X) h(X ) X

I Oij I Oij F , X i Xj I . (34)

Let A = E n I Oij I Oij F

o , Zij = I Oij I Oij F 1Ac(ij) (1 p)q A, and Z Rnd nd collects each

Zij in its (i, j)-th block. We can compute

I Oij I Oij F , X i Xj I =

Zij + (1 p)q A, X i Xj I

(1 p)q A, X i Xj I (35)

Zij, (Xi I) (Xj I) + (Xi I) + (Xj I) .

To bound the ﬁrst term in (35), we can proceed as

(1 p)q A, X i Xj I =

i=1 Xi + n I

i=1 Xi + n I

i=1 Xi + n I

2(1 p)qn A F

(36) Here, the second equality is true since Pn i=1 Xi can be taken as a diagonal matrix. According to [39,

Lemma A.1], we know that A F = E n I Oij I Oij F

2 for all d 2. On the other hand, we know that

dist(X, X )2 =

i=1 Xi I 2 F = 2 trace

where the last inequality holds because n I Pn i=1 Xi is a nonnegative diagonal matrix. Therefore, we obtain

(1 p)q A, X i Xj I (1 p)qn

2 dist(X, X )2

2 max i Xi I F

i=1 Xi I F .

To further bound the second term in (35), we can proceed as

Zij, (Xi I) (Xj I) + (Xi I) + (Xj I)

(X In)Z(X In) + 2

1 j n Zij, Xi I

Z op X In 2 F + 2 X

d Z op + 2 max 1 i n

1 i n Xi I F ,

where In Rnd d collects n identity matrix together and the last inequality holds because X In 2 F = P

1 i n Xi I 2 F P

1 i n( Xi F + I F ) Xi I F = 2

1 i n Xi I F .

Using the randomness of Oij, we claim that with probability at least 1 4d/n, we have

8n(1 p)q log n, and max 1 i n

8n(1 p)q log n. (40)

Combining the above bounds gives h(X) h(X )

8n(1 p)q log n + (1 p)qn

2 max i Xi I F

1 i n Xi I F

16 dist1(X, X ),

where the last inequality holds because we assume p2q2 = Ω log n

n , maxi Xi I F =

dist (X, X ) = O(p), and P

1 i n Xi I F = dist1(X, X ).

Finally, it remains to show that the two inequalities in (40) holds with probability at least 1 4d/n. It is quick to verify that E(Zij) = 0, Zij op 1 + (1 p)q A F 2, and

E(Z2) = Blk Diag

j E(Z1j Z 1j), . . . , X

j E(Znj Z nj)

j E(Z1j Z 1j) In

= n(1 p)q E (I Oij)(I Oij)

I Oij 2 F (1 p)2q2AA In,

where Blk Diag( ) means the block diagonal matrix, means the Kronecker product and In denotes the n-by-n identity matrix. Thus, we have E(Z2) op n(1 p)q. According to the Matrix Bernstein inequality [38], we have

8n(1 p)q log n 1 2nd exp

4n(1 p)q log n

n(1 p)q + 2 p

8n(1 p)q log n/3

n . (42) Here, the last inequality holds because we assume 2 p

8n(1 p)q log n/3 n(1 p)q, which is true as long as p = Ω(log n/n).

For any ﬁxed 1 i n, a similar argument based on Matrix Bernstein inequality [38] shows that

8n(1 p)q log n

4n(1 p)q log n

n(1 p)q + p

8n(1 p)q log n/3

n2 , (43) which implies Pr maxi P

8n(1 p)q log n 1 2d/n.

C Full Proof of Theorem 4

C.1 Proof of Corollary 1

Combining the convexity of f and Theorem 3, we have

1 i n Xi X i F f(X ) f(X) D e f(X), X X E , e f(X) f(X).

for all X SO(d)n satisfying dist (X, X ) = O(p). For any e Rf(X) Rf(X), we can further compute D e Rf(X), X X E = X

D e Rf(Xi), PT Xi(Xi X i ) E

1 i n e Rf(Xi) F PT Xi(Xi X i ) F .

On the one hand, notice that PT Xi(Xi X i ) =Xi X i PTXi(Xi X i )

=Xi X i (Xi X i ) + (Xi X i ) Xi /2

=Xi(Xi X i ) (Xi X i )/2, which implies

PT X(Xi X i ) F = 1

2 Xi(Xi X i ) (Xi X i ) F 1

2 Xi X i 2 F .

On the other hand, according to Lemma 1, with probability at least 1 O(1/n), for any i [n],

e Rf(Xi) F e f(Xi) F =

e fi,j(Xi) F |Ei| 2nq,

where e fi,j(Xi) fi,j(Xi) satisﬁes e fi,j(Xi) F 1. Hence, we have D e Rf(X), X X E nq X

1 i n Xi X i 2 F npq

for any X such that dist (X, X ) p 16. Invoking the above bounds into (44) yields the desired result D e Rf(X), X X E = D e f(X) e Rf(X), X X E npq

1 i n Xi X i F .

C.2 Proof of Contraction

Let us ﬁrst present some preliminary results, which will be used in our later derivations. By noticing that e f(Xi) = P

j Ei e fi,j(Xi) where e fi,j(Xi) fi,j(Xi), we deﬁne

e g(Xi) = X

e fi,j(Xi), e h(Xi) = X

e fi,j(Xi).

Recall that e Rg(Xi) = PTXi(e g(Xi)) and e Rh(Xi) = PTXi(e h(Xi)). Similarly, we have e f(Xi) = e g(Xi) + e h(Xi) and e Rf(Xi) = e Rg(Xi) + e Rh(Xi). Furthermore, the QR decomposition-based retraction satisﬁes the second-order boundedness property, i.e., there exists some M 1 such that

Xk+1 i X i F = Retr Xk i

µk e Rf(Xk i ) X i F

Xk i µk e Rf(Xk i ) X i F + M µ2 k e Rf(Xk i ) 2 F . (45)

Recall that e Rf(Xk i ) = e Rg(Xk i ) + e Rh(Xk i ), we have

e Rf(Xk i ) F e Rg(Xk i ) F + e Rh(Xk i ) F

2δ0nq + (1 + ϵ)npq (1 + 2ϵ)npq, (46)

where the second inequality comes from

e Rg(Xk i ) F =

e Rfi,j(Xi)

e Rfi,j(Xi) F |Ai| (1 + ϵ)npq,

and Lemma 10 and the last inequality is due to the choice δ0 ϵ2p2/50 (i.e., δ0 = O(p2)). Thus, by choosing µk = O( δk

n ), and δ0 = O(p2), the second-order term M µ2 k e Rf(Xk i ) 2 F = O(p6q2), which is a very high-order error. In the following analysis, we will ignore this term to simplify our derivations.

Using the above preliminaries and Corollary 1, we are ready to establish two key lemmas, which show that if Xk N k F N k , then Xk+1 N k+1 F (Lemma 8) and Xk+1 N k+1 (Lemma 9), respectively. This completes the proof of Theorem 4.

Lemma 8. With high probability, suppose that Xk N k F N k , µk = O( δk

δ0 = O(p2), and ξ0 = Θ( npqδ0), (47)

then Xk+1 N k+1 F .

Proof. By ignoring the high-order error term in (45), in order to bound Xk+1 X 2 F we can ﬁrst compute

Xk+1 X 2 F = X

1 i n Xk+1 i X i 2 F X

1 i n Xk i µk e Rf(Xk i ) X i 2 F

= Xk X 2 F 2µk D e Rf(Xk), Xk X E + µ2 k e Rf(Xk) 2 F .

Then, according to Corollary 1, we have D e Rf(Xk), Xk X E npq

1 i n Xk i X i F npq

1 i n Xk i X i 2 F

16δk Xk X 2 F ,

where the second inequality holds because Xk i X i F δk (i.e., Xk N k ). Combining the above two inequalities gives

Xk+1 X 2 F 1 µk npq

Xk X 2 F + µ2 k e Rf(Xk) 2 F

Xk X 2 F + (1 + 2ϵ)2n3p2q2µ2 k,

where the last inequality is due to (46). Since µk = O( δk

n ) and ξ0 = Θ( npqδ0) (i.e., ξk = Θ( npqδk), we have n3p2q2µ2 k = O(pqξ2 k), which implies

Xk+1 X 2 F 1 pq

16ξ2 k = 1 pq

ξ2 k = ξ2 k+1.

This completes the proof.

Lemma 9. With high probability, suppose that Xk N k F N k , µk = O( δk

δ0 = O(p2) and ξ0 = O( npqδ0), (48)

then Xk+1 N k+1 .

Proof. As stated at the beginning of Appendix B, we can assume without loss of generality that R = I and X i = I for all 1 i n. We divide the index set [n] into three sets

I1 = i | Xk i I F δk

, I2 = i | δk

4 < Xk i I F 3δk

and I3 = i | 3δk

4 < Xk i I F δk

For any i I1 S I2, we have

Xk i µk e Rf(Xk i ) I F Xk i I + µk e Rf(Xk i ) 3δk

4 + 2µknpq δk+1, (49) where the last inequality holds because we choose µk = O( δk

n ) δk 16n.

It remains to consider the case i I3. Firstly, it is easy to see that

dist(Xk, X )2 X

i I2 S I3 Xk X (δk) 2 F δ2 k 16|I2 [ I3|.

Note that we have |I2 S I3| dist(Xk,X )2

δ2 k/16 16ξ2 k δ2 k = O(npq) according to the assumption ξ0 = O( npqδ0). Hence, for any i I3, we have

e f(Xi) = X

Xi Xj Xi Xj F + X

Xi Xj Xi Xj F + e h(Xi). (50)

Since |I2 S I3| = O(npq), we can choose ξ0 properly such that |I2 S I3| ϵnpq. Then, we have

|Ai \ I1| |I2 [ I3| ϵnpq and |Ai I1| |Ai| |Ai \ I1| (1 2ϵ)npq.

Let us use e g1(Xi) to denote the ﬁrst term on the RHS of (50). We have

e g1(Xi) = X

Xi Xj Xi Xj F = X

Xi I Xi Xj F + I Xj Xi Xj F

= σ(Xi I) + X

I Xj Xi Xj F , (51)

where σ = P

j Ai I1 1 Xi Xj F . For the last term in the above equation, we have

I Xj Xi Xj F

I Xj F Xi Xj F δk

1 Xi Xj F = δkσ

In addition, the projection of Xi I onto the cotangent space can be bounded as PT Xi(Xi I) F = (Xi I) 2 F /2 δ2 k/2,

which implies that e Rg1(Xi) = PTXi(e g1(Xi)) satisﬁes

e Rg1(Xi) σ(Xi I) F δkσ

Then, the fact e Rf(Xi) = e Rg1(Xi) + PT Xi

j Ai\I1 Xi Xj Xi Xj F

+ e Rh(Xi) implies

e Rf(Xi) σ(Xi I) F e Rg1(Xi) σ(Xi I) F + |Ai \ I1| + e Rh(Xi) F

2 + ϵnpq + 5 p

Next, motivated by the update of Re Sync, we can construct

Xk i µk e Rf(Xk i ) I = (1 µkσ)(Xk i I) + µk(e Rf(Xi) σ(Xi I)), which implies

Xk i µk e Rf(Xk i ) I F (1 µkσ) Xk i I + µk

2 + ϵnpq + 5 p

(1 µkσ)δk + µk

2 + ϵnpq + 5 p

(52) In order to further upper bound the above inequality, we can compute

1 Xi Xj F X

1 Xi I F + Xj I F

4 5δk |Ai I1| = 4(1 2ϵ)npq

where the second inequality holds because Xi I F + Xj I F 5δk/4 as j I1. It implies

2δ0nq 2(1 2ϵ)npq

By plugging the above bound into (52) and ignoring the high-order error term in (45), we complete the proof.

Finally, we present Lemma 10 and its proof.

Lemma 10. With high probability, the following holds for all X0 N 0 : e Rh(Xi) F 5 p

2δ0nq i [n]. (53)

Proof of Lemma 10. Firstly, deﬁne Oij = Oij Xj X i , then

e h(Xi) = X

Xi Oij Xj Xi Oij Xj F = X

I Oij Xj X i Xi Oij Xj F Xi = X

I Oij I Oij F Xi

The fact that X N 0 implies that Xi Xj F 2δ0, i.e., I Xj X i F 2δ0. For any Oij I 2δ0, we have I Oij I Oij F I Oij I Oij F

I Oij I Oij F I Oij I Oij F

I Oij I Oij F I Oij I Oij F

where the last inequality holds because I Oij I Oij F I Oij I Oij F

F = I Oij F

1 I Oij F 1 I Oij F

I Oij F I Oij F

I Oij F Oij Oij F

= I Xj X i F I Oij F p

and I Oij I Oij F I Oij I Oij F

F = Oij Oij F

I Oij F = I Xj X i F I Oij F p

Let Φi = {j Ei/Ai | Oij I 2δ0}. According to the fact that |Ei/Ai| (1 + ϵ)nq and the randomness of Oij, it is easy to show that |Φi| 2δ0nq hold for all 1 i n with high probability. Thus, by splitting the sum P

j Ei/Ai in to two parts: j Φi and j / Φi, we have

I Oij I Oij F X

I Oij I Oij F

2δ0nq. (54)

Besides, since Oij is uniformly distributed on SO(d), according to Lemma A.1 in [39], we know

that E n I Oij I Oij F

o = c(d)I. Then, the matrix Bernstein s inequality [38] tells us that, with high probability,

I Oij I Oij F |Ei/Ωi| c(d)I

2δ0nq. (55)

This, together with (54), implies that

I Oij I Oij F |Ei/Ωi| c(d)I

The fact that e h(Xi) = P

j Ei/Ai I Oij I Oij F Xi implies that e h(Xi) |Ei/Ωi| c(d)Xi F 5 p

2δ0nq. (56)

We complete the proof by taking the projection operator e Rh(Xi) = PTXi(e h(Xi)) and the fact that PTXi(Xi) = 0.