# gyrogroup_batch_normalization__ef241e0b.pdf Published as a conference paper at ICLR 2025 GYROGROUP BATCH NORMALIZATION Ziheng Chen1 , Yue Song1, Xiao-Jun Wu2 & Nicu Sebe1 1 University of Trento, 2 Jiangnan University Several Riemannian manifolds in machine learning, such as Symmetric Positive Definite (SPD), Grassmann, spherical, and hyperbolic manifolds, have been proven to admit gyro structures, thus enabling a principled and effective extension of Euclidean Deep Neural Networks (DNNs) to manifolds. Inspired by this, this study introduces a general Riemannian Batch Normalization (RBN) framework on gyrogroups, termed Gyro BN. We identify the least requirements to guarantee Gyro BN with theoretical control over sample statistics, referred to as pseudoreduction and gyroisometric gyrations, which are satisfied by all the existing gyrogroups in machine learning. Besides, our Gyro BN incorporates several existing normalization methods, including the one on general Lie groups and different types of RBN on the non-group SPD geometry. Lastly, we instantiate our Gyro BN on the Grassmannian and hyperbolic spaces. Experiments on the Grassmannian and hyperbolic networks demonstrate the effectiveness of our Gyro BN. The code is available at https://github.com/Git ZH-Chen/Gyro BN.git. 1 INTRODUCTION Deep Neural Networks (DNNs) on Riemannian manifolds have gained increasing interest in various machine learning applications, such as computer vision (Huang et al., 2017; Huang & Van Gool, 2017; Huang et al., 2018; Skopek et al., 2019; Wang et al., 2022b;a; Chen et al., 2023c; Gao et al., 2023; Wang et al., 2024b; Chen et al., 2025), natural language processing (Ganea et al., 2018; Shimizu et al., 2020), drone classification (Brooks et al., 2019; Chen et al., 2024a), human neuroimaging (Pan et al., 2022; Kobler et al., 2022a; Ju et al., 2024; Wang et al., 2024a), medical imaging (Huang et al., 2019; Chakraborty et al., 2020), node and graph classification (Chami et al., 2019; Dai et al., 2021; Zhao et al., 2023; Chen et al., 2023b; Nguyen et al., 2024; Chen et al., 2024c). As core techniques in DNNs, normalization techniques (Ioffe & Szegedy, 2015; Ba et al., 2016; Ulyanov et al., 2016; Wu & He, 2018; Chen et al., 2023a) have also been extended into different geometries. However, most existing Riemannian normalization methods are designed for a selected few geometries or fail to normalize the sample statistics. Brooks et al. (2019); Kobler et al. (2022b;a) introduced Riemannian Batch Normalization (RBN) on SPD manifolds under the specific Affine-Invariant Metric (AIM). Chakraborty (2020) generalized this idea and proposed a Riemannian normalization framework over homogeneous spaces. However, this approach cannot generally normalize the sample statistics. Similar formulation and issue can also be found in Lou et al. (2020, Alg. 2) and Bdeir et al. (2024, Sec. 4.2). Besides, Chakraborty (2020) also developed a Riemannian normalization for matrix Lie groups. Although this approach can control firstand second-order statistics, it is limited to a specific type of distance (Chakraborty, 2020, Sec. 3.2). Chen et al. (2024b) took one step further and developed RBN over the general Lie group, referred to as Lie BN. Although Lie BN can effectively normalize sample statistics, many geometries do not admit a group structure. In summary, the existing RBN methods fail to normalize manifold-valued samples in a principled manner. Recently, building Riemannian networks based on gyro structures has shown notable success across various geometries, including Symmetric Positive Definite (SPD) (Nguyen, 2022a;b), Grassmannian (Nguyen, 2022b), hyperbolic (Ganea et al., 2018), and spherical (Skopek et al., 2019) manifolds. Gyro structures, natural extensions of vector structures, offer powerful mathematical tools for building Riemannian neural networks. Moreover, gyrogroups naturally encompass Lie groups and extend to non-group geometries. For instance, AIM on the SPD manifold, as well as Grassmannian, hyperbolic, and spherical manifolds, do not form Lie groups but instead gyrogroups. Correspondence to ziheng_ch@163.com Published as a conference paper at ICLR 2025 Figure 1: Illustration of Gyro BN on the Grassmannian and hyperbolic spaces. As Gr(1, 3) is homeomorphic to the real projective space RP2, we illustrate Gr(1, 3) as the unit hemisphere with antipodal points identified. We set the bias as I1,3 = (1, 0, 0) and the scaling scalar as 0.2 for better illustration. For the hyperbolic space, we visualize the Gyro BN on the Poincaré ball model P3 1, which is the interior of the unit sphere in R3. We set bias and shift as zero vector and 0.7, respectively. Table 1: Comparison of previous RBN methods with our Gyro BN, where M and V denote the sample mean and variance. Compared with the existing RBN methods, our Gyro BN can normalize sample statistics in a principled manner. Besides, several previous RBN methods with theoretical control over sample statistics are special cases of our Gyro BN. Method Controllable Statistics Applied Geometries Incorporated by Gyro BN SPDBN (Brooks et al., 2019) M SPD manifolds under AIM SPDBN (Kobler et al., 2022b) M+V SPD manifolds under AIM SPDDSMBN (Kobler et al., 2022a) M+V SPD manifolds under AIM Manifold Norm (Chakraborty, 2020, Algs. 1-2) N/A Riemannian homogeneous space Manifold Norm (Chakraborty, 2020, Algs. 3-4) M+V Matrix Lie groups under the distance d(X, Y ) = mlog X 1Y RBN (Lou et al., 2020, Alg. 2) N/A Geodesically complete manifolds Lie BN (Chen et al., 2024b) M+V General Lie groups Gyro BN M+V Pseudo-reductive gyrogroups with gyro isometric gyrations N/A Based on the above analysis, this paper introduces Gyro BN, an RBN framework for general gyrogroups. We use the gyro addition, subtraction, and scalar product to extend the centering (vector subtraction), biasing (vector addition), and scaling (vector scalar product) in the Euclidean BN into manifolds in a principled manner. For broader applicability and in-depth theoretical analysis, we adapt the existing gyrogroup into a more relaxed structure, termed pseudo-reductive gyrogroup. Our theoretical analysis shows that pseudo-reductive gyrogroups with gyroisometric gyrations can enable Gyro BN with theoretical control over sample statistics. More importantly, these requirements are satisfied by all the existing gyrogroups in machine learning. Therefore, compared with the existing RBN methods, our Gyro BN can normalize sample statistics in a principled manner. Besides, several existing RBN methods are incorporated by our Gyro BN as special cases, including the Lie BN on Lie groups, such as three SPD Lie groups and rotation matrices, and different types of RBN on the non-group SPD geometry. We provide a detailed comparison in Tab. 1. Empirically, we instantiate our Gyro BN on the Grassmannian and hyperbolic spaces, as illustrated in Fig. 1. To the best of our knowledge, our Grassmannian Gyro BN is the first implementation of Grassmannian RBN. Experiments on the Grassmannian and hyperbolic networks validate the effectiveness of our Gyro BN. Our main theoretical contributions are summarized as follows: (1) We propose the pseudo-reductive gyrogroup, a relaxed structure of the gyrogroup, and present relevant theoretical analyses; (2) We identify the requirements that guarantee our Gyro BN with theoretical control over sample statistics, i.e.,pseudo-reduction and gyroisometric gyrations; (3) We propose a Gyro BN framework for RBN over general gyrogroups, which can be manifested in various geometries in a plug-and-player manner. (4) We implement our Gyro BN on the Grassmannian and hyperbolic spaces. Extensive experiments on popular Grassmannian and hyperbolic networks validate the effectiveness of our framework. Published as a conference paper at ICLR 2025 Main theoretical results: Def. 3.1 relaxes the existing gyrogroup into the pseudo-reductive gyrogroup. Prop. 3.2 reveals that the non-reductive Grassmannian gyrogroup is, in fact, pseudo-reductive. Thms. 3.3 and 3.5 highlight that the invariance of gyronorm under gyrations is crucial for enabling several operators to act as gyroisometries. Prop. 3.6 confirms that all gyrogroups listed in Tab. 2 are pseudo-reductive and their gyrations are gyroisometries. These two properties are essential for enabling Gyro BN in Alg. 1 to normalize sample statistics, which are formalized in Thm. 4.1. Sec. 5 discusses how several existing RBN methods are special cases of our Gyro BN. Lastly, Sec. 6 manifest our Gyro BN on the Grassmannian and hyperbolic spaces, where Prop. 6.1 discuss the efficient implementation on the Grassmannian. Due to page limits, all the proofs are presented in App. G. 2 PRELIMINARIES This section recaps gyrogroups (Ungar, 2009) and several concrete gyrogroups in machine learning. Definition 2.1 (Gyrogroups (Ungar, 2009)). Given a nonempty set G with a binary operation : G G G, {G, } forms a gyrogroup if its binary operation satisfies the following axioms for any a, b, c G : (G1) There is at least one element e G called a left identity (or neutral element) such that e a = a. (G2) There is an element a G called a left inverse of a such that a a = e. (G3) There is an automorphism gyr[a, b] : G G for each a, b G such that a (b c) = (a b) gyr[a, b]c (Left Gyroassociative Law). (1) The automorphism gyr[a, b] is called the gyroautomorphism, or the gyration of G generated by a, b. (G4) Left reduction law: gyr[a, b] = gyr[a b, b]. Definition 2.2 (Gyrocommutative Gyrogroups (Ungar, 2009)). A gyrogroup {G, } is gyrocommutative if it satisfies a b = gyr[a, b](b a) (Gyrocommutative Law). (2) Definition 2.3 (Nonreductive Gyrogroups (Nguyen, 2022a)). A groupoid {G, } is a nonreductive gyrogroup if it satisfies axioms (G1), (G2), and (G3). Intuitively, gyrogroups are natural generalizations of groups. Unlike groups, gyrogroups are nonassociative but have gyroassociativity characterized by gyrations. Since all gyrations in any (Lie) group are the identity map, every (Lie) group is automatically a gyrogroup. As shown by Nguyen & Yang (2023), given P, Q and R in a manifold M and t R, the gyro structures can be defined as: Gyro addition: P Q = Exp P (PTE P (Log E(Q))) , (3) Gyro scalar product: t P = Exp E (t Log E(P)) , (4) Gyro inverse: P = 1 P = Exp E ( Log E(P)) , (5) Gyration: gyr[P, Q]R = ( (P Q)) (P (Q R)), (6) Gyro inner product: P, Q gr = Log E(P), Log E(Q) E , (7) Gyro norm: P gr = P, P gr , (8) Gyrodistance: dgry(P, Q) = P Q gr , (9) where E is the gyro identity element, and Log E and , E is the Riemannian logarithm and metric at E. A bijection ω : G G is called gyroisometry, if it preserves the gyrodistance dgry(ω(P), ω(Q)) = dgry(P, Q). (10) Note that the gyro scalar product is required for a gyrogroup to form a gyrovector space (Nguyen, 2022b). Although this paper only involves gyrogroups, we also recap gyrovector spaces in App. C. Several geometries in machine learning admit a gyro structure defined in Eq. (3)-Eq. (9) and form a (nonreductive) gyrogroup, such as Affine-Invariant Metric (AIM) (Pennec et al., 2006), Log-Euclidean Metric (LEM) (Arsigny et al., 2005), and Log-Cholesky Metric (LCM) (Lin, 2019) on the SPD manifold Sn ++ (Nguyen, 2022a), Orthonormal Basis (ONB) perspective Gr(p, n) (Bendokat et al., 2024) and projector perspective f Gr(p, n) (Bendokat et al., 2024) for the Grassmannian (Nguyen, Published as a conference paper at ICLR 2025 2022a; Nguyen & Yang, 2023), Poincaré ball Pn K for the hyperboloid (Ungar, 2009; Ganea et al., 2018), and projected hypersphere Dn K for the hypersphere (Skopek et al., 2019). Besides, the gyrogroups proposed by Nguyen (2022b) on the SPD manifold under the LEM and LCM coincide with the Lie groups proposed by Arsigny et al. (2005); Lin (2019). We denote MK as Pn K(K < 0), Dn K(K > 0) and Rn(K = 0), respectively. MK is known as the Constant Curvature Space (CCS) (Do Carmo & Flaherty Francis, 1992). We summarize all the necessary gyro properties in Tab. 2. Table 2: Gyrogroup properties on several geometries. Related notations are defined in App. C.3.2. Geometry Symbol P Q or x y E P or x Lie group Gyrogroup References AIM Sn ++ AI P 1 2 QP 1 2 In P 1 (Nguyen, 2022b) LEM Sn ++ LE mexp(mlog(P) + mlog(Q)) In P 1 (Arsigny et al., 2005) (Nguyen, 2022b) LCM Sn ++ LC ψ 1 LC(ψLC(P) + ψLC(Q)) In ψLC( ψLC(P)) (Lin, 2019) (Nguyen, 2022b) (Chen et al., 2024e) f Gr(p, n) e Gr mexp(Ω)Q mexp( Ω) e Ip,n mexp( Ω)e Ip,n mexp(Ω) Non-reductive (Nguyen, 2022a) Gr(p, n) Gr mexp(Ω)V Ip,n mexp( Ω)Ip,n (Nguyen & Yang, 2023) MK K (1 2K x,y K y 2)x+(1+K x 2)y 1 2K x,y +K2 x 2 y 2 0 x ( for K=0) (Ungar, 2009) (Ganea et al., 2018) (Skopek et al., 2019) 3 PSEUDO-REDUCTIVE GYROGROUPS As shown in Tab. 2, the Grassmannian does not satisfy left reduction (G4) in Def. 2.1. However, we find that a relaxed version of (G4) is necessary to guarantee the sample normalization over gyrogroups. Therefore, this section proposes an intermediate between the gyrogroup and the nonreductive one, called pseudo-reductive gyrogroups. Unless specifically emphasized, the gyro structure in this paper is defined as Eq. (3)-Eq. (9). Given a gyrogroup {G, }, the left gyrotranslation by P G is defined as LP : G G, with LP (Q) = P Q, Q G. (11) If any left gyrotranslation is a gyroisometry, we can use gyrotranslation to center samples. Nguyen & Yang (2023) show that any left gyrotranslation on the SPD and Grassmannian manifolds is a gyroisometry. However, the proof relies on the left cancellation law of gyrogroups, which does not hold for nonreductive gyrogroups. Therefore, the proof is questionable for the Grassmannian. This subsection proposes an intermediate structure, referred to as pseudo-reductive gyrogroups, which can support left cancellation law in general and, therefore, gyroisometry of left gyrotranslation. We illustrate the derivation logic in Fig. 2. Invariance of gyronorm under any gyration Left cancellation law Left gyrotranslation law Axiom (G1-3) Pseudo-reduction Gyroisometries of any gyration and left gyrotranslation Gyroisometry of the gyroinverse Gyrocommutativity Axiom (G1-3) Left reduction (G4) Figure 2: The conceptual comparison of derivation logic of gyroisometries of our work against previous work (Nguyen & Yang, 2023), where the left gyrotranslation law is presented in Lem. G.1. The previous work proves the results on the SPD and Grassmannian manifolds in a case-by-case manner. In contrast, we relax the left reduction into pseudo-reduction and give a general theoretical framework. Our framework also corrects the proof for the Grassmannian cases. Definition 3.1 (Pseudo-reductive Gyrogroups). A groupoid {G, } is a pseudo-reductive gyrogroup if it satisfies axioms (G1), (G2), (G3) and the following pseudo-reductive law: gyr[X, P] = 1, for any left inverse X of P in G, (12) Published as a conference paper at ICLR 2025 where 1 is the identity map. Eq. (12) can be intuitively viewed as the intermediate between reduction and non-reduction. For gyrogroups, Eq. (12) can be directly obtained from the left gyroassociativity (G3) and reduction (G4) (Ungar, 2009, p. 12). However, there is no theoretical guarantee that Eq. (12) holds for general non-reductive gyrogroups. Therefore, we name Eq. (12) as pseudo-reduction. Nevertheless, for the specific non-reductive Grassmannian, it is indeed pseudo-reductive. Proposition 3.2. [ ] Gr(p, n) and f Gr(p, n) form pseudo-reductive and gyrocommutative gyrogroups. Our pseudo-reductive gyrogroup naturally generalizes the vanilla gyrogroup, as it shares most of the basic properties of gyrogroups (Ungar, 2009, Thms. 1.13 - 1. 14), which are detiled in Thm. D.1. The most related property in Thm. D.1 is the left cancellation law, one of the key prerequisites for gyro translation as gyroisometry. Note that the left cancellation on the gyrogroup comes from left gyro associativity and Eq. (12) (Ungar, 2009, p. 12). Therefore, left cancellation does not generally hold for non-reductive gyrogroups, but exists in pseudo-reductive gyrogroups. Next, we point out an iff statement about gyroisometry, which will be useful in the following. Theorem 3.3. [ ] Given a pseudo-reductive gyrogroup {G, }, gyr[P, Q] preserves gyronorm for any P, Q G, iff gyr[P, Q] is a gyroisometry for any P, Q G. We find that the isometry of any gyration is a key prerequisite for other gyro operators as isometries. Definition 3.4 (Gyro Left-invariance). The gyrodistance or gyrogroup is gyro left-invariant if any left gyrotranslation is a gyroisometry. Theorem 3.5 (Gyroisometries). [ ] Given a pseudo-reductive gyrogroup {G, } with any gyr[ , ] as a gyroisometry, then we have the following: 1. The gyrodistance (Eq. (9)) is gyro left-invariant; 2. Symmetry of the gyrodistance: P, Q G, dgry(P, Q) = dgry(Q, P); 3. If {G, } is gyrocommutative, then the gyroinverse is a gyroisometry; Proposition 3.6. [ ] For every (pseudo-reductive) gyrogroup in Tab. 2, the gyrodistance is identical to the geodesic distance (therefore symmetric). The gyroinverse, any gyration and any left gyrotranslation are gyroisometries. Credit of the proof. For the Grassmannian and SPD manifolds, the isometries of gyration, gyroinverse, and left gyrotranslation have been proven by Nguyen & Yang (2023, Thms. 2. 12 - 2. 14 and 2.16 - 2. 18). Nevertheless, in the proof of Thms. 2. 12 - 2. 14, the authors view the non-reductive Grassmannian as a gyrogroup, as they use the left cancellation property of the gyrogroup. Fortunately, our Prop. 3.2 and Thm. D.1 shows that the Grassmannian still enjoys left cancellation. Therefore, all the results on isometries in their Thms. 2.12-2.14 are correct. Nevertheless, these results on the SPD and Grassmannian manifolds can be directly obtained by our Thms. 3.3 and 3.5, as these gyrogroups are pseudo-reductive and the gyrations preserve the gyronorm. Besides, we show the expressions for the gyrodistance of all gyrogroups in Tab. 2, and the associated gyroisometries on MK, both of which is none-trivial. The detailed proof is presented in App. G.4. 4 GYROBN ON GENERAL PSEUDO-REDUCTIVE GYROGROUPS Prop. 3.6 shows that several geometries enjoy isometric gyrotranslation, offering a theoretical foundation for normalizing samples over gyrogroups in a principled manner. Inspired by this, this section develops Riemannian Batch Normalization (RBN) for general pseudo-reductive gyrogroups, referred to as Gyro BN. In the following, {M, } is assumed as a pseudo-reductive gyrogroup with the gyro structure defined as Eq. (3)-Eq. (9)1. 4.1 EUCLIDEAN BATCH NORMALIZATION REVISITED As the core operations of different Euclidean normalization variants (Ioffe & Szegedy, 2015; Ba et al., 2016; Ulyanov et al., 2016; Wu & He, 2018) are similar, this paper focuses on BN. Given a batch of activations {xi...N}, the core operations of BN can be expressed as: i N, xi γ xi µ v2 + ϵ + β (13) 1In our Gyro BN, is not required to comply with the axioms of the gyrovector space (Def. C.4). Published as a conference paper at ICLR 2025 where µ, v2, γ, and β are the sample mean, sample variance, scaling parameter, and biasing parameter, respectively. To generalize the Euclidean BN into gyrogroups, we first define sample mean, sample variance, centering, biasing, and scaling over gyrogroups. Then, we introduce our Gyro BN framework with a theoretical analysis of the ability to normalize sample statistics. We define the gyromean as the Fréchet mean under the gyrodistance: M = FM({Pi}) = argmin Q M i=1 d2 gry (Pi, Q) (14) The gyrovariance is the Fréchet variance, i.e.,the minimization of the rightest hand side of Eq. (14). When the distance in Eq. (14) is the geodesic distance, the Fréchet mean and variance are known as the Riemannian mean and variance. Although the gyromean and gyrovariance are not necessarily the same as the Riemannian ones, Prop. 3.6 indicates the equivalence for the gyrogroups in Tab. 2. Easy computation shows that the centering and biasing in the Euclidean BN (Eq. (13)) can be viewed as gyro addition (Eq. (3)) in Rn, and the scaling can be viewed as gyro scalar product (Eq. (4)), scaling in the tangent space at the identity element. Inspired by this, we define the normalization over gyrogroups by gyro addition and gyro scalar product. Given a batch of activations {Pi...N M}, we defined the core operations of Gyro BN as Biasing z}|{ B Scaling z }| { s Centering z }| { M Pi where M M and v2 are gyromean and gyrovariance, B M is the biasing parameter, s R is the scaling parameter, and ϵ is a small value for numerical stability. Theorem 4.1 (Homogeneity). [ ] Supposing {M, } is a pseudo-reductive gyrogroup with any gyration gyr[ , ] as a gyroisometry, for N samples {Pi...N M}, we have the following properties: Homogeneity of gyromean: FM({B Pi}) = B FM({Pi}), B M, (16) Homogeneity of dispersion from E: 1 i=1 d2 gry(t Pi, E) = t2 i=1 d2 gry(Pi, E), (17) The most important property of the Euclidean BN (Eq. (13)) lies in its ability to normalize data distribution by the control over sample mean and variance. Similarly, our formulation in Eq. (15) can also normalize gyromean and gyrovariance. Specifically, given a pseudo-reductive gyrogroup with isometric gyrations, Eq. (16) indicates that the centering and biasing can transfer the gyromean, while Eq. (17) can scale the sample variance, since after centering, the resulting gyromean is the identity element E. To finalize our Gyro BN, we define the running mean updates over gyrogroups as the binary barycenter based on gyrodistance: Barη(P1, P2) = argmin Q M ηd2 gry (P1, Q) + (1 η)d2 gry (P2, Q) , with η [0, 1]. (18) Notably, when the gyrodistance is identical to the geodesic distance, the binary barycenter can be calculated by geodesic. With all the above ingredients, the general framework for our Gyro BN is presented in Alg. 1. Thm. 4.1 indicates that given a pseudo-reductive gyrogroup with any gyration as a gyroisometry, our Gyro BN enjoys a theoretical guarantee of control over the gyro statistics. Specifically, for the gyrogroups in Tab. 2, our Gyro BN can control the gyromean and gyrovariance. Besides, as the gyromean and gyrovariance are identical to the Riemannian counterparts, the Gyro BNs on these gyrogroups also normalize the Riemannian statistics. Especially, simple computation shows that our Gyro BN recovers the standard Euclidean BN (Ioffe & Szegedy, 2015) when M = Rn. Published as a conference paper at ICLR 2025 Algorithm 1: Gyrogroup Batch Normalization (Gyro BN) Require : batch of activations {P1...N M}, small positive constant ϵ, and momentum η [0, 1], running mean Mr, running variance v2 r, biasing parameter B M, scaling parameter s R. Return :normalized batch { P1...N M} 1 if training then 2 Compute batch mean Mb and variance v2 b of {P1...N}; 3 Update running statistics Mr = Barγ(Mb, Mr), v2 r = γv2 b + (1 γ)v2 r; 5 (M, v2) = (Mb, v2 b) if training else (Mr, v2 r) 6 i N, Pi = B s v2+ϵ ( M Pi) 5 A GYRO PERSPECTIVE FOR THE EXISTING RIEMANNIAN NORMALIZATIONS Several existing Riemannian normalization methods on different geometries enjoy theoretical control of sample mean and variance, including Lie BN on general Lie groups (Chen et al., 2024b), and SPDBNs based on the specific AIM geometry (Brooks et al., 2019; Kobler et al., 2022b;a). This subsection further reveals that they are concrete implementations of our Gyro BN. 5.1 LIEBN AS A SPECIAL CASE OF GYROBN Chakraborty (2020, Algs. 3-4) first proposed Riemannian normalization on matrix Lie groups under a specific distance. Chen et al. (2024b) extended their framework into general Lie groups, referred to as Lie BN, with a theoretical control over Riemannian mean and variance. This subsection shows that Lie BN is indeed a special case of our Gyro BN. Lie BN is established under a left-invariant metric on the Lie group. The centering and biasing are defined by left group translation, and scaling is defined by the scaling on the tangent space at the identity element (Chen et al., 2024b, Eq.13-15). As every Lie group is automatically a gyrogroup, the gyrotranslation is the exact group translation. Therefore, the centering, biasing, and scaling are the same under Gyro BN and Lie BN. However, the mean, variance, and running mean updates on Lie BN are defined based on the geodesic distance. In contrast, the counterparts on the Gyro BN are based on gyrodistance. Nevertheless, the following proposition demonstrates the equivalence of these operators under the Gyro BN with Lie BN. Proposition 5.1. [ ] Given a Lie group with a left-invariant metric, the gyrodistance and geodesic distance are identical. The Gyro BN is, therefore, identical to the Lie BN (Chen et al., 2024b, Alg. 1). Chen et al. (2024b) implemented Lie BN on four left-invariant geometries, including SPD manifold with AIM2, LEM and LCM, and rotation matrices. According to Prop. 5.1, these implementations are immediately the special cases of our Gyro BN. 5.2 AIM-BASED SPDBNS AS SPECIAL CASES OF GYROBN Several RBNs on the SPD manifold were developed based on AIM (Brooks et al., 2019; Kobler et al., 2022b;a). The core operations of these approaches can be expressed as the following: Normalization: i N, Pi = B 1 2 M 1 v2+ϵ B 1 2 (19) where M and v2 are the Riemannian mean and variance, i.e.,the Fréchet mean and variance under the geodesic distance. The running mean is updated by binary barycenter under the geodesic distance. Prop. 3.6 indicates that, under the AIM geometry, the gyrodistance is identical to the geodesic distance. Therefore, the gyromean and gyrovariance are identical to the Riemannian mean and variance. The running mean updates are also identical under the gyrodistance and geodesic distance. Besides, simple computations show that Eq. (19) is exactly the specific implementation of Eq. (15) under {Sn ++, AI, g AI}, where g AI denotes AIM. Therefore, the SPDBNs developed by Brooks et al. (2019); Kobler et al. (2022b;a) are also special cases of our Gryo BN. 2AIM is left-invariant w.r.t. the Lie group operation P Lie AI Q = LQL with L as the Cholesky factor of P = LL (Thanwerdas & Pennec, 2022). This group structure differs from the one presented in Tab. 2. Published as a conference paper at ICLR 2025 Remark 5.2. Brooks et al. (2019) only consider centering and biasing. Kobler et al. (2022b) use running mean for centering during the training. Kobler et al. (2022a) use different momentum to update running statistics for training and testing and multi-channel mechanisms for domain adaptation. Nevertheless, all of them are based on Eq. (19). Therefore, tricks such as multi-channel and separate momentum can also be applied to our Gyro BN. This is what we mean by claiming that our Gyro BN incorporates their approaches. 6 GYROBNS ON GRASSMANNIAN AND HYPERBOLIC SPACES As indicated by Prop. 3.6 and Thm. 4.1, our Gyro BN can be implemented on the Grassmannian and hyperbolic spaces, with the ability to normalize sample statistics. Our Alg. 1 allows us to implement Gyro BN in a plug-and-play manner. This section clarifies additional technical details regarding its application in these two spaces. As Prop. 3.6 has demonstrated the equivalence of the gyrodistance and geodesic distance on these spaces, we use the terms "gyromean" and "gyrovariance" interchangeably with their Riemannian counterparts. 6.1 GRASSMANNIAN GYROBN We focus on the ONB perspective. To the best of our knowledge, this is the first RBN on the Grassmannian. Given a batch of activations {U1 N} over {Gr(p, n), Gr}, Eq. (15) can be expressed as Centering to the identity Ip,n: U 1 i = mexp [MM , e Ip,n] Ui, (20) Scaling the dispersion from Ip,n: U 2 i = mexp s v2 + ϵ [U 1 i (U 1 i ) , e Ip,n] Ip,n, (21) Biasing towards parameter B M: U 3 i = mexp [B, e Ip,n] U 2 i (22) where ( ) = g Loge Ip,n( ) with g Log as the Riemannian logarithm under the projector perspective, and M is the Riemannian mean of {U1 N}, and e Ip,n = Ip,n I p,n is the identity element under the projector perspective. The Riemannian mean can be calculated by the Karcher flow (Karcher, 1977). We use Alg. 5.3 by Bendokat et al. (2024) for a stable and efficient computation of the Riemannian logarithm required in the Karcher flow. For [MM , e Ip,n] and [U 1 i (U 1 i ) , e Ip,n], inspired by Bendokat et al. (2024, Alg. 5.3) and Nguyen et al. (2024, Prop. 3.12), we have the following for fast computation. Proposition 6.1. [ ] Given U = (U 1 , U 2 ) Gr(p, n) with U1 Rp p and U2 R(n p) p, then we have the following [UU , e Ip,n] = 0 e U T 2 e U2 0 where e U2 = U2Q arcsin( ˆS) ˆS R and U 1 SVD := QSR . Here S is in ascending order, Q and R are column-wisely flipped accordingly, and ˆS = 1 S2. Remark 6.2. Although we focus on the Gyro BN under the ONB perspective, the Gyro BN under the projector perspective can be calculated via the ONB perspective by the following process: (1) mapping data into the ONB perspective by π 1 : f Gr(p, n) Gr(p, n); (2) normalizing data by the Gyro BN under Gr(p, n); (3) mapping normalized data back to f Gr(p, n) by π. Technical details are presented in App. E. 6.2 HYPERBOLIC GYROBN We focus on the Poincaré ball model over the hyperbolic space, i.e.,Pn K. The specific manifestation can be conducted in a plug-in manner. We simply need to plug the required operators from Tab. 2 and Tab. 8 into Alg. 1. Besides, the Poincaré Fréchet mean can be calculated by Lou et al. (2020, Alg. 1)]. Remark 6.3. As shown by Cannon et al. (1997), there are five isometric models that one can work for the hyperbolic spaces. Although we focus on the Poincaré ball, the Gyro BN under other isometric metrics can also be easily constructed via the Poincaré Gyro BN. The overall process is similar to Rmk. 6.2. For more detail, please refer to Lem. E.1 and Thm. E.2. Published as a conference paper at ICLR 2025 7 EXPERIMENTS Our Gyro BN layers are model-agnostic and can be seamlessly integrated into any network operating over the gyrospaces listed in Tab. 2. This section evaluates the effectiveness of our Gyro BN on Grassmannian and hyperbolic neural networks. 7.1 EXPERIMENTS ON THE GRASSMANNIAN Implementation. We focus on a newly developed Grassmannian network, Gyro Gr (Nguyen & Yang, 2023), which replaces the non-intrinsic transformation block (FRMap + Re Orth layers) in the classic Gr Net (Huang et al., 2018) with Grassmannian left gyrotranslation. This modification has demonstrated improved numerical performance and stability (Nguyen & Yang, 2023). Gyro Gr is constructed over the ONB Grassmannian and consists of three basic blocks: left gyrotranslation, pooling (Huang et al., 2018), and the projection map (Proj Map) (Huang et al., 2018). The projection map functions as an activation layer that maps data into symmetric matrices for classification. Following Nguyen & Yang (2023), we evaluate our method on skeleton-based action recognition tasks, including the HDM05 (Müller et al., 2007), NTU60 (Shahroudy et al., 2016), and NTU120 (Liu et al., 2019) datasets, focusing on mutual actions for NTU60 and NTU120. For a fair comparison, we also extend Manifold Norm (Chakraborty, 2020, Alg. 1-2) and RBN (Lou et al., 2020, Alg. 2) to the Grassmannian. Although these two BN methods were not originally implemented over the Grassmannian, they can be adapted by leveraging Riemannian operators such as geodesics, exponential and logarithmic maps, and parallel transport. The key difference is that our Gyro BN can normalize data distributions across different geometries, whereas the other two methods cannot. Further details on datasets, implementation, and training efficiency are provided in App. F.2. Table 3: Comparison of Gyro BN against other Grassmannian BNs under Gyro Gr backbone. BN None Manifold Norm-Gr RBN-Gr Gyro BN-Gr Acc. Mean std Max Mean std Max Mean std Max Mean std Max HDM05 48.97 0.24 49.23 49.67 0.76 50.41 48.64 0.77 49.49 51.89 0.37 52.43 NTU60 70.13 0.16 70.32 68.56 0.43 69.14 67.77 0.52 68.35 72.60 0.04 72.65 NTU120 53.76 0.18 53.96 51.41 0.38 51.92 50.56 0.22 50.82 55.47 0.10 55.59 Table 4: Ablation of Grassmannian Gyro BN under various network architectures. HDM05 NTU60 NTU120 Architecture 1Block 2Block 3Block 4Block 1Block 2Block 3Block 4Block 1Block 2Block 3Block 4Block Gyro Gr 49.23 49.09 47.02 27.36 70.32 70.14 70.23 65.03 53.96 54.1 54.59 47.59 Gyro Gr BN 52.43 50.62 51.56 30.29 72.65 71.93 72.25 66.67 55.59 56.15 54.63 48.9 Main results. We compare our Gyro BN with previous BN methods under the 1Block Gyro Gr backbone, which consists of one block of gyrotranslation and pooling layers followed by a Proj Map layer. We apply the BN after the pooling layer. The 5-fold results are presented in Tab. 3. Gyro BN consistently delivers improved performance, enhancing the average performance of the vanilla Gyro Gr by 2.92%, 2.47%, and 1.71%. In contrast, Grassmannian Manifold Norm and RBN could degrade the vanilla Gyro Gr network, particularly on the NTU60 and NTU120 datasets. This is primarily due to Gyro BN s theoretical guarantee of normalizing sample statistics, a capability lacking in previous methods such as Manifold Norm and RBN, as summarized in Tab. 1. Additionally, we observe that Gyro BN mitigates the performance gap between training and testing, indicating its ability to enhance the model s generalization. Detailed discussions are provided in App. F.5. Overall, the above findings highlight the effectiveness of our Gyro BN in facilitating network training. Ablations on the architecture. We further validate our Gyro BN across different architectures within the Gyro Gr baseline, which includes up to four blocks of gyrotranslation and pooling. Gyro BN is applied after the first pooling layer, and we denote Gyro Gr with Gyro BN as Gyro Gr BN. As implied by Tab. 3, both Gyro Gr and Gyro Gr BN exhibit relatively small variances, allowing us to conduct ablations using a single trial. Tab. 4 reports the results across all three datasets. We observe that Gyro BN consistently improves the performance of the vanilla Gyro Gr baseline, highlighting the effectiveness of the Gyro BN framework. Notably, as the network depth increases, the performance of the Gyro Gr backbone, with or without Gyro BN, degrades. This is because the dimensionality of the Published as a conference paper at ICLR 2025 final feature becomes excessively low, leading to underfitting. For instance, in the deepest architecture with four blocks on the HDM05 dataset, the final dimension is 12 10, which is insufficient to effectively capture discriminative information. Despite this, our Gyro BN still generally enhances the performance of the Gyro Gr backbone. Figure 3: Validation AUC of HNN with or without Gyro BNH or RBN-H on the Disease, Airport and Pubmed datasets. The results indicate that Gyro BN facilitates the training of HNNs, improving both convergence and performance, and surpasses the previous RBN-H. 7.2 EXPERIMENTS ON THE HYPERBOLIC SPACE Table 5: Comparison of HNN with or without Gyro BN-H or RBN-H on the link prediction task. Dataset HNN HNN-RBN-H HNN-Gyro BN-H Cora 89.0 0.1 93.5 0.5 94.3 0.2 Disease 75.1 0.3 76.6 2.2 81.2 0.9 Airport 90.8 0.2 94.2 0.4 95.4 0.2 Pubmed 94.9 0.1 93.4 0.2 95.8 0.1 Implementation. Following Lou et al. (2020), we validate our hyperbolic Gyro BN, referred to as Gyro BN-H, on the Hyperbolic Neural Network (HNN) (Ganea et al., 2018) backbone for the link prediction task using the Cora (Sen et al., 2008), Disease (Anderson & May, 1991), Airport (Zhang & Chen, 2018), and Pubmed (Namata et al., 2012) datasets. We also compare Gyro BN-H with the hyperbolic RBN proposed by Lou et al. (2020, Alg. 2), denoted as RBN-H. Notably, the core difference between RBN and our Gyro BN lies in the normalization our Gyro BN can normalize sample statistics across different geometries, while RBN lacks this guarantee. More details can be found in App. F.3. Results. Tab. 5 reports 5-fold average results of testing AUC on the four datasets. Although RBN and our Gyro BN can generally improve the vanilla HNN, our Gyro BN shows consistently better improvement than RBN. We also observe that RBN compromises HNN on the Pubmed dataset, while our Gyro BN can consistently facilitate network training. Empowered by the normalization of sample statistics, our Gyro BN also shows a smaller standard deviation than RBN. The above empirical results demonstrate the superiority of our Gyro BN framework. Additionally, the validation AUC curves shown in Fig. 3 further highlight the effectiveness of our Gyro BN in facilitating network training. 8 CONCLUSION We propose a novel gyro structure on manifolds, termed the pseudo-reductive gyrogroup. This structure establishes the minimal prerequisites for developing valid normalization techniques over gyro spaces. Building on this gyro structure, we introduce a novel framework for batch normalization over gyrogroups, called Gyro BN, which enables the normalization of non-Euclidean statistics. Several previous RBN methods, such as Lie BN on general Lie groups (including three groups on SPD manifolds and the group of rotation matrices) and various AIM-based SPDBN methods, are revealed to be special cases of our Gyro BN. Additionally, we manifest Gyro BN on the Grassmannian and hyperbolic spaces. To the best of our knowledge, our Grassmannian Gyro BN is the first RBN method on the Grassmannian manifold with a theoretical capability to normalize sample statistics. Experiments on Grassmannian and hyperbolic networks confirm the effectiveness of Gyro BN. As a future avenue, we will implement our Gyro BN over gyro spaces. We hope that our work will facilitate the development of deep networks for data with non-trivial geometries. Published as a conference paper at ICLR 2025 ACKNOWLEDGMENTS This work was partly supported by the MUR PNRR project FAIR (PE00000013) funded by the Next Generation EU, and the EU Horizon project ELIAS (No. 101120237). The authors also gratefully acknowledge the financial support from the China Scholarship Council (CSC), as well as the CINECA award under the ISCRA initiative for the availability of partial HPC resources support. Roy M Anderson and Robert M May. Infectious diseases of humans: dynamics and control. Oxford University Press, 1991. Vincent Arsigny, Pierre Fillard, Xavier Pennec, and Nicholas Ayache. Fast and simple computations on tensors with log-Euclidean metrics. Ph D thesis, INRIA, 2005. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016. Ekkehard Batzies, Knut Hüper, Luis Machado, and F Silva Leite. Geometric mean and geodesic regression on Grassmannians. Linear Algebra and its Applications, 466:83 101, 2015. Ahmad Bdeir, Kristian Schwethelm, and Niels Landwehr. Fully hyperbolic convolutional neural networks for computer vision. In The Twelfth International Conference on Learning Representations, 2024. Thomas Bendokat, Ralf Zimmermann, and P-A Absil. A Grassmann manifold handbook: Basic geometry and computational aspects. Advances in Computational Mathematics, 50(1):1 51, 2024. Daniel Brooks, Olivier Schwander, Frédéric Barbaresco, Jean-Yves Schneider, and Matthieu Cord. Riemannian batch normalization for SPD neural networks. In Advances in Neural Information Processing Systems, volume 32, 2019. James W Cannon, William J Floyd, Richard Kenyon, Walter R Parry, et al. Hyperbolic geometry. Flavors of geometry, 31(59-115):2, 1997. Rudrasis Chakraborty. Manifold Norm: Extending normalizations on Riemannian manifolds. ar Xiv preprint ar Xiv:2003.13869, 2020. Rudrasis Chakraborty, Jose Bouza, Jonathan Manton, and Baba C Vemuri. Manifoldnet: A deep neural network for manifold-valued data with applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. Ines Chami, Zhitao Ying, Christopher Ré, and Jure Leskovec. Hyperbolic graph convolutional neural networks. Advances in Neural Information Processing Systems, 32, 2019. Kaixuan Chen, Shunyu Liu, Tongtian Zhu, Ji Qiao, Yun Su, Yingjie Tian, Tongya Zheng, Haofei Zhang, Zunlei Feng, Jingwen Ye, et al. Improving expressivity of GNNs with subgraph-specific factor embedded normalization. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 237 249, 2023a. Kaixuan Chen, Jie Song, Shunyu Liu, Na Yu, Zunlei Feng, Gengshi Han, and Mingli Song. Distribution knowledge embedding for graph pooling. IEEE Transactions on Knowledge and Data Engineering, 35(8):7898 7908, 2023b. Ziheng Chen, Tianyang Xu, Xiao-Jun Wu, Rui Wang, Zhiwu Huang, and Josef Kittler. Riemannian local mechanism for SPD neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7104 7112, 2023c. Ziheng Chen, Yue Song, Gaowen Liu, Ramana Rao Kompella, Xiaojun Wu, and Nicu Sebe. Riemannian multinomial logistics regression for SPD neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024a. Ziheng Chen, Yue Song, Yunmei Liu, and Nicu Sebe. A Lie group approach to Riemannian batch normalization. In The Twelfth International Conference on Learning Representations, 2024b. Published as a conference paper at ICLR 2025 Ziheng Chen, Yue Song, Rui Wang, Xiaojun Wu, and Nicu Sebe. RMLR: Extending multinomial logistic regression into general geometries. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024c. Ziheng Chen, Yue Song, Xiao-Jun Wu, and Nicu Sebe. Product geometries on Cholesky manifolds with applications to SPD manifolds. ar Xiv preprint ar Xiv:2407.02607, 2024d. Ziheng Chen, Yue Song, Tianyang Xu, Zhiwu Huang, Xiao-Jun Wu, and Nicu Sebe. Adaptive log-Euclidean metrics for SPD matrix learning. IEEE Transactions on Image Processing, 2024e. Ziheng Chen, Yue Song, Xiaojun Wu, Gaowen Liu, and Nicu Sebe. Understanding matrix function normalizations in covariance pooling from the lens of Riemannian geometry. In The Thirteenth International Conference on Learning Representations, 2025. Jindou Dai, Yuwei Wu, Zhi Gao, and Yunde Jia. A hyperbolic-to-hyperbolic graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 154 163, 2021. Manfredo Perdigao Do Carmo and J Flaherty Francis. Riemannian Geometry, volume 6. Springer, 1992. Alan Edelman, Tomás A Arias, and Steven T Smith. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303 353, 1998. JEAN QUAINTANCE Gallier and Jocelyn Quaintance. Differential geometry and Lie groups, volume 12. Springer, 2020. Octavian Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic neural networks. Advances in Neural Information Processing Systems, 31, 2018. Zhi Gao, Chen Xu, Feng Li, Yunde Jia, Mehrtash Harandi, and Yuwei Wu. Exploring data geometry for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24325 24334, 2023. Uwe Helmke and John B Moore. Optimization and dynamical systems. Springer Science & Business Media, 2012. Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 2012. Zhiwu Huang and Luc Van Gool. A Riemannian network for SPD matrix learning. In Thirty-first AAAI Conference on Artificial Intelligence, 2017. Zhiwu Huang, Chengde Wan, Thomas Probst, and Luc Van Gool. Deep learning on Lie groups for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6099 6108, 2017. Zhiwu Huang, Jiqing Wu, and Luc Van Gool. Building deep networks on Grassmann manifolds. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. Zhiwu Huang, Jiqing Wu, and Luc Van Gool. Manifold-valued image generation with Wasserstein generative adversarial nets. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3886 3893, 2019. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448 456. PMLR, 2015. Ce Ju, Reinmar J Kobler, Liyao Tang, Cuntai Guan, and Motoaki Kawanabe. Deep geodesic canonical correlation analysis for covariance-based neuroimaging data. In The Twelfth International Conference on Learning Representations, 2024. Hermann Karcher. Riemannian center of mass and mollifier smoothing. Communications on Pure and Applied Mathematics, 30(5):509 541, 1977. Published as a conference paper at ICLR 2025 Isay Katsman, Eric Chen, Sidhanth Holalkere, Anna Asch, Aaron Lou, Ser Nam Lim, and Christopher M De Sa. Riemannian residual neural networks. Advances in Neural Information Processing Systems, 36, 2024. Diederik P Kingma. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Reinmar Kobler, Jun-ichiro Hirayama, Qibin Zhao, and Motoaki Kawanabe. SPD domain-specific batch normalization to crack interpretable unsupervised domain adaptation in EEG. Advances in Neural Information Processing Systems, 35:6219 6235, 2022a. Reinmar J Kobler, Jun-ichiro Hirayama, and Motoaki Kawanabe. Controlling the Fréchet variance improves batch normalization on the symmetric positive definite manifold. In ICASSP 20222022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3863 3867. IEEE, 2022b. John M Lee. Introduction to smooth manifolds. Springer, 2013. John M Lee. Introduction to Riemannian manifolds, volume 2. Springer, 2018. Mario Lezcano Casado. Trivializations for gradient-based optimization on manifolds. Advances in Neural Information Processing Systems, 32, 2019. Zhenhua Lin. Riemannian geometry of symmetric positive definite matrices via Cholesky decomposition. SIAM Journal on Matrix Analysis and Applications, 40(4):1353 1370, 2019. Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C Kot. NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10):2684 2701, 2019. Aaron Lou, Isay Katsman, Qingxuan Jiang, Serge Belongie, Ser-Nam Lim, and Christopher De Sa. Differentiating through the Fréchet mean. In International Conference on Machine Learning, pp. 6393 6403. PMLR, 2020. Meinard Müller, Tido Röder, Michael Clausen, Bernhard Eberhardt, Björn Krüger, and Andreas Weber. Documentation mocap database HDM05. Technical report, Universität Bonn, 2007. Galileo Namata, Ben London, Lise Getoor, Bert Huang, and U Edu. Query-driven active surveying for collective classification. In 10th International Workshop on Mining and Learning with Graphs, volume 8, pp. 1, 2012. Xuan Son Nguyen. The Gyro-structure of some matrix manifolds. In Advances in Neural Information Processing Systems, volume 35, pp. 26618 26630, 2022a. Xuan Son Nguyen. A Gyrovector space approach for symmetric positive semi-definite matrix learning. In Proceedings of the European Conference on Computer Vision, pp. 52 68, 2022b. Xuan Son Nguyen and Shuo Yang. Building neural networks on matrix manifolds: A gyrovector space approach. ar Xiv preprint ar Xiv:2305.04560, 2023. Xuan Son Nguyen, Shuo Yang, and Aymeric Histace. Matrix manifold neural networks++. In The Twelfth International Conference on Learning Representations, 2024. Yue-Ting Pan, Jing-Lun Chou, and Chun-Shu Wei. Matt: A manifold attention network for EEG decoding. Advances in Neural Information Processing Systems, 35:31116 31129, 2022. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 2019. Xavier Pennec, Pierre Fillard, and Nicholas Ayache. A Riemannian framework for tensor computing. International Journal of Computer Vision, 66(1):41 66, 2006. Peter Petersen. Riemannian geometry, volume 171. Springer, 2006. Published as a conference paper at ICLR 2025 Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93 93, 2008. Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. NTU RGB+D: A large scale dataset for 3D human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010 1019, 2016. Ryohei Shimizu, Yusuke Mukuta, and Tatsuya Harada. Hyperbolic neural networks++. ar Xiv preprint ar Xiv:2006.08210, 2020. Ondrej Skopek, Octavian-Eugen Ganea, and Gary Bécigneul. Mixed-curvature variational autoencoders. ar Xiv preprint ar Xiv:1911.08411, 2019. Yann Thanwerdas and Xavier Pennec. Theoretically and computationally convenient geometries on full-rank correlation matrices. SIAM Journal on Matrix Analysis and Applications, 43(4): 1851 1872, 2022. Loring W.. Tu. An introduction to manifolds. Springer, 2011. Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. ar Xiv preprint ar Xiv:1607.08022, 2016. Abraham A Ungar. Analytic hyperbolic geometry: Mathematical foundations and applications. World Scientific, 2005. Abraham A Ungar. Beyond the Einstein addition law and its gyroscopic Thomas precession: The theory of gyrogroups and gyrovector spaces. Springer Science & Business Media, 2012. Abraham Albert Ungar. A Gyrovector Space Approach to Hyperbolic Geometry. Springer, 2009. Abraham Albert Ungar. Analytic hyperbolic geometry in n dimensions: An introduction. CRC Press, 2014. Rui Wang, Xiao-Jun Wu, Ziheng Chen, Tianyang Xu, and Josef Kittler. Dream Net: A deep Riemannian manifold network for SPD matrix learning. In Proceedings of the Asian Conference on Computer Vision, pp. 3241 3257, 2022a. Rui Wang, Xiao-Jun Wu, Ziheng Chen, Tianyang Xu, and Josef Kittler. Learning a discriminative SPD manifold neural network for image set classification. Neural Networks, 151:94 110, 2022b. Rui Wang, Chen Hu, Ziheng Chen, Xiao-Jun Wu, and Xiaoning Song. A Grassmannian manifold self-attention network for signal classification. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024a. Rui Wang, Xiao-Jun Wu, Ziheng Chen, Cong Hu, and Josef Kittler. SPD manifold deep metric learning for image set classification. IEEE Transactions on Neural Networks and Learning Systems, 2024b. Yung-Chow Wong. Differential geometry of Grassmann manifolds. Proceedings of the National Academy of Sciences, 57(3):589 594, 1967. Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3 19, 2018. Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In Advances in neural information processing systems, volume 31, 2018. Wei Zhao, Federico Lopez, J Maxwell Riestenberg, Michael Strube, Diaaeldin Taha, and Steve Trettel. Modeling graphs beyond hyperbolic: Graph neural networks in symmetric positive definite matrices. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 122 139. Springer, 2023. Vladimir Antonovich Zorich and Octavio Paniagua. Mathematical analysis II, volume 220. Springer, 2016. Published as a conference paper at ICLR 2025 APPENDIX CONTENTS A Limitation 17 B Notations 17 C Preliminaries 18 C.1 Riemannian geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 C.2 Gyrovector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 C.3 Matrix and vector manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 C.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 C.3.2 Gyro and Riemannian structures . . . . . . . . . . . . . . . . . . . . . . . 20 C.4 Intuitive explanations of gyrogroups . . . . . . . . . . . . . . . . . . . . . . . . . 20 D First pseudo-reductive gyrogroups properties 21 E Grassmannian Gyro BN under the projector perspective 22 F Experimental details and additional ablations 24 F.1 Summary of operators in Grassmannian and hyperbolic Gyro BNs . . . . . . . . . 24 F.2 Details on the Grassmannian experiments . . . . . . . . . . . . . . . . . . . . . . 25 F.2.1 Datasets and preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 25 F.2.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 F.3 Details on the hyperbolic experiments . . . . . . . . . . . . . . . . . . . . . . . . 26 F.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 F.3.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 F.4 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 F.5 Gyro BN and model generalization ability . . . . . . . . . . . . . . . . . . . . . . 26 F.6 Training efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 F.6.1 Training efficiency on the Grassmannian networks . . . . . . . . . . . . . 27 F.6.2 Training efficiency on the hyperbolic networks . . . . . . . . . . . . . . . 27 F.7 Experiments on other hyperbolic baselines . . . . . . . . . . . . . . . . . . . . . . 28 F.7.1 Experiments on HNN++ . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 F.7.2 Experiments on RRes Net . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 F.8 Experiments on covariate shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 F.8.1 Covariate shifts on the Grassmannian network . . . . . . . . . . . . . . . . 29 F.8.2 Covariate shifts on the hyperbolic network . . . . . . . . . . . . . . . . . 29 F.9 Experiments on condition numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 29 F.9.1 Condition numbers of weight matrices of the transformation layer . . . . . 29 F.9.2 Condition number of network Jacobian . . . . . . . . . . . . . . . . . . . 30 Published as a conference paper at ICLR 2025 G Proofs 30 G.1 Proof of Prop. 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 G.2 Proof of Thm. 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 G.3 Proof of Thm. 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 G.4 Proof of Prop. 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 G.5 Proof of Thm. 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 G.6 Proof of Prop. 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 G.7 Proof of Prop. 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Published as a conference paper at ICLR 2025 A LIMITATION As shown in Prop. 3.6, several geometries, including SPD, Grassmannian, hyperbolic, and hyperpherical manifolds, have gyro structures that can enable Gyro BN with a theoretical control over sample statistics. However, some geometries do not have gyro structures, or their gyro structures are still untouched. Therefore, Gyro BN cannot be established on these manifolds. We will explore other techniques to establish an RBN framework for these non-gyro geometries. B NOTATIONS Tab. 6 summarizes all the notations in the main paper. Table 6: Summary of notations. Notation Explanation {G, } or abbreviated as G A gyrogroup {M, , g} or abbreviated as M A Riemannian manifold {M, g} with a gyrogroup structure induced by g 1 identity map TP M Tangent space at P M gp( , ) or , P Riemannian metric at P M P The norm induced by , P on TP M dgeo( , ) Geodesic distance Log P Riemannian logarithm at P Exp P Riemannian exponentiation at P f ,P The differential map of the smooth map f at P M PTP Q Parallel transportation along the geodesic connecting P and Q E Gyro identity of {M, } P Group inverse of P M Gyro scalar product gyr[ , ] Gyration , gr Gyro inner product gr Gyronorm dgry( , ) Gyrodistance FM Fréchet mean under a gyrodistance Barη( , ) Binary barycenter based on a gyrodistance , The standard Frobenius inner product Norm induced by the standard Frobenius inner product mlog Matrix logarithm mexp Matrix exponentiation L Cholesky decomposition Dlog The diagonal element-wise logarithm ψLC Dlog L Sn ++ The SPD manifold Sn The Euclidean space of symmetric matrices Pn K n-dimensional Poincaré ball with curvature K < 0 Rn n-dimensional Euclidean space Dn K, n-dimensional projected hypersphere with curvature K > 0 MK Constant Curvature Spaces (CCS) Gr(p, n) The Grassmannian under the ONB perspective f Gr(p, n) The Grassmannian under the projector perspective Ip,n The Grassmannian identity under the ONB perspective e Ip,n The Grassmannian identity under the projector perspective π The Riemannian isometry from Gr(p, n) onto f Gr(p, n) [ , ] Matrix commutator [ ] An element in Gr(p, n), which is a equivalent class In n n identity matrix P θ Matrix power for SPD matrix P AI, LE and LC Gyro additions on the SPD manifold under AIM, LEM and LCM e Gr and Gr Gyro additions on the Grassmannian under the ONB and projector perspectives K Gyro additions on the CCS ( ) ( ) = g Loge Ip,n( ) with g Log as the Riemannian logarithm on f Gr(p, n) g AI, g LE and g LC AIM, LEM, LCM on the SPD manifold g Gr and eg Gr Riemannian metrics on the Grassmannian under the ONB and projector perspectives tan K tan K = tan if k > 0, elif K > 0, tan K = tanh Published as a conference paper at ICLR 2025 C PRELIMINARIES C.1 RIEMANNIAN GEOMETRY Manifolds can be intuitively understood as locally Euclidean spaces. Differentials generalize the concept of classical derivatives. For a detailed introduction to smooth manifolds, we refer readers to Tu (2011); Lee (2013). A Riemannian manifold is a manifold equipped with a Riemannian metric, which can be intuitively interpreted as a point-wise inner product. This metric allows for the adaptation of various Euclidean operators to the manifold setting. For an in-depth discussion on Riemannian manifolds, see Do Carmo & Flaherty Francis (1992); Lee (2018). Definition C.1 (Riemannian Manifolds (Do Carmo & Flaherty Francis, 1992)). A Riemannian metric on M is a smooth symmetric covariant 2-tensor field on M, which is positive definite at every point. A Riemannian manifold is a pair {M, g}, where M is a smooth manifold and g is a Riemannian metric. The isometries generalize the bijection in the set theory into the Riemannian geometry. If two manifolds are isometric, they can be viewed as equivalent. The Riemannian operators in these two manifolds are also closely related (Chen et al., 2024d, App. A. 2). The following defines the Riemannian isometry. Definition C.2 (Isometries (Lee, 2018)). If {M, g} and { f M, eg} are both Riemannian manifolds, a smooth map f : M f M is called a (Riemannian) isometry if it is a diffeomorphism that satisfies gp(V, W) = egf(p)(f ,p(V ), f ,p(W)), (24) where f ,p( ) : Tp M Tf(p) f M is the differential map of f at p M, and V, W Tp M are two tangent vectors. The exponential & logarithmic maps and parallel transportation are also crucial for Riemannian approaches in machine learning. To bypass the notation burdens caused by their definitions, we review the geometric reinterpretations of these operators (Pennec et al., 2006; Do Carmo & Flaherty Francis, 1992). In detail, in a manifold M, geodesics correspond to straight lines in Euclidean space. A tangent vector xy Tx M can be locally identified to a point y on the manifold by geodesic starting at x with an initial velocity of xy, i.e. y = Expx( xy). On the other hand, the logarithmic map is the inverse of the exponential map, generating the initial velocity of the geodesic connecting x and y, i.e., xy = Logx(y). These two operators generalize the idea of addition and subtraction in Euclidean space. For the parallel transportation PTx y(V ), it is a generalization of parallelly moving a vector along a curve in Euclidean space. we summarize the reinterpretation in Tab. 7. Table 7: The geometric reinterpretations of Riemannian operators. Operations Euclidean spaces Riemannian manifolds Straight line Straight line Geodesic Subtraction xy = y x xy = Logx(y) Addition y = x + xy y = Expx( xy) Parallelly moving V V PTx y(V ) A Lie group is a manifold with a smooth group structure. It is a combination of algebra and geometry. Definition C.3 (Lie Groups). A manifold is a Lie group, if it forms a group with a group operation such that m(x, y) 7 x y and i(x) 7 x 1 are both smooth, where x 1 is the group inverse of x. The following are some naive examples of the Lie group: 1. The set of real numbers R, whose group operation is the addition. 2. The set of n n invertible matrices GL(n), whose group operation is the matrix product. This group is known as the general linear group. 3. The set of n n orthogonal matrices O(n), whose group operation is the matrix product. This group is a subgroup of GL(n), known as the orthogonal group. Published as a conference paper at ICLR 2025 C.2 GYROVECTOR SPACES Gyrogroups in Def. 2.1 generalize groups to non-associative algebraic systems by gyrations. Similarly, the gyrovector space generalizes the vector space, which has shown impressive success in hyperbolic geometry (Ungar, 2005; 2009; 2012; 2014). Definition C.4 (Gyrovector Spaces (Nguyen, 2022a)). A gyrocommutative gyrogroup {G, } equipped with a scalar multiplication : R G G is called a gyrovector space if it satisfies the following axioms for s, t R and a, b, c G: (V1) 1 a = a, 0 a = t e = e, and ( 1) a = a. (V2) (s + t) a = s a t a. (V3) (st) a = s (t a). (V4) gyr[a, b](t c) = t gyr[a, b]c. (V5) gyr[s a, t a] = 1, where 1 is the identity map. Gyrovector spaces generalize vector spaces to curved geometries, such as the SPD and Grassmannian manifolds (Nguyen, 2022a). While retaining familiar properties like distributivity (V2) and associativity (V3), they incorporate the complexities of gyrations. C.3 MATRIX AND VECTOR MANIFOLDS C.3.1 DEFINITIONS SPD: The set Sn ++ of n n SPD matrices form a manifold, named the SPD manifold (Pennec et al., 2006). We focus on three popular Riemannian metrics on the SPD manifold: Affine-Invariant Metric (AIM) (Pennec et al., 2006), Log-Euclidean Metric (LEM) (Arsigny et al., 2005), and Log-Cholesky Metric (LCM) (Lin, 2019). Grassmannian: The Grassmannian is the set of p-dimensional subspace of n-dimensional vector space (Tu, 2011), which has two matrix representations (Bendokat et al., 2024): Projector perspective: f Gr(p, n) = {P Sn : P 2 = P, rank(P) = p}, ONB perspective: Gr(p, n) = {[U] : [U] := {e U St(p, n) | e U = UR, R O(p)}}, (25) where Sn is the Euclidean space of symmetric matrices, St(p, n) is the Stiefel manifold, and O(p) is the orthogonal group. By abuse of notations, we use [U] and U interchangeably for the element of Gr(p, n). As shown by Helmke & Moore (2012), the ONB perspective Gr(p, n) is diffeomorphism to f Gr(p, n) by π(U) = UU , U Gr(p, n), (26) where the n p column-wise orthonormal matrix U should be more precisely understood as a representative of an equivalence class (Bendokat et al., 2024). CCS: The Poincaré ball Pn K for the hyperbolic space (Ungar, 2009; Ganea et al., 2018), projected hypersphere Dn K for the hypersphere (Skopek et al., 2019), and standard Euclidean space Rn (Zorich & Paniagua, 2016) are more generally called the Constant Curvature Space (CCS) (Do Carmo & Flaherty Francis, 1992), as they have constant sectional curvature K. The n-dimensional Poincaré ball and projected hypersphere are represented as Pn K = x Rn : x, x < 1 K with K < 0 and Dn K = Rn with K > 0. When K = 0, the CCS becomes the standard Euclidean space Rn. Published as a conference paper at ICLR 2025 C.3.2 GYRO AND RIEMANNIAN STRUCTURES For a matrix manifold M and a CCS MK, we make the following notations: {Sn ++, g AI} (The SPD manifold under AIM) {Sn ++, g LE} (The SPD manifold under LEM) {Sn ++, g LC} (The SPD manifold under LEM) {Gr(p, n), g Gr} (The Grassmannian under the ONB perspective) {f Gr(p, n), eg Gr} (The Grassmannian under the projector perspective) Pn K, for K < 0 Rn, for K = 0 Dn K, for K > 0 (28) tan K = tan if K > 0 tanh if K < 0 (29) Notations in Tab. 2: We summarize all the necessary group operations in Tab. 2 with the following notations. Given any P, Q M with M as Sn ++ or f Gr(p, n), and x, y MK with MK as Pn K(K < 0) , Dn K(K > 0) or Rn(K = 0), we make the following notations. For the Grassmannian, U = π 1(P) and V = π 1(Q) are the ONB representations. We denote matrix exponential, matrix logarithm, and Cholesky decomposition as mexp, mlog, and L , respectively. We denote ψLC = Dlog L , where ψLC is the diagonal logarithm. As shown by Chen et al. (2024e), LCM and the associated Lie group on Sn ++ are pulled back by ψLC for the Euclidean space of lower triangular matrices. In is the n n identity matrix, Ip,n = (Ip, 0) Rn p, and e Ip,n = π(Ip,n). For the Grassmannian f Gr(n, p), Ω= [P, e Ip,n], where P = Loge Ip,n(P) and [ , ] is the matrix commutator. We denote , and as the standard (matrix & vector) inner product and norm. We further make the following notations for the related Riemannian operators in Tab. 8. Given P M (x MK), we denote the tangent vector as V TP M (v Tx MK). We denote the geodesic distance, Riemannian logarithm, and Riemannian exponential as dgeo, Log and Exp, respectively. Table 8: Riemannian geometries of several matrix and vector manifolds. For CCS MK, we present the operators for K = 0, as when K = 0, all the Riemannian operators are reduced to the familiar vector operators. Manifolds dgeo(P, Q) or dgeo(x, y) Log P Q or Logx y Exp P V or Expx v References {Sn ++, g LE} mlog(P) mlog(Q) (mlog ,P ) 1 (mlog(Q) mlog(P)) mexp mlog(P) + mlog ,P (V ) (Arsigny et al., 2005) {Sn ++, g AI} mlog Q 1 2 P 1 2 mlog P 1 2 P 1 2 P 1 2 mexp P 1 2 P 1 2 (Pennec et al., 2006) {Sn ++, g LC} ψLC(P) ψLC(Q) (L 1) ,L (ψLC(Q) ψLC(P))) ψ 1 LC (ψLC(P) + (ψLC) ,P (V )) (Lin, 2019) (Chen et al., 2024e) Gr(p, n) arccos(Σ) P Q SVD := OΣR O arctan(Σ)R (In PP )Q(P Q) 1 SVD := OΣR PR cos(Σ)RT + O sin(Σ)RT V SVD := OΣR (Edelman et al., 1998) (Bendokat et al., 2024) f Gr(p, n) 1 2 mlog ((In 2Q) (In 2P)) 1 2[mlog ((In 2Q) (In 2P)) , P] mexp([V, P])P mexp( [V, P]) (Bendokat et al., 2024) (Batzies et al., 2015) |K| tan 1 K p |K| x K y 2 |K|λK x tan 1 K p |K| x K y x Ky x Ky x K (Do Carmo & Flaherty Francis, 1992) (Petersen, 2006) (Skopek et al., 2019) Remark C.5 (The Grassmannian and cut locus). Due to the cut locus of the Grassmannian, the logarithm map does not exist globally (Bendokat et al., 2024). In this paper, when we use Log P (Q) on the Grassmannian, we implicitly assume P and Q are not in each other s cut locus. Besides, more precisely, the gyro addition and scalar multiplication on Grassmannian are also not globally defined (Nguyen, 2022a), due to the cut locus. Following Nguyen (2022a) and Nguyen & Yang (2023), we implicitly assume the gyro operations are well-defined on the Grassmannian. C.4 INTUITIVE EXPLANATIONS OF GYROGROUPS This subsection intuitively explains gyrogroups by contrast with the trivial R. Gyrogroups are proposed to generalize the concept of addition in R or Rn to non-Euclidean spaces. Gyrogroups in Def. 2.1 extend the concept of groups to non-associative algebraic systems, i.e.,a (b c) = (a b) c. The gyrogroups over the manifolds in Tab. 2 are defined by Eqs. (3), (5) and (6). Published as a conference paper at ICLR 2025 More importantly, these definitions are natural generalizations of Euclidean vector operations. Take the gyro addition Eq. (3) as an example. When the manifold M is the Euclidean space Rn, the gyro addition defined by Eq. (3) becomes the familiar vector addition. Gyration & gyroassociativity: The gyration is an automorphism gyr[a, b] : G G for each a, b G. It is called an automorphism as it can preserve the gyro addition: gyr[a, b](c d) = gyr[a, b](c) gyr[a, b](d), c, d G. (30) The gyration is used to generalize the associativity: a + (b + c) = (a + b) + c, a, b, c R. (31) Take the Grassmannian gyrogroup {Gr(p, n), Gr} as an example. For any U, V, R Gr, we have Non-associativity: V Gr (U Gr R) = (V Gr U) Gr R, (32) Left gyroassociativity: V Gr (U Gr R) = (V Gr U) Gr gyr Gr[U, V ](R). (33) Eq. (33) differs with Eq. (31) only in a gyration. The concrete expression of the gyration over the Grassmannian is presented by Nguyen (2022a, Def. 3.18). Left reduction is defined as gyr[a, b] = gyr[a b, b] for any a and b in the gyrogroup G. It is called a "left reduction" because b in a b can be canceled out. Gyrocommutativity in Def. 2.2 generalizes the commutative property (a+b = b+a) by the gyration operator gyr[a, b]. Nonreductive gyrogroups in Def. 2.3 are relaxed gyrogroups, allowing structures where the left reduction law (G4) does not hold. Its prototype comes from the gyrogroup of the Grassmannian (Nguyen, 2022a, Thm. 3.20), where (G4) does not hold. Relation between left reduction, non-reduction, & pseudo-reduction: The left reduction can induce many basic properties of gyrogroups ( see Thm. D.1 and Ungar (2009, Thms. 1.13 and 1.14)). However, non-reduction does not guarantee several basic properties. This drawback will undermine the rationality of the non-reductive gyrogroups. Therefore, as an intermediate, we propose pseudo-reduction. The associated pseudo-reductive gyrogroups maintain most of the basic properties of gyrogroups. Please refer to Thm. D.1 and its remark for more details. D FIRST PSEUDO-REDUCTIVE GYROGROUPS PROPERTIES Theorem D.1 (First Pseudo-reductive Gyrogroups Properties). Let {G, } be a pseudo-reductive gyrogroup. For any elements P, Q, R, X G, we have: 1. If P Q = P R, then Q = R (General Left Cancellation law; see (9) below). 2. gyr[E, P] = 1 for any left identity E in G. 3. gyr[X, P] = 1 for any left inverse X of P in G. 4. There is P left identity which is P right identity. 5. There is only one left identity. 6. Every left inverse is P right inverse. 7. There is only one left inverse, P, of P, and ( P) = P. 8. The left cancellation law: P (P Q) = Q. 9. The gyrator identity: gyr[P, Q]X = (P Q) {P (Q X)}. 10. gyr[P, Q]E = E. 11. gyr[P, Q]( X) = gyr[P, Q]X. 12. gyr[P, E] = 1. 13. The gyrosum inversion law: (P Q) = gyr[P, Q]( Q P) Published as a conference paper at ICLR 2025 Credit of the proof. Most of the proof is borrowed from Ungar (2009, Thms. 1.13 and 1.14), which presents the corresponding properties for gyrogroups. After re-analyzing these two theorems, we find that most of the proof can be readily extended into pseudo-reductive gyrogroups. Proof. This theorem follows Ungar (2009, Thms. 1.13 and 1.14), which presents some useful properties for gyrogroups. We argue that all the properties except gyr[a, a] = 1 are independent of the left reduction law (G4), and are therefore satisfied on pseudo-reductive gyrogroups. All the properties can be proven in the same way as the ones for Thms. 1.13 and 1. 14 by Ungar (2009). We summarize the logic in the following: left gyroassociativity 1 left gyroassociativity + 1 2 definition 3 left gyroassociativity + 1 + 3 4 definition 5 left gyroassociativity + (G2) + 1 + 3 + 4 + 5 6 left gyroassociativity +3 8 left gyroassociativity + left cancellation in 8 9 gyro identity in 9 10 left cancellation in 8 + gyro identity in 9 12 left cancellation in 8 + gyro identity in 9 13 Remark D.2. For non-reductive gyrogroups, 2 and 3 are agnostic. Therefore, every property based on 2 or 3 is not guaranteed, such as 4, and from 6 to 10. The missing of these basic properties will undermine the rationality of non-reductive gyrogroups. In contrast, most basic properties of gyrogroups are preserved in our pseudo-reductive gyrogroups. E GRASSMANNIAN GYROBN UNDER THE PROJECTOR PERSPECTIVE Given two isometric manifolds {M1, g1} and {M2, g2}, the induced gyro structures in Eq. (3)- Eq. (9) have the following relations. We first present a useful lemma. Lemma E.1. Given manifolds {M1, g1} and {M2, g2} and a Riemannian isometry f : M1 M2, we have the following: 1. The groupoid {M1, 1} induced by g1 is pseudo-reductive (left-invariant), iff the groupoid {M2, 2} induced by g2 is pseudo-reductive (left-invariant); 2. Any gyration in {M1, g1} preserves gyronorm iff any gyration in {M2, g2} preserves gyronorm; 3. f preserves the gyrodistance. Proof. First, we review some facts about gyrogroups under the Riemannian isometry. As shown by Nguyen & Yang (2023, Thm. 2.5), {M1, 1} satisfies (G1-G3) in Def. 2.1 iff {M2, 2} satisfies (G1-G3). Besides, f 1 is an isomorphism satisfying f 1(P 2 Q) = f 1(P) 1 f 1(Q), (34) f 1(t 2 P) = t 1 f 1(P), (35) f 1( 2P) = 1f 1(P), (36) gyr2[P, Q](R) = f gyr1[f 1(P), f 1(Q)](f 1(R)) . (37) Published as a conference paper at ICLR 2025 where P, Q, R M2 are arbitrary points, t R is a real scalar, i, gyri are the gyro inverses and gyrations on Mi for i = 1, 2. f is also an isomorphism with similar properties. We only need to prove one direction for the iff condition for the pseudo-reduction or norm invariance. We focus on and follow the above notations in the following. Pseudo-reduction: gyr2[ P, P] = f gyr1[f 1( 2P), f 1(P)] f 1 (Eq. (37)) = f gyr1[ 1f 1(P), f 1(P)] f 1 (Eq. (36)) = 1. Gyronorm invariance under gyrations: For simplicity, we denote the gyronorm, identity element, and Riemannian logarithm on Mi as i Ei, and Logi, respectively. First, the following demonstrates that the Riemannian isometry f preserves gyronorm: P 2 = Log2 E2 (P) E2 (1) = Log1 f 1(E2) f 1(P) f 1(E2) (2) = Log E1 f 1(P) E1 = f 1(P) 1 (1) As f : M1 M2 is a Riemannian isometry, we have the following equations: Log2 P Q = f ,f 1(P ) Log1 f 1(P ) f 1(Q) , P, Q M2, (40) g2 P (V, W) = g1 f 1(P )((f 1) ,P (V ), (f 1) ,P (W)), V, W TP M2, (41) where ( ) is the differential map; (2) E2 = f(E1). Then we have the following: gyr2[P, Q](R) 2 (1) = f gyr1[f 1(P), f 1(Q)](f 1(R)) 2 (2) = gyr1[f 1(P), f 1(Q)](f 1(R)) 1 (3) = f 1(R) 1 (4) = R 2 The above derivation comes from the following. (1) Eq. (37); (2) Eq. (39); (3) Any gyration on M1 can preserve the gyronorm; (4) Eq. (39). Invariance of gyrodistance under f: Denoting the gyrodistance on Mi as di, for any U, V M1, we have the following: d1(U, V ) = 1U 1 V 1 (1) = f( 1P 1 Q) 2 (2) = 1f(P) 2 f(Q) 2 = d2(f(P), f(Q))2 The above derivation comes from the following. (1) f preserves gyronorm; Published as a conference paper at ICLR 2025 (2) f is an isomorphism. Given a batch of activations {P1 N} on a gyrogroup {M, }, we denote the Gyro BN as Gyro BN({Pi}; B, s, ϵ, η), (44) where B M and s are biasing and scaling parameters, ϵ is a small positive value, and η is the momentum. Theorem E.2. Given manifolds {M1, g1} and {M2, g2} and a Riemannian isometry f : M1 M2, for a batch of activation {P1...N} in M1, Gyro BN1(Pi; B, s, ϵ, γ) in M1 can be calculated as Gyro BN1(Pi; B, s, ϵ, γ) = f 1 (Gyro BN2(f(Pi); f(B), s, ϵ, γ)) , (45) where Gyro BN2 is the Gyro BN in M2. Proof. This theorem is inspired by Chen et al. (2024b, thm. 5.3), which characterizes the Lie BNs under isometric manifolds. As gyrogroups are natural generalizations of Lie groups, our Gyro BN is expected to have similar results. The following proof follows a similar logic to the one by Chen et al. (2024b, thm. 5.3), except that all operations are gyro operations. For i = 1, 2, we denote Eq. (15) and binary gyro barycenter on Mi as ξi( |M, v2, B, s) and Bari η( , ). Let B = {P1...N} and f(B) = {f(P1...N)}. We only need to show the following: M2 = f(M1), v1 = v2, (46) ξ1(Pi|M1, v2, B, s) = f 1(ξ2(f(Pi)|M2, v2, f(B), s)), (47) Bar1 η(P, Q) = f 1 Bar2 η(f(P), f(Q)) , P, Q M1. (48) where Mi and vi are the batch Fréchet mean and variance over Mi for i = 1, 2. Eqs. (46) and (48) can be directly obtained by the invariance of gyrodistance under f (Lem. E.1). We only need to show Eq. (47). We have the following: f 1(ξ2(f(Pi)|M2, v2, f(B), s)) = f 1(f(B) 2 (t 2 ( 2M2 2 f(Pi))) (1) = f 1 f (B 1 (t ( 1M1 1 Pi))) = B 1 (t ( 1M1 1 Pi)) = ξ1(Pi|M, v2, B, s), where t = s v2+ϵ. The above derivation comes from the following. (1) f is an isomorphism preserving gyro operations. As π 1 : f Gr(p, n) Gr(p, n) is a Riemannian isometry, Thm. E.2 indicates that the Gyro BN under the projector perspective can be calculated by the ONB perspective by the following process: 1. mapping data into the ONB perspective by π 1 : f Gr(p, n) Gr(p, n); 2. normalizing data by the Gyro BN under Gr(p, n); 3. mapping normalized data back to f Gr(p, n) by π. Besides, both Lem. E.1 or Prop. 3.6 can guarantee theoretical control over the gyromean and gyrovariance under the projector perspective. F EXPERIMENTAL DETAILS AND ADDITIONAL ABLATIONS F.1 SUMMARY OF OPERATORS IN GRASSMANNIAN AND HYPERBOLIC GYROBNS The specific implementation of Alg. 1 on a gyrogroup can be carried out in a plug-in manner. This involves simply substituting the required operators from Tabs. 2 and 8 into Alg. 1. To streamline this process, we summarize the discussion in Sec. 6 in Tab. 9, where we present all the required operators for the Grassmannian and hyperbolic Gyro BNs. Published as a conference paper at ICLR 2025 Table 9: Key operators in calculating Gyro BN on the Grassmannian and hyperbolic manifolds. Here P, Q Gr(p, n) are two ONB Grassmannian points, while x, y Pn K are two Poincaré vectors. Other notations follow from Tabs. 2 and 8. Operator Gr(p, n) Pn K Identity element Ip,n 0 Rn P Gr Q or x K y mexp(Ω)V (1 2K x,y K y 2)x+(1+K x 2)y 1 2K x,y +K2 x 2 y 2 Gr P or Kx mexp( Ω)Ip,n x t Gr P or t K x mexp(tΩ)Ip,n 1 |K| tanh t tanh 1( p |K| x ) x x Bar Gr γ (Q, P) or Bar K γ (y, x) Exp Gr P (γ Log Gr P (Q)) x K ( x K y) K t Fréchet Mean Karcher Flow (Karcher, 1977) (Lou et al., 2020, Alg. 1) 0 50 100 150 Epoch NTU60 Training Gyro Gr - Train Gyro Gr BN - Train 0 50 100 150 Epoch NTU60 Testing Gyro Gr - Test Gyro Gr BN - Test 0 50 100 150 Epoch NTU120 Training Gyro Gr - Train Gyro Gr BN - Train 0 50 100 150 Epoch NTU120 Testing Gyro Gr - Test Gyro Gr BN - Test Figure 4: Training and testing performance on the NTU datasets of 1Block Gyro Gr. Our BN improves the generalization abilities of Gyro Gr. F.2 DETAILS ON THE GRASSMANNIAN EXPERIMENTS F.2.1 DATASETS AND PREPROCESSING HDM053 (Müller et al., 2007). It consists of 2,273 skeleton-based motion capture sequences executed by different actors. Each frame consists of 3D coordinates of 31 joints. We remove the under-represented clips, trimming the dataset down to 2086 instances scattered throughout 117 classes. Following Nguyen & Yang (2023), we model each sequence as a 93 10 Grassmannian matrix. NTU60 (Shahroudy et al., 2016). It has 56,880 sequences of 3D skeleton data classified into 60 classes, where each frame contains the 3D coordinates of 25 or 50 body joints. We use mutual actions and follow the cross-view protocol (Shahroudy et al., 2016). Following Nguyen & Yang (2023), we model each sequence as a 150 10 Grassmannian matrix. NTU1204 (Liu et al., 2019). This dataset contains 114,480 sequences in 120 action classes. We use mutual actions and follow the cross-setup protocol (Liu et al., 2019). Following Nguyen & Yang (2023), we model each sequence as a 150 10 Grassmannian matrix. 3https://resources.mpi-inf.mpg.de/HDM05/ 4https://github.com/shahroudy/NTURGB-D Published as a conference paper at ICLR 2025 F.2.2 IMPLEMENTATION DETAILS Architectures. The 1Block Gyro Gr is structured as: gyrotranslation pooling [Gyro BN] Proj Map classification. The KBlock Gyro Gr consists of K blocks of gyrotranslation and pooling, followed by a final Proj Map layer for classification. Since each pooling operation halves the dimensionality, we omit pooling in the last block when K > 1. Following Huang et al. (2018), the number of channels is set to 8. The momentum for Gyro BN is the default value of 0.1. Additionally, in line with Nguyen & Yang (2023), we use the Cayley transform to approximate matrix exponentiation over skew-symmetric matrices. Optimization. Following Nguyen & Yang (2023), we adopt the trivialization tricks (Lezcano Casado, 2019) for the Grassmannian parameters in the gyrotranslation and our Gyro BN layers. Given a parameter U Gr(p, n), we parameterize it by a matrix U R(n p) p such that 0 UT U 0 UU , e Ip,n i . (50) The biasing parameter U can be retrieved by UU , e Ip,n i Ip,n = mexp 0 UT U 0 In this way, all parameters lie in Euclidean spaces. We, therefore, can directly use the optimizers in Pytorch (Paszke et al., 2019). In all experiments, we use an SGD optimizer with a learning rate of 5e-2 and zero weight decay. The batch size is 30, and training epochs are 400, 200, and 200 for the HDM05, NTU60, and NTU120 datasets. F.3 DETAILS ON THE HYPERBOLIC EXPERIMENTS F.3.1 DATASETS Cora (Sen et al., 2008). It is a citation network where nodes represent scientific papers in the area of machine learning, edges are citations between them, and node labels are academic (sub)areas. Disease (Anderson & May, 1991). It represents a disease propagation tree, simulating the SIR disease transmission model, with each node representing either an infection or a non-infection state. Airport (Zhang & Chen, 2018). It is a transductive dataset where nodes represent airports and edges represent the airline routes as from Open Flights.org. Pubmed (Namata et al., 2012). This is a standard benchmark describing citation networks where nodes represent scientific papers in the area of medicine, edges are citations between them, and node labels are academic (sub)areas. F.3.2 IMPLEMENTATION DETAILS We use the official implementations of HGCN5 (Chami et al., 2019) and Lou et al. (2020)6 to conduct the experiments. We follow all the settings as Lou et al. (2020, Sec. H.1) to implement our experiments. Specifically, the baseline encoder is an HNN (Ganea et al., 2018) with two layers: the first maps the input feature dimension to 128, and the second maps 128 to 128. Hyperbolic BN layers, such as Gyro BN or RBN, follow each layer. We use the Adam optimizer (Kingma, 2014), with a learning rate of 1e 2 and a weight decay of 1e 3, except for the Cora dataset, where the weight decay is set to 0. F.4 HARDWARE All experiments except the ones on the Pubmed dataset use an Intel Core i9-7960X CPU with 32GB RAM and an NVIDIA Ge Force RTX 2080 Ti GPU. The experiments on the Pubmed dataset are conducted on a single NVIDIA Quadro RTX A6000 48GB GPU. F.5 GYROBN AND MODEL GENERALIZATION ABILITY We denoted Gyro Gr with our Gyro BN as Gyro Gr BN. The training and testing curves on the NTU datasets are presented in Fig. 4. We can observe that the gap between training and testing performance is much lower when Gyro Gr is endowed with the Grassmannian Gyro BN. This phenomenon indicates that our Gyro BN can improve the generalization ability of the Grassmannian neural networks. 5https://github.com/Hazy Research/hgcn 6https://github.com/CUAI/Differentiable-Frechet-Mean Published as a conference paper at ICLR 2025 F.6 TRAINING EFFICIENCY F.6.1 TRAINING EFFICIENCY ON THE GRASSMANNIAN NETWORKS Table 10: Average training time (seconds/epoch) for Gyro Gr with and without different Grassmannian BNs. The dimensions of the input Grassmannian matrices for the BN layer are also reported. Methods HDM05 (47 10) NTU60 (75 10) NTU120 (75 10) Gyro Gr 2.19 50.92 80.72 Gyro Gr-Manifold Norm 4.98 242.12 409.48 Gyro Gr-RBN 5.16 242.63 410.08 Gyro Gr-Gyro BN 3.10 59.55 108.92 As the Riemannian computations over the ONB Grassmannian Gr(p, n) involve matrix decompositions, the disparity in efficiency is more evident. Empirical Results. Tab. 10 shows the training efficiency of Gyro Gr with and without different Grassmannian BNs. Compared to other Grassmannian BNs, Gyro BN is significantly more efficient due to the simplicity of the gyro operation. Analysis. The key difference between Gyro BN (Alg. 1) and the Manifold Norm (Chakraborty, 2020, Algs. 1-2) or RBN (Lou et al., 2020, Alg. 2) lies in their methods for centering, biasing, and scaling. Gyro BN uses gyro operations, while Manifold Norm and RBN rely on Riemannian operators, such as parallel transport and the Riemannian logarithmic and exponential maps. This distinction underpins the efficiency of Gyro BN. The three primary contributing factors, ranked by importance, are as follows: 1. Riemann vs. Gyro. The Riemannian operators over the Grassmannian involve computationally expensive processes like SVD decomposition or matrix inversion (see Tab. 8 for Riemannian exp and log, and (Edelman et al., 1998, Thm. 2.4) for parallel transport). Consequently, Manifold Norm and RBN require multiple SVD or matrix inversion operations. In contrast, Gyro BN is relatively simpler. As discussed in Sec. 6.1, the gyro operation can be further simplified, and the involved SVD is performed on a reduced p p matrix instead of n p. Additionally, as noted in App. F.2.2, computationally intensive matrix exponentiation is efficiently approximated using the Cayley map. 2. Reduced Matrix Products. As shown in Tab. 8, each Riemannian operator involves several matrix products over n p matrices. Gyro BN reduces these to matrix products over (n p) p or p p matrices, which are computationally more efficient, such as Prop. 6.1. 3. Optimization. The biasing parameter optimization in Gyro BN is simpler. Manifold Norm uses an n n orthogonal matrix for biasing, and RBN employs an n p Grassmannian matrix, both requiring Riemannian optimization. In contrast, Gyro BN applies trivialization tricks (parameterizing manifold data via the Riemannian exp (Lezcano Casado, 2019), as shown in Eq. (51)), making the biasing parameter a (n p) p Euclidean matrix. Furthermore, Eq. (51) and Eq. (22) share a similar form, allowing them to be jointly simplified. Although RBN could adopt similar trivialization for biasing, this would introduce an additional Riemannian exp step, leaving little advantage over the Riemannian optimization. In summary, Gyro BN benefits from joint simplification with trivialization, whereas the other two Grassmannian BN methods require additional Riemannian optimization. F.6.2 TRAINING EFFICIENCY ON THE HYPERBOLIC NETWORKS Table 11: Average training time (seconds/epoch) for HNN with and without different hyperbolic BNs. Methods Cora Disease Airport Pubmed HNN 0.0323 0.0271 0.054 0.1253 HNN-RBN-H 0.0905 0.0883 0.1215 0.3416 HNN-Gyro BN-H 0.0757 0.0842 0.119 0.3351 Published as a conference paper at ICLR 2025 Recalling Tab. 8, the Riemannian logarithmic and exponential maps over the hyperbolic space contain gyro operations. Therefore, it is expected that our Gyro BN is more efficient than RBN. Tab. 11 reports the average training time per epoch, indicating that our Gyro BN is more efficient than RBN (Lou et al., 2020) in the hyperbolic space. F.7 EXPERIMENTS ON OTHER HYPERBOLIC BASELINES The subsection validates our Gryo BN on the HNN++ (Shimizu et al., 2020) and RRes Net (Katsman et al., 2024) backbones. We follow the HNN experiments and use these two backbones as the encoder for the link prediction task. F.7.1 EXPERIMENTS ON HNN++ Table 12: Comparison of HNN++ with or without RBN-H and Gyro BN-H on the Cora, Disease, and Airport datasets. Dataset Cora Disease Airport HNN++ 91.06 0.47 77.83 1.39 94.93 0.23 HNN++-RBN-H 88.22 0.67 59.50 1.67 94.91 0.23 HNN++-Gyro BN-H 92.83 0.32 79.53 0.74 95.89 0.10 We evaluate HNN++ with and without RBN-H (Lou et al., 2020) and our Gyro BN-H on the Cora, Disease, and Airport datasets. Like the HNN experiments, we used a two-layer HNN++ backbone as the encoder, with hyperbolic hidden feature dimensions set to 128. Tab. A reports the five-fold average results, leading to the following findings: 1. Gyro BN can improve and stabilize HNN++. While RBN can degrade HNN++ s performance, particularly on the Cora dataset, Gyro BN consistently enhances it. Additionally, we observed that HNN++ could be sensitive to weight decay. For instance, on the Cora dataset, setting the weight decay to 1e 3 reduces the performance to 62.55, prompting us to use zero weight decay for HNN++ on this dataset. In contrast, HNN++ with Gyro BN is robust to weight decay. Furthermore, HNN++-Gyro BN achieves the smallest performance variance across all three datasets, indicating that Gyro BN facilitates more robust learning. 2. Gyro BN can accelerate convergence. On the Cora dataset, HNN++ requires over 400 epochs to converge. In contrast, HNN++-Gyro BN converges in approximately 150 epochs, significantly improving convergence. F.7.2 EXPERIMENTS ON RRESNET Table 13: Comparison of RRes Net with or without RBN-H and Gyro BN-H under different hidden feature dimensions on the Cora dataset. Dim 8 16 32 64 128 RRes Net 71.77 6.67 77.38 10.82 77.14 7.48 66.73 11.85 80.75 4.12 RRes Net-RBN-H 61.94 2.28 63.54 3.01 62.26 3.48 60.38 2.97 87.92 2.67 RRes Net-Gyro BN-H 84.58 5.44 87.67 2.53 88.08 2.90 86.50 1.59 89.52 3.42 We focus on the RRes Net Horo (Katsman et al., 2024), which leverages a vector field induced by the horosphere projection feature map. Following the settings of HNN and HNN++, we use 2 residual blocks as the backbone. For a more comprehensive comparison, we evaluate hidden feature dimensions ranging from 8 to 128. Tab. 13 reports the five-fold average results, leading to the following findings: 1. Gyro BN consistently improves performance. While RBN performs well only under the 128-dimensional setting, it fails under other dimensions, performing worse than the baseline RRes Net. In contrast, Gyro BN consistently improves RRes Net s performance across all hidden feature dimensions. 2. Gyro BN stabilizes performance and preserves representation power. Both RRes Net and RRes Net-RBN exhibit significant fluctuations in performance across different hidden Published as a conference paper at ICLR 2025 dimensions. However, RRes Net-Gyro BN shows a more stable performance. This indicates that our Gyro BN could maintain the representation power of the backbone network. F.8 EXPERIMENTS ON COVARIATE SHIFTS The covariate shift serves as the main motivation for the classical Euclidean BN (Ioffe & Szegedy, 2015). Similar challenges arise in Riemannian networks, where the complexity of the manifold often disrupts data distributions, underscoring the utility of our proposed Gyro BN. To demonstrate covariate shifts, we perform numerical experiments on the Grassmannian and hyperbolic networks. F.8.1 COVARIATE SHIFTS ON THE GRASSMANNIAN NETWORK Table 14: Ten-fold results for geodesic distance d(Mout, Ip,n) and shift = d(Mout,Ip,n) b 100, where b = p π 2 is the boundary of geodesic distance over Gr(p, n). Pooling Transformation d(Mout, Ip,n) 2.90 0.11 4.18 0.04 (%) 58.47% 2.28% 84.22% 0.72% The transformation and pooling layers in the Gyro Gr baseline can be expressed as ftrans : Gr(p, n) Gr(p, n), (52) fpooling : Gr(p, n) Gr(p, n/2). (53) For simplicity, we assume n is even for the pooling layer. Since the ONB Grassmannian Gr(p, n) is a quotient manifold and pooling changes dimensions, we use the geodesic distance between the batch mean and the identity element Ip,n = (Ip, 0) Rn p as a consistent measure. Note that the geodesic distance on Gr(p, n) is bounded by p π 2 (Wong, 1967, Thm. 8). We randomly generate 30 Grassmannian matrices of size 100 10 with an initial batch mean as the identity element Ip,n. We denote the resulting batch mean after transformation or pooling as Mout. If the distribution were preserved, the geodesic distance d(Mout, Ip,n) (or d(Mout, Ip,n/2) for pooling) would be zero. Tab. 14 shows that Mout significantly deviates from Ip,n, indicating the covariate shift. F.8.2 COVARIATE SHIFTS ON THE HYPERBOLIC NETWORK We examine the covariate shift in the transformation layer (Ganea et al., 2018, Lem. 6) in HNN. We focus on the canonical Poincaré ball (curvature K = 1). We randomly generate 30 5-dimensional Poincaré vectors. The batch mean vectors of input and output in the transformation layer are Min = [ 0.0055, 0.0503, 0.0913, 0.0493, 0.0652] , (54) Mout = [ 0.0969, 0.0443, 0.0385, 0.0936, 0.0770] . (55) The geodesic distance between Min and Mout is 0.55, indicating the covariate shift in HNN. F.9 EXPERIMENTS ON CONDITION NUMBERS In the Gyro Gr network, the transformation layer outputs n p Grassmannian matrices, whose condition numbers are trivial since all singular values of a Grassmannian matrix are 1. Similarly, for the HNN model, the transformation layer outputs a vector, leading to a trivial condition number as well. Thus, we focus on evaluating the condition numbers of the weight matrices in the hidden transformation layer and the overall network Jacobian matrix. We evaluate Grassmannian Gyro Gr networks without loss of generality under one block architecture, both with and without different BN layers, using each trained model. The results demonstrate that Gyro BN consistently reduces the condition numbers of the weight matrices and the network Jacobian. F.9.1 CONDITION NUMBERS OF WEIGHT MATRICES OF THE TRANSFORMATION LAYER As shown in Tab. 15, Gyro BN consistently reduces the condition numbers of the weight matrices across datasets. Interestingly, although Manifold Norm achieves the lowest condition numbers, its performance could be even worse than the Gyro Gr baseline, suggesting that excessively reducing the condition number could limit the model s capacity to capture sufficient expressive features. Published as a conference paper at ICLR 2025 Table 15: Statistics of condition numbers for the weight matrices in the Gyro Gr transformation layer along with different normalization methods. As weight matrices have 8 channels, we report the mean, standard deviation (std), minimum, and maximum values of their condition numbers. The dimensions of the 8-channel transformation matrices are 83 10 on the HDM05 dataset and 140 10 on the other datasets. Dataset BN Mean Min Max None 3.81 0.23 3.46 4.17 Manifold Norm-Gr 1.97 0.15 1.83 2.30 RBN-Gr 3.49 0.46 2.96 4.43 Gyro BN-Gr 2.37 0.18 2.12 2.67 None 2.77 0.12 2.61 2.94 Manifold Norm-Gr 1.76 0.10 1.63 1.92 RBN-Gr 2.17 0.20 1.94 2.51 Gyro BN-Gr 2.02 0.14 1.82 2.28 None 3.35 0.28 2.97 3.72 Manifold Norm-Gr 1.91 0.10 1.80 2.14 RBN-Gr 2.22 0.11 2.00 2.36 Gyro BN-Gr 2.16 0.11 2.00 2.33 F.9.2 CONDITION NUMBER OF NETWORK JACOBIAN We further analyze the condition number of the network Jacobian on the HDM05 and NTU120 datasets. We randomly select 100 samples to calculate the statistics of the associated network condition numbers, as shown in Tab. 16. Our Gyro BN consistently lowers condition numbers compared to other methods across both datasets, indicating that Gyro BN stabilizes network training. Additionally, the reduced maximum condition numbers, especially on the NTU120 dataset, suggest that Gyro BN avoids extreme outliers in conditioning, further highlighting its advantage over previous methods. Notably, on the HDM05 dataset, both Manifold Norm and RBN increase network conditioning. Table 16: Condition numbers of network Jacobian (output w.r.t. input) across different normalization methods on the Gyro Gr baseline. The statistics come from 100 random samples. The dimensions of Jacobian matrices are 117 930 on the HDM05 dataset and 11 9600 on the NTU120 dataset. Dataset BN Mean Min Max None 61.05 4.15 52.19 74.29 Manifold Norm-Gr 67.38 6.82 51.15 88.81 RBN-Gr 64.16 6.44 50.54 80.07 Gyro BN-Gr 59.85 4.33 48.46 70.47 None 122.24 68.45 61.8 399.28 Manifold Norm-Gr 116.75 65.17 75.46 443.01 RBN-Gr 108.75 52.27 63.19 349.25 Gyro BN-Gr 97.01 41.88 61.19 262.55 G.1 PROOF OF PROP. 3.2 Proof of Prop. 3.2. By Nguyen & Yang (2023, Lem. 2.3), easy computations show that Eq. (12) holds for Gr(p, n) iff it holds for f Gr(p, n). Without loss of generality, we prove the case for the projector perspective. Given any P, Q f Gr(p, n), Def. 3.18 by Nguyen (2022a) gives the expression for gyration: gyr[ P, P]Q = F( P, P)Q (F( P, P)) 1 , (56) Published as a conference paper at ICLR 2025 with F( P, P) defined as F( P, P) = mexp h P P, e Ip,n i mexp h P, e Ip,n i mexp h P, e Ip,n i , (57) where ( ) = Loge Ip,n( ). This equation can be further simplified as F( P, P) (1) = mexp(0) mexp h P, e Ip,n i mexp h P, e Ip,n i , (2) = mexp h P, e Ip,n i mexp h P, e Ip,n i , The above derivation follows from (1) P P = e Ip,n = 0 Rn n. (2) mexp(0) = In. (3) P = P and mexp h P, e Ip,n i = mexp h P, e Ip,n i 1 . Therefore, gyr[ P, P] is the identity map. G.2 PROOF OF THM. 3.3 Proof of Thm. 3.3. We first prove a useful lemma. Lemma G.1 (Left Gyrotranslation Law). Every pseudo-reductive gyrogroups {G, } verifies the left gyrotranslation law: (P Q) (P R) = gyr[P, Q] ( Q R) , P, Q, R G. (59) Proof. This lemma generalizes Lems. I.1 and L.1 by Nguyen & Yang (2023), which proves the left gyrotranslation law on the specific gyrogroups of the SPD and Grassmannian manifolds. Their proof only relies on the left cancellation and the basic axioms (G1-3). Note that the original proof of left gyrotranslation on the Grassmannian (Nguyen & Yang, 2023, Lems. I.1) is questionable, as relies on the left cancellation of gyrogroups, and the Grassmannian is not a gyrogroup but a non-reductive gyrogroup. Fortunately, as we show in Thm. D.1, the general pseudo-reductive gyrogroups, including the Grassmannian, enjoy left cancellation. Therefore, all the proof by Nguyen & Yang (2023, Lems. I.1) can be readily generalized into the general pseudo-reductive gyrogroups. : For any R, S G, the gyroautomorphism can be expressed by the gyrator identity in Thm. D.1: gyr[P, Q]R = X R, (60) gyr[P, Q]S = X S, (61) where X = (P Q), R = P (Q R), and S = P (Q S). Then Eq. (30) in (Nguyen & Yang, 2023) for the specific Grassmannian can be directly extended into the pseudo-reductive gyrogroup, as it only relies on left gyrotranslation, invariance of the norm under gyroautomorphisms, and the axioms of (G1-G3). gyr[P, Q](R) gr = gyr[P, Q]( E R) gr , (7 in Thm. D.1 indicates E = E) = gyr[P, Q](E) gyr[P, Q](R) gr , (automorphism) = dgry (gyr[P, Q](E), gyr[P, Q](R)) , = dgry (E, R) , = E R gr , Published as a conference paper at ICLR 2025 G.3 PROOF OF THM. 3.5 Proof of Thm. 3.5. Given any P, Q, R G, we make the following proof. Gyroisometry of the left gyrotranslation: This property generalizes Thms. 2.12 and 2.16 by Nguyen & Yang (2023), which deal with the gyrotranslations in the SPD and Grassmannian manifolds. We have the following: dgry(LP (Q), LP (R)) = dgry(P Q, P R), = (P Q) (P R) gr , = gyr[P, Q] ( Q R) gr (left gyrotranslation), = Q R gr (gyroisometry of the automorphism), = dgry(Q, R). Symmetry of the gyrodistance: dgry(P, Q) = P Q gr , = ( P Q) gr (Eq. (5) and Eq. (7)), = gyr[ P, Q]( Q P) gr (gyrosum inversion law), = Q P gr (gyroisometry of the automorphism), = dgry(Q, P). Gyroisometry of the gyroinverse: dgry( P, Q) = P Q gr , = Q P gr ( gyrocommutativity and gyroisometry of the automorphism), = dgry(Q, P), = dgry(P, Q) (Symmetry of the gyrodistance). G.4 PROOF OF PROP. 3.6 Proof of Prop. 3.6. We first show the expressions of the gyrodistance on M . Then we proceed to show the gyrodistance and gyroisometries on MK. We follow all the notations in Tab. 2 and denote the geodesic distance under a specific geometry as dgeo. Expressions of gyrodistance on M: For {Sn ++, AI}, we have the following: dgry(P, Q) = P 1 2 ) (Log I = mlog), = dgeo(P, Q). For {Sn ++, LE}, we have the following: dgry(P, Q) = mexp( mlog(P) + mlog(Q)) gr , = mlog 1 ,I( mlog(P) + mlog(Q)) I , (Log I(P) = mlog 1 ,I(mlog(P))), = mlog(P) mlog(Q) , (the pullback of LEM (Chen et al., 2024e)) = dgeo(P, Q), where mlog 1 ,I is the inverse of the differential map of mlog, and I is the norm induced by the LEM at I. Published as a conference paper at ICLR 2025 As LCM is also a pullback metric (Chen et al., 2024e), {Sn ++, LC} follow the same logic: dgry(P, Q) = ψ 1 LC( ψLC(P) + ψLC(Q)) gr , = (ψLC) 1 ,I( ψLC(P) + ψLC(Q)) I (Log I(P) = (ψLC) 1 ,IψLC(P)), = ψLC(P) ψLC(Q) , (the pullback of LCM (Chen et al., 2024e)) = dgeo(P, Q). For {f Gr(p, n), e Gr}, we have the following: dgry(P, Q) = mexp( [P, e Ip,n])Q mexp([P, e Ip,n]) gr ( P = P), = e P Q e P gr ( e P = mexp([P, e Ip,n]) SO(n)), e P Q e P ( I = 1 [Ω, e Ip,n] , where [Ω, e Ip,n] = Loge Ip,n e P Q e P . For Ω, we have the following: 2 mlog In 2 e P Q e P In 2e Ip,n ((Bendokat et al., 2024, Prop. 5.6)), 2 mlog e P (In 2Q) e P e P In 2 e P e Ip,n e P e P 2 mlog e P (In 2Q) In 2 e P e Ip,n e P e P , (1) = e P 1 2 mlog (In 2Q) In 2 e P e Ip,n e P e P, (2) = e P 1 2 mlog ((In 2Q) (In 2P)) e P, (3) = e P Ωe P, Eq. (70) follows from the following facts: (1) When mlog(B) is well-defined and A is non-singular, A 1(log B)A = log A 1BA (Horn & Johnson, 2012). (2) P = mexp([P, e Ip,n])e Ip,n mexp( [P, e Ip,n]) (Nguyen, 2022a, Eq. (36)). (3) Let Ω= 1 2 mlog ((In 2Q) (In 2P)) Combining Eqs. (69) and (70), we have dgry(P, Q) = 1 [Ω, e Ip,n] , [ e P Ωe P, e Ip,n] , e P [Ω, e P e Ip,n e P ] e P , [Ω, e P e Ip,n e P ] , 2 Log P (Q) , (5) = Log P (Q) P , (6) = dgeo(P, Q). Published as a conference paper at ICLR 2025 Eq. (71) follows from the following facts: [ e P Ωe P, e Ip,n] = e P Ωe P e Ip,n e Ip,n e P Ωe P, Ωe P e Ip,n e P e P e Ip,n e P Ω e P, = e P [Ω, e P e Ip,n e P ] e P. (2) Euclidean norm is invariant under the action A 7 OAO , A Rn n, O O(n). (3) P = mexp([P, e Ip,n])e Ip,n mexp( [P, e Ip,n]). (4) Log P (Q) = [Ω, P]. 2 , P f Gr(p, n). (6) dgeo(P, Q) = Log P (Q) P for any P, Q f Gr(p, n) not in each other s cut locus. For {Gr(p, n), Gr}, we first make the following notations: We denote the geodesic distance, gyrodistance, Riemannian logarithm at Ip,n, and Riemannian metric at Ip,n on Gr(p, n) as dgeo, dgry, Log Ip,n, and ONB Ip,n . The counterparts on f Gr(p, n) are g dgeo, g dgry, g Loge Ip,n, and PP e Ip,n. As shown by Nguyen et al. (2024, App. N), π : Gr(p, n) f Gr(p, n) is a Riemannian isometry. Then, for any U, V Gr(p, n) with π(U) = P and π(V ) = Q, we have the following: dgry(U, V ) = Log Ip,n Gr U Gr V ONB (1) = Log Ip,n π 1(e Gr P e Gr Q) ONB (2) = π 1 ,Ip,n g Loge Ip,n e Gr P e Gr Q ONB (3) = g Loge Ip,n e Gr P e Gr Q PP = g dgry(P, Q) = g dgeo(P, Q) (4) = dgeo(U, V ) The above comes from the following. (1) Gyro additions under the Riemannian isometry (Nguyen & Yang, 2023, Lem. 2.1). (2,3,4) Riemannian operators under the Riemannian isometry (Gallier & Quaintance, 2020). Constant Curvature Spaces: When MK = Rn(K = 0), the gyro structures defined in Eq. (3)-Eq. (9) are reduced to the vector structures. The claim can be directly proved. In the following, we present the proof for K = 0. We first show the expression for the gyrodistance under MK, then the isometry of gyration, and finally the results on the gyroisometry of gyrotranslation and gyroinverse. In the following, a, b, x, y are arbitrary points in MK. Gyrodistance and geodesic distance: The Riemannian metric, logarithm, and geodesic distance on the CCS (Skopek et al., 2019) are g K x = λK x 2 g E, (74) Log K x (y) = 2 p |K|λK x tan 1 K p |K| x K y x K y x K y , (75) dgeo(x, y) = 2 p |K| tan 1 K p |K| x y , (76) Published as a conference paper at ICLR 2025 where λK x = 2 1+K x 2 , g E is the standard Euclidean inner product, and K is the gyro addition in Tab. 2. Especially, when x = 0, we have g K 0 = 22g E, (77) Log K 0 (y) = 1 p |K| tan 1 K p For the gyrodistance, we have the following: dgry(x, y) = x y gr , = Log0( x y) 0, |K| tan 1 K p |K| x y x y |K| tan 1 K p |K| x y ( s > 0, tan 1 K (s) > 0), where 0 is the norm induced by g K 0 . Norm invariance under gyration: As MK forms a real inner product gyrovector spaces (Ungar, 2009, Def. 3.2), any gyration on CCS preserves the norm induced by standard inner product: gyr[a, b](x) = x , x. (80) For the gyronorm, we have the following: gyr[a, b]x gr = 2 Log0(gyr[a, b]x) 0, |K| tan 1 K p |K| gyr[a, b]x , |K| tan 1 K p Isometry of left gyrotranslation and gyroinverse: Note that MK forms a gyrocommutative gyrogroup. According to Thm. 3.5, we can obtain the results. G.5 PROOF OF THM. 4.1 Proof of Thm. 4.1. According to Thm. 3.3 and Thm. 3.5, any left gyrotranslation is a gyroisometry. Therefore, for any Q M, we have the following: dgry (B Pi, Q) (1) = dgry ( B (B Pi), B Q) (2) = dgry (( B B) gyr[ B, B](Pi)), B Q) (3) = dgry (Pi, B Q) . The above comes from the following. (1) Any left gyrotranslation is a gyroisometry. (2) Left gyroassociative law. (3) B B = E and pseudo-reduction. Denoting the gyromean of {Pi} and {B Pi} as M and f M, we have the following: B M (1) = B ( B f M) (2) = gyr[B, B](f M) The above comes from the following. Published as a conference paper at ICLR 2025 (1) Eq. (82) indicates that M = B f M. (2) Left gyroassociative law. (3) Pseudo-reduction. Now, we proceed to deal with the second property. We have the following: dgry(t Pi, E) (1) = E (t Pi) gr (2) = t Pi gr = t Log E Pi E = |t| Log E Pi E = |t| Pi gr (3) = |t| E Pi gr = |t|dgry(E, Pi) (4) = |t|dgry(Pi, E) The above follows from the following. (1) Symmetry of gyrodistance (Thm. 3.5). (3) Pi = E Pi. (4) Symmetry of gyrodistance. The last equation in Eq. (84) indicates the homogeneity of dispersion from E. G.6 PROOF OF PROP. 5.1 Proof of Prop. 5.1. We only need to prove the equivalence of gyrodistance and geodesic distance. Note that every Lie group is automatically a gyrogroup with each gyration as the identity map. We denote {M, , g} as a Lie group with left-invariant metric g. For any P and Q in M, we have the following: dgry(P, Q) (1) = P Q gr (2) = Log E( P Q) E = dgeo(E, P Q) (3) = dgeo(P, P ( P Q)) (4) = dgeo(P, Q). The above derivation comes from the following. 1. Definition of gyrodistance Eq. (9). 2. Definition of gyronorm Eq. (8). 3. Under a left-invariant metric, any left Lie group translation is a Riemannian isometry. P ( P Q) = (P P) Q( the associative of group addition) = E Q = Q. (86) Therefore, the gyromean and gyrovariance are exactly the Fréchet mean and variance under the geodesic distance, while the running mean updates are also identical under gyrodistance and geodesic distance. Published as a conference paper at ICLR 2025 G.7 PROOF OF PROP. 6.1 We first review a fast and stable algorithm for the ONB Grassmannian logarithm (Bendokat et al., 2024, Alg. 5.3), and the calculation of Grassmannian logarithm under the projector perspective by the ONB Grassmannian logarithm (Nguyen et al., 2024, Prop. 3.12). Algorithm 2: Grassmann logarithm under the ONB perspective (Bendokat et al., 2024, Alg. 5.3) Input: U, Y Gr(p, n) are Stiefle representatives under ONB perspective. 1 QSRT SVD := Y T U with S in ascending order, and Q and R column-wisely flipped accordingly; 2 ˆS = In S2; 3 = (In UU )Y Q arcsin( ˆS) Output: Log U(Y ) = Alg. 2 reviews a fast and stable algorithm for the Grassmannian Riemannian logarithm under the ONB perspective Gr(p, n). The vanilla Riemannian logarithm in Tab. 8 requires an n p SVD and a p p matrix inverse, while Alg. 2 only requires an p p SVD. Therefore, Alg. 2 is more efficient than the vanilla logarithm. Besides, Alg. 2 can also return a unique tangent vector when Y is in the cut locus of U. For more details, please refer to Bendokat et al. (2024, Sec. 5.2). As the projector perspective is isometric to the ONB perspective, the Grassmannian logarithm under the projector perspective can be calculated by the ONB Grassmannian logarithm (Nguyen et al., 2024, Prop. 3.12). Proposition G.2 ((Nguyen et al., 2024)). Given any P, Q f Gr(p, n) with U = π 1(P) and V = π 1(Q), the Riemannian logarithm g Log P (Q) on f Gr(p, n) is given as g Log P (Q) = π ,U (Log U V ) , (87) where Log is the Riemannian logarithm under the ONB perspective, π ,U : TUGr(p, n) TP f Gr(p, n) is the differential map of π at U, which is defined as π ,U( ) = U + U , TUGr(p, n). (88) Now, we begin to present the proof. Proof of Prop. 6.1. We first show the expression for Log Ip,n and g Loge Ip,n. First note the following: (In Ip,n I p,n) = 0 0 0 In p U Ip,n = U 1 , U 2 Ip 0 = U 1 , (90) By the above two equations, the ONB Grassmannian logarithm at Ip,n is Log Ip,n(U) = 0 0 0 In p Qarcsin( ˆS) ˆS RT (Alg. 2) 0 U2Q arcsin( ˆS) where QSRT SVD := U 1 with S in ascending order, and Q and R column-wisely flipped accordingly, and ˆS = In S2. Published as a conference paper at ICLR 2025 For g Loge Ip,n, we have g Loge Ip,n(UU ) (1) = π ,Ip,n Log Ip,n(U) (2) = π ,Ip,n (3) = 0 e U 2 e U2 0 The above derivation comes from the following. (1) Prop. G.2 (2) Eq. (91) (3) For any = ( 1 , 2 ) TIp,n Gr(p, n), where 1 is p p, we have the following (Ip, 0) + Ip 0 = 1 + 1 2 2 0 Combining all the above results together, we have the following: [UU , e Ip,n] = h g Loge Ip,n(UU ), e Ip,n i = 0 e U 2 e U2 0 = 0 e U 2 e U2 0 0 e U 2 e U2 0 = 0 e U T 2 e U T 2 0