# matrix_manifold_neural_networks__98a0b8eb.pdf Published as a conference paper at ICLR 2024 MATRIX MANIFOLD NEURAL NETWORKS++ Xuan Son Nguyen, Shuo Yang, Aymeric Histace ETIS, UMR 8051, CY Cergy Paris University, ENSEA, CNRS, France {xuan-son.nguyen,shuo.yang,aymeric.histace}@ensea.fr Deep neural networks (DNNs) on Riemannian manifolds have garnered increasing interest in various applied areas. For instance, DNNs on spherical and hyperbolic manifolds have been designed to solve a wide range of computer vision and nature language processing tasks. One of the key factors that contribute to the success of these networks is that spherical and hyperbolic manifolds have the rich algebraic structures of gyrogroups and gyrovector spaces. This enables principled and effective generalizations of the most successful DNNs to these manifolds. Recently, some works have shown that many concepts in the theory of gyrogroups and gyrovector spaces can also be generalized to matrix manifolds such as Symmetric Positive Definite (SPD) and Grassmann manifolds. As a result, some building blocks for SPD and Grassmann neural networks, e.g., isometric models and multinomial logistic regression (MLR) can be derived in a way that is fully analogous to their spherical and hyperbolic counterparts. Building upon these works, we design fully-connected (FC) and convolutional layers for SPD neural networks. We also develop MLR on Symmetric Positive Semi-definite (SPSD) manifolds, and propose a method for performing backpropagation with the Grassmann logarithmic map in the projector perspective. We demonstrate the effectiveness of the proposed approach in the human action recognition and node classification tasks. 1 INTRODUCTION In recent years, deep neural networks on Riemannian manifolds have achieved impressive performance in many applications (Ganea et al., 2018; Skopek et al., 2020; Cruceru et al., 2021; Shimizu et al., 2021). The most popular neural networks in this family operate on hyperbolic spaces. Such spaces of constant sectional curvature, like spherical spaces, have the rich algebraic structure of gyrovector spaces. The theory of gyrovector spaces (Ungar, 2002; 2005; 2014) offers an elegant and powerful framework based on which natural generalizations (Ganea et al., 2018; Shimizu et al., 2021) of essential building blocks in DNNs are constructed for hyperbolic neural networks (HNNs). Matrix manifolds such as SPD and Grassmann manifolds offer a convenient trade-off between structural richness and computational tractability (Cruceru et al., 2021; L opez et al., 2021). Therefore, in many applications, neural networks on matrix manifolds are attractive alternatives to their hyperbolic counterparts. However, unlike the approaches in Ganea et al. (2018); Shimizu et al. (2021), most existing approaches for building SPD and Grassmann neural networks (Dong et al., 2017; Huang & Gool, 2017; Huang et al., 2018; Nguyen et al., 2019; Brooks et al., 2019; Nguyen, 2021; Wang et al., 2021) do not provide necessary techniques and mathematical tools to generalize a broad class of DNNs to the considered manifolds. Recently, the authors of Kim (2020); Nguyen (2022b) have shown that SPD and Grassmann manifolds have the structure of gyrovector spaces or that of nonreductive gyrovector spaces (Nguyen, 2022b) that share remarkable analogies with gyrovector spaces. The work in Nguyen & Yang (2023) takes one step forward in that direction by generalizing several notions in gyrovector spaces, e.g., the inner product and gyrodistance (Ungar, 2014) to SPD and Grassmann manifolds. This allows one to characterize certain gyroisometries of these manifolds and to construct MLR on SPD manifolds. Although some useful notions in gyrovector spaces have been generalized to SPD and Grassmann manifolds (Nguyen, 2022a;b; Nguyen & Yang, 2023) that set the stage for an effective way of building neural networks on these manifolds, many questions remain open. In this paper, we aim at Published as a conference paper at ICLR 2024 addressing some limitations of existing works using a gyrovector space approach. Our contributions can be summarized as follows: 1. We generalize FC and convolutional layers to the SPD manifold setting. 2. We propose a method for performing backpropagation with the Grassmann logarithmic map in the projector perspective (Bendokat et al., 2020) without resorting to any approximation schemes. We then show how to construct graph convolutional networks (GCNs) on Grassmann manifolds. 3. We develop MLR on SPSD manifolds. 4. We showcase our approach in the human action recognition and node classification tasks. 2 PRELIMINARIES 2.1 SPD MANIFOLDS The space of n n SPD matrices, when provided with some geometric structures like a Riemannian metric, forms SPD manifold Sym+ n (Arsigny et al., 2005). Data lying on SPD manifolds are commonly encountered in various domains (Huang & Gool, 2017; Brooks et al., 2019; Nguyen, 2021; Sukthanker et al., 2021; Nguyen, 2022b;a; Nguyen & Yang, 2023). In many applications, the use of Euclidean calculus on SPD manifolds often leads to unsatisfactory results (Arsigny et al., 2005). To tackle this issue, many Riemannian structures for SPD manifolds have been introduced. In this work, we focus on two widely used Riemannian metrics, i.e., Affine-Invariant (AI) (Pennec et al., 2004) and Log-Euclidean (LE) (Arsigny et al., 2005) metrics, and a recently introduced Riemannian metrics, i.e. Log-Cholesky (LC) metrics (Lin, 2019) that offer some advantages over Affine-Invariant and Log-Euclidean metrics. 2.2 GRASSMANN MANIFOLDS Grassmann manifolds Grn,p are the collection of linear subspaces of fixed dimension p of the Euclidean space Rn (Edelman et al., 1998). Data lying on Grassmann manifolds arise naturally in many applications (Absil et al., 2007; Bendokat et al., 2020). Points on Grassmann manifolds can be represented from different perspectives (Bendokat et al., 2020). Two typical approaches use projection matrices or those with orthonormal columns. Each of them can be effective in some problems but might be inappropriate in some other contexts (Nguyen, 2022b). Although geometrical descriptions of Grassmann manifolds have been given in numerous works (Edelman et al., 1998), some computational issues remain to be addressed. For instance, the question of how to effectively perform backpropagation with the Grassmann logarithmic map in the projector perspective remains open. 2.3 NEURAL NETWORKS ON SPD AND GRASSMANN MANIFOLDS 2.3.1 NEURAL NETWORKS ON SPD MANIFOLDS The work in Huang & Gool (2017) introduces SPDNet with three novel layers, i.e., Bimap, Log Eig, and Re Eig layers that has become one of the most successful architectures in the field. In Brooks et al. (2019), the authors further improve SPDNet by developing Riemannian versions of batch normalization layers. Following these works, some works (Nguyen et al., 2019; Nguyen, 2021; Wang et al., 2021; Kobler et al., 2022; Ju & Guan, 2023) design variants of Bimap and batch normalization layers in SPD neural networks. The work in Chakraborty et al. (2020) presents a different approach based on intrinsic operations on SPD manifolds. Their proposed layers have nice theoretical properties. A common limitation of the above works is that they do not provide necessary mathematical tools for constructing many essential building blocks of DNNs on SPD manifolds. Recently, some works (Nguyen, 2022a;b; Nguyen & Yang, 2023) take a gyrovector space approach that enables natural generalizations of some building blocks of DNNs, e.g., MLR for SPD neural networks. 2.3.2 NEURAL NETWORKS ON GRASSMANN MANIFOLDS In Huang et al. (2018), the authors propose Gr Net that explores the same rule of matrix backpropagation (Ionescu et al., 2015) as SPDNet. Some existing works (Wang & Wu, 2020; Souza et al., Published as a conference paper at ICLR 2024 2020) are also inspired by Gr Net. Like their SPD counterparts, most existing Grassmann neural networks are not built upon a mathematical framework that allows one to generalize a broad class of DNNs to Grassmann manifolds. Using a gyrovector space approach, Nguyen & Yang (2023) has shown that some concepts in Euclidean spaces can be naturally extended to Grassmann manifolds. 3 PROPOSED APPROACH 3.1 NOTATION Let M be a homogeneous Riemannian manifold, TPM be the tangent space of M at P M. Denote by exp(P) and log(P) the usual matrix exponential and logarithm of P, Exp P(W) the exponential map at P that associates to a tangent vector W TPM a point of M, Log P(Q) the logarithmic map of Q M at P, TP Q(W) the parallel transport of W from P to Q along geodesics connecting P and Q. For simplicity of exposition, we will concentrate on real matrices. Denote by Mn,m the space of n m matrices, Sym+ n the space of n n SPD matrices, Symn the space of n n symmetric matrices, S+ n,p the space of n n SPSD matrices of rank p n, Grn,p the p-dimensional subspaces of Rn in the projector perspective. For clarity of presentation, let f Grn,p be the p-dimensional subspaces of Rn in the ONB (orthonormal basis) perspective (Bendokat et al., 2020). For notations related to SPD manifolds, we use the letter g {ai, le, lc} as a subscript (superscript) to indicate the considered Riemannian metric, unless otherwise stated. Other notations will be introduced in appropriate paragraphs. Our notations are summarized in Appendix A. 3.2 NEURAL NETWORKS ON SPD MANIFOLDS In Nguyen (2022a;b), the author has shown that SPD manifolds with Affine-Invariant, Log Euclidean, and Log-Cholesky metrics form gyrovector spaces referred to as AI, LE, and LC gyrovector spaces, respectively. We adopt the notations in these works and consider the case where r = 1 (see Nguyen (2022b), Definition 3.1). Let ai, le, and lc be the binary operations in AI, LE, and LC gyrovector spaces, respectively. Let ai, le, and lc be the inverse operations in AI, LE, and LC gyrovector spaces, respectively. These operations are given in Appendix G. 3.2.1 FC LAYERS IN SPD NEURAL NETWORKS Our method for generalizing FC layers to the SPD manifold setting relies on a reformulation of SPD hypergyroplanes (Nguyen & Yang, 2023). We first recap the definition of SPD hypergyroplanes. Definition 3.1 (SPD Hypergyroplanes (Nguyen & Yang, 2023)). For P Sym+,g n , W TP Sym+,g n , SPD hypergyroplanes are defined as Hspd,g W,P = {Q Sym+,g n : Logg P(Q), W g P = 0}, where ., . g P denotes the inner product at P given by the considered Riemannian metric. Proposition 3.2 gives an equivalent definition for SPD hypergyroplanes. Proposition 3.2. Let P Sym+,g n , W TP Sym+,g n , and Hspd,g W,P be the SPD hypergyroplanes defined in Definition 3.1. Then Hspd,g W,P = {Q Sym+,g n : g P g Q, Expg In(T g P In(W)) g = 0}, where In denotes the n n identity matrix, and ., . g is the SPD inner product in Sym+,g n (Nguyen & Yang, 2023) (see Appendix G.7 for the definition of the SPD inner product). Proof See Appendix I. In DNNs, an FC layer linearly transforms the input in such a way that the k-th dimension of the output corresponds to the signed distance from the output to the hyperplane that contains the origin and is orthonormal to the k-th axis of the output space. This interpretation has proven useful in generalizing FC layers to the hyperbolic setting (Shimizu et al., 2021). Published as a conference paper at ICLR 2024 Notice that the equation of SPD hypergyroplanes in Proposition 3.2 has the form g P g Q, W g = 0, where W Sym+,g n . This equation can be seen as a generalization of the hyperplane equation w, x + b = p + x, w = 0, where w, x, p Rn, b R, and p, w = b. Therefore, Proposition 3.2 suggests that any linear function of an SPD matrix X Sym+,g n can be written as g P g X, W g, where P, W Sym+,g n . The above interpretation of FC layers now can be applied to our case for constructing FC layers in SPD neural networks. For convenience of presentation, in Definition 3.3, we will index the dimensions (axes) of the output space using two subscripts corresponding to the row and column indices in a matrix. Definition 3.3. Let Eg (i,j), i j, i, j = 1, . . . , m be the (i, j)-th axis of the output space. An SPD hypergyroplane that contains the origin and is orthonormal to the Eg (i,j) axis can be defined as Hspd,g Logg Im(Eg (i,j)),Im = {Q Sym+,g m : Q, Eg (i,j) g = 0}. It remains to specify an orthonormal basis for each family of the considered Riemannian metrics of SPD manifolds. Proposition 3.4 gives such an orthonormal basis for AI gyrovector spaces along with the expression for the output of FC layers with Affine-Invariant metrics. Proposition 3.4 (FC layers with Affine-Invariant Metrics). Let (e1, . . . , em), ei = 1, i = 1 . . . , m be an orthonormal basis of Rm. Let ., . ai P be the Affine-Invariant metric computed at P Sym+,ai m as V, W ai P = Tr(VP 1WP 1) + β Tr(VP 1) Tr(WP 1), where β > 1 m. An orthonormal basis Eai (i,j), i j, i, j = 1, . . . , m of Sym+,ai m can be given as Eai (i,j) = exp eie T j 1 m 1 1 1+mβ Im , if i = j exp eie T j +eje T i 2 , if i < j Denote by v(i,j)(X) = ai P(i,j) ai X, W(i,j) ai, P(i,j), W(i,j) Sym+,ai n , i j, i, j = 1, . . . , m. Let α = 1 m( 1 + mβ 1). Then the output of an FC layer is computed as Y = exp [y(i,j)]m i,j=1 , where [y(i,j)]m i,j=1 is the matrix having y(i,j) as the element at the i-th row and j-th column, and y(i,j) is given by v(i,j)(X) + α Pm k=1 v(k,k)(X), if i = j 1 2v(i,j)(X), if i < j 2v(j,i)(X), if i > j Proof See Appendix J. As shown in Arsigny et al. (2005), a Log-Euclidean metric on Sym+,le n can be obtained from any inner product on Symn. In this work, we consider a metric that is invariant under all similarity transformations, i.e., the metric W, V le In = Tr(WV). We have the following result. Proposition 3.5 (FC layers with Log-Euclidean Metrics). An orthonormal basis Ele (i,j), i j, i, j = 1, . . . , m of Sym+,le m can be given by Ele (i,j) = exp eie T j , if i = j exp eie T j +eje T i 2 , if i < j Let v(i,j)(X) = le P(i,j) le X, W(i,j) le, P(i,j), W(i,j) Sym+,le n , i j, i, j = 1, . . . , m. Then the output of an FC layer is computed as Y = exp [y(i,j)]m i,j=1 , where y(i,j) is given by v(i,j)(X), if i = j 1 2v(i,j)(X), if i < j 2v(j,i)(X), if i > j Published as a conference paper at ICLR 2024 Proof See Appendix K. Finally, we give the characterization of an orthonormal basis for LC gyrovector spaces and the expression for the output of FC layers with Log-Cholesky metrics. Proposition 3.6 (FC layers with Log-Cholesky Metrics). An orthonormal basis Elc (i,j), i j, i, j = 1, . . . , m of Sym+,lc m can be given by Elc (i,j) = (e 1)eie T j + Im, if i = j (eje T i + Im)(eie T j + Im), if i < j Let v(i,j)(X) = lc P(i,j) lc X, W(i,j) lc, P(i,j), W(i,j) Sym+,lc n , i j, i, j = 1, . . . , m. Then the output of an FC layer is computed as Y = YY T , where Y = [y(i,j)]m i,j=1, and y(i,j) is given by exp(v(i,j)(X)), if i = j v(i,j)(X), if i < j 0, if i > j Proof See Appendix L. 3.2.2 CONVOLUTIONAL LAYERS IN SPD NEURAL NETWORKS Consider applying a 2D convolutional layer to a multi-channel image. Let Nin and Nout be the numbers of input and output channels, respectively. Denote by yk (i,j), i = 1, . . . , Nrow, j = 1, . . . , Ncol, k = 1, . . . , Nout the value of the k-th output channel at pixel (i, j). Then l=1 w(l,k), xl (i,j) + bk, (1) where xl (i,j) is a receptive field of the l-th input channel, w(l,k) is the filter associated with the l-th input channel and the k-th output channel, and bk is the bias for the k-th output channel. Let X(i,j) = concat(x1 (i,j), . . . , x Nin (i,j)), Wk = concat(w(1,k), . . . , w(Nin,k)), where operation concat(.) concatenates all of its arguments. Then Eq. (1) can be rewritten (Shimizu et al., 2021) as yk (i,j) = Wk, X(i,j) + bk. (2) Note that Eq. (2) has the form w, x + b and thus the computations discussed in Section 3.2.1 can be applied to implement convolutional layers in SPD neural networks. Specifically, given a set of SPD matrices Pi Sym+,g n , i = 1, . . . , N, operation concatspd(P1, . . . , PN) produces a block diagonal matrix having Pi as diagonal elements. In Chakraborty et al. (2020), the authors design a convolution operation for SPD neural networks. However, their method is based on the concept of weighted Fr echet Mean, while ours is built upon the concepts of SPD hypergyroplane and SPD pseudo-gyrodistance from an SPD matrix to an SPD hypergyroplane (Nguyen & Yang, 2023). Also, our convolution operation can be used for dimensionality reduction, while theirs always produces an output of the same dimension as the inputs. 3.3 MLR IN STRUCTURE SPACES Motivated by the works in Nguyen (2022a); Nguyen & Yang (2023), in this section, we aim to build MLR on SPSD manifolds. For any P S+ n,p, we consider the decomposition P = UP SP UT P , where UP f Grn,p and SP Sym+ p . Each element of S+ n,p can be seen as a flat p-dimensional ellipsoid in Rn (Bonnabel et al., 2013). The flat ellipsoid belongs to a p-dimensional subspace spanned by the columns of UP , while the p p SPD matrix SP defines the shape of the ellipsoid in Sym+ p . A canonical representation of P in structure space f Grn,p Sym+ p is computed by identifying a common subspace and then rotating UP to this subspace. The SPD matrix SP is rotated accordingly to reflect the changes of UP . Details of these computations are given in Appendix H. Published as a conference paper at ICLR 2024 Assuming that a canonical representation in structure space f Grn,p Sym+ p is obtained for each point in S+ n,p, we now discuss how to build MLR in this space. As one of the first steps for developing network building blocks in a gyrovector space approach is to construct some basic operations in the considered manifold, we give the definitions of the binary and inverse operations in the following. Definition 3.7 (The Binary Operation in Structure Spaces). Let (UP , SP ), (UQ, SQ) f Grn,p Sym+,g p . Then the binary operation psd,g in structure space f Grn,p Sym+,g p is defined as (UP , SP ) psd,g (UQ, SQ) = (UP e gr UQ, SP g SQ), where e gr is the binary operation in f Grn,p (see Appendix G.6 for the definition of e gr). Definition 3.8 (The Inverse Operation in Structure Spaces). Let (UP , SP ) f Grn,p Sym+,g p . Then the inverse operation psd,g in structure space f Grn,p Sym+,g p is defined as psd,g(UP , SP ) = (e gr UP , g SP ), where e gr is the inverse operation in f Grn,p (see Appendix G.6 for the definition of e gr). Our construction of the binary and inverse operations in f Grn,p Sym+,g p is clearly advantageous compared to the method in Nguyen (2022a) since this method does not preserve the information about the subspaces of the terms involved in these operations. In addition to the binary and inverse operations, we also need to define the inner product in structure spaces. Definition 3.9 (The Inner Product in Structure Spaces). Let (UP , SP ), (UQ, SQ) f Grn,p Sym+,g p . Then the inner product in structure space f Grn,p Sym+,g p is defined as (UP , SP ), (UQ, SQ) psd,g = λ UP UT P , UQUT Q gr + SP , SQ g, where λ > 0, ., . gr is the Grassmann inner product (Nguyen & Yang, 2023) (see Appendix G.7). The key idea to generalize MLR to a Riemannian manifold is to change the margin to reflect the geometry of the considered manifold (a formulation of MLR from the perspective of distances to hyperplanes is given in Appendix C). This requires the notions of hyperplanes and margin in the considered manifold that are referred to as hypergyroplanes and pseudo-gyrodistances (Nguyen & Yang, 2023), respectively. In our case, the definition of hypergyroplanes in structure spaces, suggested by Proposition 3.2, can be given below. Definition 3.10 (Hypergyroplanes in Structure Spaces). Let P, W f Grn,p Sym+,g p . Then hypergyroplanes in structure space f Grn,p Sym+,g p are defined as Hpsd,g W,P = {Q f Grn,p Sym+,g p : psd,g P psd,g Q, W psd,g = 0}. Pseudo-gyrodistances in structure spaces can be defined in the same way as SPD pseudogyrodistances. We refer the reader to Appendix G.8 for all related notions. Theorem 3.11 gives an expression for the pseudo-gyrodistance from a point to a hypergyroplane in a structure space. Theorem 3.11 (Pseudo-gyrodistances in Structure Spaces). Let W = (UW , SW ), P = (UP , SP ), X = (UX, SX) f Grn,p Sym+,g p , and Hpsd,g W,P be a hypergyroplane in structure space f Grn,p Sym+,g p . Then the pseudo-gyrodistance from X to Hpsd,g W,P is given by d(X, Hpsd,g W,P) = |λ (e gr UP e gr UX)(e gr UP e gr UX)T , UW UT W gr + g SP g SX, SW g| q λ( UW UT W gr)2 + ( SW g)2 , where . gr and . g are the norms induced by the Grassmann and SPD inner products, respectively. Proof See Appendix M. The algorithm for computing the pseudo-gyrodistances is given in Appendix B. Published as a conference paper at ICLR 2024 3.4 NEURAL NETWORKS ON GRASSMANN MANIFOLDS In this section, we present a method for computing the Grassmann logarithmic map in the projector perspective. We then propose GCNs on Grassmann manifolds. 3.4.1 GRASSMANN LOGARITHMIC MAP IN THE PROJECTOR PERSPECTIVE The Grassmann logarithmic map is given (Batzies et al., 2015; Bendokat et al., 2020) by Loggr P (Q) = [Ω, P], where P, Q Grn,p, and Ωis computed as 2 log (In 2Q)(In 2P) . Notice that the matrix (In 2Q)(In 2P) is generally not an SPD matrix. This raises an issue when one needs to implement an operation that requires the Grassmann logarithmic map in the projector perspective using popular deep learning frameworks like Py Torch and Tensorflow, since the matrix logarithm function is not differentiable in these frameworks. To deal with this issue, we rely on the following result that allows us to compute the Grassmann logarithmic map in the projector perspective from the Grassmann logarithmic map in the ONB perspective. Proposition 3.12. Let τ be the mapping such that τ : f Grn,p Grn,p, U 7 UUT . Let g Log gr U (V), U, V f Grn,p be the logarithmic map of V at U in the ONB perspective. Then Loggr P (Q) = τ 1(P) g Log gr τ 1(P)(τ 1(Q)) T + g Log gr τ 1(P)(τ 1(Q))τ 1(P)T . Proof See Appendix N. Note that the Grassmann logarithmic map g Log gr U (V) can be computed via singular value decomposition (SVD) that is a differentiable operation in Py Torch and Tensorflow (see Appendix E.2.2). Therefore, Proposition 3.12 provides an effective implementation of the Grassmann logarithmic map in the projector perspective for gradient-based learning. 3.4.2 GRAPH CONVOLUTIONAL NETWORKS ON GRASSMANN MANIFOLDS We propose to extend GCNs to Grassmann geometry using an approach similar to Chami et al. (2019); Zhao et al. (2023). Let G = (V, E) be a graph with vertex set V and edge set E, xl i, i V be the embedding of node i at layer l (l = 0 indicates input node features), N(i) = {j : (i, j) E} be the set of neighbors of i V, Wl and bl be the weight and bias for layer l, and σ(.) be a non-linear activation function. A basic GCN message-passing update (Zhao et al., 2023) can be expressed as pl i = Wlxl 1 i (feature transformation) j N(i) wijpl j (aggregation) xl i = σ(ql i + bl) (bias and nonlinearity) For the aggregation operation, the weights wij can be computed using different methods (Kipf & Welling, 2017; Hamilton et al., 2017). Let Xl i Grn,p, i V be the Grassmann embedding of node i at layer l. For feature transformation on Grassmann manifolds, we use isometry maps based on left Grassmann gyrotranslations (Nguyen & Yang, 2023), i.e., φM(Xl i) = exp([Loggr In,p(M), In,p])Xl i exp( [Loggr In,p(M), In,p]), where In,p = Ip 0 0 0 Mn,n, and M Grn,p is a model parameter. Let Expgr(.) be the exponential map in Grn,p. Then the aggregation process is performed as Ql i = Expgr In,p j N(i) ki,j Loggr In,p(Pl j) , Published as a conference paper at ICLR 2024 Node i Node j Node k Exp map Exp map Exp map Isometry map Isometry map Isometry map Initialization Feature transformation Aggregation Bias and nonlinearity Input sequence Output class probabilities MLR layer Convolutional layer Figure 1: The pipelines of Gyro Spd++ (left) and Gr-GCN++ (right). where Pl i and Ql i are the input and output node features of the aggregation operation, and ki,j = |N(i)| 1 2 represents the relative importance of node j to node i. For any X Grn,p, let exp([Loggr In,p(X), In,p])e In,p = VU be a QR decomposition, where e In,p = Ip 0 V Mn,p is a matrix with orthonormal columns, and U Mp,p is an upper-triangular matrix. Then the non-linear activation function (Nair & Hinton, 2010; Huang et al., 2018) is given by σ(X) = VVT . Let Bl Grn,p be the bias for layer l. Then the message-passing update of our network can be summarized as Pl i = φMl(Xl 1 i ) (feature transformation) Ql i = Expgr In,p j N(i) ki,j Loggr In,p(Pl j) (aggregation) Xl i = σ(Bl gr Ql i) (bias and nonlinearity) The Grassmann logarithmic maps in the aggregation operation are obtained using Proposition 3.12. Another approach for embedding graphs on Grassmann manifolds has also been proposed in Zhou et al. (2022). However, unlike our method, this method creates a Grassmann representation for a graph via a SVD of the matrix formed from node embeddings previously learned by a Euclidean neural network. Therefore, it is not designed to learn node embeddings on Grassmann manifolds. 4 EXPERIMENTS 4.1 HUMAN ACTION RECOGNITION We use three datasets, i.e., HDM05 (M uller et al., 2007), FPHA (Garcia-Hernando et al., 2018), and NTU RBG+D 60 (NTU60) (Shahroudy et al., 2016). We compare our networks against the following state-of-the-art models: SPDNet (Huang & Gool, 2017)1, SPDNet BN (Brooks et al., 2019)2, SPSDAI (Nguyen, 2022a), Gyro AI-HAUNet (Nguyen, 2022b), and MLR-AI (Nguyen & Yang, 2023). 4.1.1 ABLATION STUDY Convolutional layers in SPD neural networks Our network Gyro Spd++ has a MLR layer stacked on top of a convolutional layer (see Fig. 1). The motivation for using a convolutional layer 1https://github.com/zhiwu-huang/SPDNet. 2https://papers.nips.cc/paper/2019/hash/6e69ebbfad976d4637bb4b39de261bf7-Abstract. html. Published as a conference paper at ICLR 2024 Table 1: Results (mean accuracy standard deviation) and model sizes (MB) of various SPD neural networks on the three datasets (computed over 5 runs). Method HDM05 #HDM05 FPHA #FPHA NTU60 #NTU60 SPDNet 71.36 1.49 6.58 88.79 0.36 0.99 76.14 1.43 1.80 SPDNet BN 75.05 1.38 6.68 91.02 0.25 1.03 78.35 1.34 2.06 Gyro AI-HAUNet 77.05 1.35 0.31 95.65 0.23 0.11 93.27 1.29 0.02 SPSD-AI 79.64 1.54 0.31 95.72 0.44 0.11 93.92 1.55 0.03 MLR-AI 78.26 1.37 0.60 95.70 0.26 0.21 94.27 1.32 0.05 Gyro Spd++ (Ours) 79.78 1.42 0.76 96.84 0.27 0.27 95.28 1.37 0.07 Gyro Spsd++ (Ours) 78.52 1.34 0.75 97.90 0.24 0.27 96.64 1.35 0.07 Table 2: Results and computation times (seconds) per epoch of Gr-GCN++ and its variant based on the ONB perspective. Node embeddings are learned on f Gr14,7 and Gr14,7 for Gr-GCN-ONB and Gr-GCN++, respectively. Results are computed over 5 runs. Method Gr-GCN-ONB Gr-GCN++ Accuracy standard deviation 81.9 1.2 82.8 0.7 Training 0.49 0.97 Testing 0.40 0.69 Accuracy standard deviation 76.2 1.5 80.3 0.5 Training 3.40 6.48 Testing 2.76 4.47 Accuracy standard deviation 68.1 1.0 81.6 0.4 Training 0.57 0.77 Testing 0.46 0.52 is that it can extract global features from local ones (covariance matrices computed from joint coordinates within sub-sequences of an action sequence). We use Affine-Invariant metrics for the convolutional layer and Log-Euclidean metrics for the MLR layer. Results in Tab. 1 show that Gyro Spd++ consistently outperforms the SPD baselines in terms of mean accuracy. Results of Gyro Spd++ with different designs of Riemannian metrics for its layers are given in Appendix D.4.1. MLR in structure spaces We build Gyro Spsd++ by replacing the MLR layer of Gyro Spd++ with a MLR layer proposed in Section 3.3. Results of Gyro Spsd++ are given in Tab. 1. Except SPSDAI, Gyro Spsd++ outperforms the other baselines on HDM05 dataset in terms of mean accuracy. Furthermore, Gyro Spsd++ outperforms Gyro Spd++ and all the baselines on FPHA and NTU60 datasets in terms of mean accuracy. These results show that MLR is effective when being designed in structure spaces from a gyrovector space perspective. 4.2 NODE CLASSIFICATION We use three datasets, i.e., Airport (Zhang & Chen, 2018), Pubmed (Namata et al., 2012a), and Cora (Sen et al., 2008), each of them contains a single graph with thousands of labeled nodes. We compare our network Gr-GCN++ (see Fig. 1) against its variant Gr-GCN-ONB (see Appendix E.2.4) based on the ONB perspective. Results are shown in Tab. 2. Both networks give the best performance for n = 14 and p = 7. It can be seen that Gr-GCN++ outperforms Gr-GCN-ONB in all cases. The performance gaps are significant on Pubmed and Cora datasets. 5 CONCLUSION In this paper, we develop FC and convolutional layers for SPD neural networks, and MLR on SPSD manifolds. We show how to perform backpropagation with the Grassmann logarithmic map in the projector perspective. Based on this method, we extend GCNs to Grassmann geometry. Finally, we present our experimental results demonstrating the efficacy of our approach in the human action recognition and node classification tasks. Published as a conference paper at ICLR 2024 Pierre-Antoine Absil, Robert E. Mahony, and Rodolphe Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2007. Vincent Arsigny, Pierre Fillard, Xavier Pennec, and Nicholas Ayache. Fast and Simple Computations on Tensors with Log-Euclidean Metrics. Technical Report RR-5584, INRIA, 2005. E. Batzies, K. Hper, L. Machado, and F. Silva Leite. Geometric Mean and Geodesic Regression on Grassmannians. Linear Algebra and its Applications, 466:83 101, 2015. Thomas Bendokat, Ralf Zimmermann, and P. A. Absil. A Grassmann Manifold Handbook: Basic Geometry and Computational Aspects. Co RR, abs/2011.13699, 2020. URL https://arxiv. org/abs/2011.13699. Silv ere Bonnabel, Anne Collard, and Rodolphe Sepulchre. Rank-preserving Geometric Means of Positive Semi-definite Matrices. Linear Algebra and its Applications, 438:3202 3216, 2013. Daniel A. Brooks, Olivier Schwander, Fr ed eric Barbaresco, Jean-Yves Schneider, and Matthieu Cord. Riemannian Batch Normalization for SPD Neural Networks. In Neur IPS, pp. 15463 15474, 2019. Rudrasis Chakraborty, Jose Bouza, Jonathan H. Manton, and Baba C. Vemuri. Manifold Net: A Deep Neural Network for Manifold-valued Data with Applications. TPAMI, 44(2):799 810, 2020. Ines Chami, Rex Ying, Christopher R, and Jure Leskovec. Hyperbolic Graph Convolutional Neural Networks. Co RR, abs/1910.12933, 2019. URL https://arxiv.org/abs/1910.12933. Weize Chen, Xu Han, Yankai Lin, Hexu Zhao, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. Fully Hyperbolic Neural Networks. In ACL, pp. 5672 5686, 2022. Calin Cruceru, Gary B ecigneul, and Octavian-Eugen Ganea. Computationally Tractable Riemannian Manifolds for Graph Embeddings. In AAAI, pp. 7133 7141, 2021. Jindou Dai, Yuwei Wu, Zhi Gao, and Yunde Jia. A Hyperbolic-to-Hyperbolic Graph Convolutional Network. In CVPR, pp. 154 163, 2021. Zhen Dong, Su Jia, Chi Zhang, Mingtao Pei, and Yuwei Wu. Deep Manifold Learning of Symmetric Positive Definite Matrices with Application to Face Recognition. In AAAI, pp. 4009 4015, 2017. Alan Edelman, Tom as A. Arias, and Steven T. Smith. The Geometry of Algorithms with Orthogonality Constraints. SIAM Journal on Matrix Analysis and Applications, 20(2):303 353, 1998. Octavian-Eugen Ganea, Gary B ecigneul, and Thomas Hofmann. Hyperbolic neural networks. In Neur IPS, pp. 5350 5360, 2018. Guillermo Garcia-Hernando, Shanxin Yuan, Seungryul Baek, and Tae-Kyun Kim. First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations. In CVPR, pp. 409 419, 2018. William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive Representation Learning on Large Graphs. In NIPS, pp. 1025 1035, 2017. Mehrtash Harandi, Mathieu Salzmann, and Richard Hartley. Dimensionality Reduction on SPD Manifolds: The Emergence of Geometry-Aware Methods. TPAMI, 40:48 62, 2018. Zhiwu Huang and Luc Van Gool. A Riemannian Network for SPD Matrix Learning. In AAAI, pp. 2036 2042, 2017. Zhiwu Huang, Jiqing Wu, and Luc Van Gool. Building Deep Networks on Grassmann Manifolds. In AAAI, pp. 3279 3286, 2018. Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu. Matrix Backpropagation for Deep Networks with Structured Layers. In ICCV, pp. 2965 2973, 2015. Published as a conference paper at ICLR 2024 Ce Ju and Cuntai Guan. Graph Neural Networks on SPD Manifolds for Motor Imagery Classification: A Perspective From the Time-Frequency Analysis. IEEE Transactions on Neural Networks and Learning Systems, pp. 1 15, 2023. Qiyu Kang, Kai Zhao, Yang Song, Sijie Wang, and Wee Peng Tay. Node Embedding from Neural Hamiltonian Orbits in Graph Neural Networks. In ICML, pp. 15786 15808, 2023. Sejong Kim. Ordered Gyrovector Spaces. Symmetry, 12(6), 2020. Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. Co RR, abs/1609.02907, 2017. URL https://arxiv.org/abs/1609.02907. Reinmar J. Kobler, Jun ichiro Hirayama, Qibin Zhao, and Motoaki Kawanabe. SPD Domainspecific Batch Normalization to Crack Interpretable Unsupervised Domain Adaptation in EEG. In Neur IPS, pp. 6219 6235, 2022. Guy Lebanon and John Lafferty. Hyperplane Margin Classifiers on the Multinomial Manifold. In ICML, pp. 66, 2004. Zhenhua Lin. Riemannian Geometry of Symmetric Positive Definite Matrices via Cholesky Decomposition. SIAM Journal on Matrix Analysis and Applications, 40(4):1353 1370, 2019. Qi Liu, Maximilian Nickel, and Douwe Kiela. Hyperbolic Graph Neural Networks. In Neur IPS, pp. 8228 8239, 2019. Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, and Wanli Ouyang. Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition. In CVPR, pp. 143 152, 2020. Federico L opez, Beatrice Pozzetti, Steve Trettel, Michael Strube, and Anna Wienhard. Vectorvalued Distance and Gyrocalculus on the Space of Symmetric Positive Definite Matrices. In Neur IPS, pp. 18350 18366, 2021. Meinard M uller, Tido R oder, Michael Clausen, Bernhard Eberhardt, Bj orn Kr uger, and Andreas Weber. Documentation Mocap Database HDM05. Technical Report CG-2007-2, Universit at Bonn, June 2007. Vinod Nair and Geoffrey E. Hinton. Rectified Linear Units Improve Restricted Boltzmann Machines. In ICML, pp. 807 814, 2010. Galileo Namata, Ben London, Lise Getoor, and Bert Huang. Query-driven Active Surveying for Collective Classification. In 10th International Workshop on Mining and Learning with Graphs, volume 8, pp. 1, 2012a. Galileo Namata, Ben London, Lise Getoor, and Bert Huang. Query-driven Active Surveying for Collective Classification. In Workshop on Mining and Learning with Graphs, 2012b. Xuan Son Nguyen. Geom Net: A Neural Network Based on Riemannian Geometries of SPD Matrix Space and Cholesky Space for 3D Skeleton-Based Interaction Recognition. In ICCV, pp. 13379 13389, 2021. Xuan Son Nguyen. A Gyrovector Space Approach for Symmetric Positive Semi-definite Matrix Learning. In ECCV, pp. 52 68, 2022a. Xuan Son Nguyen. The Gyro-Structure of Some Matrix Manifolds. In Neur IPS, pp. 26618 26630, 2022b. Xuan Son Nguyen and Shuo Yang. Building Neural Networks on Matrix Manifolds: A Gyrovector Space Approach. Co RR, abs/2305.04560, 2023. URL https://arxiv.org/abs/2305. 04560. Xuan Son Nguyen, Luc Brun, Olivier L ezoray, and S ebastien Bougleux. A Neural Network Based on SPD Manifold Learning for Skeleton-based Hand Gesture Recognition. In CVPR, pp. 12036 12045, 2019. Published as a conference paper at ICLR 2024 Xavier Pennec, Pierre Fillard, and Nicholas Ayache. A Riemannian Framework for Tensor Computing. Technical Report RR-5255, INRIA, 2004. Xavier Pennec, Stefan Horst Sommer, and Tom Fletcher. Riemannian Geometric Statistics in Medical Image Analysis. Academic Press, 2020. Chiara Plizzari, Marco Cannici, and Matteo Matteucci. Skeleton-based Action Recognition via Spatial and Temporal Transformer Networks. Computer Vision and Image Understanding, 208: 103219, 2021. Prithviraj Sen, Galileo Mark Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, and Tina Eliassi-Rad. Collective Classification in Network Data. AI Magazine, 29(3):93 106, 2008. Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In CVPR, pp. 1010 1019, 2016. Ryohei Shimizu, Yusuke Mukuta, and Tatsuya Harada. Hyperbolic Neural Networks++. Co RR, abs/2006.08210, 2021. URL https://arxiv.org/abs/2006.08210. Ondrej Skopek, Octavian-Eugen Ganea, and Gary B ecigneul. Mixed-curvature Variational Autoencoders. Co RR, abs/1911.08411, 2020. URL https://arxiv.org/abs/1911.08411. Lincon S. Souza, Naoya Sogi, Bernardo B. Gatto, Takumi Kobayashi, and Kazuhiro Fukui. An Interface between Grassmann Manifolds and Vector Spaces. In CVPRW, pp. 3695 3704, 2020. Rhea Sanjay Sukthanker, Zhiwu Huang, Suryansh Kumar, Erik Goron Endsjo, Yan Wu, and Luc Van Gool. Neural Architecture Search of SPD Manifold Networks. In IJCAI, pp. 3002 3009, 2021. Abraham Albert Ungar. Beyond the Einstein Addition Law and Its Gyroscopic Thomas Precession: The Theory of Gyrogroups and Gyrovector Spaces. Fundamental Theories of Physics, vol. 117, Springer, Netherlands, 2002. Abraham Albert Ungar. Analytic Hyperbolic Geometry: Mathematical Foundations and Applications. World Scientific Publishing Co. Pte. Ltd., Hackensack, NJ, 2005. Abraham Albert Ungar. Analytic Hyperbolic Geometry in N Dimensions: An Introduction. CRC Press, 2014. Petar Veli ckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li, and Yoshua Bengio. Graph Attention Networks. Co RR, abs/1710.10903, 2018. URL https://arxiv. org/abs/1710.10903. Rui Wang and Xiao-Jun Wu. Gras Net: A Simple Grassmannian Network for Image Set Classification. Neural Processing Letters, 52(1):693 711, 2020. Rui Wang, Xiao-Jun Wu, and Josef Kittler. Sym Net: A Simple Symmetric Positive Definite Manifold Deep Learning Method for Image Set Classification. IEEE Transactions on Neural Networks and Learning Systems, pp. 1 15, 2021. Muhan Zhang and Yixin Chen. Link Prediction Based on Graph Neural Networks. Co RR, abs/1802.09691, 2018. URL https://arxiv.org/abs/1802.09691. Yiding Zhang, Xiao Wang, Chuan Shi, Nian Liu, and Guojie Song. Lorentzian Graph Convolutional Networks. In Proceedings of the Web Conference 2021, pp. 1249 1261, 2021. Yiding Zhang, Xiao Wang, Chuan Shi, Xunqiang Jiang, and Yanfang Ye. Hyperbolic Graph Attention Network. IEEE Transactions on Big Data, 8(6):1690 1701, 2022. Wei Zhao, Federico Lopez, J. Maxwell Riestenberg, Michael Strube, Diaaeldin Taha, and Steve Trettel. Modeling Graphs Beyond Hyperbolic: Graph Neural Networks in Symmetric Positive Definite Matrices. Co RR, abs/2306.14064, 2023. URL https://arxiv.org/abs/2306. 14064. Bingxin Zhou, Xuebin Zheng, Yu Guang Wang, Ming Li, and Junbin Gao. Embedding Graphs on Grassmann Manifold. Neural Networks, 152:322 331, 2022. Huanyu Zhou, Qingjie Liu, and Yunhong Wang. Learning Discriminative Representations for Skeleton Based Action Recognition. In CVPR, pp. 10608 10617, 2023. Published as a conference paper at ICLR 2024 Symbol Name Mn,m Space of n m matrices Sym+ n Space of n n SPD matrices Sym+,ai n Space of n n SPD matrices with AI geometry Symn Space of n n symmetric matrices Grn,p Grassmannian in the projector perspective f Grn,p Grassmannian in the ONB perspective S+ n,p Space of n n SPSD matrices of rank p n M Matrix manifold TPM Tangent space of M at P exp(P) Matrix exponential of P log(P) Matrix logarithm of P Expai P (W) Exponential map of W at P in Sym+,ai n Logai P (Q) Logarithmic map of Q at P in Sym+,ai n T ai P Q(W) Parallel transport of W from P to Q in Sym+,ai n Expgr P (W) Exponential map of W at P in Grn,p Loggr P (Q) Logarithmic map of Q at P in Grn,p g Log gr P (Q) Logarithmic map of Q at P in f Grn,p ai, ai Binary and inverse operations in Sym+,ai n gr, gr Binary and inverse operations in Grn,p e gr, e gr Binary and inverse operations in f Grn,p psd,ai Binary operation in f Grn,p Sym+,ai p psd,ai Inverse operation in f Grn,p Sym+,ai p Hspd,ai W,P Hypergyroplane in Sym+,ai n Hpsd,ai W,P Hypergyroplane in f Grn,p Sym+,ai p Eai (i,j) Orthonormal basis of Sym+,ai m ., . ai Inner product in Sym+,ai n ., . gr Inner product in Grn,p ., . psd,ai Inner product in f Grn,p Sym+,ai p ., . ai P Affine-Invariant metric at P In n n identity matrix Table 3: The main notations used in the paper. For the notations related to SPD manifolds, only those associated with Affine-Invariant geometry are shown. A NOTATIONS Tab. 3 presents the main notations used in our paper. B MLR IN STRUCTURE SPACES Algorithm 1 summarizes all steps for the computation of pseudo-gyrodistances in Theorem 3.11. Details of some steps are given below: Published as a conference paper at ICLR 2024 Algorithm 1: Computation of Pseudo-gyrodistances Input: A batch of SPSD matrices Xi S+ n,p, i = 1, . . . , N The number of classes C The parameters for each class Uc P , Uc W f Grn,p, Sc P , Sc W Sym+ p , c = 1, . . . , C A constant γ [0, 1] Output: An array d MN,C of pseudo-gyrodistances 1 Um e In,p; 2 (Ui,Σi, Vi)i=1,...,N SVD((Xi)i=1,...,N); /* Xi = UiΣi VT i */ 3 (Ui)i=1,...,N (Ui[:, : p])i=1,...,N; 4 if training then 5 U Gr Mean((Ui)i=1,...,N); 6 Um Gr Geodesic(Um, U, γ); 8 (Ui X, Si X)i=1,...,N = Gr Canonicalize((Xi, Ui)i=1,...,N, Um); 9 for c 1 to C do 10 d((X)i=1,...,N, c) = |λ (e gr Uc P e gr UX)(e gr Uc P e gr UX)T ,Uc W (Uc W )T gr+ g Sc P g SX,Sc W g| λ( Uc W (Uc W )T gr)2+( Sc W g)2 ; SVD((Xi)i=1,...,N) performs singular value decompositions for a batch of matrices. Gr Mean((Ui)i=1,...,N) computes the Fr echet mean of its arguments. Gr Geodesic(Um, U, γ) computes a point on a geodesic from Um to U at step γ (γ = 0.1 in our experiments). Gr Canonicalize((Xi, Ui)i=1,...,N, V) computes the canonical representations of Xi, i = 1, . . . , N using V as a common subspace (see Appendix H). Line 10: UX = (Ui X)i=1,...,N, SX = (Si X)i=1,...,N, and the computation of pseudogyrodistances is performed in batches. C FORMULATION OF MLR FROM THE PERSPECTIVE OF DISTANCES TO HYPERPLANES Given K classes, MLR computes the probability of each of the output classes as p(y = k|x) = exp(w T k x + bk) PK i=1 exp(w T i x + bi) exp(w T k x + bk), (3) where x is an input sample, bi R, x, wi Rn, i = 1, . . . , K. As shown in Lebanon & Lafferty (2004), Eq. (3) can be rewritten as p(y = k|x) exp(sign(w T k x + bk) wk d(x, Hwk,bk)), where d(x, Hwk,bk) is the distance from point x to a hyperplane Hwk,bk defined as Hw,b = {x Rn : w, x + b = 0}, where w Rn \ {0}, and b R. D HUMAN ACTION RECOGNITION D.1 DATASETS HDM05 (M uller et al., 2007) It has 2337 sequences of 3D skeleton data classified into 130 classes. Each frame contains the 3D coordinates of 31 body joints. We use all the action classes and follow the experimental protocol in Harandi et al. (2018) in which 2 subjects are used for training and the remaining 3 subjects are used for testing. Published as a conference paper at ICLR 2024 FPHA (Garcia-Hernando et al., 2018) It has 1175 sequences of 3D skeleton data classified into 45 classes. Each frame contains the 3D coordinates of 21 hand joints. We follow the experimental protocol in Garcia-Hernando et al. (2018) in which 600 sequences are used for training and 575 sequences are used for testing. NTU60 (Shahroudy et al., 2016) It has 56880 sequences of 3D skeleton data classified into 60 classes. Each frame contains the 3D coordinates of 25 or 50 body joints. We use the mutual actions and follow the cross-subject experimental protocol in Shahroudy et al. (2016) in which data from 20 subjects are used for training, and those from the other 20 subjects are used for testing. D.2 IMPLEMENTATION DETAILS D.2.1 SETUP We use the Py Torch framework to implement our networks and those from previous works. These networks are trained using cross-entropy loss and Adadelta optimizer for 2000 epochs. The learning rate is set to 10 3. The factors β (see Proposition 3.4) and λ (see Definition 3.9) are set to 0 and 1, respectively. For Gyro Spd++, the sizes of output matrices of the convolutional layer are set to 34 34, 21 21, and 11 11 for the experiments on HDM05, FPHA, and NTU60 datasets, respectively. For Gyro Spsd++, the sizes of SPD matrices in structure spaces are set to 20 20, 14 14, and 8 8 for the experiments on HDM05, FPHA, and NTU60 datasets, respectively. We use a batch size of 32 for HDM05 and FPHA datasets, and a batch size of 256 for NTU60 dataset. D.2.2 INPUT DATA We use a similar method as in Nguyen (2022b) to compute the input data for our network Gyro Spd++. We first identify a closest left (right) neighbor of every joint based on their distance to the hip (wrist) joint, and then combine the 3D coordinates of each joint and those of its left (right) neighbor to create a feature vector for the joint. For a given frame t, a mean vector µt and a covariance matrix Σt are computed from the set of feature vectors of the frame and then combined to create an SPD matrix as Yt = Σt + µt(µt)T µt (µt)T 1 The lower part of matrix log(Yt) is flattened to obtain a vector vt. All vectors vt within a time window [t, t+c 1], where c is determined from a temporal pyramid representation of the sequence (the number of temporal pyramids is set to 2 in our experiments), are used to compute a covariance matrix as i=t ( vi vt)( vi vt)T , (4) where vt = 1 c Pt+c 1 i=t vi. For Gyro AI-HAUNet, SPSD-AI, MLR-AI, Gyro Spd++, and Gyro Spsd++, the input data are the set of matrices obtained in Eq. (4). For SPDNet and SPDNet BN, each sequence is represented by a covariance matrix (Huang & Gool, 2017; Brooks et al., 2019). The sizes of the covariance matrices are 93 93, 60 60, and 150 150 for HDM05, FPHA, and NTU60 datasets, respectively. For SPDNet, the same architecture as the one in Huang & Gool (2017) is used with three Bimap layers. For SPDNet BN, the same architecture as the one in Brooks et al. (2019) is used with three Bimap layers. The sizes of the transformation matrices for the experiments on HDM05, FPHA, and NTU60 datasets are set to 93 93, 60 60, and 150 150, respectively. D.2.3 CONVOLUTIONAL LAYERS In order to reduce the number of parameters and the computational cost for the convolutional layer in Gyro Spd++, we assume a diagonal structure for the parameter P(i,j) (see Propositions 3.4, 3.5, and 3.6), i.e., P(i,j) = concatspd(P1 (i,j), . . . , PL (i,j)), where L is the number of input SPD matrices of operation concatspd(.). Published as a conference paper at ICLR 2024 D.2.4 OPTIMIZATION For parameters that are SPD matrices, we model them on the space of symmetric matrices, and then apply the exponential map at the identity. For any parameter P f Grn,p, we parameterize it by a matrix B Mp,n p such that 0 B BT 0 = [Loggr In,p(PPT ), In,p]. Then parameter P can be computed by P = exp([Loggr In,p(PPT ), In,p])e In,p = exp 0 B BT 0 Thus, we can optimize all parameters on Euclidean spaces without having to resort to techniques developed on Riemannian manifolds. D.3 TIME COMPLEXITY ANALYSIS Let nin nin be the size of input SPD matrices, nout nout be the size of output matrices of the convolutional layer in Gyro Spd++, nrank nrank be the size of SPD matrices in structure spaces, nc be the number of action classes, ns be the number of SPD matrices encoding a sequence. Gyro Spd++: The convolutional layer has time complexity O(nsn2 outn3 in). The MLR layer has time complexity O(ncn3 out). Gyro Spsd++: The convolutional layer has time complexity O(nsn2 outn3 in). The MLR layer has time complexity O(n3 out + ncn3 rank). D.4 MORE EXPERIMENTAL RESULTS D.4.1 ABLATION STUDY Impact of the factor β in Affine-Invariant metrics To study the impact of the factor β in Affine Invariant metrics on the performance of Gyro Spd++, we follow the approach in Nguyen (2021). Denote by (µ,Σ) the Gaussian distribution where µ Rn and Σ Mn,n are its mean and covariance. We can identify the Gaussian distribution (µ,Σ) with the following matrix: (detΣ) 1 n+k Σ + kµµT µ(k) µ(k)T Ik where k 1, µ(k) is a matrix with k identical column vectors µ. The natural symmetric Riemannian metric resulting from the above embedding is given (Nguyen, 2021) by V, W P = Tr(VP 1WP 1) 1 n + k Tr(VP 1) Tr(WP 1), where P is an SPD matrix, V and W are two tangent vectors at point P of the manifold. This Riemannian metric belongs to the family of Affine-Invariant metrics where β = 1 n+k > 1 For this study, we replace matrix Zt in Eq. (4) with the following matrix: e Zt = (det Zt) 1 n1+k Zt + k vt v T t vt(k) vt(k)T Ik where n1 n1 is the size of matrix Zt. The input data for Gyro Spd++ are then computed from e Zt as before. Tab. 4 reports the mean accuracies and standard deviations of Gyro Spd++ with respect to different settings of β on the three datasets. Gyro Spd++ with the setting β = 0 generally works well on all the datasets. Setting k = 3 improves the accuracy of Gyro Spd++ on NTU60 dataset. We also observe that setting k to a high value, e.g., k = 10 lowers the accuracies of Gyro Spd++ on the datasets. Published as a conference paper at ICLR 2024 Table 4: Results (mean accuracy standard deviation) of Gyro Spd++ with respect to different settings of β on the three datasets (computed over 5 runs). Dataset HDM05 FPHA NTU60 β = 0 79.78 1.42 96.84 0.27 95.28 1.37 k = 3 79.12 1.37 96.16 0.25 96.32 1.33 k = 10 78.25 1.39 95.91 0.29 94.44 1.34 Table 5: Results and computation times (seconds) of Gyro Spd++ with respect to different settings of the output dimension of the convolutional layer on FPHA dataset (computed over 5 runs). Experiments are conducted on a machine with Intel Core i7-8565U CPU 1.80 GHz 24GB RAM. m Accuracy standard deviation Computation time/epoch Training Testing 10 94.53 0.31 30.08 12.07 21 96.84 0.27 129.30 50.84 30 96.80 0.26 182.52 71.49 Output dimension of convolutional layers Tab. 5 presents results and computation times of Gyro Spd++ with respect to different settings of the output dimension of the convolutional layer on FPHA dataset. Results show that the setting m = 21 clearly outperforms the setting m = 10 in terms of mean accuracy and standard deviation. However, compared to the setting m = 21, the setting m = 30 only increases the training and testing times without improving the mean accuracy of Gyro Spd++. Design of Riemannian metrics for network blocks The use of different Riemannian metrics for the convolutional and MLR layers of Gyro Spd++ results in different variants of the same architecture. Results of some of these variants on FPHA dataset are shown in Tab. 6. It is noted that our architecture gives the best performance in terms of mean accuracy, while the architecture with Log-Cholesky geometry for the MLR layer performs the worst in terms of mean accuracy. D.4.2 COMPARISON OF GYROSPD++ AGAINST STATE-OF-THE-ART METHODS Here we present more comparisons of our networks against state-of-the-art networks. These networks belong to one of the following families of neural networks: (1) Hyperbolic neural networks: Hyp GRU (Ganea et al., 2018)3; (2) Graph neural networks: MS-G3D (Liu et al., 2020)4, TGN (Zhou et al., 2023)5; (3) Transformers: ST-TR (Plizzari et al., 2021)6. Note that MS-G3D, TGN, and ST-TR are specifically designed for skeleton-based action recognition. We use default parameter settings for these networks. Results of our networks and their competitors on HDM05, FPHA, and NTU60 datasets are shown in Tabs. 7, 8, and 9, respectively. On HDM05 dataset, Gyro Spd++ outperforms Hyp GRU, MS-G3D, ST-TR, and TGN by 25.6%, 10.8%, 11.9%, and 9.5% points in terms of mean accuracy, respectively. On FPHA dataset, Gyro Spd++ outperforms Hyp GRU, MS-G3D, ST-TR, and TGN by 38.6%, 8.5%, 10.9%, and 6.0% points in terms of mean accuracy, respectively. On NTU60 dataset, Gyro Spd++ outperforms Hyp GRU, MS-G3D, ST-TR, and TGN by 7.0%, 3.1%, 4.2%, and 2.2% points in terms of mean accuracy, respectively. Overall, our networks are superior to their competitors in all cases. Finally, we present a comparison of computation times of SPD neural networks in Tab. 10. Published as a conference paper at ICLR 2024 Table 6: Results (mean accuracy standard deviation) of Gyro Spd++ with different designs of Riemannian metrics for its layers on FPHA dataset (computed over 5 runs). AI-LE (Gyro Spd++) LE-LE AI-AI LE-AI AI-LC 96.84 0.27 94.72 0.25 94.35 0.29 95.21 0.26 89.16 0.26 Table 7: Results of our networks and some state-of-the-art methods on HDM05 dataset (computed over 5 runs). Method Accuracy Standard deviation Hyp GRU (Ganea et al., 2018) 54.18 1.51 MS-G3D (Liu et al., 2020) 68.92 1.72 ST-TR (Plizzari et al., 2021) 67.84 1.66 TGN (Zhou et al., 2023) 70.26 1.48 Gyro Spd++ (Ours) 79.78 1.42 Gyro Spsd++ (Ours) 78.52 1.34 E NODE CLASSIFICATION E.1 DATASETS Airport (Chami et al., 2019) It is a flight network dataset from Open Flights.org where nodes represent airports, edges represent the airline Routes, and node labels are the populations of the country where the airport belongs. Pubmed (Namata et al., 2012b) It is a standard benchmark describing citation networks where nodes represent scientific papers in the area of medicine, edges are citations between them, and node labels are academic (sub)areas. Cora (Sen et al., 2008) It is a citation network where nodes represent scientific papers in the area of machine learning, edges are citations between them, and node labels are academic (sub)areas. The statistics of the three datasets are summarized in Tab. 11. E.2 IMPLEMENTATION DETAILS E.2.1 SETUP Our network is implemented using the Py Torch framework. We set hyperparameters as in Zhao et al. (2023) that are found via grid search for each graph architecture on the development set of a given dataset. The best settings of n and p are found from (n, p) {(2k, k)}, k = 2, 3, . . . , 10. The batch size is set to the total number of graph nodes in a dataset (Chami et al., 2019; Zhao et al., 2023). The networks are trained using cross-entropy loss and Adam optimizer for a maximum of 500 epochs. The learning rate is set to 10 2. Early stopping is used when the loss on the development set has not decreased for 200 epochs. Each network has two layers that perform message passing twice at one iteration (Zhao et al., 2023). We use the 70/15/15 percent splits (Chami et al., 2019) for Airport dataset, and standard splits in GCN Kipf & Welling (2017) for Pubmed and Cora datasets. 3https://github.com/dalab/hyperbolic_nn. 4https://github.com/kenziyuliu/MS-G3D. 5https://github.com/zhysora/FR-Head. 6https://github.com/Chiaraplizz/ST-TR. Published as a conference paper at ICLR 2024 Table 8: Results of our networks and some state-of-the-art methods on FPHA dataset (computed over 5 runs). Method Accuracy Standard deviation Hyp GRU (Ganea et al., 2018) 58.24 0.29 MS-G3D (Liu et al., 2020) 88.26 0.67 ST-TR (Plizzari et al., 2021) 85.94 0.46 TGN (Zhou et al., 2023) 90.81 0.53 Gyro Spd++ (Ours) 96.84 0.27 Gyro Spsd++ (Ours) 97.90 0.24 Table 9: Results of our networks and some state-of-the-art methods on NTU60 dataset (computed over 5 runs). Method Accuracy Standard deviation Hyp GRU (Ganea et al., 2018) 88.26 1.40 MS-G3D (Liu et al., 2020) 92.15 1.60 ST-TR (Plizzari et al., 2021) 91.04 1.52 TGN (Zhou et al., 2023) 93.02 1.56 Gyro Spd++ (Ours) 95.28 1.37 Gyro Spsd++ (Ours) 96.64 1.35 E.2.2 GRASSMANN LOGARITHMIC MAP IN THE ONB PERSPECTIVE The Grassmann logarithmic map in the ONB perspective is given (Edelman et al., 1998) by g Log gr P (Q) = U arctan(Σ)VT , where P, Q f Grn,p, U, Σ, and V are obtained from the SVD (In PPT )Q(PT Q) 1 = UΣVT . E.2.3 GR-GCN++ To create Grassmann embeddings as input node features, we first transform d-dimensional input features into p(n p)-dimensional vectors via a linear map. We then reshape each resulting vector to a matrix B Mp,n p. The input Grassmann embedding X0 i , i V is computed as X0 i = exp 0 B BT 0 In,p exp 0 B BT 0 E.2.4 GR-GCN-ONB To create Grassmann embeddings as input node features, we first transform d-dimensional input features into p(n p)-dimensional vectors via a linear map. We then reshape each resulting vector to a matrix B Mp,n p. The input Grassmann embedding X0 i , i V is computed as X0 i = exp 0 B BT 0 Feature transformation is performed by first mapping the input to a projection matrix (using the mapping τ in Section 3.4.1), then applying an isometry map based on left Grassmann gyrotranslations (Nguyen & Yang, 2023), and finally mapping the result back to a matrix with orthonormal columns. This is equivalent to performing the following mapping: φM(Xl i) = Me gr Xl i = exp([Loggr In,p(MMT ), In,p])Xl i, where Xl i f Grn,p and M f Grn,p is a model parameter. Published as a conference paper at ICLR 2024 Table 10: Computation times (seconds) per epoch of our networks and some state-of-the-art SPD neural networks on FPHA dataset. Experiments are conducted on a machine with Intel Core i78565U CPU 1.80 GHz 24GB RAM. Method SPDNet SPDNet BN Gyro AI-HAUNet SPSD-AI MLR-AI Gyro Spd++ Gyro Spsd++ Training 17.52 40.08 62.21 73.73 102.58 129.30 126.08 Testing 3.48 6.22 30.83 35.54 46.28 50.84 48.06 Table 11: Description of the datasets for node classification. Dataset #Nodes #Edges #Classes #Features Airport 3188 18631 4 4 Pubmed 19717 44338 3 500 Cora 2708 5429 7 1433 For any X f Grn,p, let X = VU be a QR decomposition of X, where V Mn,p is a matrix with orthonormal columns, and U Mp,p is an upper-triangular matrix. Then the non-linear activation function is given by σ(X) = V. Bias addition is performed using operation e gr instead of operation gr. The output of Gr-GCNONB is mapped to a projection matrix for node classification. E.2.5 OPTIMIZATION For any parameter P Grn,p, we parameterize it by a matrix B Mp,n p such that 0 B BT 0 = [Loggr In,p(P), In,p]. Then parameter P can be computed by P = exp 0 B BT 0 In,p exp 0 B BT 0 E.3 MORE EXPERIMENTAL RESULTS E.3.1 ABLATION STUDY Projector vs. ONB perspective More results of Gr-GCN++ and Gr-GCN-ONB are presented in Tabs. 12 and 13. As can be observed, Gr-GCN++ outperforms Gr-GCN-ONB in all cases. In particular, the former outperforms the latter by large margins on Airport and Cora datasets. Results show that while both the networks learn node embeddings on Grassmann manifolds, the choice of perspective for representing these embeddings and the associated parameters can have a significant impact on the network performance. E.3.2 COMPARISON OF GR-GCN++ AGAINST STATE-OF-THE-ART METHODS Tab. 14 shows results of Gr-GCN++ and some state-of-the-art methods on the three datasets. The hyperbolic networks outperform their SPD and Grassmann counterparts on Airport dataset with high hyperbolicity (Chami et al., 2019). This agrees with previous works (Chami et al., 2019; Zhang et al., 2022) that report good performances of hyperbolic embeddings on tree-like datasets. However, our network and its SPD counterpart SPD-GCN outperform their competitors on Pubmed and Cora datasets with low hyperbolicities. Compared to SPD-GCN, Gr-GCN++ always gives more consistent results. Published as a conference paper at ICLR 2024 Table 12: Results and computation times (seconds) per epoch of Gr-GCN++ and its variant Gr GCN-ONB based on the ONB perspective. Node embeddings are learned on f Gr4,2 and Gr4,2 for Gr-GCN-ONB and Gr-GCN++, respectively. Results are computed over 5 runs. Experiments are conducted on a machine with Intel Core i7-9700 CPU 3.00 GHz 15GB RAM. Method Gr-GCN-ONB Gr-GCN++ Accuracy standard deviation 53.2 1.9 60.1 1.3 Training 0.07 0.21 Testing 0.05 0.12 Accuracy standard deviation 75.7 2.1 77.5 1.1 Training 0.50 0.90 Testing 0.38 0.54 Accuracy standard deviation 33.9 2.3 64.4 1.4 Training 0.10 0.12 Testing 0.07 0.08 Table 13: Results and computation times (seconds) per epoch of Gr-GCN++ and its variant Gr GCN-ONB based on the ONB perspective. Node embeddings are learned on f Gr6,3 and Gr6,3 for Gr-GCN-ONB and Gr-GCN++, respectively. Results are computed over 5 runs. Experiments are conducted on a machine with Intel Core i7-9700 CPU 3.00 GHz 15GB RAM. Method Gr-GCN-ONB Gr-GCN++ Accuracy standard deviation 65.8 1.5 74.1 0.9 Training 0.19 0.34 Testing 0.14 0.21 Accuracy standard deviation 75.8 2.0 78.5 0.9 Training 0.90 1.76 Testing 0.75 1.05 Accuracy standard deviation 41.4 2.2 70.5 1.1 Training 0.16 0.22 Testing 0.12 0.16 F LIMITATIONS OF OUR WORK Our SPD network Gyro Spd++ relies on different Riemannian metrics across the layers, i.e., the convolutional layer is based on Affine-Invariant metrics while the MLR layer is based on Log Euclidean metrics. Although we have provided the experimental results demonstrating that Gyro Spd++ achieves good performance on all the datasets compared to state-of-the-art methods, it is not clear if our design is optimal for the human action recognition task. When it comes to building a deep SPD architecture, it is useful to provide insights into Riemannian metrics one should use for each network block in order to obtain good performance on a target task. In our Grassmann network Gr-GCN++, the feature transformation and bias and nonlinearity operations are performed on Grassmann manifolds, while the aggregation operation is performed in tangent spaces. Previous works (Dai et al., 2021; Chen et al., 2022) on HNNs have shown that this hybrid method limits the modeling ability of networks. Therefore, it is desirable to develop GCNs where all the operations are formalized on Grassmann manifolds. Published as a conference paper at ICLR 2024 Table 14: Results (mean accuracy standard deviation) of Gr-GCN++ and some state-of-the-art methods on the three datasets. The best and second best results in terms of mean accuracy are highlighted in red and blue, respectively. Method Airport Pubmed Cora GCN (Kipf & Welling, 2017) 82.2 0.6 77.8 0.8 80.2 2.3 GAT (Veli ckovi c et al., 2018) 92.9 0.8 77.6 0.8 80.3 0.6 HGNN (Liu et al., 2019) 84.5 0.7 76.6 1.4 79.5 0.9 HGCN (Chami et al., 2019) 85.3 0.6 76.4 0.8 78.7 0.9 LGCN (Zhang et al., 2021) 88.2 0.2 77.3 1.4 80.6 0.9 HGAT (Zhang et al., 2022) 87.5 0.9 78.0 0.5 80.9 0.7 SPD-GCN (Zhao et al., 2023) 82.6 1.5 78.7 0.5 82.3 0.5 Ham GNN (Kang et al., 2023) 95.9 0.1 78.3 0.6 80.1 1.6 Gr-GCN++ (Ours) 82.8 0.7 80.3 0.5 81.6 0.4 G SOME RELATED DEFINITIONS G.1 GYROGROUPS AND GYROVECTOR SPACES Gyrovector spaces form the setting for hyperbolic geometry in the same way that vector spaces form the setting for Euclidean geometry (Ungar, 2002; 2005; 2014). We recap the definitions of gyrogroups and gyrocommutative gyrogroups proposed in Ungar (2002; 2005; 2014). For greater mathematical detail and in-depth discussion, we refer the interested reader to these papers. Definition G.1 (Gyrogroups (Ungar, 2014)). A pair (G, ) is a groupoid in the sense that it is a nonempty set, G, with a binary operation, . A groupoid (G, ) is a gyrogroup if its binary operation satisfies the following axioms for a, b, c G: (G1) There is at least one element e G called a left identity such that e a = a. (G2) There is an element a G called a left inverse of a such that a a = e. (G3) There is an automorphism gyr[a, b] : G G for each a, b G such that a (b c) = (a b) gyr[a, b]c (Left Gyroassociative Law). The automorphism gyr[a, b] is called the gyroautomorphism, or the gyration of G generated by a, b. (G4) gyr[a, b] = gyr[a b, b] (Left Reduction Property). Definition G.2 (Gyrocommutative Gyrogroups (Ungar, 2014)). A gyrogroup (G, ) is gyrocommutative if it satisfies a b = gyr[a, b](b a) (Gyrocommutative Law). The following definition of gyrovector spaces is slightly different from Definition 3.2 in Ungar (2014). Definition G.3 (Gyrovector Spaces). A gyrocommutative gyrogroup (G, ) equipped with a scalar multiplication (t, x) t x : R G G is called a gyrovector space if it satisfies the following axioms for s, t R and a, b, c G: (V1) 1 a = a, 0 a = t e = e, and ( 1) a = a. (V2) (s + t) a = s a t a. (V3) (st) a = s (t a). (V4) gyr[a, b](t c) = t gyr[a, b]c. (V5) gyr[s a, t a] = Id, where Id is the identity map. Published as a conference paper at ICLR 2024 G.2 AI GYROVECTOR SPACES For P, Q Sym+ n , the binary operation (Nguyen, 2022a) is given as P ai Q = P 1 2 QP 1 2 . The inverse operation (Nguyen, 2022a) is given by ai P = P 1. G.3 LE GYROVECTOR SPACES For P, Q Sym+ n , the binary operation (Nguyen, 2022a) is given as P le Q = exp(log(P) + log(Q)). The inverse operation (Nguyen, 2022a) is given as le P = P 1. G.4 LC GYROVECTOR SPACES For P, Q Sym+ n , the binary operation (Nguyen, 2022a) is given as P lc Q = L (P) + L (Q) +D(L (P))D(L (Q)) . L (P) + L (Q) +D(L (P))D(L (Q)) T , where Y is a matrix of the same size as matrix Y Mn,n whose (i, j) element is Y(i,j) if i > j and is zero otherwise, D(Y) is a diagonal matrix of the same size as matrix Y whose (i, i) element is Y(i,i), and L (P) denotes the Cholesky factor of P, i.e., L (P) is a lower triangular matrix with positive diagonal entries such that P = L (P)L (P)T . The inverse operation (Nguyen, 2022a) is given by lc P = L (P) + D(L (P)) 1 L (P) + D(L (P)) 1 T . G.5 GRASSMANN MANIFOLDS IN THE PROJECTOR PERSPECTIVE For P, Q Grn,p, the binary operation (Nguyen, 2022b) is given as P gr Q = exp([Loggr In,p(P), In,p])Q exp( [Loggr In,p(P), In,p]), where [., .] denotes the matrix commutator. The inverse operation (Nguyen, 2022b) is defined as gr P = Expgr In,p( Loggr In,p(P)). G.6 GRASSMANN MANIFOLDS IN THE ONB PERSPECTIVE For U, V f Grn,p, the binary operation (Nguyen & Yang, 2023) is defined as Ue gr V = exp([Loggr In,p(UUT ), In,p])V. The inverse operation can be defined using the approach in Nguyen & Yang (2023) (see Section 2.3.1), i.e., e gr U = τ 1 gr (UUT ) , where the mapping τ is defined in Proposition 3.12, i.e., τ : f Grn,p Grn,p, U 7 UUT . Published as a conference paper at ICLR 2024 G.7 THE SPD AND GRASSMANN INNER PRODUCTS Definition G.4 (The SPD Inner Product). Let P, Q Sym+,g n . Then the SPD inner product of P and Q is defined as P, Q g = Logg In(P), Logg In(Q) g In. Definition G.5 (The Grassmann Inner Product). Let P, Q Grn,p. Then the Grassmann inner product of P and Q is defined as P, Q gr = Loggr In,p(P), Loggr In,p(Q) In,p, where ., . In,p denotes the inner product at In,p given by the canonical metric of Grn,p. G.8 THE GYROCOSINE FUNCTION AND GYROANGLES IN STRUCTURE SPACES Definition G.6 (The Gyrocosine Function and Gyroangles). Let P, Q, and R be three distinct gyropoints in structure space f Grn,p Sym+ p . The gyrocosine of the measure of the gyroangle α, 0 α π, between psd,g P psd,g Q and psd,g P psd,g R is given by the equation cos α = psd,g P psd,g Q, psd,g P psd,g R psd,g psd,g P psd,g Q psd,g. psd,g P psd,g R psd,g , where . psd,g is the norm induced by the inner product in structure spaces. The gyroangle α is denoted by α = QPR. G.9 THE GYRODISTANCE FUNCTION IN STRUCTURE SPACES Definition G.7 (The Gyrodistance Function in Structure Spaces). Let P, Q f Grn,p Sym+ p . Then the gyrodistance function in structure spaces f Grn,p Sym+ p is defined as d(P, Q) = psd,g P psd,g Q psd,g. G.10 THE PSEUDO-GYRODISTANCE FUNCTION IN STRUCTURE SPACES Definition G.8 (The Pseudo-gyrodistance Function in Structure Spaces). Let Hpsd,g W,P be a hyper- gyroplane in structure space f Grn,p Sym+ p , and X f Grn,p Sym+ p . Then the pseudo-gyrodistance from X to Hpsd,g W,P is defined as d(X, Hpsd,g W,P) = sin( XPQ)d(X, P), where Q is given by Q = arg max Q Hpsd,g W,P \{P} cos( QPX). By convention, sin( XPQ) = 0 for any X, Q Hpsd,g W,P. H COMPUTATION OF CANONICAL REPRESENTATIONS Let Vn,p be the space of n p matrices with orthonormal columns. For any P S+ n,p, let UP f Grn,p, SP Sym+ p such that P = UP SP UT P . Denote by W the common subspace used for computing a canonical representation of P. We first compute two bases of span(UP ) and span(W), denoted respectively by U and W, such that d Vn,p(U, W) = df Grn,p(span(UP ), span(W)), Published as a conference paper at ICLR 2024 where d Vn,p(., .) and df Grn,p(., .) are the distances between two points in Vn,p and f Grn,p, respec- tively. These two bases can be computed as U = UP Y, W = WV, where Y and V are obtained from a SVD of (UP )T W, i.e., (UP )T W = Y(cosΣ)VT . The SPD matrix SP in the canonical representation of P is then computed as SP = VU T PUVT . I PROOF OF PROPOSITION 3.2 Proof. We first recall the definition of the binary operation g in Nguyen (2022b). Definition I.1 (The Binary Operation (Nguyen, 2022b)). Let P, Q Sym+ n . Then the binary operation g is defined as P g Q = Expg P(T g In P(Logg In(Q))). Logg P(Q), W g P (1) = T g P In Logg P(Q) , T g P In(W) g In (2) = Expg In T g P In Logg P(Q) , Expg In T g P In(W) g, (5) where (1) follows from the invariance of the inner product under parallel transport, and (2) follows from Definition G.4. Let R = Expg In T g P In Logg P(Q) . Then Logg In(R) = T g P In Logg P(Q) , which results in T g In P Logg In(R) = Logg P(Q). Hence Expg P T g In P Logg In(R) = Q. By the Left Cancellation Law, Q = P g ( g P g Q). Expg P T g In P Logg In(R) = P g ( g P g Q) = Expg P T g In P Logg In( g P g Q) , where the last equality follows from Definition I.1. We thus have g P g Q = R T g P In Logg P(Q) . (6) Combining Eqs. (5) and (6), we get Logg P(Q), W g P = g P g Q, Expg In T g P In(W) g, which concludes the proof of Proposition 3.2. Published as a conference paper at ICLR 2024 J PROOF OF PROPOSITION 3.4 Proof. The first part of Proposition 3.4 can be easily verified using the definition of the SPD inner product (see Definition G.4) and that of Affine-Invariant metrics (Pennec et al., 2020) (see Chapter 3). To prove the second part of Proposition 3.4, we will use the notion of SPD pseudogyrodistance (Nguyen & Yang, 2023) in our interpretation of FC layers on SPD manifolds, i.e., the signed distance is replaced with the signed SPD pseudo-gyrodistance in the interpretation given in Section 3.2.1. First, we need the following result from Nguyen & Yang (2023). Theorem J.1 (The SPD Pseudo-gyrodistance from an SPD Matrix to an SPD Hypergyroplane in an AI Gyrovector Space (Nguyen & Yang, 2023)). Let HW,P be an SPD hypergyroplane in a gyrovector space (Sym+ n , ai, ai), and X Sym+ n . Then the SPD pseudo-gyrodistance from X to HW,P is given by d(X, HW,P) = | log(P 1 By Theorem J.1, the signed SPD pseudo-gyrodistance from Y to an SPD hypergyroplane that contains the origin and is orthogonal to the Eai (i,j) axis is given by d(Y, HLogai Im(Eai (i,j)),Im) = log(Y), Logai Im(Eai (i,j)) F Logai Im(Eai (i,j)) F . According to our interpretation of FC layers, v(i,j)(X) = log(Y), Logai Im(Eai (i,j)) F Logai Im(Eai (i,j)) F . We consider two cases: Case 1: i < j. v(i,j)(X) = log(Y), 1 2(eie T j + eje T i ) F 2(eie T j + eje T i ) F = log(Y), 1 2(eie T j + eje T i ) F log(Y)(i,j) + log(Y)(j,i) 2 log(Y)(i,j). We thus deduce that log(Y)(i,j) = 1 2v(i,j)(X). Case 2: i = j. v(i,i)(X) = log(Y), eie T i 1 m 1 1 1+mβ Im F m 1 1 1+mβ Im F = log(Y), eie T i 1 m 1 1 1 + mβ Published as a conference paper at ICLR 2024 This leads to v(i,i)(X) = log(Y)(i,i) 1 m 1 1 1 + mβ j=1 log(Y)(j,j), (7) for i = 1, . . . , m. By summing up v(i,i)(X), i = 1, . . . , m, we get i=1 v(i,i)(X) = 1 1 + mβ i=1 log(Y)(i,i), or equivalently, m X i=1 log(Y)(i,i) = p i=1 v(i,i)(X) . (8) Replacing the term Pm j=1 log(Y)(j,j) in Eq. (7) with the expression on the right-hand side of Eq. (8) results in log(Y)(i,i) = v(i,i)(X) + 1 j=1 v(j,j)(X). Note that Y = exp([log(Y)(i,j)]m i,j=1). This concludes the proof of Proposition 3.4. K PROOF OF PROPOSITION 3.5 Proof. This proposition is a direct consequence of Proposition 3.4 for β = 0. L PROOF OF PROPOSITION 3.6 Proof. The first part of Proposition 3.6 can be easily verified using the definition of the SPD inner product (see Definition G.4) and that of Log-Cholesky metrics (Lin, 2019). To prove the second part of Proposition 3.6, we first recall the following result from Nguyen & Yang (2023). Theorem L.1 (The SPD Gyrodistance from an SPD Matrix to an SPD Hypergyroplane in a LC Gyrovector Space (Nguyen & Yang, 2023)). Let HW,P be an SPD hypergyroplane in a gyrovector space (Sym+ n , lc, lc), and X Sym+ n . Then the SPD pseudo-gyrodistance from X to HW,P is equal to the SPD gyrodistance from X to HW,P and is given by d(X, HW,P) = | A, B F | where A = ϕ(P) + ϕ(X) + log(D(ϕ(P)) 1D(ϕ(X))), B = f W + D(ϕ(P)) 1D(f W), f W = ϕ(P) ϕ(P) 1W(ϕ(P) 1)T where Y and D(Y), Y Mn,n are defined in Section G.4, and ϕ(P) = L (P). By Theorem L.1, the signed SPD pseudo-gyrodistance from Y to an SPD hypergyroplane that contains the origin and is orthogonal to the Elc (i,j) axis is given by d(Y, HLoglc Im(Elc (i,j)),Im) = ϕ(Y) + log(D(ϕ(Y))), Loglc Im(Elc (i,j)) Loglc Im(Elc (i,j)) Published as a conference paper at ICLR 2024 According to our interpretation of FC layers, v(i,j)(X) = ϕ(Y) + log(D(ϕ(Y))), Loglc Im(Elc (i,j)) Loglc Im(Elc (i,j)) We consider two cases: Case 1: i < j. v(i,j)(X) = ϕ(Y) + log(D(ϕ(Y))), eje T i F eje T i F = ϕ(Y) + log(D(ϕ(Y))), eje T i F = ϕ(Y)(j,i). We thus have ϕ(Y)(j,i) = v(i,j)(X). Case 2: i = j. v(i,j)(X) = ϕ(Y) + log(D(ϕ(Y))), eie T i F eie T i F = ϕ(Y) + log(D(ϕ(Y))), eie T i F = log(ϕ(Y)(i,i)). Hence ϕ(Y)(i,i) = exp(v(i,i)(X)). Setting ϕ(Y) = [y(i,j)]m i,j=1, then y(i,j) are given by exp(v(i,j)(X)), if i = j v(i,j)(X), if i < j 0, if i > j Since ϕ(Y) is the Cholesky factor of Y, we have Y = ϕ(Y)ϕ(Y)T , which concludes the proof of Proposition 3.6. M PROOF OF THEOREM 3.11 Proof. Let Hpsd,g W,P be a hypergyroplane in structure space f Grn,p Sym+ p and X f Grn,p Sym+ p . By the definition of the pseudo-gyrodistance function, d(X, Hpsd,g W,P) = sin( XPQ)d(X, P), where Q is given by Q = arg max Q Hpsd,g W,P \{P} cos( QPX) = arg max Q Hpsd,g W,P \{P} psd,g P psd,g Q, psd,g P psd,g X psd,g psd,g P psd,g Q psd,g. psd,g P psd,g X psd,g . Published as a conference paper at ICLR 2024 By the definitions of the binary and inverse operations in structure spaces, psd,g P psd,g X = (e gr UP e gr UX, g SP g SX), psd,g P psd,g Q = (e gr UP e gr UQ, g SP g SQ). psd,g P psd,g X, psd,g P psd,g Q psd,g =λ (e gr UP e gr UX)(e gr UP e gr UX)T , (e gr UP e gr UQ)(e gr UP e gr UQ)T gr + g SP g SX, g SP g SQ g. Let A1 = Loggr In,p (e gr UP e gr UX)(e gr UP e gr UX)T , B1 = Loggr In,p (e gr UP e gr UQ)(e gr UP e gr UQ)T , A2 = Logg In( g SP g SX), and B2 = Logg In( g SP g SQ). Then we have Q = arg max Q Hpsd,g W,P \{P} λ A1, B1 F + A2, B2 F p λ A1 2 F + A2 2 F . p λ B1 2 F + B2 2 F = arg max Q Hpsd,g W,P \{P} λB1 B2] F [ λA1 A2] F . [ λB1 B2] F , (9) where is the concatenation operation similar to operation concatspd(.). From the equation of hypergyroplanes in structure space f Grn,p Sym+ p , psd,g P psd,g Q, W psd,g = 0. Let W = (UW , SW ). Then we have λ (e gr UP e gr UQ)(e gr UP e gr UQ)T , UW (UW )T gr + g SP g SQ, SW g = 0. (10) Let W1 = Loggr In,p UW (UW )T , W2 = Logg In,p(SW ). Then Eq. (10) can be rewritten as λ B1, W1 F + B2, W2 F = 0, which is equivalent to [ λW1 W2] F = 0. (11) Now, the problem in (9) is to find the minimum angle between the vector [ λA1 A2] and the Euclidean hyperplane described by Eq. (11). The pseudo-gyrodistance from X to Hpsd,g W,P thus can be obtained as d(X, Hpsd,g W,P) = [ λW1 W2] F [ = λ A1, W1 F + A2, W2 F p λ W1 2 F + W2 2 F . Some simple manipulations lead to d(X, Hpsd,g W,P) = |λ (e gr UP e gr UX)(e gr UP e gr UX)T , UW UT W gr + g SP g SX, SW g| q λ( UW UT W gr)2 + ( SW g)2 , which concludes the proof of Theorem 3.11. Published as a conference paper at ICLR 2024 N PROOF OF PROPOSITION 3.12 Proof. We need the following result from Nguyen & Yang (2023). Proposition N.1. Let M and N be two Riemannian manifolds. Let φ : M N be an isometry. Then Log P(Q) = (Dφ 1 φ(P))(g Logφ(P)(φ(Q))), where P, Q M, DτR(W) denotes the directional derivative of a mapping τ at point R N along direction W TRN, Log(.) and g Log(.) are the logarithmic maps in manifolds M and N, respectively. We adopt the notations in Bendokat et al. (2020). The Riemannian metric g O Q(., .) on On is the standard inner product given (Edelman et al., 1998; Bendokat et al., 2020) as g O Q(Ω1, Ω2) = Tr(ΩT 1 Ω2), where Q On, Ω1, Ω2 TQ On. Let U f Grn,p, D1, D2 TUf Grn,p. The canonical metric gf Gr U (D1, D2) on f Grn,p is the restriction of the Riemannian metric g O Q(., .) to the horizontal space of TQ On (multiplied by 1/2) and is given (Edelman et al., 1998; Bendokat et al., 2020) by g f Gr U (D1, D2) = Tr DT 1 (In 1 2UUT )D2 . (12) Let P Grn,p, 1, 2 TP Grn,p. The canonical metric g Gr P ( 1, 2) on Grn,p is the restriction of the Riemannian metric g O Q(., .) to the horizontal space of TQ On (multiplied by 1/2) and is given (Edelman et al., 1998; Bendokat et al., 2020) by g Gr P ( 1, 2) = 1 2 Tr ( hor 1,Q)T hor 2,Q , where hor 1,Q and hor 2,Q are the horizontal lifts of 1 and 2 to Q, respectively. Here, Q is related to P by Q = (U U ) and P = UUT , where U f Grn,p and U is the orthogonal completion of U. Denote by Hor U f Grn,p the horizontal space of TUf Grn,p. Then this subspace is characterized by Hor U f Grn,p = {U B|B Mn p,p}. From Eq. (3.2) in Bendokat et al. (2020), g Gr P ( 1, 2) = Tr ( hor 1,U)T hor 2,U , (13) where hor 1,U and hor 2,U are the horizontal lifts of 1 and 2 to U, respectively. Therefore, by Eq. (12), g f Grn,p U ( hor 1,U, hor 2,U) = Tr ( hor 1,U)T (In 1 2UUT ) hor 2,U = Tr ( hor 1,U)T hor 2,U 1 2 Tr ( hor 1,U)T UUT hor 2,U = Tr ( hor 1,U)T hor 2,U 1 2 Tr (U B1)T UUT U B2 = Tr ( hor 1,U)T hor 2,U 1 2 Tr BT 1 UT UUT U B2 = Tr ( hor 1,U)T hor 2,U , where the last equality follows from the fact that UT U = 0. Combining Eqs. (13) and (14), we get g Gr P ( 1, 2) = g f Grn,p U ( hor 1,U, hor 2,U). Published as a conference paper at ICLR 2024 By Proposition N.1, Loggr P (F) = (Dττ 1(P))(g Log gr τ 1(P)(τ 1(F))), where P, F Grn,p. From Eq. (3.15) in Bendokat et al. (2020), DτR(W) = RWT + WRT . Loggr P (F) = τ 1(P) g Log gr τ 1(P)(τ 1(F)) T + g Log gr τ 1(P)(τ 1(F))τ 1(P)T , which concludes the proof of Proposition 3.12.