# matrix_manifold_neural_networks__98a0b8eb.pdf

Published as a conference paper at ICLR 2024

MATRIX MANIFOLD NEURAL NETWORKS++

Xuan Son Nguyen, Shuo Yang, Aymeric Histace ETIS, UMR 8051, CY Cergy Paris University, ENSEA, CNRS, France {xuan-son.nguyen,shuo.yang,aymeric.histace}@ensea.fr

Deep neural networks (DNNs) on Riemannian manifolds have garnered increasing interest in various applied areas. For instance, DNNs on spherical and hyperbolic manifolds have been designed to solve a wide range of computer vision and nature language processing tasks. One of the key factors that contribute to the success of these networks is that spherical and hyperbolic manifolds have the rich algebraic structures of gyrogroups and gyrovector spaces. This enables principled and effective generalizations of the most successful DNNs to these manifolds. Recently, some works have shown that many concepts in the theory of gyrogroups and gyrovector spaces can also be generalized to matrix manifolds such as Symmetric Positive Deﬁnite (SPD) and Grassmann manifolds. As a result, some building blocks for SPD and Grassmann neural networks, e.g., isometric models and multinomial logistic regression (MLR) can be derived in a way that is fully analogous to their spherical and hyperbolic counterparts. Building upon these works, we design fully-connected (FC) and convolutional layers for SPD neural networks. We also develop MLR on Symmetric Positive Semi-deﬁnite (SPSD) manifolds, and propose a method for performing backpropagation with the Grassmann logarithmic map in the projector perspective. We demonstrate the effectiveness of the proposed approach in the human action recognition and node classiﬁcation tasks.

1 INTRODUCTION

In recent years, deep neural networks on Riemannian manifolds have achieved impressive performance in many applications (Ganea et al., 2018; Skopek et al., 2020; Cruceru et al., 2021; Shimizu et al., 2021). The most popular neural networks in this family operate on hyperbolic spaces. Such spaces of constant sectional curvature, like spherical spaces, have the rich algebraic structure of gyrovector spaces. The theory of gyrovector spaces (Ungar, 2002; 2005; 2014) offers an elegant and powerful framework based on which natural generalizations (Ganea et al., 2018; Shimizu et al., 2021) of essential building blocks in DNNs are constructed for hyperbolic neural networks (HNNs).

Matrix manifolds such as SPD and Grassmann manifolds offer a convenient trade-off between structural richness and computational tractability (Cruceru et al., 2021; L opez et al., 2021). Therefore, in many applications, neural networks on matrix manifolds are attractive alternatives to their hyperbolic counterparts. However, unlike the approaches in Ganea et al. (2018); Shimizu et al. (2021), most existing approaches for building SPD and Grassmann neural networks (Dong et al., 2017; Huang & Gool, 2017; Huang et al., 2018; Nguyen et al., 2019; Brooks et al., 2019; Nguyen, 2021; Wang et al., 2021) do not provide necessary techniques and mathematical tools to generalize a broad class of DNNs to the considered manifolds.

Recently, the authors of Kim (2020); Nguyen (2022b) have shown that SPD and Grassmann manifolds have the structure of gyrovector spaces or that of nonreductive gyrovector spaces (Nguyen, 2022b) that share remarkable analogies with gyrovector spaces. The work in Nguyen & Yang (2023) takes one step forward in that direction by generalizing several notions in gyrovector spaces, e.g., the inner product and gyrodistance (Ungar, 2014) to SPD and Grassmann manifolds. This allows one to characterize certain gyroisometries of these manifolds and to construct MLR on SPD manifolds.

Although some useful notions in gyrovector spaces have been generalized to SPD and Grassmann manifolds (Nguyen, 2022a;b; Nguyen & Yang, 2023) that set the stage for an effective way of building neural networks on these manifolds, many questions remain open. In this paper, we aim at

Published as a conference paper at ICLR 2024

addressing some limitations of existing works using a gyrovector space approach. Our contributions can be summarized as follows:

1. We generalize FC and convolutional layers to the SPD manifold setting. 2. We propose a method for performing backpropagation with the Grassmann logarithmic map in the projector perspective (Bendokat et al., 2020) without resorting to any approximation schemes. We then show how to construct graph convolutional networks (GCNs) on Grassmann manifolds. 3. We develop MLR on SPSD manifolds. 4. We showcase our approach in the human action recognition and node classiﬁcation tasks.

2 PRELIMINARIES

2.1 SPD MANIFOLDS

The space of n n SPD matrices, when provided with some geometric structures like a Riemannian metric, forms SPD manifold Sym+ n (Arsigny et al., 2005). Data lying on SPD manifolds are commonly encountered in various domains (Huang & Gool, 2017; Brooks et al., 2019; Nguyen, 2021; Sukthanker et al., 2021; Nguyen, 2022b;a; Nguyen & Yang, 2023). In many applications, the use of Euclidean calculus on SPD manifolds often leads to unsatisfactory results (Arsigny et al., 2005). To tackle this issue, many Riemannian structures for SPD manifolds have been introduced. In this work, we focus on two widely used Riemannian metrics, i.e., Afﬁne-Invariant (AI) (Pennec et al., 2004) and Log-Euclidean (LE) (Arsigny et al., 2005) metrics, and a recently introduced Riemannian metrics, i.e. Log-Cholesky (LC) metrics (Lin, 2019) that offer some advantages over Afﬁne-Invariant and Log-Euclidean metrics.

2.2 GRASSMANN MANIFOLDS

Grassmann manifolds Grn,p are the collection of linear subspaces of ﬁxed dimension p of the Euclidean space Rn (Edelman et al., 1998). Data lying on Grassmann manifolds arise naturally in many applications (Absil et al., 2007; Bendokat et al., 2020). Points on Grassmann manifolds can be represented from different perspectives (Bendokat et al., 2020). Two typical approaches use projection matrices or those with orthonormal columns. Each of them can be effective in some problems but might be inappropriate in some other contexts (Nguyen, 2022b). Although geometrical descriptions of Grassmann manifolds have been given in numerous works (Edelman et al., 1998), some computational issues remain to be addressed. For instance, the question of how to effectively perform backpropagation with the Grassmann logarithmic map in the projector perspective remains open.

2.3 NEURAL NETWORKS ON SPD AND GRASSMANN MANIFOLDS

2.3.1 NEURAL NETWORKS ON SPD MANIFOLDS

The work in Huang & Gool (2017) introduces SPDNet with three novel layers, i.e., Bimap, Log Eig, and Re Eig layers that has become one of the most successful architectures in the ﬁeld. In Brooks et al. (2019), the authors further improve SPDNet by developing Riemannian versions of batch normalization layers. Following these works, some works (Nguyen et al., 2019; Nguyen, 2021; Wang et al., 2021; Kobler et al., 2022; Ju & Guan, 2023) design variants of Bimap and batch normalization layers in SPD neural networks. The work in Chakraborty et al. (2020) presents a different approach based on intrinsic operations on SPD manifolds. Their proposed layers have nice theoretical properties. A common limitation of the above works is that they do not provide necessary mathematical tools for constructing many essential building blocks of DNNs on SPD manifolds. Recently, some works (Nguyen, 2022a;b; Nguyen & Yang, 2023) take a gyrovector space approach that enables natural generalizations of some building blocks of DNNs, e.g., MLR for SPD neural networks.

2.3.2 NEURAL NETWORKS ON GRASSMANN MANIFOLDS

In Huang et al. (2018), the authors propose Gr Net that explores the same rule of matrix backpropagation (Ionescu et al., 2015) as SPDNet. Some existing works (Wang & Wu, 2020; Souza et al.,

Published as a conference paper at ICLR 2024

2020) are also inspired by Gr Net. Like their SPD counterparts, most existing Grassmann neural networks are not built upon a mathematical framework that allows one to generalize a broad class of DNNs to Grassmann manifolds. Using a gyrovector space approach, Nguyen & Yang (2023) has shown that some concepts in Euclidean spaces can be naturally extended to Grassmann manifolds.

3 PROPOSED APPROACH

3.1 NOTATION

Let M be a homogeneous Riemannian manifold, TPM be the tangent space of M at P M. Denote by exp(P) and log(P) the usual matrix exponential and logarithm of P, Exp P(W) the exponential map at P that associates to a tangent vector W TPM a point of M, Log P(Q) the logarithmic map of Q M at P, TP Q(W) the parallel transport of W from P to Q along geodesics connecting P and Q. For simplicity of exposition, we will concentrate on real matrices. Denote by Mn,m the space of n m matrices, Sym+ n the space of n n SPD matrices, Symn the space of n n symmetric matrices, S+ n,p the space of n n SPSD matrices of rank p n, Grn,p the p-dimensional subspaces of Rn in the projector perspective. For clarity of presentation, let f Grn,p be the p-dimensional subspaces of Rn in the ONB (orthonormal basis) perspective (Bendokat et al., 2020). For notations related to SPD manifolds, we use the letter g {ai, le, lc} as a subscript (superscript) to indicate the considered Riemannian metric, unless otherwise stated. Other notations will be introduced in appropriate paragraphs. Our notations are summarized in Appendix A.

3.2 NEURAL NETWORKS ON SPD MANIFOLDS

In Nguyen (2022a;b), the author has shown that SPD manifolds with Afﬁne-Invariant, Log Euclidean, and Log-Cholesky metrics form gyrovector spaces referred to as AI, LE, and LC gyrovector spaces, respectively. We adopt the notations in these works and consider the case where r = 1 (see Nguyen (2022b), Deﬁnition 3.1). Let ai, le, and lc be the binary operations in AI, LE, and LC gyrovector spaces, respectively. Let ai, le, and lc be the inverse operations in AI, LE, and LC gyrovector spaces, respectively. These operations are given in Appendix G.

3.2.1 FC LAYERS IN SPD NEURAL NETWORKS

Our method for generalizing FC layers to the SPD manifold setting relies on a reformulation of SPD hypergyroplanes (Nguyen & Yang, 2023). We ﬁrst recap the deﬁnition of SPD hypergyroplanes.

Deﬁnition 3.1 (SPD Hypergyroplanes (Nguyen & Yang, 2023)). For P Sym+,g n , W TP Sym+,g n , SPD hypergyroplanes are deﬁned as

Hspd,g W,P = {Q Sym+,g n : Logg P(Q), W g P = 0},

where ., . g P denotes the inner product at P given by the considered Riemannian metric.

Proposition 3.2 gives an equivalent deﬁnition for SPD hypergyroplanes.

Proposition 3.2. Let P Sym+,g n , W TP Sym+,g n , and Hspd,g W,P be the SPD hypergyroplanes deﬁned in Deﬁnition 3.1. Then

Hspd,g W,P = {Q Sym+,g n : g P g Q, Expg In(T g P In(W)) g = 0},

where In denotes the n n identity matrix, and ., . g is the SPD inner product in Sym+,g n (Nguyen & Yang, 2023) (see Appendix G.7 for the deﬁnition of the SPD inner product).

Proof See Appendix I.

In DNNs, an FC layer linearly transforms the input in such a way that the k-th dimension of the output corresponds to the signed distance from the output to the hyperplane that contains the origin and is orthonormal to the k-th axis of the output space. This interpretation has proven useful in generalizing FC layers to the hyperbolic setting (Shimizu et al., 2021).

Published as a conference paper at ICLR 2024

Notice that the equation of SPD hypergyroplanes in Proposition 3.2 has the form g P g Q, W g = 0, where W Sym+,g n . This equation can be seen as a generalization of the hyperplane equation w, x + b = p + x, w = 0, where w, x, p Rn, b R, and p, w = b. Therefore, Proposition 3.2 suggests that any linear function of an SPD matrix X Sym+,g n can be written as g P g X, W g, where P, W Sym+,g n . The above interpretation of FC layers now can be applied to our case for constructing FC layers in SPD neural networks. For convenience of presentation, in Deﬁnition 3.3, we will index the dimensions (axes) of the output space using two subscripts corresponding to the row and column indices in a matrix. Deﬁnition 3.3. Let Eg (i,j), i j, i, j = 1, . . . , m be the (i, j)-th axis of the output space. An SPD hypergyroplane that contains the origin and is orthonormal to the Eg (i,j) axis can be deﬁned as

Hspd,g Logg Im(Eg (i,j)),Im = {Q Sym+,g m : Q, Eg (i,j) g = 0}.

It remains to specify an orthonormal basis for each family of the considered Riemannian metrics of SPD manifolds. Proposition 3.4 gives such an orthonormal basis for AI gyrovector spaces along with the expression for the output of FC layers with Afﬁne-Invariant metrics. Proposition 3.4 (FC layers with Afﬁne-Invariant Metrics). Let (e1, . . . , em), ei = 1, i = 1 . . . , m be an orthonormal basis of Rm. Let ., . ai P be the Afﬁne-Invariant metric computed at P Sym+,ai m as

V, W ai P = Tr(VP 1WP 1) + β Tr(VP 1) Tr(WP 1),

where β > 1

m. An orthonormal basis Eai (i,j), i j, i, j = 1, . . . , m of Sym+,ai m can be given as

Eai (i,j) =

exp eie T j 1

m 1 1 1+mβ Im , if i = j

exp eie T j +eje T i

2 , if i < j

Denote by v(i,j)(X) = ai P(i,j) ai X, W(i,j) ai, P(i,j), W(i,j) Sym+,ai n , i j, i, j = 1, . . . , m. Let α = 1 m( 1 + mβ 1). Then the output of an FC layer is computed as Y = exp [y(i,j)]m i,j=1 , where [y(i,j)]m i,j=1 is the matrix having y(i,j) as the element at the i-th row and j-th column, and y(i,j) is given by

v(i,j)(X) + α Pm k=1 v(k,k)(X), if i = j 1

2v(i,j)(X), if i < j

2v(j,i)(X), if i > j

Proof See Appendix J.

As shown in Arsigny et al. (2005), a Log-Euclidean metric on Sym+,le n can be obtained from any inner product on Symn. In this work, we consider a metric that is invariant under all similarity transformations, i.e., the metric W, V le In = Tr(WV). We have the following result.

Proposition 3.5 (FC layers with Log-Euclidean Metrics). An orthonormal basis Ele (i,j), i

j, i, j = 1, . . . , m of Sym+,le m can be given by

Ele (i,j) =

exp eie T j , if i = j

exp eie T j +eje T i

2 , if i < j

Let v(i,j)(X) = le P(i,j) le X, W(i,j) le, P(i,j), W(i,j) Sym+,le n , i j, i, j = 1, . . . , m. Then the output of an FC layer is computed as Y = exp [y(i,j)]m i,j=1 , where y(i,j) is given by

v(i,j)(X), if i = j 1

2v(i,j)(X), if i < j

2v(j,i)(X), if i > j

Published as a conference paper at ICLR 2024

Proof See Appendix K.

Finally, we give the characterization of an orthonormal basis for LC gyrovector spaces and the expression for the output of FC layers with Log-Cholesky metrics. Proposition 3.6 (FC layers with Log-Cholesky Metrics). An orthonormal basis Elc (i,j), i

j, i, j = 1, . . . , m of Sym+,lc m can be given by

Elc (i,j) = (e 1)eie T j + Im, if i = j (eje T i + Im)(eie T j + Im), if i < j

Let v(i,j)(X) = lc P(i,j) lc X, W(i,j) lc, P(i,j), W(i,j) Sym+,lc n , i j, i, j = 1, . . . , m.

Then the output of an FC layer is computed as Y = YY T , where Y = [y(i,j)]m i,j=1, and y(i,j) is given by

exp(v(i,j)(X)), if i = j v(i,j)(X), if i < j 0, if i > j

Proof See Appendix L.

3.2.2 CONVOLUTIONAL LAYERS IN SPD NEURAL NETWORKS

Consider applying a 2D convolutional layer to a multi-channel image. Let Nin and Nout be the numbers of input and output channels, respectively. Denote by yk (i,j), i = 1, . . . , Nrow, j = 1, . . . , Ncol, k = 1, . . . , Nout the value of the k-th output channel at pixel (i, j). Then

l=1 w(l,k), xl (i,j) + bk, (1)

where xl (i,j) is a receptive ﬁeld of the l-th input channel, w(l,k) is the ﬁlter associated with the l-th input channel and the k-th output channel, and bk is the bias for the k-th output channel. Let X(i,j) = concat(x1 (i,j), . . . , x Nin (i,j)), Wk = concat(w(1,k), . . . , w(Nin,k)), where operation concat(.) concatenates all of its arguments. Then Eq. (1) can be rewritten (Shimizu et al., 2021) as

yk (i,j) = Wk, X(i,j) + bk. (2)

Note that Eq. (2) has the form w, x + b and thus the computations discussed in Section 3.2.1 can be applied to implement convolutional layers in SPD neural networks. Speciﬁcally, given a set of SPD matrices Pi Sym+,g n , i = 1, . . . , N, operation concatspd(P1, . . . , PN) produces a block diagonal matrix having Pi as diagonal elements.

In Chakraborty et al. (2020), the authors design a convolution operation for SPD neural networks. However, their method is based on the concept of weighted Fr echet Mean, while ours is built upon the concepts of SPD hypergyroplane and SPD pseudo-gyrodistance from an SPD matrix to an SPD hypergyroplane (Nguyen & Yang, 2023). Also, our convolution operation can be used for dimensionality reduction, while theirs always produces an output of the same dimension as the inputs.

3.3 MLR IN STRUCTURE SPACES

Motivated by the works in Nguyen (2022a); Nguyen & Yang (2023), in this section, we aim to build MLR on SPSD manifolds. For any P S+ n,p, we consider the decomposition P = UP SP UT P , where UP f Grn,p and SP Sym+ p . Each element of S+ n,p can be seen as a ﬂat p-dimensional ellipsoid in Rn (Bonnabel et al., 2013). The ﬂat ellipsoid belongs to a p-dimensional subspace spanned by the columns of UP , while the p p SPD matrix SP deﬁnes the shape of the ellipsoid in Sym+ p . A canonical representation of P in structure space f Grn,p Sym+ p is computed by identifying a common subspace and then rotating UP to this subspace. The SPD matrix SP is rotated accordingly to reﬂect the changes of UP . Details of these computations are given in Appendix H.

Published as a conference paper at ICLR 2024

Assuming that a canonical representation in structure space f Grn,p Sym+ p is obtained for each point in S+ n,p, we now discuss how to build MLR in this space. As one of the ﬁrst steps for developing network building blocks in a gyrovector space approach is to construct some basic operations in the considered manifold, we give the deﬁnitions of the binary and inverse operations in the following.

Deﬁnition 3.7 (The Binary Operation in Structure Spaces). Let (UP , SP ), (UQ, SQ) f Grn,p Sym+,g p . Then the binary operation psd,g in structure space f Grn,p Sym+,g p is deﬁned as

(UP , SP ) psd,g (UQ, SQ) = (UP e gr UQ, SP g SQ),

where e gr is the binary operation in f Grn,p (see Appendix G.6 for the deﬁnition of e gr).

Deﬁnition 3.8 (The Inverse Operation in Structure Spaces). Let (UP , SP ) f Grn,p Sym+,g p . Then the inverse operation psd,g in structure space f Grn,p Sym+,g p is deﬁned as

psd,g(UP , SP ) = (e gr UP , g SP ),

where e gr is the inverse operation in f Grn,p (see Appendix G.6 for the deﬁnition of e gr).

Our construction of the binary and inverse operations in f Grn,p Sym+,g p is clearly advantageous compared to the method in Nguyen (2022a) since this method does not preserve the information about the subspaces of the terms involved in these operations. In addition to the binary and inverse operations, we also need to deﬁne the inner product in structure spaces.

Deﬁnition 3.9 (The Inner Product in Structure Spaces). Let (UP , SP ), (UQ, SQ) f Grn,p Sym+,g p . Then the inner product in structure space f Grn,p Sym+,g p is deﬁned as

(UP , SP ), (UQ, SQ) psd,g = λ UP UT P , UQUT Q gr + SP , SQ g,

where λ > 0, ., . gr is the Grassmann inner product (Nguyen & Yang, 2023) (see Appendix G.7).

The key idea to generalize MLR to a Riemannian manifold is to change the margin to reﬂect the geometry of the considered manifold (a formulation of MLR from the perspective of distances to hyperplanes is given in Appendix C). This requires the notions of hyperplanes and margin in the considered manifold that are referred to as hypergyroplanes and pseudo-gyrodistances (Nguyen & Yang, 2023), respectively. In our case, the deﬁnition of hypergyroplanes in structure spaces, suggested by Proposition 3.2, can be given below.

Deﬁnition 3.10 (Hypergyroplanes in Structure Spaces). Let P, W f Grn,p Sym+,g p . Then hypergyroplanes in structure space f Grn,p Sym+,g p are deﬁned as

Hpsd,g W,P = {Q f Grn,p Sym+,g p : psd,g P psd,g Q, W psd,g = 0}.

Pseudo-gyrodistances in structure spaces can be deﬁned in the same way as SPD pseudogyrodistances. We refer the reader to Appendix G.8 for all related notions. Theorem 3.11 gives an expression for the pseudo-gyrodistance from a point to a hypergyroplane in a structure space.

Theorem 3.11 (Pseudo-gyrodistances in Structure Spaces). Let W = (UW , SW ), P = (UP , SP ), X = (UX, SX) f Grn,p Sym+,g p , and Hpsd,g W,P be a hypergyroplane in structure

space f Grn,p Sym+,g p . Then the pseudo-gyrodistance from X to Hpsd,g W,P is given by

d(X, Hpsd,g W,P) = |λ (e gr UP e gr UX)(e gr UP e gr UX)T , UW UT W gr + g SP g SX, SW g| q

λ( UW UT W gr)2 + ( SW g)2 ,

where . gr and . g are the norms induced by the Grassmann and SPD inner products, respectively.

Proof See Appendix M.

The algorithm for computing the pseudo-gyrodistances is given in Appendix B.

Published as a conference paper at ICLR 2024

3.4 NEURAL NETWORKS ON GRASSMANN MANIFOLDS

In this section, we present a method for computing the Grassmann logarithmic map in the projector perspective. We then propose GCNs on Grassmann manifolds.

3.4.1 GRASSMANN LOGARITHMIC MAP IN THE PROJECTOR PERSPECTIVE

The Grassmann logarithmic map is given (Batzies et al., 2015; Bendokat et al., 2020) by

Loggr P (Q) = [Ω, P],

where P, Q Grn,p, and Ωis computed as

2 log (In 2Q)(In 2P) .

Notice that the matrix (In 2Q)(In 2P) is generally not an SPD matrix. This raises an issue when one needs to implement an operation that requires the Grassmann logarithmic map in the projector perspective using popular deep learning frameworks like Py Torch and Tensorﬂow, since the matrix logarithm function is not differentiable in these frameworks. To deal with this issue, we rely on the following result that allows us to compute the Grassmann logarithmic map in the projector perspective from the Grassmann logarithmic map in the ONB perspective. Proposition 3.12. Let τ be the mapping such that

τ : f Grn,p Grn,p, U 7 UUT .

Let g Log gr U (V), U, V f Grn,p be the logarithmic map of V at U in the ONB perspective. Then

Loggr P (Q) = τ 1(P) g Log gr τ 1(P)(τ 1(Q)) T + g Log gr τ 1(P)(τ 1(Q))τ 1(P)T .

Proof See Appendix N.

Note that the Grassmann logarithmic map g Log gr U (V) can be computed via singular value decomposition (SVD) that is a differentiable operation in Py Torch and Tensorﬂow (see Appendix E.2.2). Therefore, Proposition 3.12 provides an effective implementation of the Grassmann logarithmic map in the projector perspective for gradient-based learning.

3.4.2 GRAPH CONVOLUTIONAL NETWORKS ON GRASSMANN MANIFOLDS

We propose to extend GCNs to Grassmann geometry using an approach similar to Chami et al. (2019); Zhao et al. (2023). Let G = (V, E) be a graph with vertex set V and edge set E, xl i, i V be the embedding of node i at layer l (l = 0 indicates input node features), N(i) = {j : (i, j) E} be the set of neighbors of i V, Wl and bl be the weight and bias for layer l, and σ(.) be a non-linear activation function. A basic GCN message-passing update (Zhao et al., 2023) can be expressed as

pl i = Wlxl 1 i (feature transformation)

j N(i) wijpl j (aggregation)

xl i = σ(ql i + bl) (bias and nonlinearity)

For the aggregation operation, the weights wij can be computed using different methods (Kipf & Welling, 2017; Hamilton et al., 2017). Let Xl i Grn,p, i V be the Grassmann embedding of node i at layer l. For feature transformation on Grassmann manifolds, we use isometry maps based on left Grassmann gyrotranslations (Nguyen & Yang, 2023), i.e.,

φM(Xl i) = exp([Loggr In,p(M), In,p])Xl i exp( [Loggr In,p(M), In,p]),

where In,p = Ip 0 0 0

Mn,n, and M Grn,p is a model parameter. Let Expgr(.) be the

exponential map in Grn,p. Then the aggregation process is performed as

Ql i = Expgr In,p

j N(i) ki,j Loggr In,p(Pl j) ,

Published as a conference paper at ICLR 2024

Node i Node j Node k

Exp map Exp map Exp map

Isometry map Isometry map Isometry map

Initialization Feature transformation Aggregation

Bias and nonlinearity

Input sequence

Output class probabilities

MLR layer Convolutional layer

Figure 1: The pipelines of Gyro Spd++ (left) and Gr-GCN++ (right).

where Pl i and Ql i are the input and output node features of the aggregation operation, and ki,j = |N(i)| 1

2 represents the relative importance of node j to node i. For any X Grn,p,

let exp([Loggr In,p(X), In,p])e In,p = VU be a QR decomposition, where e In,p = Ip 0

V Mn,p is a matrix with orthonormal columns, and U Mp,p is an upper-triangular matrix. Then the non-linear activation function (Nair & Hinton, 2010; Huang et al., 2018) is given by

σ(X) = VVT .

Let Bl Grn,p be the bias for layer l. Then the message-passing update of our network can be summarized as

Pl i = φMl(Xl 1 i ) (feature transformation)

Ql i = Expgr In,p

j N(i) ki,j Loggr In,p(Pl j) (aggregation)

Xl i = σ(Bl gr Ql i) (bias and nonlinearity)

The Grassmann logarithmic maps in the aggregation operation are obtained using Proposition 3.12.

Another approach for embedding graphs on Grassmann manifolds has also been proposed in Zhou et al. (2022). However, unlike our method, this method creates a Grassmann representation for a graph via a SVD of the matrix formed from node embeddings previously learned by a Euclidean neural network. Therefore, it is not designed to learn node embeddings on Grassmann manifolds.

4 EXPERIMENTS

4.1 HUMAN ACTION RECOGNITION

We use three datasets, i.e., HDM05 (M uller et al., 2007), FPHA (Garcia-Hernando et al., 2018), and NTU RBG+D 60 (NTU60) (Shahroudy et al., 2016). We compare our networks against the following state-of-the-art models: SPDNet (Huang & Gool, 2017)1, SPDNet BN (Brooks et al., 2019)2, SPSDAI (Nguyen, 2022a), Gyro AI-HAUNet (Nguyen, 2022b), and MLR-AI (Nguyen & Yang, 2023).

4.1.1 ABLATION STUDY

Convolutional layers in SPD neural networks Our network Gyro Spd++ has a MLR layer stacked on top of a convolutional layer (see Fig. 1). The motivation for using a convolutional layer

1https://github.com/zhiwu-huang/SPDNet. 2https://papers.nips.cc/paper/2019/hash/6e69ebbfad976d4637bb4b39de261bf7-Abstract. html.

Published as a conference paper at ICLR 2024

Table 1: Results (mean accuracy standard deviation) and model sizes (MB) of various SPD neural networks on the three datasets (computed over 5 runs).

Method HDM05 #HDM05 FPHA #FPHA NTU60 #NTU60 SPDNet 71.36 1.49 6.58 88.79 0.36 0.99 76.14 1.43 1.80 SPDNet BN 75.05 1.38 6.68 91.02 0.25 1.03 78.35 1.34 2.06 Gyro AI-HAUNet 77.05 1.35 0.31 95.65 0.23 0.11 93.27 1.29 0.02 SPSD-AI 79.64 1.54 0.31 95.72 0.44 0.11 93.92 1.55 0.03 MLR-AI 78.26 1.37 0.60 95.70 0.26 0.21 94.27 1.32 0.05 Gyro Spd++ (Ours) 79.78 1.42 0.76 96.84 0.27 0.27 95.28 1.37 0.07 Gyro Spsd++ (Ours) 78.52 1.34 0.75 97.90 0.24 0.27 96.64 1.35 0.07

Table 2: Results and computation times (seconds) per epoch of Gr-GCN++ and its variant based on the ONB perspective. Node embeddings are learned on f Gr14,7 and Gr14,7 for Gr-GCN-ONB and Gr-GCN++, respectively. Results are computed over 5 runs.

Method Gr-GCN-ONB Gr-GCN++

Accuracy standard deviation 81.9 1.2 82.8 0.7 Training 0.49 0.97 Testing 0.40 0.69

Accuracy standard deviation 76.2 1.5 80.3 0.5 Training 3.40 6.48 Testing 2.76 4.47

Accuracy standard deviation 68.1 1.0 81.6 0.4 Training 0.57 0.77 Testing 0.46 0.52

is that it can extract global features from local ones (covariance matrices computed from joint coordinates within sub-sequences of an action sequence). We use Afﬁne-Invariant metrics for the convolutional layer and Log-Euclidean metrics for the MLR layer. Results in Tab. 1 show that Gyro Spd++ consistently outperforms the SPD baselines in terms of mean accuracy. Results of Gyro Spd++ with different designs of Riemannian metrics for its layers are given in Appendix D.4.1.

MLR in structure spaces We build Gyro Spsd++ by replacing the MLR layer of Gyro Spd++ with a MLR layer proposed in Section 3.3. Results of Gyro Spsd++ are given in Tab. 1. Except SPSDAI, Gyro Spsd++ outperforms the other baselines on HDM05 dataset in terms of mean accuracy. Furthermore, Gyro Spsd++ outperforms Gyro Spd++ and all the baselines on FPHA and NTU60 datasets in terms of mean accuracy. These results show that MLR is effective when being designed in structure spaces from a gyrovector space perspective.

4.2 NODE CLASSIFICATION

We use three datasets, i.e., Airport (Zhang & Chen, 2018), Pubmed (Namata et al., 2012a), and Cora (Sen et al., 2008), each of them contains a single graph with thousands of labeled nodes. We compare our network Gr-GCN++ (see Fig. 1) against its variant Gr-GCN-ONB (see Appendix E.2.4) based on the ONB perspective. Results are shown in Tab. 2. Both networks give the best performance for n = 14 and p = 7. It can be seen that Gr-GCN++ outperforms Gr-GCN-ONB in all cases. The performance gaps are signiﬁcant on Pubmed and Cora datasets.

5 CONCLUSION

In this paper, we develop FC and convolutional layers for SPD neural networks, and MLR on SPSD manifolds. We show how to perform backpropagation with the Grassmann logarithmic map in the projector perspective. Based on this method, we extend GCNs to Grassmann geometry. Finally, we present our experimental results demonstrating the efﬁcacy of our approach in the human action recognition and node classiﬁcation tasks.

Published as a conference paper at ICLR 2024

Pierre-Antoine Absil, Robert E. Mahony, and Rodolphe Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2007.

Vincent Arsigny, Pierre Fillard, Xavier Pennec, and Nicholas Ayache. Fast and Simple Computations on Tensors with Log-Euclidean Metrics. Technical Report RR-5584, INRIA, 2005.

E. Batzies, K. Hper, L. Machado, and F. Silva Leite. Geometric Mean and Geodesic Regression on Grassmannians. Linear Algebra and its Applications, 466:83 101, 2015.

Thomas Bendokat, Ralf Zimmermann, and P. A. Absil. A Grassmann Manifold Handbook: Basic Geometry and Computational Aspects. Co RR, abs/2011.13699, 2020. URL https://arxiv. org/abs/2011.13699.

Silv ere Bonnabel, Anne Collard, and Rodolphe Sepulchre. Rank-preserving Geometric Means of Positive Semi-deﬁnite Matrices. Linear Algebra and its Applications, 438:3202 3216, 2013.

Daniel A. Brooks, Olivier Schwander, Fr ed eric Barbaresco, Jean-Yves Schneider, and Matthieu Cord. Riemannian Batch Normalization for SPD Neural Networks. In Neur IPS, pp. 15463 15474, 2019.

Rudrasis Chakraborty, Jose Bouza, Jonathan H. Manton, and Baba C. Vemuri. Manifold Net: A Deep Neural Network for Manifold-valued Data with Applications. TPAMI, 44(2):799 810, 2020.

Ines Chami, Rex Ying, Christopher R, and Jure Leskovec. Hyperbolic Graph Convolutional Neural Networks. Co RR, abs/1910.12933, 2019. URL https://arxiv.org/abs/1910.12933.

Weize Chen, Xu Han, Yankai Lin, Hexu Zhao, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. Fully Hyperbolic Neural Networks. In ACL, pp. 5672 5686, 2022.

Calin Cruceru, Gary B ecigneul, and Octavian-Eugen Ganea. Computationally Tractable Riemannian Manifolds for Graph Embeddings. In AAAI, pp. 7133 7141, 2021.

Jindou Dai, Yuwei Wu, Zhi Gao, and Yunde Jia. A Hyperbolic-to-Hyperbolic Graph Convolutional Network. In CVPR, pp. 154 163, 2021.

Zhen Dong, Su Jia, Chi Zhang, Mingtao Pei, and Yuwei Wu. Deep Manifold Learning of Symmetric Positive Deﬁnite Matrices with Application to Face Recognition. In AAAI, pp. 4009 4015, 2017.

Alan Edelman, Tom as A. Arias, and Steven T. Smith. The Geometry of Algorithms with Orthogonality Constraints. SIAM Journal on Matrix Analysis and Applications, 20(2):303 353, 1998.

Octavian-Eugen Ganea, Gary B ecigneul, and Thomas Hofmann. Hyperbolic neural networks. In Neur IPS, pp. 5350 5360, 2018.

Guillermo Garcia-Hernando, Shanxin Yuan, Seungryul Baek, and Tae-Kyun Kim. First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations. In CVPR, pp. 409 419, 2018.

William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive Representation Learning on Large Graphs. In NIPS, pp. 1025 1035, 2017.

Mehrtash Harandi, Mathieu Salzmann, and Richard Hartley. Dimensionality Reduction on SPD Manifolds: The Emergence of Geometry-Aware Methods. TPAMI, 40:48 62, 2018.

Zhiwu Huang and Luc Van Gool. A Riemannian Network for SPD Matrix Learning. In AAAI, pp. 2036 2042, 2017.

Zhiwu Huang, Jiqing Wu, and Luc Van Gool. Building Deep Networks on Grassmann Manifolds. In AAAI, pp. 3279 3286, 2018.

Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu. Matrix Backpropagation for Deep Networks with Structured Layers. In ICCV, pp. 2965 2973, 2015.

Published as a conference paper at ICLR 2024

Ce Ju and Cuntai Guan. Graph Neural Networks on SPD Manifolds for Motor Imagery Classiﬁcation: A Perspective From the Time-Frequency Analysis. IEEE Transactions on Neural Networks and Learning Systems, pp. 1 15, 2023.

Qiyu Kang, Kai Zhao, Yang Song, Sijie Wang, and Wee Peng Tay. Node Embedding from Neural Hamiltonian Orbits in Graph Neural Networks. In ICML, pp. 15786 15808, 2023.

Sejong Kim. Ordered Gyrovector Spaces. Symmetry, 12(6), 2020.

Thomas N. Kipf and Max Welling. Semi-Supervised Classiﬁcation with Graph Convolutional Networks. Co RR, abs/1609.02907, 2017. URL https://arxiv.org/abs/1609.02907.

Reinmar J. Kobler, Jun ichiro Hirayama, Qibin Zhao, and Motoaki Kawanabe. SPD Domainspeciﬁc Batch Normalization to Crack Interpretable Unsupervised Domain Adaptation in EEG. In Neur IPS, pp. 6219 6235, 2022.

Guy Lebanon and John Lafferty. Hyperplane Margin Classiﬁers on the Multinomial Manifold. In ICML, pp. 66, 2004.

Zhenhua Lin. Riemannian Geometry of Symmetric Positive Deﬁnite Matrices via Cholesky Decomposition. SIAM Journal on Matrix Analysis and Applications, 40(4):1353 1370, 2019.

Qi Liu, Maximilian Nickel, and Douwe Kiela. Hyperbolic Graph Neural Networks. In Neur IPS, pp. 8228 8239, 2019.

Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, and Wanli Ouyang. Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition. In CVPR, pp. 143 152, 2020.

Federico L opez, Beatrice Pozzetti, Steve Trettel, Michael Strube, and Anna Wienhard. Vectorvalued Distance and Gyrocalculus on the Space of Symmetric Positive Deﬁnite Matrices. In Neur IPS, pp. 18350 18366, 2021.

Meinard M uller, Tido R oder, Michael Clausen, Bernhard Eberhardt, Bj orn Kr uger, and Andreas Weber. Documentation Mocap Database HDM05. Technical Report CG-2007-2, Universit at Bonn, June 2007.

Vinod Nair and Geoffrey E. Hinton. Rectiﬁed Linear Units Improve Restricted Boltzmann Machines. In ICML, pp. 807 814, 2010.

Galileo Namata, Ben London, Lise Getoor, and Bert Huang. Query-driven Active Surveying for Collective Classiﬁcation. In 10th International Workshop on Mining and Learning with Graphs, volume 8, pp. 1, 2012a.

Galileo Namata, Ben London, Lise Getoor, and Bert Huang. Query-driven Active Surveying for Collective Classiﬁcation. In Workshop on Mining and Learning with Graphs, 2012b.

Xuan Son Nguyen. Geom Net: A Neural Network Based on Riemannian Geometries of SPD Matrix Space and Cholesky Space for 3D Skeleton-Based Interaction Recognition. In ICCV, pp. 13379 13389, 2021.

Xuan Son Nguyen. A Gyrovector Space Approach for Symmetric Positive Semi-deﬁnite Matrix Learning. In ECCV, pp. 52 68, 2022a.

Xuan Son Nguyen. The Gyro-Structure of Some Matrix Manifolds. In Neur IPS, pp. 26618 26630, 2022b.

Xuan Son Nguyen and Shuo Yang. Building Neural Networks on Matrix Manifolds: A Gyrovector Space Approach. Co RR, abs/2305.04560, 2023. URL https://arxiv.org/abs/2305. 04560.

Xuan Son Nguyen, Luc Brun, Olivier L ezoray, and S ebastien Bougleux. A Neural Network Based on SPD Manifold Learning for Skeleton-based Hand Gesture Recognition. In CVPR, pp. 12036 12045, 2019.

Published as a conference paper at ICLR 2024

Xavier Pennec, Pierre Fillard, and Nicholas Ayache. A Riemannian Framework for Tensor Computing. Technical Report RR-5255, INRIA, 2004.

Xavier Pennec, Stefan Horst Sommer, and Tom Fletcher. Riemannian Geometric Statistics in Medical Image Analysis. Academic Press, 2020.

Chiara Plizzari, Marco Cannici, and Matteo Matteucci. Skeleton-based Action Recognition via Spatial and Temporal Transformer Networks. Computer Vision and Image Understanding, 208: 103219, 2021.

Prithviraj Sen, Galileo Mark Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, and Tina Eliassi-Rad. Collective Classiﬁcation in Network Data. AI Magazine, 29(3):93 106, 2008.

Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In CVPR, pp. 1010 1019, 2016.

Ryohei Shimizu, Yusuke Mukuta, and Tatsuya Harada. Hyperbolic Neural Networks++. Co RR, abs/2006.08210, 2021. URL https://arxiv.org/abs/2006.08210.

Ondrej Skopek, Octavian-Eugen Ganea, and Gary B ecigneul. Mixed-curvature Variational Autoencoders. Co RR, abs/1911.08411, 2020. URL https://arxiv.org/abs/1911.08411.

Lincon S. Souza, Naoya Sogi, Bernardo B. Gatto, Takumi Kobayashi, and Kazuhiro Fukui. An Interface between Grassmann Manifolds and Vector Spaces. In CVPRW, pp. 3695 3704, 2020.

Rhea Sanjay Sukthanker, Zhiwu Huang, Suryansh Kumar, Erik Goron Endsjo, Yan Wu, and Luc Van Gool. Neural Architecture Search of SPD Manifold Networks. In IJCAI, pp. 3002 3009, 2021.

Abraham Albert Ungar. Beyond the Einstein Addition Law and Its Gyroscopic Thomas Precession: The Theory of Gyrogroups and Gyrovector Spaces. Fundamental Theories of Physics, vol. 117, Springer, Netherlands, 2002.

Abraham Albert Ungar. Analytic Hyperbolic Geometry: Mathematical Foundations and Applications. World Scientiﬁc Publishing Co. Pte. Ltd., Hackensack, NJ, 2005.

Abraham Albert Ungar. Analytic Hyperbolic Geometry in N Dimensions: An Introduction. CRC Press, 2014.

Petar Veli ckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li, and Yoshua Bengio. Graph Attention Networks. Co RR, abs/1710.10903, 2018. URL https://arxiv. org/abs/1710.10903.

Rui Wang and Xiao-Jun Wu. Gras Net: A Simple Grassmannian Network for Image Set Classiﬁcation. Neural Processing Letters, 52(1):693 711, 2020.

Rui Wang, Xiao-Jun Wu, and Josef Kittler. Sym Net: A Simple Symmetric Positive Deﬁnite Manifold Deep Learning Method for Image Set Classiﬁcation. IEEE Transactions on Neural Networks and Learning Systems, pp. 1 15, 2021.

Muhan Zhang and Yixin Chen. Link Prediction Based on Graph Neural Networks. Co RR, abs/1802.09691, 2018. URL https://arxiv.org/abs/1802.09691.

Yiding Zhang, Xiao Wang, Chuan Shi, Nian Liu, and Guojie Song. Lorentzian Graph Convolutional Networks. In Proceedings of the Web Conference 2021, pp. 1249 1261, 2021.

Yiding Zhang, Xiao Wang, Chuan Shi, Xunqiang Jiang, and Yanfang Ye. Hyperbolic Graph Attention Network. IEEE Transactions on Big Data, 8(6):1690 1701, 2022.

Wei Zhao, Federico Lopez, J. Maxwell Riestenberg, Michael Strube, Diaaeldin Taha, and Steve Trettel. Modeling Graphs Beyond Hyperbolic: Graph Neural Networks in Symmetric Positive Deﬁnite Matrices. Co RR, abs/2306.14064, 2023. URL https://arxiv.org/abs/2306. 14064.

Bingxin Zhou, Xuebin Zheng, Yu Guang Wang, Ming Li, and Junbin Gao. Embedding Graphs on Grassmann Manifold. Neural Networks, 152:322 331, 2022.

Huanyu Zhou, Qingjie Liu, and Yunhong Wang. Learning Discriminative Representations for Skeleton Based Action Recognition. In CVPR, pp. 10608 10617, 2023.

Published as a conference paper at ICLR 2024

Symbol Name Mn,m Space of n m matrices Sym+ n Space of n n SPD matrices Sym+,ai n Space of n n SPD matrices with AI geometry Symn Space of n n symmetric matrices Grn,p Grassmannian in the projector perspective f Grn,p Grassmannian in the ONB perspective S+ n,p Space of n n SPSD matrices of rank p n M Matrix manifold TPM Tangent space of M at P exp(P) Matrix exponential of P log(P) Matrix logarithm of P Expai P (W) Exponential map of W at P in Sym+,ai n Logai P (Q) Logarithmic map of Q at P in Sym+,ai n T ai P Q(W) Parallel transport of W from P to Q in Sym+,ai n Expgr P (W) Exponential map of W at P in Grn,p Loggr P (Q) Logarithmic map of Q at P in Grn,p g Log gr P (Q) Logarithmic map of Q at P in f Grn,p ai, ai Binary and inverse operations in Sym+,ai n gr, gr Binary and inverse operations in Grn,p e gr, e gr Binary and inverse operations in f Grn,p psd,ai Binary operation in f Grn,p Sym+,ai p psd,ai Inverse operation in f Grn,p Sym+,ai p Hspd,ai W,P Hypergyroplane in Sym+,ai n Hpsd,ai W,P Hypergyroplane in f Grn,p Sym+,ai p Eai (i,j) Orthonormal basis of Sym+,ai m ., . ai Inner product in Sym+,ai n ., . gr Inner product in Grn,p ., . psd,ai Inner product in f Grn,p Sym+,ai p ., . ai P Afﬁne-Invariant metric at P In n n identity matrix

Table 3: The main notations used in the paper. For the notations related to SPD manifolds, only those associated with Afﬁne-Invariant geometry are shown.

A NOTATIONS

Tab. 3 presents the main notations used in our paper.

B MLR IN STRUCTURE SPACES

Algorithm 1 summarizes all steps for the computation of pseudo-gyrodistances in Theorem 3.11.

Details of some steps are given below:

Published as a conference paper at ICLR 2024

Algorithm 1: Computation of Pseudo-gyrodistances

Input: A batch of SPSD matrices Xi S+ n,p, i = 1, . . . , N The number of classes C The parameters for each class Uc P , Uc W f Grn,p, Sc P , Sc W Sym+ p , c = 1, . . . , C A constant γ [0, 1] Output: An array d MN,C of pseudo-gyrodistances

1 Um e In,p;

2 (Ui,Σi, Vi)i=1,...,N SVD((Xi)i=1,...,N); /* Xi = UiΣi VT i */

3 (Ui)i=1,...,N (Ui[:, : p])i=1,...,N;

4 if training then

5 U Gr Mean((Ui)i=1,...,N);

6 Um Gr Geodesic(Um, U, γ);

8 (Ui X, Si X)i=1,...,N = Gr Canonicalize((Xi, Ui)i=1,...,N, Um);

9 for c 1 to C do

10 d((X)i=1,...,N, c) = |λ (e gr Uc P e gr UX)(e gr Uc P e gr UX)T ,Uc W (Uc W )T gr+ g Sc P g SX,Sc W g|

λ( Uc W (Uc W )T gr)2+( Sc W g)2 ;

SVD((Xi)i=1,...,N) performs singular value decompositions for a batch of matrices. Gr Mean((Ui)i=1,...,N) computes the Fr echet mean of its arguments. Gr Geodesic(Um, U, γ) computes a point on a geodesic from Um to U at step γ (γ = 0.1 in our experiments). Gr Canonicalize((Xi, Ui)i=1,...,N, V) computes the canonical representations of Xi, i = 1, . . . , N using V as a common subspace (see Appendix H). Line 10: UX = (Ui X)i=1,...,N, SX = (Si X)i=1,...,N, and the computation of pseudogyrodistances is performed in batches.

C FORMULATION OF MLR FROM THE PERSPECTIVE OF DISTANCES TO HYPERPLANES

Given K classes, MLR computes the probability of each of the output classes as

p(y = k|x) = exp(w T k x + bk) PK i=1 exp(w T i x + bi) exp(w T k x + bk), (3)

where x is an input sample, bi R, x, wi Rn, i = 1, . . . , K.

As shown in Lebanon & Lafferty (2004), Eq. (3) can be rewritten as

p(y = k|x) exp(sign(w T k x + bk) wk d(x, Hwk,bk)),

where d(x, Hwk,bk) is the distance from point x to a hyperplane Hwk,bk deﬁned as

Hw,b = {x Rn : w, x + b = 0},

where w Rn \ {0}, and b R.

D HUMAN ACTION RECOGNITION

D.1 DATASETS

HDM05 (M uller et al., 2007) It has 2337 sequences of 3D skeleton data classiﬁed into 130 classes. Each frame contains the 3D coordinates of 31 body joints. We use all the action classes and follow the experimental protocol in Harandi et al. (2018) in which 2 subjects are used for training and the remaining 3 subjects are used for testing.

Published as a conference paper at ICLR 2024

FPHA (Garcia-Hernando et al., 2018) It has 1175 sequences of 3D skeleton data classiﬁed into 45 classes. Each frame contains the 3D coordinates of 21 hand joints. We follow the experimental protocol in Garcia-Hernando et al. (2018) in which 600 sequences are used for training and 575 sequences are used for testing.

NTU60 (Shahroudy et al., 2016) It has 56880 sequences of 3D skeleton data classiﬁed into 60 classes. Each frame contains the 3D coordinates of 25 or 50 body joints. We use the mutual actions and follow the cross-subject experimental protocol in Shahroudy et al. (2016) in which data from 20 subjects are used for training, and those from the other 20 subjects are used for testing.

D.2 IMPLEMENTATION DETAILS

D.2.1 SETUP

We use the Py Torch framework to implement our networks and those from previous works. These networks are trained using cross-entropy loss and Adadelta optimizer for 2000 epochs. The learning rate is set to 10 3. The factors β (see Proposition 3.4) and λ (see Deﬁnition 3.9) are set to 0 and 1, respectively. For Gyro Spd++, the sizes of output matrices of the convolutional layer are set to 34 34, 21 21, and 11 11 for the experiments on HDM05, FPHA, and NTU60 datasets, respectively. For Gyro Spsd++, the sizes of SPD matrices in structure spaces are set to 20 20, 14 14, and 8 8 for the experiments on HDM05, FPHA, and NTU60 datasets, respectively. We use a batch size of 32 for HDM05 and FPHA datasets, and a batch size of 256 for NTU60 dataset.

D.2.2 INPUT DATA

We use a similar method as in Nguyen (2022b) to compute the input data for our network Gyro Spd++. We ﬁrst identify a closest left (right) neighbor of every joint based on their distance to the hip (wrist) joint, and then combine the 3D coordinates of each joint and those of its left (right) neighbor to create a feature vector for the joint. For a given frame t, a mean vector µt and a covariance matrix Σt are computed from the set of feature vectors of the frame and then combined to create an SPD matrix as

Yt = Σt + µt(µt)T µt (µt)T 1

The lower part of matrix log(Yt) is ﬂattened to obtain a vector vt. All vectors vt within a time window [t, t+c 1], where c is determined from a temporal pyramid representation of the sequence (the number of temporal pyramids is set to 2 in our experiments), are used to compute a covariance matrix as

i=t ( vi vt)( vi vt)T , (4)

where vt = 1 c Pt+c 1 i=t vi. For Gyro AI-HAUNet, SPSD-AI, MLR-AI, Gyro Spd++, and Gyro Spsd++, the input data are the set of matrices obtained in Eq. (4).

For SPDNet and SPDNet BN, each sequence is represented by a covariance matrix (Huang & Gool, 2017; Brooks et al., 2019). The sizes of the covariance matrices are 93 93, 60 60, and 150 150 for HDM05, FPHA, and NTU60 datasets, respectively. For SPDNet, the same architecture as the one in Huang & Gool (2017) is used with three Bimap layers. For SPDNet BN, the same architecture as the one in Brooks et al. (2019) is used with three Bimap layers. The sizes of the transformation matrices for the experiments on HDM05, FPHA, and NTU60 datasets are set to 93 93, 60 60, and 150 150, respectively.

D.2.3 CONVOLUTIONAL LAYERS

In order to reduce the number of parameters and the computational cost for the convolutional layer in Gyro Spd++, we assume a diagonal structure for the parameter P(i,j) (see Propositions 3.4, 3.5, and 3.6), i.e., P(i,j) = concatspd(P1 (i,j), . . . , PL (i,j)),

where L is the number of input SPD matrices of operation concatspd(.).

Published as a conference paper at ICLR 2024

D.2.4 OPTIMIZATION

For parameters that are SPD matrices, we model them on the space of symmetric matrices, and then apply the exponential map at the identity.

For any parameter P f Grn,p, we parameterize it by a matrix B Mp,n p such that 0 B BT 0

= [Loggr In,p(PPT ), In,p].

Then parameter P can be computed by

P = exp([Loggr In,p(PPT ), In,p])e In,p = exp 0 B BT 0

Thus, we can optimize all parameters on Euclidean spaces without having to resort to techniques developed on Riemannian manifolds.

D.3 TIME COMPLEXITY ANALYSIS

Let nin nin be the size of input SPD matrices, nout nout be the size of output matrices of the convolutional layer in Gyro Spd++, nrank nrank be the size of SPD matrices in structure spaces, nc be the number of action classes, ns be the number of SPD matrices encoding a sequence.

Gyro Spd++: The convolutional layer has time complexity O(nsn2 outn3 in). The MLR layer has time complexity O(ncn3 out).

Gyro Spsd++: The convolutional layer has time complexity O(nsn2 outn3 in). The MLR layer has time complexity O(n3 out + ncn3 rank).

D.4 MORE EXPERIMENTAL RESULTS

D.4.1 ABLATION STUDY

Impact of the factor β in Afﬁne-Invariant metrics To study the impact of the factor β in Afﬁne Invariant metrics on the performance of Gyro Spd++, we follow the approach in Nguyen (2021). Denote by (µ,Σ) the Gaussian distribution where µ Rn and Σ Mn,n are its mean and covariance. We can identify the Gaussian distribution (µ,Σ) with the following matrix:

(detΣ) 1 n+k Σ + kµµT µ(k) µ(k)T Ik

where k 1, µ(k) is a matrix with k identical column vectors µ. The natural symmetric Riemannian metric resulting from the above embedding is given (Nguyen, 2021) by

V, W P = Tr(VP 1WP 1) 1 n + k Tr(VP 1) Tr(WP 1),

where P is an SPD matrix, V and W are two tangent vectors at point P of the manifold. This Riemannian metric belongs to the family of Afﬁne-Invariant metrics where β = 1 n+k > 1

For this study, we replace matrix Zt in Eq. (4) with the following matrix:

e Zt = (det Zt) 1 n1+k Zt + k vt v T t vt(k) vt(k)T Ik

where n1 n1 is the size of matrix Zt. The input data for Gyro Spd++ are then computed from e Zt as before.

Tab. 4 reports the mean accuracies and standard deviations of Gyro Spd++ with respect to different settings of β on the three datasets. Gyro Spd++ with the setting β = 0 generally works well on all the datasets. Setting k = 3 improves the accuracy of Gyro Spd++ on NTU60 dataset. We also observe that setting k to a high value, e.g., k = 10 lowers the accuracies of Gyro Spd++ on the datasets.

Published as a conference paper at ICLR 2024

Table 4: Results (mean accuracy standard deviation) of Gyro Spd++ with respect to different settings of β on the three datasets (computed over 5 runs).

Dataset HDM05 FPHA NTU60 β = 0 79.78 1.42 96.84 0.27 95.28 1.37 k = 3 79.12 1.37 96.16 0.25 96.32 1.33 k = 10 78.25 1.39 95.91 0.29 94.44 1.34

Table 5: Results and computation times (seconds) of Gyro Spd++ with respect to different settings of the output dimension of the convolutional layer on FPHA dataset (computed over 5 runs). Experiments are conducted on a machine with Intel Core i7-8565U CPU 1.80 GHz 24GB RAM.

m Accuracy standard deviation Computation time/epoch Training Testing 10 94.53 0.31 30.08 12.07 21 96.84 0.27 129.30 50.84 30 96.80 0.26 182.52 71.49

Output dimension of convolutional layers Tab. 5 presents results and computation times of Gyro Spd++ with respect to different settings of the output dimension of the convolutional layer on FPHA dataset. Results show that the setting m = 21 clearly outperforms the setting m = 10 in terms of mean accuracy and standard deviation. However, compared to the setting m = 21, the setting m = 30 only increases the training and testing times without improving the mean accuracy of Gyro Spd++.

Design of Riemannian metrics for network blocks The use of different Riemannian metrics for the convolutional and MLR layers of Gyro Spd++ results in different variants of the same architecture. Results of some of these variants on FPHA dataset are shown in Tab. 6. It is noted that our architecture gives the best performance in terms of mean accuracy, while the architecture with Log-Cholesky geometry for the MLR layer performs the worst in terms of mean accuracy.

D.4.2 COMPARISON OF GYROSPD++ AGAINST STATE-OF-THE-ART METHODS

Here we present more comparisons of our networks against state-of-the-art networks. These networks belong to one of the following families of neural networks: (1) Hyperbolic neural networks: Hyp GRU (Ganea et al., 2018)3; (2) Graph neural networks: MS-G3D (Liu et al., 2020)4, TGN (Zhou et al., 2023)5; (3) Transformers: ST-TR (Plizzari et al., 2021)6. Note that MS-G3D, TGN, and ST-TR are speciﬁcally designed for skeleton-based action recognition. We use default parameter settings for these networks. Results of our networks and their competitors on HDM05, FPHA, and NTU60 datasets are shown in Tabs. 7, 8, and 9, respectively. On HDM05 dataset, Gyro Spd++ outperforms Hyp GRU, MS-G3D, ST-TR, and TGN by 25.6%, 10.8%, 11.9%, and 9.5% points in terms of mean accuracy, respectively. On FPHA dataset, Gyro Spd++ outperforms Hyp GRU, MS-G3D, ST-TR, and TGN by 38.6%, 8.5%, 10.9%, and 6.0% points in terms of mean accuracy, respectively. On NTU60 dataset, Gyro Spd++ outperforms Hyp GRU, MS-G3D, ST-TR, and TGN by 7.0%, 3.1%, 4.2%, and 2.2% points in terms of mean accuracy, respectively. Overall, our networks are superior to their competitors in all cases.

Finally, we present a comparison of computation times of SPD neural networks in Tab. 10.

Published as a conference paper at ICLR 2024

Table 6: Results (mean accuracy standard deviation) of Gyro Spd++ with different designs of Riemannian metrics for its layers on FPHA dataset (computed over 5 runs).

AI-LE (Gyro Spd++) LE-LE AI-AI LE-AI AI-LC 96.84 0.27 94.72 0.25 94.35 0.29 95.21 0.26 89.16 0.26

Table 7: Results of our networks and some state-of-the-art methods on HDM05 dataset (computed over 5 runs).

Method Accuracy Standard deviation Hyp GRU (Ganea et al., 2018) 54.18 1.51 MS-G3D (Liu et al., 2020) 68.92 1.72 ST-TR (Plizzari et al., 2021) 67.84 1.66 TGN (Zhou et al., 2023) 70.26 1.48 Gyro Spd++ (Ours) 79.78 1.42 Gyro Spsd++ (Ours) 78.52 1.34

E NODE CLASSIFICATION

E.1 DATASETS

Airport (Chami et al., 2019) It is a ﬂight network dataset from Open Flights.org where nodes represent airports, edges represent the airline Routes, and node labels are the populations of the country where the airport belongs.

Pubmed (Namata et al., 2012b) It is a standard benchmark describing citation networks where nodes represent scientiﬁc papers in the area of medicine, edges are citations between them, and node labels are academic (sub)areas.

Cora (Sen et al., 2008) It is a citation network where nodes represent scientiﬁc papers in the area of machine learning, edges are citations between them, and node labels are academic (sub)areas.

The statistics of the three datasets are summarized in Tab. 11.

E.2 IMPLEMENTATION DETAILS

E.2.1 SETUP

Our network is implemented using the Py Torch framework. We set hyperparameters as in Zhao et al. (2023) that are found via grid search for each graph architecture on the development set of a given dataset. The best settings of n and p are found from (n, p) {(2k, k)}, k = 2, 3, . . . , 10. The batch size is set to the total number of graph nodes in a dataset (Chami et al., 2019; Zhao et al., 2023). The networks are trained using cross-entropy loss and Adam optimizer for a maximum of 500 epochs. The learning rate is set to 10 2. Early stopping is used when the loss on the development set has not decreased for 200 epochs. Each network has two layers that perform message passing twice at one iteration (Zhao et al., 2023). We use the 70/15/15 percent splits (Chami et al., 2019) for Airport dataset, and standard splits in GCN Kipf & Welling (2017) for Pubmed and Cora datasets.

3https://github.com/dalab/hyperbolic_nn. 4https://github.com/kenziyuliu/MS-G3D. 5https://github.com/zhysora/FR-Head. 6https://github.com/Chiaraplizz/ST-TR.

Published as a conference paper at ICLR 2024

Table 8: Results of our networks and some state-of-the-art methods on FPHA dataset (computed over 5 runs).

Method Accuracy Standard deviation Hyp GRU (Ganea et al., 2018) 58.24 0.29 MS-G3D (Liu et al., 2020) 88.26 0.67 ST-TR (Plizzari et al., 2021) 85.94 0.46 TGN (Zhou et al., 2023) 90.81 0.53 Gyro Spd++ (Ours) 96.84 0.27 Gyro Spsd++ (Ours) 97.90 0.24

Table 9: Results of our networks and some state-of-the-art methods on NTU60 dataset (computed over 5 runs).

Method Accuracy Standard deviation Hyp GRU (Ganea et al., 2018) 88.26 1.40 MS-G3D (Liu et al., 2020) 92.15 1.60 ST-TR (Plizzari et al., 2021) 91.04 1.52 TGN (Zhou et al., 2023) 93.02 1.56 Gyro Spd++ (Ours) 95.28 1.37 Gyro Spsd++ (Ours) 96.64 1.35

E.2.2 GRASSMANN LOGARITHMIC MAP IN THE ONB PERSPECTIVE

The Grassmann logarithmic map in the ONB perspective is given (Edelman et al., 1998) by

g Log gr P (Q) = U arctan(Σ)VT ,

where P, Q f Grn,p, U, Σ, and V are obtained from the SVD (In PPT )Q(PT Q) 1 = UΣVT .

E.2.3 GR-GCN++

To create Grassmann embeddings as input node features, we ﬁrst transform d-dimensional input features into p(n p)-dimensional vectors via a linear map. We then reshape each resulting vector to a matrix B Mp,n p. The input Grassmann embedding X0 i , i V is computed as

X0 i = exp 0 B BT 0

In,p exp 0 B BT 0

E.2.4 GR-GCN-ONB

To create Grassmann embeddings as input node features, we ﬁrst transform d-dimensional input features into p(n p)-dimensional vectors via a linear map. We then reshape each resulting vector to a matrix B Mp,n p. The input Grassmann embedding X0 i , i V is computed as

X0 i = exp 0 B BT 0

Feature transformation is performed by ﬁrst mapping the input to a projection matrix (using the mapping τ in Section 3.4.1), then applying an isometry map based on left Grassmann gyrotranslations (Nguyen & Yang, 2023), and ﬁnally mapping the result back to a matrix with orthonormal columns. This is equivalent to performing the following mapping:

φM(Xl i) = Me gr Xl i = exp([Loggr In,p(MMT ), In,p])Xl i,

where Xl i f Grn,p and M f Grn,p is a model parameter.

Published as a conference paper at ICLR 2024

Table 10: Computation times (seconds) per epoch of our networks and some state-of-the-art SPD neural networks on FPHA dataset. Experiments are conducted on a machine with Intel Core i78565U CPU 1.80 GHz 24GB RAM.

Method SPDNet SPDNet BN Gyro AI-HAUNet SPSD-AI MLR-AI Gyro Spd++ Gyro Spsd++ Training 17.52 40.08 62.21 73.73 102.58 129.30 126.08 Testing 3.48 6.22 30.83 35.54 46.28 50.84 48.06

Table 11: Description of the datasets for node classiﬁcation.

Dataset #Nodes #Edges #Classes #Features Airport 3188 18631 4 4 Pubmed 19717 44338 3 500 Cora 2708 5429 7 1433

For any X f Grn,p, let X = VU be a QR decomposition of X, where V Mn,p is a matrix with orthonormal columns, and U Mp,p is an upper-triangular matrix. Then the non-linear activation function is given by σ(X) = V.

Bias addition is performed using operation e gr instead of operation gr. The output of Gr-GCNONB is mapped to a projection matrix for node classiﬁcation.

E.2.5 OPTIMIZATION

For any parameter P Grn,p, we parameterize it by a matrix B Mp,n p such that 0 B BT 0

= [Loggr In,p(P), In,p].

Then parameter P can be computed by

P = exp 0 B BT 0

In,p exp 0 B BT 0

E.3 MORE EXPERIMENTAL RESULTS

E.3.1 ABLATION STUDY

Projector vs. ONB perspective More results of Gr-GCN++ and Gr-GCN-ONB are presented in Tabs. 12 and 13. As can be observed, Gr-GCN++ outperforms Gr-GCN-ONB in all cases. In particular, the former outperforms the latter by large margins on Airport and Cora datasets. Results show that while both the networks learn node embeddings on Grassmann manifolds, the choice of perspective for representing these embeddings and the associated parameters can have a signiﬁcant impact on the network performance.

E.3.2 COMPARISON OF GR-GCN++ AGAINST STATE-OF-THE-ART METHODS

Tab. 14 shows results of Gr-GCN++ and some state-of-the-art methods on the three datasets. The hyperbolic networks outperform their SPD and Grassmann counterparts on Airport dataset with high hyperbolicity (Chami et al., 2019). This agrees with previous works (Chami et al., 2019; Zhang et al., 2022) that report good performances of hyperbolic embeddings on tree-like datasets. However, our network and its SPD counterpart SPD-GCN outperform their competitors on Pubmed and Cora datasets with low hyperbolicities. Compared to SPD-GCN, Gr-GCN++ always gives more consistent results.

Published as a conference paper at ICLR 2024

Table 12: Results and computation times (seconds) per epoch of Gr-GCN++ and its variant Gr GCN-ONB based on the ONB perspective. Node embeddings are learned on f Gr4,2 and Gr4,2 for Gr-GCN-ONB and Gr-GCN++, respectively. Results are computed over 5 runs. Experiments are conducted on a machine with Intel Core i7-9700 CPU 3.00 GHz 15GB RAM.

Method Gr-GCN-ONB Gr-GCN++

Accuracy standard deviation 53.2 1.9 60.1 1.3 Training 0.07 0.21 Testing 0.05 0.12

Accuracy standard deviation 75.7 2.1 77.5 1.1 Training 0.50 0.90 Testing 0.38 0.54

Accuracy standard deviation 33.9 2.3 64.4 1.4 Training 0.10 0.12 Testing 0.07 0.08

Table 13: Results and computation times (seconds) per epoch of Gr-GCN++ and its variant Gr GCN-ONB based on the ONB perspective. Node embeddings are learned on f Gr6,3 and Gr6,3 for Gr-GCN-ONB and Gr-GCN++, respectively. Results are computed over 5 runs. Experiments are conducted on a machine with Intel Core i7-9700 CPU 3.00 GHz 15GB RAM.

Method Gr-GCN-ONB Gr-GCN++

Accuracy standard deviation 65.8 1.5 74.1 0.9 Training 0.19 0.34 Testing 0.14 0.21

Accuracy standard deviation 75.8 2.0 78.5 0.9 Training 0.90 1.76 Testing 0.75 1.05

Accuracy standard deviation 41.4 2.2 70.5 1.1 Training 0.16 0.22 Testing 0.12 0.16

F LIMITATIONS OF OUR WORK

Our SPD network Gyro Spd++ relies on different Riemannian metrics across the layers, i.e., the convolutional layer is based on Afﬁne-Invariant metrics while the MLR layer is based on Log Euclidean metrics. Although we have provided the experimental results demonstrating that Gyro Spd++ achieves good performance on all the datasets compared to state-of-the-art methods, it is not clear if our design is optimal for the human action recognition task. When it comes to building a deep SPD architecture, it is useful to provide insights into Riemannian metrics one should use for each network block in order to obtain good performance on a target task.

In our Grassmann network Gr-GCN++, the feature transformation and bias and nonlinearity operations are performed on Grassmann manifolds, while the aggregation operation is performed in tangent spaces. Previous works (Dai et al., 2021; Chen et al., 2022) on HNNs have shown that this hybrid method limits the modeling ability of networks. Therefore, it is desirable to develop GCNs where all the operations are formalized on Grassmann manifolds.

Published as a conference paper at ICLR 2024

Table 14: Results (mean accuracy standard deviation) of Gr-GCN++ and some state-of-the-art methods on the three datasets. The best and second best results in terms of mean accuracy are highlighted in red and blue, respectively.

Method Airport Pubmed Cora GCN (Kipf & Welling, 2017) 82.2 0.6 77.8 0.8 80.2 2.3 GAT (Veli ckovi c et al., 2018) 92.9 0.8 77.6 0.8 80.3 0.6 HGNN (Liu et al., 2019) 84.5 0.7 76.6 1.4 79.5 0.9 HGCN (Chami et al., 2019) 85.3 0.6 76.4 0.8 78.7 0.9 LGCN (Zhang et al., 2021) 88.2 0.2 77.3 1.4 80.6 0.9 HGAT (Zhang et al., 2022) 87.5 0.9 78.0 0.5 80.9 0.7 SPD-GCN (Zhao et al., 2023) 82.6 1.5 78.7 0.5 82.3 0.5 Ham GNN (Kang et al., 2023) 95.9 0.1 78.3 0.6 80.1 1.6 Gr-GCN++ (Ours) 82.8 0.7 80.3 0.5 81.6 0.4

G SOME RELATED DEFINITIONS

G.1 GYROGROUPS AND GYROVECTOR SPACES

Gyrovector spaces form the setting for hyperbolic geometry in the same way that vector spaces form the setting for Euclidean geometry (Ungar, 2002; 2005; 2014). We recap the deﬁnitions of gyrogroups and gyrocommutative gyrogroups proposed in Ungar (2002; 2005; 2014). For greater mathematical detail and in-depth discussion, we refer the interested reader to these papers. Deﬁnition G.1 (Gyrogroups (Ungar, 2014)). A pair (G, ) is a groupoid in the sense that it is a nonempty set, G, with a binary operation, . A groupoid (G, ) is a gyrogroup if its binary operation satisﬁes the following axioms for a, b, c G:

(G1) There is at least one element e G called a left identity such that e a = a.

(G2) There is an element a G called a left inverse of a such that a a = e.

(G3) There is an automorphism gyr[a, b] : G G for each a, b G such that

a (b c) = (a b) gyr[a, b]c (Left Gyroassociative Law).

The automorphism gyr[a, b] is called the gyroautomorphism, or the gyration of G generated by a, b.

(G4) gyr[a, b] = gyr[a b, b] (Left Reduction Property). Deﬁnition G.2 (Gyrocommutative Gyrogroups (Ungar, 2014)). A gyrogroup (G, ) is gyrocommutative if it satisﬁes

a b = gyr[a, b](b a) (Gyrocommutative Law).

The following deﬁnition of gyrovector spaces is slightly different from Deﬁnition 3.2 in Ungar (2014). Deﬁnition G.3 (Gyrovector Spaces). A gyrocommutative gyrogroup (G, ) equipped with a scalar multiplication (t, x) t x : R G G is called a gyrovector space if it satisﬁes the following axioms for s, t R and a, b, c G:

(V1) 1 a = a, 0 a = t e = e, and ( 1) a = a.

(V2) (s + t) a = s a t a.

(V3) (st) a = s (t a).

(V4) gyr[a, b](t c) = t gyr[a, b]c.

(V5) gyr[s a, t a] = Id, where Id is the identity map.

Published as a conference paper at ICLR 2024

G.2 AI GYROVECTOR SPACES

For P, Q Sym+ n , the binary operation (Nguyen, 2022a) is given as

P ai Q = P 1 2 QP 1 2 .

The inverse operation (Nguyen, 2022a) is given by

ai P = P 1.

G.3 LE GYROVECTOR SPACES

For P, Q Sym+ n , the binary operation (Nguyen, 2022a) is given as

P le Q = exp(log(P) + log(Q)).

The inverse operation (Nguyen, 2022a) is given as

le P = P 1.

G.4 LC GYROVECTOR SPACES

For P, Q Sym+ n , the binary operation (Nguyen, 2022a) is given as

P lc Q = L (P) + L (Q) +D(L (P))D(L (Q)) . L (P) + L (Q) +D(L (P))D(L (Q)) T ,

where Y is a matrix of the same size as matrix Y Mn,n whose (i, j) element is Y(i,j) if i > j and is zero otherwise, D(Y) is a diagonal matrix of the same size as matrix Y whose (i, i) element is Y(i,i), and L (P) denotes the Cholesky factor of P, i.e., L (P) is a lower triangular matrix with positive diagonal entries such that P = L (P)L (P)T .

The inverse operation (Nguyen, 2022a) is given by

lc P = L (P) + D(L (P)) 1 L (P) + D(L (P)) 1 T .

G.5 GRASSMANN MANIFOLDS IN THE PROJECTOR PERSPECTIVE

For P, Q Grn,p, the binary operation (Nguyen, 2022b) is given as

P gr Q = exp([Loggr In,p(P), In,p])Q exp( [Loggr In,p(P), In,p]),

where [., .] denotes the matrix commutator.

The inverse operation (Nguyen, 2022b) is deﬁned as

gr P = Expgr In,p( Loggr In,p(P)).

G.6 GRASSMANN MANIFOLDS IN THE ONB PERSPECTIVE

For U, V f Grn,p, the binary operation (Nguyen & Yang, 2023) is deﬁned as

Ue gr V = exp([Loggr In,p(UUT ), In,p])V.

The inverse operation can be deﬁned using the approach in Nguyen & Yang (2023) (see Section 2.3.1), i.e., e gr U = τ 1 gr (UUT ) ,

where the mapping τ is deﬁned in Proposition 3.12, i.e.,

τ : f Grn,p Grn,p, U 7 UUT .

Published as a conference paper at ICLR 2024

G.7 THE SPD AND GRASSMANN INNER PRODUCTS

Deﬁnition G.4 (The SPD Inner Product). Let P, Q Sym+,g n . Then the SPD inner product of P and Q is deﬁned as

P, Q g = Logg In(P), Logg In(Q) g In.

Deﬁnition G.5 (The Grassmann Inner Product). Let P, Q Grn,p. Then the Grassmann inner product of P and Q is deﬁned as

P, Q gr = Loggr In,p(P), Loggr In,p(Q) In,p,

where ., . In,p denotes the inner product at In,p given by the canonical metric of Grn,p.

G.8 THE GYROCOSINE FUNCTION AND GYROANGLES IN STRUCTURE SPACES

Deﬁnition G.6 (The Gyrocosine Function and Gyroangles). Let P, Q, and R be three distinct gyropoints in structure space f Grn,p Sym+ p . The gyrocosine of the measure of the gyroangle α, 0 α π, between psd,g P psd,g Q and psd,g P psd,g R is given by the equation

cos α = psd,g P psd,g Q, psd,g P psd,g R psd,g

psd,g P psd,g Q psd,g. psd,g P psd,g R psd,g ,

where . psd,g is the norm induced by the inner product in structure spaces. The gyroangle α is denoted by α = QPR.

G.9 THE GYRODISTANCE FUNCTION IN STRUCTURE SPACES

Deﬁnition G.7 (The Gyrodistance Function in Structure Spaces). Let P, Q f Grn,p Sym+ p . Then the gyrodistance function in structure spaces f Grn,p Sym+ p is deﬁned as

d(P, Q) = psd,g P psd,g Q psd,g.

G.10 THE PSEUDO-GYRODISTANCE FUNCTION IN STRUCTURE SPACES

Deﬁnition G.8 (The Pseudo-gyrodistance Function in Structure Spaces). Let Hpsd,g W,P be a hyper-

gyroplane in structure space f Grn,p Sym+ p , and X f Grn,p Sym+ p . Then the pseudo-gyrodistance from X to Hpsd,g W,P is deﬁned as

d(X, Hpsd,g W,P) = sin( XPQ)d(X, P),

where Q is given by

Q = arg max Q Hpsd,g W,P \{P} cos( QPX).

By convention, sin( XPQ) = 0 for any X, Q Hpsd,g W,P.

H COMPUTATION OF CANONICAL REPRESENTATIONS

Let Vn,p be the space of n p matrices with orthonormal columns. For any P S+ n,p, let UP f Grn,p, SP Sym+ p such that P = UP SP UT P . Denote by W the common subspace used for computing a canonical representation of P. We ﬁrst compute two bases of span(UP ) and span(W), denoted respectively by U and W, such that

d Vn,p(U, W) = df Grn,p(span(UP ), span(W)),

Published as a conference paper at ICLR 2024

where d Vn,p(., .) and df Grn,p(., .) are the distances between two points in Vn,p and f Grn,p, respec-

tively. These two bases can be computed as U = UP Y, W = WV, where Y and V are obtained from a SVD of (UP )T W, i.e.,

(UP )T W = Y(cosΣ)VT .

The SPD matrix SP in the canonical representation of P is then computed as

SP = VU T PUVT .

I PROOF OF PROPOSITION 3.2

Proof. We ﬁrst recall the deﬁnition of the binary operation g in Nguyen (2022b).

Deﬁnition I.1 (The Binary Operation (Nguyen, 2022b)). Let P, Q Sym+ n . Then the binary operation g is deﬁned as

P g Q = Expg P(T g In P(Logg In(Q))).

Logg P(Q), W g P (1) = T g P In Logg P(Q) , T g P In(W) g In (2) = Expg In

T g P In Logg P(Q) , Expg In

T g P In(W) g, (5)

where (1) follows from the invariance of the inner product under parallel transport, and (2) follows from Deﬁnition G.4.

Let R = Expg In

T g P In Logg P(Q) . Then

Logg In(R) = T g P In Logg P(Q) ,

which results in T g In P Logg In(R) = Logg P(Q).

Hence Expg P T g In P Logg In(R) = Q.

By the Left Cancellation Law, Q = P g ( g P g Q).

Expg P T g In P Logg In(R) = P g ( g P g Q)

= Expg P T g In P Logg In( g P g Q) ,

where the last equality follows from Deﬁnition I.1.

We thus have

g P g Q = R

T g P In Logg P(Q) . (6)

Combining Eqs. (5) and (6), we get

Logg P(Q), W g P = g P g Q, Expg In T g P In(W) g,

which concludes the proof of Proposition 3.2.

Published as a conference paper at ICLR 2024

J PROOF OF PROPOSITION 3.4

Proof. The ﬁrst part of Proposition 3.4 can be easily veriﬁed using the deﬁnition of the SPD inner product (see Deﬁnition G.4) and that of Afﬁne-Invariant metrics (Pennec et al., 2020) (see Chapter 3).

To prove the second part of Proposition 3.4, we will use the notion of SPD pseudogyrodistance (Nguyen & Yang, 2023) in our interpretation of FC layers on SPD manifolds, i.e., the signed distance is replaced with the signed SPD pseudo-gyrodistance in the interpretation given in Section 3.2.1. First, we need the following result from Nguyen & Yang (2023).

Theorem J.1 (The SPD Pseudo-gyrodistance from an SPD Matrix to an SPD Hypergyroplane in an AI Gyrovector Space (Nguyen & Yang, 2023)). Let HW,P be an SPD hypergyroplane in a gyrovector space (Sym+ n , ai, ai), and X Sym+ n . Then the SPD pseudo-gyrodistance from X to HW,P is given by

d(X, HW,P) = | log(P 1

By Theorem J.1, the signed SPD pseudo-gyrodistance from Y to an SPD hypergyroplane that contains the origin and is orthogonal to the Eai (i,j) axis is given by

d(Y, HLogai Im(Eai (i,j)),Im) = log(Y), Logai Im(Eai (i,j)) F

Logai Im(Eai (i,j)) F .

According to our interpretation of FC layers,

v(i,j)(X) = log(Y), Logai Im(Eai (i,j)) F

Logai Im(Eai (i,j)) F .

We consider two cases:

Case 1: i < j.

v(i,j)(X) = log(Y), 1

2(eie T j + eje T i ) F

2(eie T j + eje T i ) F

= log(Y), 1

2(eie T j + eje T i ) F

log(Y)(i,j) + log(Y)(j,i)

2 log(Y)(i,j).

We thus deduce that

log(Y)(i,j) = 1

2v(i,j)(X).

Case 2: i = j.

v(i,i)(X) = log(Y), eie T i 1

m 1 1 1+mβ Im F

m 1 1 1+mβ Im F

= log(Y), eie T i 1

m 1 1 1 + mβ

Published as a conference paper at ICLR 2024

This leads to

v(i,i)(X) = log(Y)(i,i) 1

m 1 1 1 + mβ

j=1 log(Y)(j,j), (7)

for i = 1, . . . , m. By summing up v(i,i)(X), i = 1, . . . , m, we get

i=1 v(i,i)(X) = 1 1 + mβ

i=1 log(Y)(i,i),

or equivalently, m X

i=1 log(Y)(i,i) = p

i=1 v(i,i)(X) . (8)

Replacing the term Pm j=1 log(Y)(j,j) in Eq. (7) with the expression on the right-hand side of Eq. (8) results in

log(Y)(i,i) = v(i,i)(X) + 1

j=1 v(j,j)(X).

Note that Y = exp([log(Y)(i,j)]m i,j=1). This concludes the proof of Proposition 3.4.

K PROOF OF PROPOSITION 3.5

Proof. This proposition is a direct consequence of Proposition 3.4 for β = 0.

L PROOF OF PROPOSITION 3.6

Proof. The ﬁrst part of Proposition 3.6 can be easily veriﬁed using the deﬁnition of the SPD inner product (see Deﬁnition G.4) and that of Log-Cholesky metrics (Lin, 2019).

To prove the second part of Proposition 3.6, we ﬁrst recall the following result from Nguyen & Yang (2023).

Theorem L.1 (The SPD Gyrodistance from an SPD Matrix to an SPD Hypergyroplane in a LC Gyrovector Space (Nguyen & Yang, 2023)). Let HW,P be an SPD hypergyroplane in a gyrovector space (Sym+ n , lc, lc), and X Sym+ n . Then the SPD pseudo-gyrodistance from X to HW,P is equal to the SPD gyrodistance from X to HW,P and is given by

d(X, HW,P) = | A, B F |

where A = ϕ(P) + ϕ(X) + log(D(ϕ(P)) 1D(ϕ(X))),

B = f W + D(ϕ(P)) 1D(f W),

f W = ϕ(P) ϕ(P) 1W(ϕ(P) 1)T

where Y and D(Y), Y Mn,n are deﬁned in Section G.4, and ϕ(P) = L (P).

By Theorem L.1, the signed SPD pseudo-gyrodistance from Y to an SPD hypergyroplane that contains the origin and is orthogonal to the Elc (i,j) axis is given by

d(Y, HLoglc Im(Elc (i,j)),Im) = ϕ(Y) + log(D(ϕ(Y))), Loglc Im(Elc (i,j))

Loglc Im(Elc (i,j))

Published as a conference paper at ICLR 2024

According to our interpretation of FC layers,

v(i,j)(X) = ϕ(Y) + log(D(ϕ(Y))), Loglc Im(Elc (i,j))

Loglc Im(Elc (i,j))

We consider two cases:

Case 1: i < j.

v(i,j)(X) = ϕ(Y) + log(D(ϕ(Y))), eje T i F eje T i F = ϕ(Y) + log(D(ϕ(Y))), eje T i F = ϕ(Y)(j,i).

We thus have ϕ(Y)(j,i) = v(i,j)(X).

Case 2: i = j.

v(i,j)(X) = ϕ(Y) + log(D(ϕ(Y))), eie T i F eie T i F = ϕ(Y) + log(D(ϕ(Y))), eie T i F = log(ϕ(Y)(i,i)).

Hence ϕ(Y)(i,i) = exp(v(i,i)(X)).

Setting ϕ(Y) = [y(i,j)]m i,j=1, then y(i,j) are given by

exp(v(i,j)(X)), if i = j v(i,j)(X), if i < j 0, if i > j

Since ϕ(Y) is the Cholesky factor of Y, we have

Y = ϕ(Y)ϕ(Y)T ,

which concludes the proof of Proposition 3.6.

M PROOF OF THEOREM 3.11

Proof. Let Hpsd,g W,P be a hypergyroplane in structure space f Grn,p Sym+ p and X f Grn,p Sym+ p . By the deﬁnition of the pseudo-gyrodistance function,

d(X, Hpsd,g W,P) = sin( XPQ)d(X, P),

where Q is given by

Q = arg max Q Hpsd,g W,P \{P} cos( QPX)

= arg max Q Hpsd,g W,P \{P}

psd,g P psd,g Q, psd,g P psd,g X psd,g

psd,g P psd,g Q psd,g. psd,g P psd,g X psd,g .

Published as a conference paper at ICLR 2024

By the deﬁnitions of the binary and inverse operations in structure spaces,

psd,g P psd,g X = (e gr UP e gr UX, g SP g SX),

psd,g P psd,g Q = (e gr UP e gr UQ, g SP g SQ).

psd,g P psd,g X, psd,g P psd,g Q psd,g =λ (e gr UP e gr UX)(e gr UP e gr UX)T ,

(e gr UP e gr UQ)(e gr UP e gr UQ)T gr

+ g SP g SX, g SP g SQ g.

Let A1 = Loggr In,p (e gr UP e gr UX)(e gr UP e gr UX)T , B1 = Loggr In,p (e gr UP e gr UQ)(e gr UP e gr UQ)T , A2 = Logg In( g SP g SX), and B2 = Logg In( g SP g SQ). Then we have

Q = arg max Q Hpsd,g W,P \{P}

λ A1, B1 F + A2, B2 F p

λ A1 2 F + A2 2 F . p

λ B1 2 F + B2 2 F

= arg max Q Hpsd,g W,P \{P}

λB1 B2] F [

λA1 A2] F . [

λB1 B2] F , (9)

where is the concatenation operation similar to operation concatspd(.).

From the equation of hypergyroplanes in structure space f Grn,p Sym+ p ,

psd,g P psd,g Q, W psd,g = 0.

Let W = (UW , SW ). Then we have

λ (e gr UP e gr UQ)(e gr UP e gr UQ)T , UW (UW )T gr + g SP g SQ, SW g = 0. (10)

Let W1 = Loggr In,p UW (UW )T , W2 = Logg In,p(SW ). Then Eq. (10) can be rewritten as

λ B1, W1 F + B2, W2 F = 0,

which is equivalent to [

λW1 W2] F = 0. (11)

Now, the problem in (9) is to ﬁnd the minimum angle between the vector [

λA1 A2] and the Euclidean hyperplane described by Eq. (11). The pseudo-gyrodistance from X to Hpsd,g W,P thus can be obtained as

d(X, Hpsd,g W,P) = [

λW1 W2] F [

= λ A1, W1 F + A2, W2 F p

λ W1 2 F + W2 2 F .

Some simple manipulations lead to

d(X, Hpsd,g W,P) = |λ (e gr UP e gr UX)(e gr UP e gr UX)T , UW UT W gr + g SP g SX, SW g| q

λ( UW UT W gr)2 + ( SW g)2 ,

which concludes the proof of Theorem 3.11.

Published as a conference paper at ICLR 2024

N PROOF OF PROPOSITION 3.12

Proof. We need the following result from Nguyen & Yang (2023).

Proposition N.1. Let M and N be two Riemannian manifolds. Let φ : M N be an isometry. Then Log P(Q) = (Dφ 1 φ(P))(g Logφ(P)(φ(Q))),

where P, Q M, DτR(W) denotes the directional derivative of a mapping τ at point R N along direction W TRN, Log(.) and g Log(.) are the logarithmic maps in manifolds M and N, respectively.

We adopt the notations in Bendokat et al. (2020). The Riemannian metric g O Q(., .) on On is the standard inner product given (Edelman et al., 1998; Bendokat et al., 2020) as

g O Q(Ω1, Ω2) = Tr(ΩT 1 Ω2),

where Q On, Ω1, Ω2 TQ On.

Let U f Grn,p, D1, D2 TUf Grn,p. The canonical metric gf Gr U (D1, D2) on f Grn,p is the restriction of the Riemannian metric g O Q(., .) to the horizontal space of TQ On (multiplied by 1/2) and is given (Edelman et al., 1998; Bendokat et al., 2020) by

g f Gr U (D1, D2) = Tr DT 1 (In 1

2UUT )D2 . (12)

Let P Grn,p, 1, 2 TP Grn,p. The canonical metric g Gr P ( 1, 2) on Grn,p is the restriction of the Riemannian metric g O Q(., .) to the horizontal space of TQ On (multiplied by 1/2) and is given (Edelman et al., 1998; Bendokat et al., 2020) by

g Gr P ( 1, 2) = 1

2 Tr ( hor 1,Q)T hor 2,Q ,

where hor 1,Q and hor 2,Q are the horizontal lifts of 1 and 2 to Q, respectively. Here, Q is related

to P by Q = (U U ) and P = UUT , where U f Grn,p and U is the orthogonal completion of U.

Denote by Hor U f Grn,p the horizontal space of TUf Grn,p. Then this subspace is characterized by

Hor U f Grn,p = {U B|B Mn p,p}.

From Eq. (3.2) in Bendokat et al. (2020),

g Gr P ( 1, 2) = Tr ( hor 1,U)T hor 2,U , (13)

where hor 1,U and hor 2,U are the horizontal lifts of 1 and 2 to U, respectively.

Therefore, by Eq. (12),

g f Grn,p U ( hor 1,U, hor 2,U) = Tr ( hor 1,U)T (In 1

2UUT ) hor 2,U

= Tr ( hor 1,U)T hor 2,U 1

2 Tr ( hor 1,U)T UUT hor 2,U

= Tr ( hor 1,U)T hor 2,U 1

2 Tr (U B1)T UUT U B2

= Tr ( hor 1,U)T hor 2,U 1

2 Tr BT 1 UT UUT U B2

= Tr ( hor 1,U)T hor 2,U ,

where the last equality follows from the fact that UT U = 0.

Combining Eqs. (13) and (14), we get

g Gr P ( 1, 2) = g f Grn,p U ( hor 1,U, hor 2,U).

Published as a conference paper at ICLR 2024

By Proposition N.1, Loggr P (F) = (Dττ 1(P))(g Log gr τ 1(P)(τ 1(F))),

where P, F Grn,p.

From Eq. (3.15) in Bendokat et al. (2020),

DτR(W) = RWT + WRT .

Loggr P (F) = τ 1(P) g Log gr τ 1(P)(τ 1(F)) T + g Log gr τ 1(P)(τ 1(F))τ 1(P)T ,

which concludes the proof of Proposition 3.12.