# building_deep_networks_on_grassmann_manifolds__05c69890.pdf

Building Deep Networks on Grassmann Manifolds

Zhiwu Huang, Jiqing Wu, Luc Van Gool

Computer Vision Lab, ETH Zurich, Switzerland VISICS, KU Leuven, Belgium {zhiwu.huang, jiqing.wu, vangool}@vision.ee.ethz.ch

Learning representations on Grassmann manifolds is popular in quite a few visual recognition tasks. In order to enable deep learning on Grassmann manifolds, this paper proposes a deep network architecture by generalizing the Euclidean network paradigm to Grassmann manifolds. In particular, we design full rank mapping layers to transform input Grassmannian data to more desirable ones, exploit re-orthonormalization layers to normalize the resulting matrices, study projection pooling layers to reduce the model complexity in the Grassmannian context, and devise projection mapping layers to respect Grassmannian geometry and meanwhile achieve Euclidean forms for regular output layers. To train the Grassmann networks, we exploit a stochastic gradient descent setting on manifolds of the connection weights, and study a matrix generalization of backpropagation to update the structured data. The evaluations on three visual recognition tasks show that our Grassmann networks have clear advantages over existing Grassmann learning methods, and achieve results comparable with state-of-the-art approaches.

Introduction This paper introduces a deep network architecture on Grassmannians, which are manifolds of linear subspaces. In many computer vision applications, linear subspaces have become a core representation. For example, for face veriﬁcation (Huang et al. 2015b), emotion estimation (Liu et al. 2014b) and activity recognition (Cherian et al. 2017), the image sets of a single person are often modeled by low dimensional subspaces that are then compared on Grassmannians. Besides, for video classiﬁcation, it is also very common to use autoregressive and moving average (ARMA) model (Vemulapalli, Pillai, and Chellappa 2013). The parameters of the ARMA model are known to be modeled with a highdimensional linear subspace. For shape analysis, the widelyused afﬁne and linear shape spaces for speciﬁc conﬁgurations can be also identiﬁed by points on the Grassmann manifold (Anirudh et al. 2017). Applications involving dynamic environments and autonomous agents often perform online visual learning by using subspace tracking techniques like incremental principal component analysis (PCA) to dynamically learn a better representational model as the appearance of the moving target (Turaga et al. 2011).

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

The popular applications of Grassmannian data motivate us to build a deep neural network architecture for Grassmannian representation learning. To this end, the new network architecture is designed to accept Grassmannian data directly as input, and learns new favorable Grassmannian representations that are able to improve the ﬁnal visual recognition tasks. In other words, the new network aims to deeply learn Grassmannian features on their underlying Riemannian manifolds in an end-to-end learning architecture. In summary, two main contributions are made by this paper: We explore a novel deep network architecture in the context of Grassmann manifolds, where it has not been possible to apply deep neural networks. More generally, treating Grassmannian data in deep networks can be very valuable in a variety of machine learning applications. We generalize backpropagation to train the proposed network with deriving an connection weight update rule on a speciﬁc Riemannian manifold. Furthermore, we incorporate QR decomposition into backpropagation that might prove very useful in other applications since QR decomposition is a very common linear algebra operator.

Background Grassmannian Geometry A Grassmann manifold Gr(q, D) is a q(D q) dimensional compact Riemannian manifold, which is the set of q-dimensional linear subspaces of the RD. Thus, each point on Gr(q, D) is a linear subspace that is spanned by the related orthonormal basis matrix X of size D q such that XT X = Iq, where Iq is the identity matrix of size q q. One of the most popular approaches to represent linear subspaces and approximate the true Grassmannian geodesic distance is the projection mapping framework Φ(X) = XXT proposed by (Edelman, Arias, and Smith 1998). As the projection Φ(X) is a D D symmetric matrix, a natural choice of inner product is X1, X2 Φ = tr(Φ(X1)T Φ(X2)). The inner product induces a distance metric named projection metric:

dp(X1, X2) = 2 1/2 X1XT 1 X2XT 2 F . (1)

where F indicates the matrix Frobenius norm. As proved in (Harandi et al. 2013), the projection metric can approximate the true geodesic distance up to a scale of

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

Grassmann Learning To perform discriminant learning on Grassmann manifolds, many works (Hamm and Lee. 2008; Hamm and Lee 2009; Cetingul and Vidal 2009; Harandi et al. 2011; 2013; 2014) either adopt tangent space approximation of the underlying manifolds, or exploit positive deﬁnite kernel functions to embed the manifolds into reproducing kernel Hilbert spaces. In both of such two cases, any existing Euclidean techniques can then be applied to the embedded data, since Hilbert spaces respect Euclidean geometry as well. For example, (Hamm and Lee. 2008) ﬁrst embeds the Grassmannian into a high dimensional Hilbert space, and then applies traditional Fisher analysis methods. Obviously, most of these methods are limited to the Mercer kernels, and hence are restricted to use only kernel-based classiﬁers. Moreover, their computational complexity increases steeply with the growing number of training samples. More recently, a new learning scheme was proposed by (Huang et al. 2015b) to perform a geometry-aware dimensionality reduction from the original Grassmann manifold to another lower-dimensional, more discriminative Grassmann manifold. This could better preserve the original Riemannian data structure, which commonly leads to more favorable classiﬁcation performances as studied in classical manifold learning. While (Huang et al. 2015b) has reached some success, it merely adopts a shallow learning scheme on Grassmann manifolds, which is still far away from the best solution for the problem of representation learning on nonlinear manifolds. Accordingly, this paper attempts to open up a possibility of deep learning on Grassmannians.

Manifold Network By leveraging the paradigm of traditional neural networks, an increasing number of networks (Masci et al. 2015; Ionescu, Vantzos, and Sminchisescu 2015; Huang and Van Gool 2017) have been built over general manifold domains. For instance, (Masci et al. 2015) proposed a geodesic convolution on local geodesic coordinate systems to extract local patches on the shape manifold for shape analysis. In particular, the method implements convolutions by sliding a window over the shape manifold, and local geodesic coordinates are employed instead of image patches. In (Huang and Van Gool 2017), to deeply learn appropriate features on the manifolds of symmetric positive deﬁnite (SPD) matrices, a deep network structure was developed with some spectral layers, which can be trained by a variant of backpropagation. Nevertheless, to the best of our knowledge, this is the ﬁrst work that studies a deep network architecture on Grassmann manifolds.

Grassmann Network Architecture In analogy to convolutional networks (Conv Nets), the proposed Grassmann network (Gr Net) also devises a Projection block containing fully connected convolution-like layers and normalization layers, named full rank mapping (FRMap) layers and re-orthonormalization (Re Orth) layers respectively. Inspired by the geometry-aware manifold learning idea (Huang et al. 2015b), the FRMap layers are proposed

to ﬁrstly perform transformations on input orthonormal matrices of subspaces to generate new matrices by adopting a full rank mapping scheme. Then, the Re Orth layers are developed to normalize the output matrices of the FRMap layers so that they can keep the basic orthogonality. In other words, the normalized data become orthonormal matrices that reside on Stiefel manifold1. As well-studied in (Edelman, Arias, and Smith 1998; Hamm and Lee. 2008), the projection metric performing on orthonormal matrices can represent linear subspaces and respect the geometry of Grassmann manifolds, which is actually a quotient manifold of the Stiefel manifold. Accordingly, we develop projection mapping (Proj Map) layers to maintain the Grassmannian property of the resulting data. Meanwhile, the Proj Map layers are able to transfer the resulting Grassmannian data into Euclidean data, which enables the regular Euclidean layers such as softmax layers for classiﬁcation. The Proj Map and softmax layers forms the Output block for Gr Net. Additionally, since traditional pooling layers can reduce the network complexity, we also study projection pooling (Proj Pooling) layers on the projection matrix form of the resulting orthonormal matrices. As it is non-trivial to perform pooling on non-Euclidean data directly, we develop a Pooling block to combine Pro Map, Proj Pooling and orthonormal mapping (Orth Map) layers, which respectively achieves Euclidean representation, performs pooling on resulting Euclidean data and transforms the results back to orthonormal data. The proposed Gr Net structure is illustrated in Fig.1.

FRMap Layer To learn compact and discriminative Grassmannian representation for better classiﬁcation, we design the FRMap layers to ﬁrst transform the input orthonormal matrices of subspaces to new matrices by a linear mapping function ffr as

Xk = f (k) fr (Xk 1; Wk) = Wk Xk 1, (2)

where Xk 1 Gr(q, dk 1) is the input of the k-th layer2, Wk Rdk dk 1 , (dk < dk 1) is the transformation matrix (connection weights) that is basically required to be a row full rank matrix, Xk Rdk q is the resulting matrix. Generally, the transformation result Wk Xk 1 is not an orthonormal basis matrix. To tackle this problem, we exploit a normalization strategy of QR decomposition in the following Re Orth layer. In addition, as classical deep networks, multiple projections {W 1 k , . . . , W m k } per FRMap layer can be applied on each input orthonormal matrix as well, where m is the number of transformation matrices. As the weight space Rdk dk 1 of full rank matrices on each FRMap layer is a non-compact Stiefel manifold where geodesic distance has no closed form. In view of gradientsteepest-descent learning by geodesic stepping for criterion optimization, it is necessary to have a closed form of the geodesic distance to derive natural gradients over a smooth

1A Stiefel manifold St(dk, dk 1) is the set of dk-dimensional orthogonal matrices of the Rdk 1. 2For consistency, k is used to denote the relative index of each involved layer in the sequel.

ࢄ ଵ ࢄ ଶ ࢄ ଶࢄ ଶ

3URM0DS /D\HU )& 6RIWPD[ /D\HUV ሺ ሻ

ࢄଵ ࢄ ሾࢃଵ ࢄ ǡ ǥ ሿ

)50DS /D\HU

5H2UWK /D\HU

ࢄଶ ࢄଵ ࡽଵ ࢄଵ ࡽଵࡾଵ

ሺଷሻ ሺସሻ ሺହሻ

3URM0DS /D\HU 3URM3RROLQJ /D\HU 2UWK0DS /D\HU

ࢄଷ ࢄଶ ࢄଶࢄଶ ࢄସ ሺሼࢄଷ ଵǡ ǥ ሽሻ ࢄହ ࢄସ ࢁସǡଵǣ ࢄସ ࢁସ ସࢁସ

3URMHFWLRQ %ORFN 3RROLQJ %ORFN 2XWSXW %ORFN

Figure 1: Conceptual illustration of the proposed Grassmann Network (Gr Net) architecture. The rectangles in blue represent three basic blocks, i.e., Projection, Pooling and Output blocks, respectively.

manifold. Hence, optimizing on the non-compact manifold directly is infeasible unless endowing the manifold with pseudo-Riemannian metrics. To handle this problem, one feasible solution is imposing orthogonality constraint on the weight matrix Wk so that it resides on a compact Stiefel manifold St(dk, dk 1). Obviously, such orthogonal solution space is smaller than the original solution space Rn m , making the optimization theoretically yield suboptimal solution of Wk. To achieve a more faithful solution, following (Huang et al. 2015b), we alternatively do the optimization over the manifolds PSD(dk, dk 1)3 of the conjugates Pk = W T k Wk, which are actually positive semideﬁnite (PSD) matrices. As studied in (Bonnabel and Sepulchre 2009; Journee et al. 2010; Meyer, Bonnabel, and Sepulchre 2011) and the popular manopt toolbox4, a PSD manifold is actually a quotient space, and thus optimizing on it actually pursues optimal full rank matrix Wk directly. In the sense, optimizing on PSD manifolds actually minimizes f(Wk) instead of f(W T k Wk), where f denotes the involved layer s loss function that will be introduced in the next section.

Re Orth Layer Inspired by (Kim, Kittler, and Cipolla 2007; Huang et al. 2015b) that use QR decomposition to transform a regular matrix to an orthonormal matrix, we design the Re Orth layers to perform QR decomposition on the input matrix Xk 1

Xk 1 = Qk 1Rk 1, (3)

where Qk 1 Rdk 1 q is the orthonormal matrix composed by the ﬁrst q columns, and Rk 1 Rq q is the invertible upper-triangular matrix. Since R is invertible and Q is orthonormal, we can make Xk become an orthonormal basis matrix by normalizing Xk 1 in the k-th layer as:

Xk = f (k) ro (Xk 1) = Xk 1R 1 k 1 = Qk 1. (4)

In the context of Conv Nets, (Cybenko 1989; Jarrett et al. 2009; Nair and Hinton 2010; Goodfellow et al. 2013; He et al. 2015) have presented various nonlinear activation functions, e.g., rectiﬁed linear units (Re LU), to improve discriminative performance. Accordingly, exploiting this kind

3A PSD manifold PSD(dk, dk 1) is the set of dk-rank positive semideﬁnite matrices of size dk 1 (Bonnabel and Sepulchre 2009; Journee et al. 2010). 4http://www.manopt.org/

of layers to introduce the non-linearity to the domain of the proposed Gr Net is also necessary. In the light of this, to some extent, the function Eqn.4 also takes a role of performing a non-linear activation with the QR factorization.

Proj Map Layer The Proj Map layer is designed to perform Grassmannian computation on the resulting orthonormal matrices. As wellstudied in (Edelman, Arias, and Smith 1998; Hamm and Lee. 2008; Hamm and Lee 2009; Harandi et al. 2011; Huang et al. 2015b), the projection metric is one of the most popular Grassmannian metrics, and is able to endow the speciﬁc Riemannian manifold with an inner product structure so that the manifold is reduced to a ﬂat space. In the ﬂat space, classical Euclidean computations can be applied to the projection domain of orthonormal matrices. Formally, we apply the projection mapping (Edelman, Arias, and Smith 1998) to a orthonormal matrix Xk 1 in the k-th layer as

Xk = f (k) pm(Xk 1) = Xk 1XT k 1. (5)

As for other Riemannian computations on Grassmann manifolds, please refer to (Le 1991; Edelman, Arias, and Smith 1998; Srivastava and Klassen 2004; Absil, Mahony, and Sepulchre 2009; Helmke, H uper, and Trumpf 2007).

Proj Pooling Layer It is known that classical pooling layers with max, min and mean pooling functions reduce the sizes of the representations to lower the model complexity, and therefore improve the regular Conv Nets. Motivated by this, we also design pooling layers for the feature maps of Grassmannian data. Without loss of generality, we here study mean pooling for the Grassmannian points. Actually, there exist some approaches (Absil, Mahony, and Sepulchre 2004; Dodge and Rousson 1999; Srivastava and Klassen 2002; Marrinan et al. 2014) to compute mean points on Grassmannians. Inspired by the idea (Srivastava and Klassen 2002) with keeping the balance between computational time and calculation accuracy, we propose three layers to implement mean pooling for Grassmannian data. In particular, the Grassmannian data are ﬁrst mapped to the space of projection matrices by the Proj Map layer presented before. As the resulting m projection matrices {Xi k 1|1 i m} of size dk 1 dk 1 are Euclidean, we then design a regular mean pooling layer for

them. Lastly, we devise an orthonormal mapping (Orth Map) layer to transform the mean projection matrix back to orthonormal data. Formally, the functions for the Proj Pooling and Orth Map layers are respectively deﬁned as

Xk = f (k) pp ({ ˆ X1 k 1, . . . , ˆ Xn k 1}) = 1

i ˆ Xi k 1, (6)

Xk+1 = f (k) om(Xk) = Uk 1,1:q. (7)

where Uk 1,1:q is the ﬁrst q largest eigenvectors achieved by eigenvalue (EIG) decomposition on the input projection matrices Xk = Uk 1Σk 1U T k 1, and n is the number of instances ˆ Xi k 1 for the pooling. The instances can be either n projection matrices or n entries within an square patch of size n n located in one projection matrix. In other words, the ﬁrst type of pooling is performed across the projection matrices (A-Proj Pooling), while the second one is executed by sliding the mean ﬁlter over each square patch within one projection matrix (W-Proj Pooling). As a result, A-Proj Pooling with Orth Map ﬁnally yields m

n orthonormal matrices of size dk 1 q, while W-Proj Pooling with Orth Map outputs m orthonormal matrices of size dk 1 n q.

Output Layers As shown in Fig.1, after applying the Proj Map layer, the outputs (i.e., the projection matrices) lie in Euclidean space, and thus can be converted into vector forms. Hence, on the top of the Proj Map layer, classical Euclidean network layers can be employed. For instance, the regular fully connected (FC) layer could be used after the Proj Map layer. The dimensionality of the ﬁlters in the FC layer is typically set to dk dk 1, where dk and dk 1 are the class number and the dimensionality of the input vector forms respectively. Finally, the common softmax layer can be used for visual classiﬁcation.

Training Grassmann Network As most layers in the Gr Net model are expressed with complex matrix factorization functions, they cannot be simply reduced to a constructed bottom-up from element-wise calculations. In other words, the matrix backpropgation (backprop) cannot be derived by using traditional matrix that treats element-wise operations in matrix form. As a result, simply using the traditional backprop will break down in the setting. To solve the problem, (Huang and Van Gool 2017; Ionescu, Vantzos, and Sminchisescu 2015) introduced manifold-valued connection weight update rule and matrix backprop respectively. Furthermore, the convergence of the stochastic gradient descent (SGD) algorithm on Riemannian manifolds has also been studied well in (Bottou 2010; Bonnabel 2013). Accordingly, we exploit the training procedure for the proposed Gr Net upon these existing works. To begin with, we represent the proposed Gr Net model with a sequence of successive function compositions f = f (l) f (l 1) f (l 2) . . . f (2) f (1) with a parameter tuple W = (Wl, Wl 1, . . . , W1), where f (k) and Wk are the function and the weight matrix respectively for the k-th layer, and l is the number of layers. The loss of the k-th layer

can be denoted by a function as L(k) = ℓ f (l) . . . f (k), where ℓis the loss function for the last output layer. Then, we recall the deﬁnition of the matrix backprop and its properties studied in (Ionescu, Vantzos, and Sminchisescu 2015). In particular, (Ionescu, Vantzos, and Sminchisescu 2015) exploits a function F to describe the variations of the upper layer variables with respect to the lower layer variables, i.e., d Xk = F(d Xk 1). Consequently, a new version of the chain rule for the matrix backprop is deﬁned as L(k)(Xk 1, y)

Xk 1 = F L(k+1)(Xk, y)

where y is the desired output, Xk = f (k)(Xk 1), F is a non-linear adjoint operator of F, i.e., a : F(b) = F (a) : b, the matrix inner product A : B = Tr(AT B). In the sequel, we will detail the connection weight update on the speciﬁc PSD manifold and the matrix backprop process through some key layers in the context of the proposed Gr Net. For simplicity, we uniformly let L(k)(Xk 1, y) be L(k), Qk 1 be Q, and Rk 1 be R respectively.

FRMap Layer For the FRMap layers, we propose a new way of updating the weights appeared in Eqn.2 by exploiting an SGD setting on PSD manifolds. As studied in (Absil, Mahony, and Sepulchre 2009), the steepest descent direction for the corresponding loss function L(k)(Xk 1, y) with respect to Wk on one Riemannian manifold is the Riemannian gradient L(k) Wk. In particular, following the standard optimization (Absil, Mahony, and Sepulchre 2009) on Riemannian manifolds, we ﬁrst apply the parallel transport to transfer the Euclidean gradient in the tangent space at the current status of the weight W t k to the one in the tangent space at the next status W t+1 k . Then the resulting Euclidean gradient is subtracted to the normal component of the Euclidean gradient L(k) W t k. After this operation, searching along the tangential direction yields the update in the tangent space of the PSD manifold. Finally, the resulting update is mapped back to the PSD manifold with a retraction operation Γ. For more details about the geometry of PSD manifolds and the retraction operation on Riemannian manifolds, the readers are referred to (Bonnabel and Sepulchre 2009; Journee et al. 2010; Meyer, Bonnabel, and Sepulchre 2011; Absil, Mahony, and Sepulchre 2009). Accordingly, the update of the current connection weight W t k on the PSD manifold adheres to the following form L(k) W t k = L(k) W t k L(k) W t k(W t k) T W t k, (9)

W t+1 k = Γ(W t k λ L(k) W t k), (10)

where λ is the learning rate, L(k) W t k(W t k)T W t k is the normal

component of the Euclidean gradient L(k) W t k. By employing

the conventional backprop, L(k) W t k is computed by

L(k) W t k = L(k+1)

f (k)(Xk 1)

W t k = L(k+1)

Re Orth Layer Actually, the Re Orth layers involve QR decomposition Eqn.3 and the non-linear operation Eqn.4. Firstly, for Eqn.3 we introduce a virtual layer k , which receives Xk 1 as input and produces a tuple Xk = (Q, R). Following (Ionescu, Vantzos, and Sminchisescu 2015) to handle the case of a tuple output, we apply the new chain rule Eqn.8 with the equations a : F(b) = F (a) : b and d Xk = F(d Xk 1) to achieve the update rule for the structured data:

Xk 1 : d Xk 1

: d Xk 1 + F L(k )

Q : F (d Xk 1) + L(k )

R : F (d Xk 1)

Q : d Q + L(k )

(12) where the two variations d Q and d R are derived by the variation of the QR operation d Xk 1 = d QR + Qd R as:

d Q = Sd Xk 1R 1 + Q(QT d Xk 1R 1)asym, (13)

d R = QT d Xk 1 (QT d Xk 1R 1)asym R, (14)

where S = I QQT , I is an identity matrix, Aasym = Atril (Atril)T , Atril extracts the elements below the main diagonal of A. For more details to derive Eqn.13 and Eqn.14, please refer to the part I of Appendix. As derived by the part II of Appendix, plugging Eqn.13 and Eqn.14 into Eqn.12 achieves the partial derivatives of the loss functions for the Re Orth layers:

bsym (R 1)T

(15) where Absym = Atril (AT )tril, Atril extracts the ele-

ments below the main diagonal of A. L(k ) Q and L(k ) R can then be obtained on the function Eqn.4 employed in the Re Orth layers. Specially, its variation becomes d Xk = d Q. Therefore, the involved partial derivatives with respect to Q

and R are computed by L(k ) Q = L(k+1)

Xk and L(k ) R = 0.

Orth Map Layer As presented before, the Orth Map layers involve eigenvalue (EIG) decomposition. Thus we adopt the proposition in (Ionescu, Vantzos, and Sminchisescu 2015) to calculate the partial derivatives for the EIG computation.

Proposition 1 Let Xk 1 = UΣU T with Xk 1 RD D, such that U T U = I and Σ owns a diagonal structure. The

resulting partial derivative for the EIG layer k is given by

where ˆ K = 1/(σi σj), i = j; 0, i = j (σi is the diagonal element of Σ), and the partial derivatives with respect to Σ and U in Eqn.7 for the Orth Map layers can be achieved

by L(k ) U = [ L(k+1)

Xk 0] and L(k ) Σ = 0, where 0 is the matrix of size D (D q) with all elements being zero.

Empirical Evaluation

We compare four groups of exiting methods to evaluate the proposed Gr Net for three visual classiﬁcation tasks: emotion recognition, action recognition and face veriﬁcation.

Comparing methods: 1). General manifold learning methods: Expressionlets on Spatio-Temporal Manifold (STMExp Let) (Liu et al. 2014a) and Riemannian Sparse Representaion combining with Manifold Learning on the manifold of SPD matrices (RSR-SPDML) (Harandi, Salzmann, and Hartley 2014); 2). Grassmann learning methods: Discriminative Canonical Correlations (DCC) (Kim, Kittler, and Cipolla 2007), Grassmann Discriminant Analysis (GDA) (Hamm and Lee. 2008), Grassmannian Graph-Embedding Discriminant Analysis (GGDA) (Hamm and Lee 2009) and Projection Metric Learning (PML) (Huang et al. 2015b); 3). Regular convolutional networks: VGGDeep Face (Parkhiand, Vedaldi, and Zisserman 2015) and Deep Second-order Pooling (Deep O2P) (Ionescu, Vantzos, and Sminchisescu 2015); 4). Manifold network: Network on SPD manifolds (SPDNet) (Huang and Van Gool 2017) and Deep O2P that trains standard Conv Nets with a manifold layer end-to-end

We use source codes of all the comparing methods from authors with tuning their parameters as in the original papers. For our Gr Net, we build its architecture with using i Projection-Pooling block(s) (named as Gr Net-i Block) and one Output block, all of which are illustrated in Fig.1. The learning rate λ and the batch size are set to 0.01 and 30 respectively. The FRMap matrices are all initialized as random full rank matrices, and the number of them per layer is set to 16. For all the Proj Pooling layers, the number n of the instances for pooling are ﬁxed as 4. For training the Gr Net, we just use an i7-2600K (3.40GHz) PC without any GPUs5. Note that, the readers can follow the Deep O2P method to implement an end-to-end learning of Conv Net+Gr Net. Since the focus of this paper is on deep learning for Grassmannian inputs, we leave the study in future work.

5As the matrix factorizations are implemented well in CUDA, we will achieve the GPU version of our Gr Net for speedups.

Emotion Recognition: We utilize the popular Acted Facial Expression in Wild (AFEW) (Dhall et al. 2014) dataset. The dataset contains 1,345 sequences of facial expressions acted by 330 actors in close to real world setting. The standard protocol designed by (Dhall et al. 2014) splits the dataset into three data sets, i.e., training, validation and test data sets. In the training and validation data sets, each video is classiﬁed into one of seven expressions, while the ground truth of the test set has not been released. As a result, we follow (Liu et al. 2014a; Huang and Van Gool 2017) to present the results on the validation set. As done in many works such as (Huang and Van Gool 2017) for augmenting the training data, we split the training videos to 1,747 small subvideos. For the evaluation, each facial frame is normalized to an image of size 20 20. Then, following (Liu et al. 2013; 2014b), we model sequences of facial expression with a set of linear subspaces of order 10, which span a Grassmann manifold Gr(10, 400). In the task, the sizes of the Gr Net1Block weights are set to 400 100, while those of the Gr Net-2Blocks are set to 400 300 and 150 100.

Action Recognition: We use the HDM05 database (M uller et al. 2007) that is one of the largest-scale skeleton-based human action datasets. The dataset consists of 2,337 sequences of 130 action classes, and provides 3D locations of 31 joints of the subjects. Following the protocol designed in (Huang and Van Gool 2017), we conduct 10 random evaluations, each of which randomly selected half of sequences for training and the rest for testing. For data augmentation, the training sequences are divided into around 18,000 small subsequences in each random evaluation. As done in (Harandi, Salzmann, and Hartley 2014; Huang and Van Gool 2017), we represent each sequence by a covariance descriptor of size 93 93, which is computed by the second order statistics of the 3D coordinates of the 31 joints in each frame. Then, we apply SVD on the covariance descriptors to get linear subspaces of order 10, which form the data on a Grassmannian Gr(10, 93). For our Gr Net-1Block, the sizes of the connection weights are set to 93 60, while those of Gr Net-2Blocks are ﬁxed as 93 80, 40 30 respectively.

Face Veriﬁcation: We employ one standard dataset named Point-and-Shoot Challenge (Pa SC) (Beveridge et al. 2013). For 256 subjects, it owns 1,401 videos taken by control cameras, and 1,401 videos from handheld cameras. For the dataset, (Beveridge et al. 2013) designs control and handheld face veriﬁcation experiments. As done in (Beveridge, Zhang, and others 2015; Huang and Van Gool 2017), we use its 280 training videos and the COX data (Huang et al. 2015a) with 900 videos for training. Similarly, the training data are split to 12,529 small clips. To extract the state-of-the-art deep features, we perform the approach of (Parkhiand, Vedaldi, and Zisserman 2015) on the normalized face images of size 224 224. To speed up the training, we employ PCA to reduce the deep features to 400-dimensional ones. Following (Huang and Van Gool 2017), a SPD matrix of size 401 401 is computed by fusing covariance matrix and mean for each video. As done in (Huang et al. 2015b) on each video, we ﬁnally compute a linear subspace of order 10, which lies on

Gr(10, 401). We set the sizes of Gr Net-1Block weights to 401 100, while setting those to 401 300 and 150 100 for Gr Net-2Blocks.

Experimental Analysis

Table.1 presents the performances of the comparing methods for the three used datasets. The results show our Gr Net with 2 blocks can outperform the existing general manifold learning, Grassmann learning methods and standard Conv Nets (i.e., VGGDeep Face and Deep O2P). Particularly, on HDM05, we can observe that our Gr Net outperforms the state-of-the-art Grassmann learning methods by a large margin (more than 11%). This veriﬁes that the proposed Gr Net yields great improvements when the training data is large enough. For Pa SC, although the used softmax output layer in the Gr Net does not suit the veriﬁcation task well, we ﬁnd that it still reaches the highest performances in the case of 2 blocks, which learns more favorable Grassmannian representation. As studied in existing state-of-the-art Grassmann learning methods, the Gr Net without Riemannian computing (i.e., Proj Map) and without geometry-aware learning (i.e., Re Orth) perform very badly (17.62% and 26.15% respectively for AFEW). Besides, the consistent improvement of stacking more Gr Net blocks veriﬁes it can learn more discriminative Grassmannian representations and ﬁnally improve the classiﬁcation performances.

By comparing one of the manifold networks Deep O2P that uses an end-to-end training of Conv Net with a single SPD manifold layer, our Gr Nets achieve better performances. In contrast, our Gr Nets fail to surpass the other manifold network SPDNets that use multiple SPD manifold layers. Nevertheless, our Gr Nets perform deep learning on a different type of manifolds with owning many considerable differences in both terms of application range and intrinsic properties, some of which are enumerated below.

1). The proposed FPMap layers learn full rank projections while the Bi Map layers in SPDNets pursue bi-linear orthogonal projections. Besides, the Gr Nets exploit both single and multiple projections in each FRMap layer (i.e., S-FRMap, M-FRMap). The results in Fig.2 (a) show the beneﬁt of using M-FRMap. Furthermore, it alternatively studies a new connection weight update rule on PSD manifolds rather than the one on Stiefel manifolds, whose performances (i.e., 32.13%, 57.25%, 80.15%, 71.28%) on the datasets are often worse.

2). The Gr Nets design brand new Re Orth, Proj Map and Proj Pooling layers. Particularly, for the Re Orth layers, the exploration of the QR decomposition in backpropagation is a very important theoretical contribution. Besides, it devises pooling layer across and within projection matrices (A-Proj Pooling, W-Proj Pooling), both of which work on the top of the M-FRMap layers. As studied in Fig.2 (a), WProj Pooling typically outperforms A-Proj Pooling. The unfavorable performance of A-Proj Poolings would be caused by the extrinsic mean calulation on the Grassmannian and the weak relationship across multiple projection matrices.

Method AFEW HDM05 Pa SC1 Pa SC2 STM-Exp Let 31.73% RSR-SPDML 30.12% 48.01% 3.38 DCC 25.78% 41.74% 1.92 75.83% 67.04% GDA 29.11% 46.25% 2.71 71.38% 67.49% GGDA 29.45% 46.87% 2.19 66.71% 68.41% PML 28.98% 47.25% 2.78 73.45% 68.32% VGGDeep Face 78.82% 68.24% Deep O2P 28.54% 68.76% 60.14% SPDNet 34.23% 61.45% 1.12 80.12% 72.83% Gr Net-0Block 25.34% 51.12% 3.55 68.52% 63.92% Gr Net-1Block 32.08% 57.73% 2.24 80.15% 72.51% Gr Net-2Blocks 34.23% 59.23% 1.78 80.52% 72.76%

Table 1: Results for the AFEW, HDM05 and Pa SC datasets. Pa SC1/Pa SC2 are the control/handheld testings.

$)(: +'0 3D6& &RQ 3D6& +DQ

6 )50DS 0 )50DS $ 3URM3RROLQJ : 3URM3RROLQJ

Figure 2: (a) Results of using single and multiple FRMap (S-FRMap, M-FRMap), Proj Poolings across or within projections (A-Proj Pooling, W-Proj Pooling) for the three used databases. (b) (c) Convergence and accuracy curves of SPDNet and the proposed Gr Net for the AFEW.

3). Training Gr Net-1Block per epoch costs about 10, 9 and 13 minutes respectively on the three datasets, while training SPDNet (w/o complex pooling) takes 2, 4 and 15 minutes. However, in theory, our Gr Net actually runs much faster than the existing SPDNet when using the same setting. This is because the Gr Net handles much lower-dimensional orthonormal matrices of size d q (the order q is often set to 10), while the SPDNet treats SPD matrices of size d d.

4). Fig.2 (b)(c) show the Gr Net can use much less epochs (than the SPDNet) to converge on AFEW, and the validation gets near 12% improvement after training. For larger datasets like HDM05, the improvement is even up to 40%.

This paper introduced the ﬁrst network architecture to perform deep learning over Grassmann manifolds. Essentially, it is a natural exploration of convolutional networks to perform fully connected convolution, normalization, pooling and Remannian computing on Grassmannian data. In three typical visual classiﬁcation evaluations, our Grassmann networks signiﬁcantly outperformed existing Grassmann learning methods, and performed comparably with state-of-theart methods. Directions for future work include extending the current Gr Net to an end-to-end Conv Net+Gr Net training system and applying it to other computer vision problems. Acknowledgement: This work is supported by EU Framework Seven project Re Me Di (grant 610902).

I. Regarding the gradient computation of QR decomposition, we ﬁrst differentiate its implicit system Xk 1 = QR, QT Q = I, 0 = Rtril, (17) where Rtril returns the elements below the main diagonal of R, and we obtain d Xk 1 = d QR + Qd R, d QT Q = QT d Q. (18) Multiplying the ﬁrst equation of Eqn.18 from the left with QT and the right with R 1 derives d R = QT d Xk 1 QT d QR, (19)

d Q = d XR 1 Qd RR 1. (20)

The multiplication of Eqn.19 from the right with the inverse of R yields the equation 0 = QT d Xk 1R 1 QT d QRR 1 d RR 1. (21)

As (d RR 1)tril = 0, we further derive (QT d Q)tril = (QT d Xk 1R 1)tril. Since QT d Q = (QT d Q)tril ((QT d Q)tril)T is antisymmetric (see Eq.18) we have QT d Q = (QT d Xk 1R 1)tril ((QT d Xk 1R 1)tril)T

= (QT d Xk 1R 1)asym. (22)

Substituting Eqn.22 into Eqn.19 derives the gradient of QR decomposition w.r.t R as Eqn.14. By plugging Eqn.14 into Eqn.20, we can derive the gradient of the QR decomposition w.r.t Q as Eqn.13.

II. When plugging Eqn.13 and Eqn.14 into Eqn.12 to achieve Eqn.15, we employ the properties of matrix inner product A : B = Tr(AT B), which were studied in (Ionescu, Vantzos, and Sminchisescu 2015), to derive the following equivalent equation A : Basym = Absym : B, (23) where Basym = Btril (Btril)T , Absym = Atril (AT )tril, Atril extracts the elements below the main diagonal of A.

Absil, P.-A.; Mahony, R.; and Sepulchre, R. 2004. Riemannian geometry of Grassmann manifolds with a view on algorithmic computation. Acta Applicandae Mathematica 80(2):199 220. Absil, P.-A.; Mahony, R.; and Sepulchre, R. 2009. Optimization algorithms on matrix manifolds. Princeton University Press. Anirudh, R.; Turaga, P.; Su, J.; and Srivastava, A. 2017. Elastic functional coding of Riemannian trajectories. IEEE-TPAMI 39(5):922 936. Beveridge, J.; Phillips, P.; Bolme, D.; Draper, B.; Given, G.; Lui, Y.; Teli, M.; Zhang, H.; Scruggs, W.; Bowyer, K.; et al. 2013. The challenge of face recognition from digital point-and-shoot cameras. In BTAS. Beveridge, J. R.; Zhang, H.; et al. 2015. Report on the FG 2015 video person recognition evaluation. In FG. Bonnabel, S., and Sepulchre, R. 2009. Riemannian metric and geometric mean for positive semideﬁnite matrices of ﬁxed rank. SIAM Journal on Matrix Analysis and Applications 31(3):1055 1070. Bonnabel, S. 2013. Stochastic gradient descent on Riemannian manifolds. IEEE T-AC 58(9):2217 2229. Bottou, L. 2010. Large-scale machine learning with stochastic gradient descent. In COMPSTAT. Cetingul, H., and Vidal, R. 2009. Intrinsic mean shift for clustering on Stiefel and Grassmann manifolds. In CVPR. Cherian, A.; Fernando, B.; Harandi, M.; and Gould, S. 2017. Generalized rank pooling for activity recognition. CVPR. Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2(4):303 314. Dhall, A.; Goecke, R.; Joshi, J.; Sikka, K.; and Gedeon, T. 2014. Emotion recognition in the wild challenge 2014: Baseline, data and protocol. In ICMI. Dodge, Y., and Rousson, V. 1999. Multivariate L1 mean. Metrika 49(2):127 134. Edelman, A.; Arias, T.; and Smith, S. 1998. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications 20(2):303 353. Goodfellow, I.; Warde-Farley, D.; Mirza, M.; Courville, A.; and Bengio, Y. 2013. Maxout networks. In ICML. Hamm, J., and Lee., D. D. 2008. Grassmann discriminant analysis: a unifying view on subspace-based learning. In ICML. Hamm, J., and Lee, D. D. 2009. Extended Grassmann kernels for subspace-based learning. In NIPS. Harandi, M.; Sanderson, C.; Shirazi, S.; and Lovell, B. C. 2011. Graph embedding discriminant analysis on Grassmannian manifolds for improved image set matching. In CVPR. Harandi, M.; Sanderson, C.; Shen, C.; and Lovell, B. 2013. Dictionary learning and sparse coding on Grassmann manifolds: An extrinsic solution. In ICCV. Harandi, M. T.; Salzmann, M.; Jayasumana, S.; Hartley, R.; and Li, H. 2014. Expanding the family of Grassmannian kernels: An embedding perspective. In ECCV. Harandi, M. T.; Salzmann, M.; and Hartley, R. 2014. From manifold to manifold: Geometry-aware dimensionality reduction for SPD matrices. In ECCV. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In ICCV.

Helmke, U.; H uper, K.; and Trumpf, J. 2007. Newton s method on Grassmann manifolds. ar Xiv preprint ar Xiv:0709.2205. Huang, Z., and Van Gool, L. 2017. A Riemannian network for SPD matrix learning. In AAAI. Huang, Z.; Shan, S.; Wang, R.; Zhang, H.; Lao, S.; Kuerban, A.; and Chen, X. 2015a. A benchmark and comparative study of video-based face recognition on COX face database. IEEE T-IP 24(12):5967 5981. Huang, Z.; Wang, R.; Shan, S.; and Chen, X. 2015b. Projection metric learning on Grassmann manifold with application to video based face recognition. In CVPR. Ionescu, C.; Vantzos, O.; and Sminchisescu, C. 2015. Training deep networks with structured layers by matrix backpropagation. ar Xiv:1509.07838. Jarrett, K.; Kavukcuoglu, K.; Ranzato, M.; and Le Cun, Y. 2009. What is the best multi-stage architecture for object recognition? In ICCV. Journee, M.; Bach, F.; Absil, P.-A.; and Sepulchre, R. 2010. Lowrank optimization on the cone of positive semideﬁnite matrices. SIOPT. Kim, T.-K.; Kittler, J.; and Cipolla, R. 2007. Discriminative learning and recognition of image set classes using canonical correlations. IEEE T-PAMI 29(6):1005 1018. Le, H. 1991. On geodesics in Euclidean shape spaces. J. Lond. Math. Soc. 360 372. Liu, M.; Wang, R.; Huang, Z.; Shan, S.; and Chen, X. 2013. Partial least squares regression on Grassmannian manifold for emotion recognition. In ICMI. Liu, M.; Shan, S.; Wang, R.; and Chen, X. 2014a. Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In CVPR. Liu, M.; Wang, R.; Li, S.; Shan, S.; Huang, Z.; and Chen, X. 2014b. Combining multiple kernel methods on Riemannian manifold for emotion recognition in the wild. In ICMI. Marrinan, T.; B.J., R.; Draper, B.; Kirby, M.; and Peterson, C. 2014. Finding the subspace mean or median to ﬁt your need. In CVPR. Masci, J.; Boscaini, D.; Bronstein, M.; and Vandergheynst, P. 2015. Geodesic convolutional neural networks on Riemannian manifolds. In ICCV Workshops. Meyer, G.; Bonnabel, S.; and Sepulchre, R. 2011. Regression on ﬁxed-rank positive semideﬁnite matrices: a Riemannian approach. JMLR 12:593 625. M uller, M.; R oder, T.; Clausen, M.; Eberhardt, B.; Kr uger, B.; and Weber, A. 2007. Documentation: Mocap database HDM05. Tech. Rep. CG-2007-2. Nair, V., and Hinton, G. 2010. Rectiﬁed linear units improve restricted boltzmann machines. In ICML. Parkhiand, O.; Vedaldi, A.; and Zisserman, A. 2015. Deep face recognition. In BMVC. Srivastava, A., and Klassen, E. 2002. Monte carlo extrinsic estimators of manifold-valued parameters. IEEE Transactions on Signal Processing 50(2):299 308. Srivastava, A., and Klassen, E. 2004. Bayesian and geometric subspace tracking. Advances in Applied Probability 43 56. Turaga, P.; Veeraraghavan, A.; Srivastava, A.; and Chellappa, R. 2011. Statistical computations on Grassmann and Stiefel manifolds for image and video-based recognition. IEEE-TPAMI 33(11):2273 2286. Vemulapalli, R.; Pillai, J.; and Chellappa, R. 2013. Kernel learning for extrinsic classiﬁcation of manifold features. In CVPR.