# learning_neural_bagofmatrixsummarization_with_riemannian_network__4f05dbcf.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Learning Neural Bag-of-Matrix-Summarization with Riemannian Network

Hong Liu, Jie Li, Yongjian Wu, Rongrong Ji

Fujian Key Laboratory of Sensing and Computing for Smart City, Department of Cognitive Science, School of Information Science and Engineering, Xiamen University, Xiamen, China Peng Cheng Laboratory, Shenzhen, China Tencent Youtu Lab, Tencent Technology (Shanghai) Co.,Ltd, Shanghai, China lynnliu.xmu@gmail.com, lijie32@stu.xmu.edu.cn, littlekenwu@tencent.com, rrji@xmu.edu.cn Contributed Equally, Corresponding Author

Symmetric positive deﬁned (SPD) matrix has attracted increasing research focus in image/video analysis, which merits in capturing the Riemannian geometry in its structured 2D feature representation. However, computation in the vector space on SPD matrices cannot capture the geometric properties, which corrupts the classiﬁcation performance. To this end, Riemannian based deep network has become a promising solution for SPD matrix classiﬁcation, because of its excellence in performing non-linear learning over SPD matrix. Besides, Riemannian metric learning typically adopts a k NN classiﬁer that cannot be extended to large-scale datasets, which limits its application in many time-efﬁcient scenarios. In this paper, we propose a Bag-of-Matrix-Summarization (Bo MS) method to be combined with Riemannian network, which handles the above issues towards highly efﬁcient and scalable SPD feature representation. Our key innovation lies in the idea of summarizing data in a Riemannian geometric space instead of the vector space. First, the whole training set is compressed with a small number of matrix features to ensure high scalability. Second, given such a compressed set, a constant-length vector representation is extracted by efﬁciently measuring the distribution variations between the summarized data and the latent feature of the Riemannian network. Finally, the proposed Bo MS descriptor is integrated into the Riemannian network, upon which the whole framework is end-to-end trained via matrix back-propagation. Experiments on four different classiﬁcation tasks demonstrate the superior performance of the proposed method over the state-of-the-art methods.

Introduction Symmetric Positive Deﬁned (SPD) matrix has been recently popular for feature representation in various computer vision and artiﬁcial intelligence applications, e.g., image classiﬁcation (Fathy and Chellappa 2017), action recognition (Huang et al. 2017a), image retrieval (Ji et al. 2017) and brain computer interface (BCI) data analysis (Lotte et al. 2018). Existing works in feature representation with SPD matrix can be categorized into either using covariance matrix (Wang et al. 2012) or using Gaussian distribution matrix (Wang et al. 2015b). The former is to preserve the secondorder statistics of a set of vectors, while the latter targets

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

at capturing the overall probability of data variations. By preserving such non-Euclidean geometric properties, SPD matrices can measure nearness on a speciﬁc Riemannian manifold (Sra 2012) rather than in the Euclidean space, which considers the geodesic distance between two points on the Riemannian manifold. As observed in (Arsigny et al. 2007), by using SPD matrices, the Riemannian space eliminates the large swelling effect that is previously existed in the Euclidean space, which brings signiﬁcant advantages in handling various problems suiting for non-Euclidean metrics. However, directly applying traditional machine learning algorithms with Euclidean geometry to SPD metrics often results in poor performance (Huang et al. 2017b). To overcome this problem, the Riemannian metric based learning model has received increasing focus, which can directly conduct non-linear learning by fed into the SPD matrix representations (Pennec, Fillard, and Ayache 2006; Wang et al. 2015a). Recently, deep learning methods have received much attention in visual feature representation (Lin et al. 2018). However, most schemes consider solely ﬁrst-order statistics using traditional neural networks. More recently, the second-order statistics, such as covariance, are further considered to construct better regional descriptors to solve challenging problems like ﬁne-grained visual recognition (Lin, Roy Chowdhury, and Maji 2015; Lin and Maji 2017). These works do not use dimensionality reduction layers to obtain effective second-order statistics, which instead directly use the multiple fully-connected (FC) layers following the matrix vectorization processing of the SPD matrices. However, such non-Euclidean representation implies the lack of familiar properties such as global parameterization, common coordinates, vector space structure, and shift-invariance. Consequently, basic operations like FC layers cannot be well deﬁned on the non-Euclidean domains (Bronstein et al. 2017). Moreover, such vanilla deep structure will destroy the Riemannian goemetry and corrupt the classiﬁcation results, as demonstrated in our experiments. To overcome these problems, Riemannian SPD Matrix Network (SPDNet) (Huang and Van Gool 2017) is proposed, which receives SPD matrices as inputs. It aims to preserve the SPD structure across layers to be non-linearly mapped into latent space and then does the task like classiﬁcation on this latent space. In particular, SPDNet is composed of two

Classifier Block

Bi Map Re Eig

Feature Extractor Block Bo MS Block

Figure 1: The framework of our proposed Bag-of-Matrix Summarization layer with SPDNet. (Best view in color)

traditional layers (i.e., the Fully-connected Layer and Softmax Layer) and three newly deﬁned layers (i.e., the Bi Map Layer, Re Eig Layer, and Log Eig Layer). A special matrix back-propagation method with stochastic gradient descend (SGD) was proposed to train the deep SPDNet in an endto-end way. Note that, the Log Eig layer in SPDNet is the key to transform the geometric space to the Euclidean space, which can then be plugged by the traditional neural layers. This layer needs to compute the matrix logarithms that dramatically increase the computational cost, while its moments do not have closed forms (Cherian et al. 2013). Moreover, the existing works (Guo, Ishwar, and Konrad 2013; Anirudh et al. 2017) typically adopt a ﬂatten vector representation towards tangent approximation or rolling maps, and then uses SVM or k NN classiﬁer to learn features in the resulting ﬂattened space. These shallow learning schemes have led to suboptimal solutions on the speciﬁc nonlinear manifolds, which also often require signiﬁcantly more time to conduct online predictions with complex calculation. However, to preserve the original geometric relation that is captured by such a matrix structure, we argue that a better feature representation, e.g., from a statistical perspective, is needed for various real-world applications. To handle this issue, we propose to embed SPD matrices by using a Bag-of-Visual-Word (Bo VW) model under a metric learning framework. Recently, Bo VW has been integrated into the convolutional neural networks to perform image classiﬁcation and simultaneously compress the model (Passalis and Tefas 2017). However, due to the SPD constraints, directly using Bo VW is intractable, which serves as the ﬁrst problem need to be tackled in this paper. Although recent works in (Sivalingam et al. 2015; Cherian et al. 2017) have extended the dictionary learning to SPD matrix representation with encoding model, the integration of Bo VW into deep Riemannian networks remains an open problem due to the difﬁculty in optimization, which serves as the second problem. To solve the above two problems, we propose a novel Bag-of-Matrix-Summarization (Bo MS) model, which can be efﬁciently inferred, be well scaled up to large dataset, and can signiﬁcantly improve the classiﬁcation accuracy. The Bo MS is based on supervised learning that is designed to learn nonlinear transformations to preserve the neighbor structure with the labeled data. It simultaneously learns an extremely small set of codewords, procedure of which can be seen as a nonlinear Riemannian metric learning, i.e., a summarized version of the low-dimensional SPD matrices. The summarized data can be seen as the code-

words, each of which contains the corresponding semantic information like objective labels. Then, the output of Bi Map layer in SPDNet can be embedded into a vector representation, each dimension of which captures the respective divergence of the codeword output. Finally, the data summarization and feature learning are learned jointly through matrix back-propagation with stochastic gradient descent (SGD) (Ionescu, Vantzos, and Sminchisescu 2015; Liu et al. 2018), which ensures our model to be scalable to large training sets. In particular, Bo MS acts as a trainable encoding layer, which can be plugged between the Bi Map layer and the FC layer to replace the Log Eig Layer in the original SPDNet. A sequence of Bi Map and Re Eig are further used to construct the feature extractor, forming a lowdimensional SPD matrix input to the corresponding classiﬁcation/recognition model. We term the proposed method as Bo MS+SPDNet, the framework of which is shown in Fig.1. Quantitatively, we compare the proposed model against various state-of-theart SPD matrix based classiﬁcation methods on four benchmarks, i.e., AFEW, HDM05, YTC, and BCI. Experiments demonstrate that the proposed Bo MS+SPDNet outperforms the existing classiﬁcation methods in terms of both accuracy and efﬁciency. The rest of this paper is organized as follows: In Sec.2, we brieﬂy overview the SPDNet, which serves as basis to perform SPD matrix classiﬁcation. Sec.3 describes the proposed Bo MS+SPDNet and the experiments are shown in Sec.4. Finally, we conclude this paper in Sec.5.

Preliminaries of SPDNet We ﬁrst brieﬂy present the SPDNet (Huang and Van Gool 2017), which serves as the basis for the proposed Bo MS model. SPDNet is the ﬁrst deep Riemannian network with four different layers, i.e., Bi Map Layer, Re Eig Layer, Log Eig Layer, and other layers. It receives SPD matrices as inputs, preserves the Riemannian manifold structure across layers, and non-linearly maps an input matrix into a vector representation. Let Xk 1 Sym+ dk 1 be the SPD matrix of the k-th layer, Wk Rdk dk 1 (dk < dk 1) be the transformation matrix in the k-th layer, and Xk Rdk dk be the resulting matrix in the k-th layer, where Sym+ dk 1 is the space of SPD real dk 1 dk 1 matrices. The Bi Map Layer is similar to the linear transformation layer in auto-encoder, which transforms the input SPD matrices to the low-dimensional SPD matrices by a bilinear mapping fb as:

Xk = f (k) b (Xk; Xk 1) = Wk Xk 1WT k . (1)

To make the output Xk remain a SPD matrix, the transformation Wk should be constrained to a raw full-rank matrix. The Re Eig Layer is to utilize non-linear activation to improve discrimination, which is similar to the Re LU layer in the convolutional neural networks (Conv Nets) (Nair and Hinton 2010). As a consequence, Re Eig Layer is devised to a non-linear function fr, which rectiﬁes the SPD matrices by tuning up their small positive eigenvalues:

Xk = f (k) r (Xk 1) = Uk 1 max(ϵI, Σk 1)UT k 1, (2)

where max( , ) is the maximum function, Uk 1 and Σk 1 are learned by the eigenvalue decomposition of Xk 1 = Uk 1Σk 1UT k 1, ϵ is a threshold parameter, and I is the identity matrix. The Log Eig Layer is similar to the Log-Euclidean Riemannian metric learning (Huang et al. 2015), in which the matrix logarithm operation log( ) is done on the SPD matrices, and then the resulting matrix is ﬂatted to a vector representation. As a result, the classical Euclidean computations can be applied to the logarithms of SPD matrix. Formally, the layer can be deﬁned as a function fl as:

Xk =f (k) l (Xk 1)=log(Xk 1)=Uk 1 log(Σk 1)UT k 1, (3) where Xk 1 = Uk 1Σk 1UT k 1 is the eigenvalue decomposition. Finally, the Other Layers are composed of a sequence of neural blocks in the traditional neural networks, i.e., Fully Connected (FC) layer and Softmax layers. FC layer is inserted after the Log Eig Layer, which is set to be a projection matrix Wfc Rdk dk 1 where dk is the class number and dk 1 is the dimension of the output of Log Eig Layer. The ﬁnal output for classiﬁcation can be a Softmax layers. To learn this SPDNet, inspired by the matrix BP (Ionescu, Vantzos, and Sminchisescu 2015), the back-propagation (BP) with an SGD setting on Stiefel manifolds was proposed, which makes the SPDNet trainable in an end-to-end setting.

The Proposed Method

As illustrated in Fig.1, the proposed Bo MS+SPDNet is composed of three blocks: a) a Feature Extraction Layer block (composed of several Bi Map Layers and Re Eig Layers), b) a Bo MS Layer Block, and c) a Classiﬁcation Layer Block. We depict the details as below:

Feature Extraction Layer Block

As shown in the left part of Fig.1, feature extraction is the fundamental block of the proposed method, which aims to extract low dimensional SPD matrix that serves as the input for the subsequent Riemannian feature learning process. Inspired by the removal of fully-connected layers in the recent deep feature extractor (Passalis and Tefas 2017), for the i-th SPD feature extractor, we remove the Log Eig Layer and the Other Layers in SPDNet, the rest of which is composed into a block sequence of Bi Map Layer and Re Eig Layer. In our feature extractor, we use a similar architecture of SPDNet with three Bi Map layers f (k) b and two Re Eig layers f (k) r , the exemplar structure of which is X0 f 1 b f 2 r f 3 b f 4 r f l b Z. The output of the last Bi Map Layer, denoted by the l-th layer, is used to feature extractor and is subsequently fed into the Bo MS block. Without loss of generality, we deﬁne the output of the feature extractor as a function F(X0) = Z1, which is subsequently fed into the proposed Bo MS block.

1Note that, the output Z is still SPD matrix, which can still hold the Riemannian geometric properties.

Bo MS Layer Block

Similar to the Bo VW model, the proposed Bo MS layer is an encoding layer that captures the statistics of the feature matrix Z. The goal of learning the Bo MS layer is two-fold: a) learn a dictionary set B = {B1, B2, ..., Bm} with m SPD matrices, where each dictionary Bi Sym+ d , b) learn an accumulating scalar on each dictionary atom to best represent the SPD matrix feature Z for classiﬁcation, We denote Zi Sym+ d as the i-th resulting feature through the feature extractor with input SPD matrix Xi, which can be collected as a set of N matrix features as Z = {Z1, Z2, ..., ZN} with associated labels yi Y = {y1, y2, ..., y N}. Bo MS aims to output a ﬁxed-length vector representation v based on the dictionary set B. The proposed layer can be viewed as a uniﬁed processing layer, whose output is sent to a subsequent classiﬁer. The output of the Bo VW can be deﬁned by a nonlinear function fp as:

vi =fp(Zi)= D(Zi, B1), ..., D(Zi, Bm) T Rm, (4)

where function D( , ) is the Riemannian metric to measure two given SPD matrices. Therefore, the key issue is how to deﬁne such a dictionary set B, which is typically achieved by k-means clustering. However, the input features here are in a matrix style, therefore the traditional clustering algorithms are not workable. On one hand, some methods have been proposed to cluster the SPD matrix features (Cherian, Morellas, and Papanikolopoulos 2016), they are unsuitable to be integrated into the deep learning architecture, in which the dictionary updating is separated with the feature learning, leading to a suboptimal learning. On the other hand, the clustering algorithm is unsupervised and cannot utilize the label information to improve the quality of the dictionaries. To solve this problem, inspired by the data summarization technology (Kusner et al. 2014), the goal of our dictionary learning is to ﬁnd a set of summarized samples ˆZ = {ˆZ1, ..., ˆZm} (m N) with labels ˆY = {ˆy1, ..., ˆym} to replace the dictionary B, so that the original training data Z and Y can be best approximated via k-nearest neighbors. Different from the traditional Bo VW, the summarization set ˆZ needs to be learned from the whole dataset, and the summarized label set ˆY should have the same proportional distribution to the original label set2. Note that, the data summarization can be viewed as a special case of supervised dictionary learning, which aims to correctly classify as many training inputs as possible in the deep matrix space. However, two issues of data summarization need to be further solved: 1) the learned metric should maximize the margin between different classes. 2) all the summarized data with the same labels maybe converge into a single point, as we target at maximizing the classiﬁcation accuracy. To solve the ﬁrst problem, we propose a margin-based loss function

2That is, if one category accounts for 60% of the original label collection, it should occupies the same percentage in the summarization collection.

for the matrix summarization learning, which is deﬁned as:

L1 ms(Z, ˆZ) =

0, α D(Zi, ˆZj) + D(Zi, ˆZk) ,

s.t. yi = yj and yi = yk. (5)

For the second problem, the summarized data with the same label should be dissociated, so that such samples can best present the diversity of the training data. To this end, the target is to maximize the pair-wise distance between two input SPD matrices with the same label, that is:

L2 ms( ˆZ) =

j=1 δ(yi, yj)D(ˆZi, ˆZj), (6)

where δ(yi, yj) = 1 if yi = yj, and 0 otherwise. Then, the ﬁnal objective function of the proposed data summarization is to combine Eq.5 and Eq.6 as follows:

Lms(Z, ˆZ) = λ1L1 ms(Z, ˆZ) λ2L2 ms( ˆZ), (7)

where λ1 and λ2 are two tradeoff parameters to control the weights between two functions Eq.5 and Eq.6. As a result, the proposed Bag-of-Matrix-Summarization (Bo MS) layer can be deﬁned as:

vi= h D F(Xi), ˆZ1 , ..., D F(Xi), ˆZm i T . (8)

Distance Metric To learn better vector representation for the corresponding SPD matrix, the key issue is to deﬁne an appropriated distance metric D( , ) in Eq.8. To this end, we introduce three representative distance metric below. Inspired by the study in (Arsigny et al. 2006), we ﬁrst use the Log-Euclidean metric (LEM) to deﬁne the distance function D( , ) in Eq.4, which aims at exploiting the Lie group structure under the typical matrix exponential and logarithm operators. Such that, the Riemannian distance between two SPD matrices is deﬁned by LEM, i.e.,

D(Zi, Zj) = log(Zi) log(Zj) , (9)

where is the Frobenius norm of the matrix. However, as mentioned in (Cherian et al. 2013), the ﬂattening of the manifold in LEM often leads to less accurate distance computations and therefore affect the performance. To solve this problem, an intuitive method is to use metric learning method to reduce such computation error, which serves as our second distance measure, i.e.,

D(Zi, Zj)= Wvec log(Zi) Wvec log(Zj) 2, (10)

where vec( ) is the matrix vectorization processing. On the other hand, LEM is interpreted as an Euclidean distance between the matrices mapped in the tangent space at the identity, which implies a deformation. Therefore, a more natural measure should be considered to hold the Riemannian geometry. To this end, we consider an effective and efﬁciency divergence measure, termed Jensen-Bregman Log Det Divergence (JBLD), which redeﬁnes Eq.4 as:

D(Zi, Zj) = log |(Zi + Zj)/2| 0.5 log |Zi Zj|, (11)

where | | denotes the determinant.

Classiﬁcation Block The ﬁnal block is to perform the classiﬁcation, which is formulated via the following objective function:

Lc(Z, W) = XN

i=1 f(vi, yi; W), (12)

where the function f( ) with parameter set W aims at learning the classiﬁer on vi according to the provided class labels yi, and vi is the resulting vector representation from the proposed Bo MS layer. There are several choices for the deﬁnition of f, we resort to the cross-entropy loss function with FC layer and softmax layer. At last, combing Lms and Lc together, we get the ﬁnal loss function for SPD classiﬁcation as:

L = Lms F(X), ˆZ + Lc fp F(X) , W . (13)

Learning with Bo MS For the SPD matrix based classiﬁcation, the proposed Bo MS can be directly integrated into the SPDNet, which can be written as a composition of sequentially connected functions with the input SPD matrix X and the output predicted class label. To train such a deep network, one can use the matrix back-propagation (Ionescu, Vantzos, and Sminchisescu 2015) together with stochastic gradient descent. Fortunately, the gradients of the parameters in the FC layer and Softmax layer can be easily calculated in the traditional ways, as these layers lie in the Euclidean space. The major problem here is that the matrix must hold the SPD constraint in the Bi Map layer, the Re Eig layer, and the proposed Bo MS layer. For the gradients of function F in Eq.8 containing Bi Map layer and Re Eig layer, similar to that in SPDNet, we use a customized updating on Stiefel manifolds. For the proposed Bo MS layer, the gradients of the corresponding parameters can be from two information ﬂows: One is the gradients from the classiﬁcation block, and the other is from the loss of the data summarization in Eq.7. Speciﬁcally, there are three components that need to be updated: the layer parameters Q, the summarized data ˆZ, and the gradient that is propagated to the feature extraction block F. As for each summarized ˆZi and function F, the updating schemes are achieved by the following chain rule:

ˆZi = L1 ms fp L2 ms fp + Lc

L F = L1 ms fp +

where vi = f D F(X), ˆZi is the i-th dimension in the Bo MS feature. Due to the different distance measures from Eq.9 to Eq.11, we use different updating schemes to calculate the gradients. For Eq.9, the function fp can be replaced by fl in Eq.3, and the calculating gradients fp

F are with the same updating rules in SPDNet. Comparing to Eq.9, Eq.10 has an additional parameters W whose updating scheme is:

L W = L1 ms W L2 ms W +

For Eq.11, the gradient is D/ Zi = (Zi + Zj) 1 0.5 Z 1 i . Therefore, the gradients of ˆZi and F can be easily calculated as similar to the Log Eig layer.

Discussion The proposed feature learning layer can better present the SPD matrix as a vector to do the classiﬁcation task. We now show the relationship between the Bo MS layer and the Log Eig layer: If we replace the summarized data ˆZ in Eq.9 to the Identity matrix, the loss function Eq.7 can be reduced to a scalar constant. As a result, when ˆZ is the identity matrix, the proposed Bo MS will degenerate into the original Log Eig layer. Therefore, the Log Eig layer can be viewed as a special case of our Bo MS. Moreover, the summarized data are generated from supervised dictionary learning, which not only inﬂuents the previous feature network with the predeﬁned metric, but also helps learn a better classiﬁer. Adding these two points into Riemannian network can further improve the classiﬁcation accuracy, whose quantitative evidences will be shown in our experiments. Finally, the proposed Bo MS is more ﬂexible and more scalable, where the distance function can be replaced with better metrics to reﬂect the Riemannian property, as also demonstrated in our experiments.

Experiments In this section, we evaluate our Bo MS model on SPDNet to the state-of-the-art SPD matrix based classiﬁcation methods on four different tasks, i.e., emotion recognition, action recognition, face recognition and brain computer interface. 1) Emotion Recognition. We use Acted Facial Expression in the Wild (AFEW) dataset, which collects 1, 345 video sequences of facial expressions acted by 330 actors in movies. This dataset has been divided into training, validation, and test sets, where each video is classiﬁed into one of seven expressions. Since the ground truth of the test set has not been released, we follow the setting in (Liu et al. 2014; Huang and Van Gool 2017) to evaluate the performance on the validation set. To augment the training set, we also segment the training videos into 1, 747 small clips. And each facial frame is normalized to an image of size 20 20. Then, we compute the covariance matrix feature of size 400 400. 2) Action Recognition. We evaluate our model on the task of skeleton-based human action recognition using the HDM05 dataset, which is a large-scale dataset for SPD matrix based representation. This dataset contains 2, 337 sequences of 130 action classes, which provides 3D locations of 31 joints of the subjects. To preprocess this dataset, we divide the training sequences set to around 18, 000 small subsequences. Then we represent each sequence by a covariance descriptor of size 93 93, which is calculated by a second order statistics of the 3D coordinates for the 31 joints in each frame. 3) Face Recognition. We use the You Tube Celebrities (TYC) dataset to perform video face recognition, which contains 1, 910 video clips of 47 subjects collected from You Tube, and most of the clips contain hundreds of frames. The dataset is randomly split into a training set and a testing set, with a splitting ratio of 1 : 2. Each face image in a

video is cropped into a 20 20 intensity image and is then histogram-equalized to eliminate lighting effects. We extract the set-based covariance matrix for each video sequence in this dataset, the matrix size of which is 401 401. 4) BCI classiﬁcation. We further evaluate the classiﬁcation performance on the BCI Competition IV dataset 2a (BCI) 3, which is a 22-electrode EEG motor-imagery dataset. It consists of 9 subjects and 2 sessions, each subject of which has 288 four-second trials of imagined movements. To preprocess this dataset, we train the models on each subject on the ﬁrst session, and test on its corresponding subject on the second session. We report the average precision results among 9 subjects. As the similar preprocessing to (Schirrmeister et al. 2017), for each channel, all the EED data are ﬁrst band-pass ﬁltered with a bandwidth of 4 38Hz, and electrode-wise exponential moving standardization is then performed to compute exponential moving means and variances, both of which are used to standardize the continuous data. As a result, each EEG signal is represented by a 22 22 SPD matrix. Compared Methods. We mainly compared to six stateof-the-art SPD matrix learning methods: i.e., Covariance Discriminative Learning (CDL) (Wang et al. 2012), Log Euclidean Metric Learning (LEML) (Huang and Van Gool 2017), SPD Manifold Learning (SPDML) (Harandi, Salzmann, and Hartley 2017) that uses afﬁne-invariant metric (SPDML-AIM) (Pennec, Fillard, and Ayache 2006) and Stein divergence (SPDML-Stein) (Sra 2016), Riemannian Sparse Representation (RSR) (Harandi et al. 2012), Matrix Square Root Normalization (MSRN) (Lin and Maji 2017), and Riemannian Network (SPDNet) (Huang and Van Gool 2017). For the above methods, we use the source codes kindly provided by the authors and tune the parameters according to the original settings. For SPDNet, we use the settings with the best performance from the original work, which uses three blocks of Bi Map/Re Eig layer for AFEW and HDM05, and uses one Bi Map layer and one Re Eig layer for BCI competition. To verify the efﬁciency of Riemannianbased network, we also compare to the vanilla neural network, which is composed of multiple fully-connected layers and one softmax layer. For MSRN, we use the best structure in (Lin and Maji 2017), which contains matrix-square root, element-wise signed square-root normalization, FC layer, and softmax layer. In details of vanilla NN and MSRN, three FC layers are used on AFEW and HDM05, and one FC layer is used on BCI. The Proposed Riemannian Network. Bo MS mainly contains three types of measures, named Bo MS-1, Bo MS2, and Bo MS-3 according to Eq.9 to Eq.11, respectively. To further validate the proposed model, we also compare to the abbreviated versions based on the proposed Bo MS2: We ﬁrst delete the metric learning part in Eq.5, termed Bo MS ML, which considers the Bo MS as the traditional Bo VW model, as similar to (Passalis and Tefas 2017). We then delete the loss taking account of the data divergence in Eq.6, while preserving the metric learning loss. We name this method as Bo MS D. Finally, we delete all the part of

3http://www.bbci.de/competition/iv/

Table 1: The results for the AFEW, HDM05, and BCI 4 2A datasets. Note that, the baseline results on AFEW and HDM05 are cited from (Huang and Van Gool 2017), which we have also reproduced. All accuracy rates are averaged number. The last column shows the testing time for different methods, which is conducted on all testing set in AFEW.

Method AFEW HDM05 BCI YTC Times CDL 31.81% 41.74% 45.02% 83.99% 2243s LEML 25.13% 46.87% 40.39% 81.93% 1823s SPDML-AIM 26.72% 47.25% 57.72% 80.49% 5366s SPDML-Stein 24.55% 46.21% 53.47% 74.52% 1849s RSR 27.49% 41.12% 45.49% 81.8% 4841s MSRN N/A 59.92% 47.72% - N/A SPDNet 34.23% 61.45% 55.59% 89.01% 6.29 s Bo MS-1 35.04% 71.03% 61.65% - 6.39 s Bo MS-2 38.81% 71.79% 65.20% 89.81% 6.49 s Bo MS-3 36.93% 70.42% 62.23% - 11.53 s Bo MS-ML 34.23% 68.84% 61.11% 83.04% - Bo MS-D 36.93% 70.80% 65.08% 89.04% - Bo MS-A 32.88% 68.61% 60.42% 80.02% - Vanilla NN N/A 49.29% 49.88% - N/A

data summarization in Eq.7 to evaluate the importance of Bo MS model, named Bo MS-A. The Setting of Network Architecture. The architecture of the proposed Bo MS+SPDNet is X F fp fr fs ˆy, where F, fp, fr, fs and ˆy indicate the feature extractor, Bo MS, FC, softmax, and approximated label. For the ﬁrst two datasets, we use the similar architecture of SPDNet with three Bi Map layers f (k) b and two Re Eig layers f (k) r , the exemplar structure of which is X0 f 1 b f 2 r f 3 b f 4 r f 5 b Z. The parameters in AFEW are set to 400 200, 200 100, and 100 50, respectively. The parameters on HDM05 are set to 93 70, 70 50, and 50 30, respectively. And the parameters on YTC are set to 401 200, 200 100, and 100 50 , respectively. For the BCI dataset, we verify the performance with just one Bi Map Layer and one Re Eig Layer as the feature extractor, whose parameters are set to 22 15. Parameter Setting. We implement our Riemannian Network with Bo MS using Pytorch on a single PC with Dual Core I7-3421 and 128G memory. We use the stochastic gradient descent to update the network parameters, and the learning rate is set to 1 10 3 with 5 10 4 weight decay. The batch size is set to 30, the weights are initialized as random semi-orthogonal matrices, as similar to the SPDNet. As described before, the summarized data is uniformly sampled according to the label ratio. For the unbalanced label rations, we have at least one data for each category. For all three benchmarks, the scale of the summarized data set is selected based on the randomly sampled validation set. In all our experiments, we empirically set λ1 = 1.2 and λ2 = 0.7 according to parameter s tuning. Results and Analysis. As shown in Tab.1, we report the proposed model with three different distance measures, which have 10.8%, 12.8%, and 10.1% accuracy improvements when compared to the second best methods respectively, such as SPDNet and SPDML-AIM. It is worth to note that, Riemannian based neural networks, e.g., both

0.2 0.1 0.0 0.1 0.2 0.3 0.4

(a) Original data.

0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4

(b) Summarization.

<5 5~10 10~15 15~20 >20 0.5

SPDNet Bo MS-2

(c) Acc vs. Label.

Figure 2: The subﬁgure (a-b) show the visualization of the summarized data on AFEW dataset . Each color point represents a category in the dataset. The ﬁgure (a) shows the initialized distribution, and the right (b) shows the distribution after training. The black circle point presents the summarized data. The subﬁgure (c) shows the detailed evaluation for imbalanced label dataset, HDM05. (Best view in color)

SPDNet and ours, not only achieve competitive results, but also use less time for online prediction. When comparing with Vanilla NN and MSRN that have been widely used in bilinear models (Lin, Roy Chowdhury, and Maji 2015; Lin and Maji 2017), our proposed models have average 35.3% accuracy improvements, respectively. As the result in Tab.1, we report the testing time for different models. All the traditional Riemannian-based models require signiﬁcantly more time to conduct online predictions, for which the k-NN classiﬁer requires pairwise distance computation and comparison. Since the scale of SPD feature is 400 400 in AFEW, the vectorized dimension of such matrix is too high and make the training of both vanilla NN and MSRN intractable to achieve. Therefore, their accuracy and testing time are not reported here. Moreover, HDM05 is an imbalance label dataset. We calculate the variance of precision for each label, and the average variance score of SPDNet is 0.1017 and that of Bo MS2 is 0.0935. More details experimental results are given in Fig.2 (c). As a conclusion, our work still achieves the best statistical results, which demonstrates that Bo MS is more robust in the imbalance classiﬁcation task. In addition, our proposed method Bo MS-1 calculates the LEM between features and summarized set multiple times. However, compared to SPDNet, but the method is still efﬁcient. It is worth to note that, the testing time of Bo MS-1 is faster than that of Bo MS-3, which is due to the framework we used that can handle batch data directly. When we test Bo MS-1 using the same calculation way as Bo MS-3, the testing time is 15.55s, which is slower than Bo MS-3. As a result, the Riemannian based NNs are all effective and efﬁciency for online testing, which veriﬁes the importance of the Riemannian-based NNs to solve the SPD matrix input. Comparing Bo MS-1 with SPDNet, although Bo MS-1 also uses Log-Euclidean metric in calculation, the accuracies on all three benchmarks are better than SPDNet. To analyze, each dimension in vector v represents the probability of SPD feature belonging to the corresponding category, where a small value in the i-th dimension means the feature is close to the category of the i-th summarized data. On the

50 100 150 200 The number of training epoch

(a) Convergence curve.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

(b) λ1 vs. λ2.

Figure 3: The convergence curves on the representative HDM05 dataset, and the parameters analysis on the representative AFEW dataset. (Best view in color)

other hand, the summarized data can be deﬁned as the importance samples in their corresponding categories, which leads to a higher information entropy for each dimension in the vector representation. When ignoring this part in Eq.7, the performance decreases signiﬁcantly on all three benchmarks. It demonstrates that the data summarization is very important for improving the classiﬁcation accuracy, which can be viewed as a supervised pooling layer. Besides, comparing to SPDNet, Bo MS-A is always better and competitive, which demonstrates our argumentation that a better statistical feature representation is needed. To demonstrate the divergence among the summarized data, we ﬁrst plot the summarized data by class before or after training, as shown in Fig.2 (a-b). Before training, the distribution of data features was chaos and difﬁcult to be distinguished between categories. After training, category information can be easily separated. Besides, the summarized data (black circle points) is not aggregated to the center of the category, but is relatively dispersed within the category. However, the Log-Euclidean metric has some inherent disadvantages. We therefore propose two other solutions to further improve the accuracy, such as Bo MS-2 and Bo MS-3. When metric learning is introduced into the Log-Euclidean metric, the performance is the best. Although Bo MS-3 achieves the second best performance in all three benchmarks, its training is very efﬁcient. It is worth to note that, the accuracy of Jensen-Bregman Log Det Divergence can be also improved with metric learning. The results on three tasks show that the proposed Bo MS has superior classiﬁcation accuracy. The convergence curve of our network is shown in Fig.3(a), which suggests that our Riemannian network can converge quickly. In addition, we analyze the validity of different parts of the proposed model, by which we have found that the representative model Bo MS-2 achieves the best result. Consequently, we evaluate the classiﬁcation results by simultaneously tuning the parameters λ1 and λ2 on the validation set for all three datasets. And the results on representative AFEW are shown in Fig.3 (b). We have found that the best accuracy is achieved when empirically setting λ1 = 1.2 and λ2 = 0.7, which is consistent in all datasets.

Table 2: The training time of each epoch (Times), compression ratio (CR.), and accuracy (ACC.) with different scale of Summarized dataset, where the results is conducted on the representative AFEW dataset.

# ˆZ 11 18 25 31 38 46 53 Times (s) 43.7 47.0 51.1 53.5 55.9 63.2 67.9 CR. (%) 0.62 1.03 1.43 1.78 2.18 2.63 3.03 Acc. (%) 35.3 36.4 36.4 38.8 35.9 34.8 34.5

Comparing to Bo MS-D that sets λ2 = 0, adding divergence can bring certain performance improvement, which indicates that the divergence should be considered in the data summarization. When we delete the triplet metric learning part, Bo MS-ML shows a signiﬁcant performance decrease comparing to Bo MS-2. To analyze, as mentioned before, LEM often leads to less accurate distance measure, where metric learning needs to be considered to reduce such loss. The same phenomenon also appears in the comparison of Bo MS-1 and Bo MS-3. The number m reﬂects the scale of the summarized set, whose relation to the classiﬁcation accuracy is shown in Tab.2. The results show that either large or small number will lead to poor performance. A small number means the representative information contained in the summary data is limited, which therefore cannot conclude sufﬁcient training data. In contrast, a large number means more unrelated data is combined to bring more noise. Moreover, Tab.2 shows the training time with the increasing number of summarized data. The results show that a large number needs longer training time. So, the suitable size not only improves the performance of classiﬁcation, but also improves the training efﬁciency. According to the results, this number is set to 31, 258, 210 and 32 for AFEW, HDM05, YTC and BCI according to the accuracies on the validation set, respectively.

This paper has proposed a Bag-of-Matrix-Summarization method, which is combined with SPDNet towards SPD matrix based classiﬁcation. The proposed Bo MS addresses the Riemannian codebook learning and Riemannian NNs optimization issues in the existing approaches, which is based on the idea of summarizing data via a metric learning scheme to compress the whole training data by a predeﬁned feature set. Then, the low-dimensional SPD matrix through Riemannian network is quantized into the predeﬁned matrix summarization bins. Finally, a constant length vector representation is extracted for each SPD matrix by calculating the divergence of data feature and matrix summarization. The proposed method can be integrated into the Riemannian network, and the whole framework can be end-to-end trained via the regular matrix back-propagation. The experiments on four benchmarks demonstrate that the proposed method has outperformed all existing state-of-the-arts in SPD matrix classiﬁcation. In our future works, we mainly consider to integrate other divergence for Riemannian geometry, such α β divergence, etc.

Acknowledge This work is supported by the National Key R&D Program (No.2017YFC0113000, and No.2016YFB1001503), Nature Science Foundation of China (No.U1705262, No.61772443, and No.61572410), the Post Doctoral Innovative Talent Support Program under Grant BX201600094, China Post Doctoral Science Foundation under Grant 2017M612134, Scientiﬁc Research Project of National Language Committee of China (Grant No.YB135-49), and Nature Science Foundation of Fujian Province, China (No.2017J01125 and No.2018J01106).

References Anirudh, R.; Turaga, P.; Su, J.; and Srivastava, A. 2017. Elastic functional coding of riemannian trajectories. TPAMI. Arsigny, V.; Fillard, P.; Pennec, X.; and Ayache, N. 2006. Log-euclidean metrics for fast and simple calculus on diffusion tensors. Magnetic resonance in medicine. Arsigny, V.; Fillard, P.; Pennec, X.; and Ayache, N. 2007. Geometric means in a novel vector space structure on symmetric positive-deﬁnite matrices. SIAM JMAA. Bronstein, M. M.; Bruna, J.; Le Cun, Y.; Szlam, A.; and Vandergheynst, P. 2017. Geometric deep learning: going beyond euclidean data. SPM. Cherian, A.; Sra, S.; Banerjee, A.; and Papanikolopoulos, N. 2013. Jensen-bregman logdet divergence with application to efﬁcient similarity search for covariance matrices. TPAMI. Cherian, A.; Stanitsas, P.; Harandi, M.; Morellas, V.; and Papanikolopoulos, N. 2017. Learning discriminative α-β divergences for positive deﬁnite matrices. ICCV. Cherian, A.; Morellas, V.; and Papanikolopoulos, N. 2016. Bayesian nonparametric clustering for positive deﬁnite matrices. TPAMI. Fathy, M. E., and Chellappa, R. 2017. Image set classiﬁcation using sparse bayesian regression. WACV. Guo, K.; Ishwar, P.; and Konrad, J. 2013. Action recognition from video using feature covariance matrices. TIP. Harandi, M. T.; Sanderson, C.; Hartley, R.; and Lovell, B. C. 2012. Sparse coding and dictionary learning for symmetric positive deﬁnite matrices: A kernel approach. ECCV. Harandi, M.; Salzmann, M.; and Hartley, R. 2017. Dimensionality reduction on spd manifolds: The emergence of geometry-aware methods. TPAMI. Huang, Z., and Van Gool, L. J. 2017. A riemannian network for spd matrix learning. AAAI. Huang, Z.; Wang, R.; Shan, S.; Li, X.; and Chen, X. 2015. Log-euclidean metric learning on symmetric positive definite manifold with application to image set classiﬁcation. ICML. Huang, Z.; Wan, C.; Probst, T.; and Van Gool, L. 2017a. Deep learning on lie groups for skeleton-based action recognition. CVPR. Huang, Z.; Wang, R.; Li, X.; Liu, W.; Shan, S.; Van Gool, L.; and Chen, X. 2017b. Geometry-aware similarity learning on spd manifolds for visual recognition. TCSVT.

Ionescu, C.; Vantzos, O.; and Sminchisescu, C. 2015. Matrix backpropagation for deep networks with structured layers. ICCV. Ji, R.; Liu, H.; Cao, L.; Liu, D.; Wu, Y.; and Huang, F. 2017. Toward optimal manifold hashing via discrete locally linear embedding. TIP. Kusner, M.; Tyree, S.; Weinberger, K.; and Agrawal, K. 2014. Stochastic neighbor compression. ICML. Lin, T.-Y., and Maji, S. 2017. Improved bilinear pooling with cnns. BMVC. Lin, S.; Ji, R.; Chen, C.; Tao, D.; and Luo, J. 2018. Holistic cnn compression via low-rank decomposition with knowledge transfer. TPAMI. Lin, T.-Y.; Roy Chowdhury, A.; and Maji, S. 2015. Bilinear cnn models for ﬁne-grained visual recognition. ICCV. Liu, M.; Shan, S.; Wang, R.; and Chen, X. 2014. Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. CVPR. Liu, H.; Ji, R.; Wang, J.; and Shen, C. 2018. Ordinal constraint binary coding for approximate nearest neighbor search. TPAMI. Lotte, F.; Bougrain, L.; Cichocki, A.; Clerc, M.; Congedo, M.; Rakotomamonjy, A.; and Yger, F. 2018. A review of classiﬁcation algorithms for eeg-based brain computer interfaces: a 10 year update. Journal of neural engineering. Nair, V., and Hinton, G. E. 2010. Rectiﬁed linear units improve restricted boltzmann machines. ICML. Passalis, N., and Tefas, A. 2017. Learning bag-of-features pooling for deep convolutional neural networks. ICCV. Pennec, X.; Fillard, P.; and Ayache, N. 2006. A riemannian framework for tensor computing. IJCV. Schirrmeister, R. T.; Springenberg, J. T.; Fiederer, L. D. J.; Glasstetter, M.; Eggensperger, K.; Tangermann, M.; Hutter, F.; Burgard, W.; and Ball, T. 2017. Deep learning with convolutional neural networks for eeg decoding and visualization. Human brain mapping. Sivalingam, R.; Boley, D.; Morellas, V.; and Papanikolopoulos, N. 2015. Tensor dictionary learning for positive deﬁnite matrices. TIP. Sra, S. 2012. A new metric on the manifold of kernel matrices with application to matrix geometric means. NIPS. Sra, S. 2016. Positive deﬁnite matrices and the s-divergence. AMS. Wang, R.; Guo, H.; Davis, L. S.; and Dai, Q. 2012. Covariance discriminative learning: A natural and efﬁcient approach to image set classiﬁcation. CVPR. Wang, L.; Zhang, J.; Zhou, L.; Tang, C.; and Li, W. 2015a. Beyond covariance: Feature representation with nonlinear kernel matrices. ICCV. Wang, W.; Wang, R.; Huang, Z.; Shan, S.; and Chen, X. 2015b. Discriminant analysis on riemannian manifold of gaussian distributions for face recognition with image sets. CVPR.