# semiorthogonal_multilinear_pca_with_relaxed_start__11a6a315.pdf Semi-Orthogonal Multilinear PCA with Relaxed Start Qiquan Shi and Haiping Lu Department of Computer Science Hong Kong Baptist University, Hong Kong, China csqqshi@comp.hkbu.edu.hk, haiping@hkbu.edu.hk Principal component analysis (PCA) is an unsupervised method for learning low-dimensional features with orthogonal projections. Multilinear PCA methods extend PCA to deal with multidimensional data (tensors) directly via tensor-to-tensor projection or tensor-to-vector projection (TVP). However, under the TVP setting, it is difficult to develop an effective multilinear PCA method with the orthogonality constraint. This paper tackles this problem by proposing a novel Semi-Orthogonal Multilinear PCA (SO-MPCA) approach. SO-MPCA learns low-dimensional features directly from tensors via TVP by imposing the orthogonality constraint in only one mode. This formulation results in more captured variance and more learned features than full orthogonality. For better generalization, we further introduce a relaxed start (RS) strategy to get SO-MPCA-RS by fixing the starting projection vectors, which increases the bias and reduces the variance of the learning model. Experiments on both face (2D) and gait (3D) data demonstrate that SO-MPCA-RS outperforms other competing algorithms on the whole, and the relaxed start strategy is also effective for other TVP-based PCA methods. Introduction Principal component analysis (PCA) is a classical unsupervised dimensionality reduction method [Jolliffe, 2002]. It transforms input data into a new feature space of lower dimension via orthogonal projections, while keeping most variance of the original data. PCA is widely used in areas such as data compression [Kusner et al., 2014], computer vision [Ke and Sukthankar, 2004], and pattern recognition [Anaraki and Hughes, 2014; Deng et al., 2014]. Many real-world data are multi-dimensional, in the form of tensors rather than vectors [Kolda and Bader, 2009]. The number of dimensions of a tensor is the order and each dimension is a mode of it. For example, gray images are second-order tensors (matrices) and video sequences are third-order tensors [Lu et al., 2013]. Tensor data are also common in applications such as data center monitoring, social network analysis, and network forensics [Faloutsos et al., 2007]. However, PCA on multi-dimensional data requires reshaping tensors into vectors first. This vectorization often leads to breaking of original data structures, more complex model with lots of parameters, and high computational and memory demands [Lu et al., 2013]. Many researchers address this problem via multilinear extensions of PCA to deal with tensors directly, and there are two main approaches. One approach is based on Tensor-to-Tensor Projection (TTP) that learns low-dimensional tensors from highdimensional tensors. The two-dimensional PCA (2DPCA) [Yang et al., 2004] is probably the first PCA extension to deal with images without vectorization. The generalized low rank approximation of matrices (GLRAM) [Ye, 2005] and the generalized PCA (GPCA) [Ye et al., 2004] further generalize 2DPCA from single-sided projections to two-sided projections via reconstruction error minimization and variance maximization, respectively. Concurrent subspace analysis (CSA) [Xu et al., 2005] and multilinear PCA (MPCA) [Lu et al., 2008] extend GLRAM and GPCA to general higher-order tensors, respectively. Another approach is based on Tensor-to-Vector Projection (TVP) that learns low-dimensional vectors from highdimensional tensors in a successive way. The tensor rankone decomposition (TROD) [Shashua and Levin, 2001] minimizes reconstruction error via (greedy) successive residue calculation. The uncorrelated multilinear PCA (UMPCA) [Lu et al., 2009] maximizes variance with the zero-correlation constraint, following the successive derivation of PCA. However, the number of features that can be extracted by UMPCA is upper-bounded by the lowest mode dimension. For example, for a tensor of size 300 200 3, UMPCA can only extract three features, which have very limited usage. Orthogonality constraint is popular in feature extraction [Hua et al., 2007; Kokiopoulou and Saad, 2007; Gao et al., 2013], tensor decomposition [Kolda, 2001], and low-rank tensor approximation [Edelman et al., 1998; Wang et al., 2015]. PCA also obtains orthogonal projections, and the TTP-based PCA methods produce orthogonal projection vectors in each mode. However, none of the existing TVP-based PCA methods derive orthogonal projections. Our study found that it is indeed ineffective to impose full orthogonality in all the modes for TVP-based PCA, due to low captured variance and limited number of extracted features. In this paper, we present a new TVP-based multilin- Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) ear PCA algorithm, Semi-Orthogonal Multilinear PCA (SOMPCA) with Relaxed Start, or SO-MPCA-RS, to be detailed in Sec. 3. There are two main contributions: We propose a novel SO-MPCA approach to maximize the captured variance via TVP with orthogonality constraint in only one mode, which is called semiorthogonality according to [Wang et al., 2015]. The semi-orthogonality results in more captured variance and more learned features than full-orthogonality. For the same tensor of size 300 200 3 discussed earlier, SO-MPCA can extract 300 features while fullorthogonal multilinear PCA can only extract three features (similar to UMPCA). We introduce a Relaxed Start (RS) strategy to get SOMPCA-RS by fixing the starting projection vectors for better generalization [Abu-Mostafa et al., 2012]. This strategy constrains the hypothesis space to a smaller set, leading to increased bias and reduced variance of the learning model. The experimental results in Sec. 4 show that SO-MPCA-RS outperforms other competing PCAbased methods on the whole. In addition, this is a new strategy for tensor-based algorithms and we show its effectiveness for other TVP-based PCA methods. In the following, we cover the necessary background first. Notations and basic operations: We follow the notations in [Lathauwer et al., 2000] to denote vectors by lowercase boldface letters, e.g., x; matrices by uppercase boldface letters, e.g., X; and tensors by calligraphic letters, e.g., X. We denote their elements with indices in parentheses, and indices by lowercase letters spanning the range from 1 to the uppercase letter of the index, e.g., n = 1, , N. An Nthorder tensor A RI1 IN is addressed by N indices {in}. Each in addresses the n-mode of A. The n-mode product of an Nth-order tensor A by a vector u RIn, denoted by B = A n u T , is a tensor with entries: B(i1, , in 1, 1, in+1, , i N) = X in A(i1, , i N) u(in). Tensor-to-vector projection: Elementary multilinear projections (EMPs) are the building blocks of a TVP. We denote an EMP as {u(1), u(2), , u(N)}, consisting of one unit projection vector in each mode, i.e., u(n) = 1 for n = 1, , N, where is the Euclidean norm for vectors. It projects a tensor X RI1 I2 IN to a scalar y through the N unit projection vectors as [Lu et al., 2013]: y = X 1 u(1)T 2 u(2)T N u(N)T . (2) The TVP of a tensor X to a vector y RP consists of P EMPs {u(1) p , , u(N) p }, p = 1, , P, which can be written concisely as {u(n) p , n = 1, , N}P p=1 or {u(n) p }P p=1: y = X N n=1 {u(n) p , n = 1, , N}P p=1, (3) where the pth component of y is obtained from the pth EMP as: yp = y(p) = X 1 u(1)T p N u(N)T p = X N n=1 {u(n) p }. (4) SO-MPCA with Relaxed Start This section presents the proposed SO-MPCA-RS by first formulating the SO-MPCA problem, then deriving the solutions with a successive and conditional approach, and finally introducing the relaxed start strategy for better generalization. Formulation of Semi-Orthogonal MPCA We define the SO-MPCA problem with orthogonality constraint in only one mode, i.e., semi-orthogonality [Wang et al., 2015], as follows: The SO-MPCA problem: A set of M tensor data samples {X1, X2, , XM} are available for training. Each sample Xm RI1 I2 IN can be viewed a point in the tensor space RI1 N RI2 N RIN , where In is the nmode dimension and N denotes the Kronecker product. SOMPCA considers a TVP, which consists of P EMPs {u(n) p RIn 1, n = 1, , N}P p=1, that projects the input tensor space RI1 N RI2 N RIN into a vector subspace RP , i.e., ym = Xm N n=1 {u(n) p , n = 1, , N}P p=1 (5) for m = 1, , M. The objective is to find a TVP to maximize the variance of the projected samples in each projection direction, subject to the orthogonality constraint in only one mode, denoted as the ν-mode. The variance is measured by the total scatter Sp defined as: m=1 (ymp yp)2, (6) where ymp = Xm N n=1 {u(n) p }, and yp = 1 M P m ymp. In other words, the objective of SO-MPCA is to obtain the P EMPs, with the pth EMP determined as: {u(n) p , n = 1, , N} = arg max m=1 (ymp yp)2, (7) s.t. u(n)T p u(n) p = 1 for n = 1, , N and (8) u(ν)T p u(ν) q = 0 for p > 1 and q = 1, , p 1, (9) where the orthogonality constraint (9) is imposed only in the ν-mode and there is no such constraint for the other modes (n = 1, , N, n = ν). The normalization constraint (8) is imposed for all modes. Bound on the number of features: Based on the proof of Corollary 1 in [Lu et al., 2009], we can derive that the number of features P that can be extracted by SO-MPCA is upper-bounded by the ν-mode dimension Iν: P Iν. Since we can choose any n as ν, we have the upper bound of P as P maxn In (i.e., the highest mode dimension). Selection of mode ν: Although we are free to choose any mode n as ν to impose the orthogonality constraint (9), it is often good to have more features in practice. Thus, in this paper, we choose the mode with the highest dimension as ν: ν = arg max n In, (10) such that P = maxn In = Iν. On the other hand, we can also obtain a total of P n In features by running SO-MPCA N times with ν = 1, , N. In this paper, we only focus on SO-MPCA with ν determined by (10). Semi-orthogonality vs. full-orthogonality: If we impose the orthogonality constraint (9) in all modes, we can get Full-Orthogonal Multilinear PCA (FO-MPCA). However, our study found that FO-MPCA is not effective primarily due to two reasons: Due to the heavy constraints, the variance captured by FO-MPCA is quite low, even lower than UMPCA. In contrast, SO-MPCA can capture more variance than both FO-MPCA and UMPCA. This is illustrated in Fig. 1 in Sec. 4. Similar to UMPCA, the number of features that can be extracted by FO-MPCA is upper-bounded by the lowest mode dimension minn In, which can be quite limited. For instance, FO-MPCA can extract only three features for a tensor of size 300 200 3 while SO-MPCA can extract 300 features by choosing ν = 1 for the same tensor. This can be observed in Fig. 1 as well. Successive Derivation of SO-MPCA To solve the SO-MPCA problem, we follow the successive derivation in [Jolliffe, 2002; Lu et al., 2009] to determine EMPs one by one in P steps: Step 1 (p = 1): Determine the first EMP {u(n) 1 , n = 1, , N} by maximizing S1 with the constraint (8). Step p (p = 2, , P): Determine the pth EMP {u(n) p , n = 1, , N} by maximizing Sp with the constraints (8) and (9). Conditional subproblem: In order to obtain the pth EMP {u(n) p , n = 1, , N}, we need to determine N vectors. We follow the approach of alternating least squares [Harshman, 1970]. Thus, we can only obtain locally optimal solutions as in many other tensor-based methods. For the pth EMP, the parameters of the n-mode projection vector u(n) p are estimated one mode by one mode separately conditioned on the projection vectors in all the other modes. Assuming the pth projection vectors in all but n-mode are given, we project the input tensor samples in these (N 1) modes to obtain the partial multilinear projections as in [Lu et al., 2013]: y(n) mp = Xm 1 u(1)T p n 1 u(n 1)T p n+1u(n+1)T p N u(N)T p , (11) where y(n) mp RIn. This conditional subproblem then becomes to determine u(n) p that projects the vector samples { y(n) mp, m = 1, , M} onto a line to maximize the variance captured. Then the total scatter matrix S(n) p corresponding to { y(n) mp, m = 1, , M} becomes: m=1 ( y(n) mp y(n) p )( y(n) mp y(n) p )T , (12) where y(n) p = 1 M PM m=1 y(n) mp. For p = 1 (step 1), the solution for u(n) 1 , where n = 1, , N, is obtained as the unit eigenvector of S(n) 1 associated with the largest eigenvalue. For p 2, we need to deal with the ν-mode and other modes differently. For modes other than ν, the solution for u(n) p , where n = 1, , N, n = ν, is obtained as the unit eigenvector of S(n) p associated with the largest eigenvalue. Constrained optimization for ν-mode and p 2: When p 2, we need to determine u(ν) p by solving the following constrained optimization problem: u(ν) p = arg max u(ν)T p S(ν) p u(ν) p (13) s.t. u(ν)T p u(ν) p = 1 and u(ν)T p u(ν) q = 0, q = 1, p 1. We solve this problem by the following theorem: Theorem 1. The solution to the problem (13) is the (unitlength) eigenvector corresponding to the largest eigenvalue of the following eigenvalue problem: Γ(ν) p S(ν) p u(ν) p = λu(ν) p , (14) Γ(ν) p = [ IIn q=1 u(ν) q u(ν) q T ], (15) and IIn is an identity matrix of size In In. Proof. First, we use Lagrange multipliers to transform the problem (13) to include all the constraints as: Lν = u(ν) p T S(ν) p u(ν) p λ(u(ν) p T u(ν) p 1) q=1 µqu(ν) p T u(ν) q , (16) where λ and {µq, q = 1, , p 1} are Lagrange multipliers. Then we set the partial derivative of Lν with respect to u(ν) p to zero: Lν u(ν) p = 2 S(ν) p u(ν) p 2λu(ν) p q=1 µqu(ν) q = 0. (17) Premultiplying (17) by u(ν) p T , the third term vanishes and we get 2u(ν) p T S(ν) p u(ν) p 2λu(ν) p T u(ν) p = 0 λ = u(ν) p T S(ν) p u(ν) p , (18) Algorithm 1 Semi-Orthogonal Multilinear PCA with Relaxed Start (SO-MPCA-RS) 1: Input: A set of tensor samples {Xm RI1 IN , m = 1, , M}, and the maximum number of iterations K. 2: Set ν = arg maxn In. 3: Set the first EMP: u(n) 1 = 1/ 1 for n = 1, , N . 4: for p = 2 to P do 5: Initialize u(n) p = 1/ 1 for n = 1, , N. 6: for k = 1 to K do 7: for n = 1 to N do 8: Calculate the partial multilinear projection { y(n) mp} for m = 1, , M according to (11). 9: if n == ν then 10: Calculate Γ(ν) p and S(ν) p according to (15) and (12), respectively. Then, set u(ν) p to the eigenvector of Γ(ν) p S(ν) p associated with the largest eigenvalue. 11: else 12: Calculate S(n) p by (12). Set u(n) p to the eigenvector of S(n) p associated with the largest eigenvalue. 13: end if 14: end for 15: end for 16: end for 17: Output The TVP {u(n) p , n = 1, , N}P p=1 . which indicates that λ is exactly the criterion to be maximized, with the orthogonality constraint. Next, a set of (p 1) equations are obtained by premulti- plying (17) by u(ν) q T , q = 1, , p 1, respectively, 2u(ν) q T S(ν) p u(ν) p 2λu(ν) q T u(ν) p s=1 µsu(ν) q T u(ν) s = 0. (19) The second term vanishes and the summand in the third term is non-zero only for s = q. Thus, we get 2u(ν) q T S(ν) p u(ν) p µq = 0 µq = 2u(ν) q T S(ν) p u(ν) p . (20) Substituting (20) into (17), we get 2 S(ν) p u(ν) p 2λu(ν) p q=1 u(ν) q 2u(ν) q T S(ν) p u(ν) p = 0 λu(ν) p = S(ν) p u(ν) p q=1 u(ν) q u(ν) q T S(ν) p u(ν) p (21) λu(ν) p = [ IIn q=1 u(ν) q u(ν) q T ] S(ν) p u(ν) p . (22) Using the definition in (15), (22) can be rewritten as: Γ(ν) p S(ν) p u(ν) p = λu(ν) p . (23) Since λ is the criterion to be maximized, this maximization is achieved by setting u(ν) p to the (unit) eigenvector of Γ(ν) p S(ν) p associated with its corresponding largest eigenvalue . Relaxed Start for Better Generalization When we use SO-MPCA features for classification, we find the performance is limited. Therefore, we further introduce a simple relaxed start (RS) strategy to get SO-MPCA-RS by fixing the first EMP {u(n) 1 , n = 1, , N} (the starting projection vectors), without variance maximization. In this paper, we set this starting EMP u(n) 1 (for n = 1, , N) to the normalized uniform vector 1/ 1 for simplicity. This idea is motivated by the theoretical studies in Chapter 4 of [Abu-Mostafa et al., 2012] showing that constraining a learning model could lead to better generalization. By fixing the first EMP as simple vectors, the following EMPs have less freedom due to the imposed semi-orthogonality, which increases the bias and reduces the variance of the learning model. Thus, the SO-MPCA-RS model has a smaller hypothesis set than the SO-MPCA model. The two algorithms differ only in how to determine the first (starting) EMP though the following EMPs will all be different due to their dependency on the first EMP. This relaxed start strategy is not specific to SO-MPCA but generally applicable to any TVP-based subspace learning algorithm. We run controlled experiments in Sec. 4 to show that it can improve the performance of not only SO-MPCA but also TROD and UMPCA. Algorithm 1 summarizes the SO-MPCA-RS algorithm.1 The SO-MPCA algorithm can be obtained from Algorithm 1 by removing line 3, changing p = 2 in line 4 to p = 1 and setting Γ(ν) 1 (p = 1) in line 10 to an identity matrix. Experiments This section evaluates the proposed methods on both secondorder and third-order tensor data in terms of recognition rate, the number of extracted features, captured variance, and convergence. In addition, we also study the effectiveness of the relaxed start strategy on other TVP-based PCA algorithms. Data:2 For second-order tensors, we use the same subset of the FERET database [Phillips et al., 2000] as in [Lu et al., 2009], with 721 face images from 70 subjects. Each face image is normalized to 80 60 graylevel pixels. For thirdorder tensors, we use a subset of the USF Human ID Gait Challenge database [Sarkar et al., 2005]. We use the same gallery set (731 samples from 71 subjects) and probe A (727 samples from 71 subjects) as in [Lu et al., 2009], and we also test probe B (423 samples from 41 subjects) and probe C (420 samples from 41 subjects). Each gait sample is a (binary) silhouette sequence of size of 32 22 10. Experiment setup: In face recognition experiments, we randomly select L = 1, 2, 3, 4, 5, 6, 7 samples from each subject as the training data and use the rest for testing. We repeat such random splits (repetitions) ten times and report the mean correct recognition rates. In gait recognition experiments, we follow the standard setting and use the gallery set as the training data and probes A, B, and C as the test data (so there is 1Matlab code is available at: http://www.comp.hkbu.edu.hk/ haiping/codedata.html 2Both face and gait data are downloaded from: http://www.dsp. utoronto.ca/ haiping/MSL.html Table 1: Face recognition rates in percentage (mean std) by the nearest neighbor classifier on the FERET subset. The top two results are highlighted with bold fonts and - indicates that no enough features can be extracted. L P PCA CSA MPCA TROD UMPCA SO-MPCA SO-MPCA-RS TROD-RS UMPCA-RS 1 2.60 0.66 3.87 1.02 2.52 0.76 2.70 0.44 5.98 2.65 2.73 0.69 6.85 1.44 2.63 0.82 6.04 2.00 5 15.12 1.31 11.90 1.25 16.65 1.82 15.88 1.20 23.23 4.49 20.06 2.34 27.27 2.36 16.68 1.50 24.78 4.76 10 22.69 2.21 21.35 2.76 22.24 1.94 21.52 2.83 31.83 5.17 28.77 2.72 36.34 3.56 21.86 3.03 33.16 5.36 1 20 27.62 2.57 26.16 2.38 27.16 1.47 26.30 2.49 35.94 5.65 31.94 2.95 40.32 3.40 26.51 1.97 36.65 5.46 50 31.38 2.58 31.37 1.92 31.29 1.71 29.63 2.21 36.14 5.73 32.33 2.78 40.48 3.09 29.80 1.48 37.05 5.46 80 31.95 1.84 32.17 2.09 31.14 2.42 32.26 2.71 40.41 3.09 31.21 1.94 1 2.69 0.73 3.36 0.54 2.63 0.55 2.65 0.77 7.28 2.44 2.69 0.46 7.97 1.10 2.81 0.79 6.82 1.56 5 20.17 1.25 15.15 1.03 21.53 0.90 21.34 1.56 26.90 5.23 24.34 1.59 33.82 1.45 21.62 1.93 29.19 3.55 10 32.03 2.49 29.45 1.95 28.04 1.69 30.05 1.70 40.17 6.76 36.94 2.58 46.63 1.95 30.83 1.93 44.01 3.22 2 20 39.07 1.87 37.07 2.27 38.86 2.12 36.87 2.37 44.51 6.54 41.70 2.48 52.19 2.11 37.64 2.19 48.67 3.57 50 43.86 2.53 44.61 2.34 44.54 2.74 42.67 2.16 45.47 6.65 42.24 2.39 52.22 1.73 42.99 2.28 49.07 3.61 80 45.28 2.39 45.82 2.76 46.02 2.67 44.46 2.53 42.20 2.39 52.19 1.80 44.78 2.48 1 2.72 0.45 4.07 0.80 2.25 0.44 2.95 0.61 7.42 1.17 2.56 0.56 7.55 1.17 2.74 0.81 7.34 1.08 5 23.89 1.64 16.58 0.95 25.95 1.26 24.48 1.79 33.86 3.65 28.30 2.05 36.93 1.76 26.14 2.17 32.68 4.67 10 37.20 1.91 36.05 1.50 34.91 2.40 34.83 2.96 49.39 2.83 43.31 1.51 54.38 3.09 34.64 3.01 50.55 4.81 3 20 46.05 2.11 43.87 1.98 45.48 2.46 43.37 2.51 55.83 3.32 49.49 2.16 61.25 2.85 42.99 2.65 55.89 4.16 50 51.35 2.53 51.60 2.50 52.00 2.70 48.92 2.69 56.42 3.11 49.80 2.29 61.08 2.83 49.47 2.76 56.38 4.62 80 52.66 2.67 52.84 2.70 53.31 2.59 51.17 2.65 49.77 2.36 60.98 2.83 51.45 2.63 1 2.68 0.85 3.92 1.04 2.22 0.62 3.11 0.83 7.96 2.15 3.13 0.90 8.34 1.37 3.08 0.66 7.03 1.68 5 25.26 1.77 18.93 1.29 28.71 1.91 27.37 2.34 37.66 4.88 29.61 2.16 40.25 1.52 28.25 2.82 38.84 2.97 10 41.54 2.02 40.39 2.36 39.43 2.05 38.82 3.91 55.10 4.55 47.10 2.88 59.30 2.49 39.25 2.95 57.30 4.86 4 20 49.34 1.69 49.39 2.35 50.18 3.03 47.57 2.70 62.40 4.49 53.85 2.89 66.60 3.07 47.73 2.96 64.08 4.94 50 56.85 2.09 57.51 2.98 57.48 2.72 54.58 2.56 63.13 4.15 54.56 3.14 67.03 2.86 54.42 2.89 64.85 5.01 80 58.16 2.46 59.05 2.65 58.91 2.50 57.30 2.46 54.47 3.17 66.89 2.90 57.48 2.46 1 2.91 0.91 4.37 1.06 2.72 0.88 2.59 0.64 7.41 2.14 3.07 0.60 8.38 0.97 2.96 0.75 7.20 1.35 5 28.95 2.07 20.75 1.77 32.99 2.47 31.75 2.79 40.78 5.82 34.20 2.67 42.35 3.04 33.45 1.41 41.67 1.95 10 47.06 1.54 45.77 2.17 43.29 3.07 43.80 3.51 60.49 6.37 53.45 2.75 63.23 3.37 44.69 2.71 62.88 2.59 5 20 55.66 1.94 56.01 2.19 56.79 2.14 54.47 1.66 66.90 6.23 61.40 2.43 69.97 2.55 54.64 1.96 70.19 3.25 50 63.91 1.71 64.58 2.13 64.37 2.27 61.54 2.75 67.71 6.31 62.26 2.88 70.70 2.47 61.51 1.92 70.81 3.08 80 64.61 1.67 65.58 1.92 65.85 2.02 64.02 2.40 62.18 2.83 70.70 2.45 64.02 2.03 1 2.86 0.89 3.89 0.67 2.49 1.01 2.86 1.01 9.07 0.83 2.56 0.74 8.97 0.97 2.56 0.99 7.21 1.35 5 30.30 2.17 21.89 2.04 33.42 2.52 33.59 2.63 42.52 4.99 35.18 1.32 43.32 1.82 35.18 2.52 44.65 4.14 10 48.97 2.96 49.14 2.57 45.65 2.85 45.88 2.97 63.16 5.32 56.15 2.36 65.75 2.76 47.24 2.55 66.51 3.10 6 20 58.57 2.84 58.97 2.60 59.73 2.96 57.11 3.22 70.73 5.39 64.72 3.11 74.39 2.79 57.81 2.18 74.52 3.10 50 66.88 2.31 67.84 2.48 67.84 2.63 64.05 2.95 72.09 5.18 65.32 2.86 74.75 2.60 65.12 2.92 75.12 2.79 80 68.31 2.33 69.44 2.49 69.70 2.35 66.78 2.94 65.08 2.73 74.72 2.56 68.01 2.90 1 2.68 1.19 4.55 1.07 2.12 1.07 2.81 0.82 11.39 1.85 2.38 1.33 10.91 1.37 1.99 1.19 8.57 1.68 5 29.52 1.38 22.51 1.29 34.72 2.86 32.03 2.46 44.98 5.32 35.58 2.04 45.89 2.34 35.02 1.82 45.93 1.93 10 51.21 2.11 49.39 3.18 46.19 2.43 46.10 2.60 65.67 5.82 56.84 2.07 67.53 2.34 48.44 3.20 66.93 2.01 7 20 59.57 2.64 60.91 2.75 61.69 2.57 57.58 2.75 73.16 4.28 65.37 2.24 74.89 1.97 58.31 2.62 75.80 2.24 50 68.10 2.21 69.35 1.89 69.26 2.22 65.54 2.79 74.11 4.52 65.37 2.02 75.24 2.12 66.06 3.10 76.88 1.72 80 69.70 2.84 70.39 1.76 70.65 1.97 67.97 2.45 65.37 2.02 75.19 2.18 68.14 2.96 no random splits/repetitions), and report the rank 1 and rank 5 recognition rates [Sarkar et al., 2005]. Algorithms and their settings: We first evaluate SOMPCA and SO-MPCA-RS against five existing PCA-based methods: PCA [Jolliffe, 2002], CSA [Xu et al., 2005], MPCA [Lu et al., 2008], TROD [Shashua and Levin, 2001], and UMPCA [Lu et al., 2009].3 CSA and MPCA produce tensorial features so they need to be vectorized. MPCA uses the full projection. For TROD and UMPCA, we use the uniform initialization [Lu et al., 2009]. For SO-MPCA and SOMPCA-RS, we set the selected mode ν = 1 for the maximum number of features. For iterative algorithms, we set the number of iterations to 20. All features are sorted according to the scatters (captured variance) in descending order for classification. We use the Nearest Neighbor Classifier with the Euclidean distance measure to classify the top P features. We test up to P = 80 features in face recognition and up to P = 32 features in gait recognition. The performance of FOMPCA is much worse than SO-MPCA so it is not included in the comparisons (except variance study) to save space. Face recognition results: Table 1 shows the face recognition results for P = 1, 5, 10, 20, 50, 80 and L = 1, 2, 3, 4, 5, 6, 7, including both the mean and the standard deviation (std) over ten repetitions. We highlight the top two re- 3For second-order tensors, CSA and MPCA are equivalent to GLRAM [Ye, 2005] and GPCA [Ye et al., 2004], respectively. sults in each row in bold fonts for easy comparison. Only SOMPCA-RS consistently achieves the top 2 results in all cases. Compared with existing methods (PCA, CSA, MPCA, TROD and UMPCA), SO-MPCA-RS outperforms the best performing existing algorithm (UMPCA) by 3.79% on average. Furthermore, for larger L = 5, 6, 7, SO-MPCA-RS outperforms the other five methods at least by 2.26% on average. For smaller L = 1, 2, 3, SO-MPCA-RS achieves a greater improvement of at least 5.28% over existing methods, indicating that SO-MPCA-RS is more superior in dealing with the small sample size (overfitting) problem. Gait recognition results: Similarly, the gait recognition results are reported in Table 2 with the top two results highlighted. Again, only SO-MPCA-RS consistently achieves the top 2 results in all cases. In rank 1 rate, the best performing existing algorithm is PCA, which outperforms SO-MPCARS by 3.73% on average. While in rank 5 rate, SO-MPCARS outperforms the best performing existing algorithm (still PCA) by 6.76% on average. Number of features: In the tables, we use - to indicate that there are not enough features. For PCA, there are at most 69 features for face data when L = 1 since there are only 70 samples for training. UMPCA can only extract 60 or 10 features for face and gait data, respectively. In contrast, SOMPCA and SO-MPCA-RS (with ν = 1) can learn 80 features for face data and 32 features for gait data. Table 2: Rank 1 and rank 5 gait recognition rates in percentage (mean std ) by the nearest neighbor classifier on the USF subset. The top two results are highlighted with bold fonts and - indicates that no enough features can be extracted. Rank P robe P PCA CSA MPCA TROD UMPCA SO-MPCA SO-MPCA-RS TROD-RS UMPCA-RS 5 30.99 22.54 32.39 28.17 39.44 30.99 40.85 18.31 39.44 10 52.11 43.66 49.30 42.25 57.75 49.30 59.15 33.80 63.38 A 20 67.61 57.75 60.56 53.52 54.93 67.61 53.52 32 71.83 59.15 61.97 60.56 54.93 69.01 64.79 5 26.83 17.07 24.39 19.51 26.83 29.27 41.46 19.51 39.02 10 48.78 39.02 46.34 39.02 46.34 53.66 63.41 36.59 51.22 1 B 20 65.85 53.66 58.54 53.66 58.54 65.85 43.90 32 68.29 60.98 58.54 65.85 60.98 68.29 63.41 5 12.20 9.76 14.63 4.88 24.39 12.20 14.63 7.32 14.63 10 29.27 14.63 19.51 14.63 29.27 29.27 29.27 19.51 21.95 C 20 34.15 31.71 29.27 21.95 31.71 34.15 24.39 32 46.34 34.15 29.27 31.71 31.71 39.02 39.02 5 57.75 56.34 66.20 54.93 73.24 67.61 84.51 42.25 73.24 10 80.28 74.65 77.46 77.46 83.10 77.46 88.73 59.15 85.92 A 20 87.32 80.28 81.69 76.06 80.28 92.96 77.46 32 87.32 81.69 83.10 78.87 80.28 92.96 81.69 5 48.78 48.78 53.66 48.78 58.54 63.41 68.29 53.66 58.54 10 73.17 70.73 70.73 60.98 65.85 70.73 75.61 75.61 68.29 5 B 20 78.05 75.61 78.05 75.61 70.73 80.49 78.05 32 78.05 73.17 78.05 80.49 70.73 80.49 80.49 5 51.22 34.15 41.46 36.59 41.46 43.90 56.10 29.27 36.59 10 53.66 46.34 43.90 48.78 43.90 48.78 70.73 43.90 56.10 C 20 65.85 56.10 60.98 48.78 48.78 78.05 46.34 32 65.85 60.98 58.54 56.10 48.78 78.05 56.10 0 20 40 60 80 Feature Index Variance Captured PCA UMPCA FO MPCA SO MPCA SO MPCA RS (a) Sorted variance 0 20 40 60 80 Feature Index Variance Captured PCA UMPCA FO MPCA SO MPCA SO MPCA RS (b) Unsorted variance Figure 1: The captured variance on face data with L = 1. 2 4 6 8 10 12 14 16 18 20 1.2302 1.2312x 10 8 Number of Iterations Variance Captured 2 4 6 8 10 12 14 16 18 20 2.54 Number of Iterations Variance Captured (b) p = 5 Figure 2: Illustration of the SO-MPCA-RS algorithm s convergence performance on the face data with L = 1. Feature variance: We illustrate the variance captured by PCA, UMPCA, FO-MPCA, SO-MPCA, and SO-MPCA-RS in Fig. 1 for face data with L = 1 (not all methods are shown for clarity). Figure 1(a) shows the sorted variance. It is clear that semi-orthogonality captures more variance than full-orthogonality, as we discussed in Sec. 3.1. Moreover, both SO-MPCA and SO-MPCA-RS can capture more variance than UMPCA, but less than PCA (and also CSA, MPCA, and TROD, which are not shown). Though capturing less variance, SO-MPCA-RS achieves better overall classification performance than other PCA-based methods, with results consistently in the top two in all experiments. We also show the unsorted captured variance in Fig. 1(b). The variance captured by the first (fixed) EMP of SO-MPCARS is much less than other EMPs, which is not surprising since the variance is not maximized. Convergence: We demonstrate the convergence of SOMPCA-RS in Fig. 2 for face data with L = 1. We can see that SO-MPCA-RS converges in just a few iterations. SO-MPCA has a similar convergence rate. Effectiveness of relaxed start: To evaluate the proposed relaxed start strategy, we apply it to two other TVPbased methods, TROD and UMPCA, getting TROD-RS and UMPCA-RS, respectively. We summarize their performance in the last two columns of Tables 1 and 2 for face and gait recognition experiments, respectively. Both tables show that relaxed start can help both TROD and UMPCA to achieve better recognition rates. From Table 1, TROD-RS improves over TROD by 0.32% and UMPCARS improves over UMPCA by 2.15% on average. From Table 2, TROD-RS achieves 3.03 % improvement over TROD and UMPCA-RS achieves 1.07% improvement over UMPCA for rank 1 rate. For rank 5 rate, TROD-RS improves 0.94% over TROD, and UMPCA-RS improves 5.28% over UMPCA. The relaxed start is most effective for our SO-MPCA. SO-MPCARS has an improvement of 9.97% on face data over SOMPCA. On gait data, SO-MPCA-RS outperforms SO-MPCA by 9.56% in rank 1 rate and 17.26% in rank 5 rate on average. In addition, SO-MPCA-RS has better face recognition performance than TROD-RS and UMPCA-RS with an improvement of 8.08% and 1.64%, respectively. On gait data, SOMPCA-RS improves the rank 1 recognition rate by 3.03% over TROD-RS and 13.25% over UMPCA-RS on average, and SO-MPCA-RS improves the rank 5 recognition rate by 11.07% and 13.73% over TROD-RS and UMPCA-RS. These controlled experiments show the effectiveness of relaxed start on SO-MPCA and other TVP-based multilinear PCA methods (UMPCA and TROD). One possible explanation is that RS increases the bias and reduces the variance of the learning model, while further investigation is needed. Conclusion This paper proposes a novel multilinear PCA algorithm under the TVP setting, named as semi-orthogonal multilinear PCA with relaxed start (SO-MPCA-RS). The proposed SO-MPCA approach learns features directly from tensors via TVP to maximize the captured variance with the orthogonality constraint imposed in only one mode. This semi-orthogonality can capture more variance and learn more features than fullorthogonality. Furthermore, the introduced relaxed start strategy can achieve better generalization by fixing the starting projection vectors to uniform vectors to increase the bias and reduce the variance of the learning model. Experiments on face (2D data) and gait (3D data) recognition show that SOMPCA-RS achieves the best overall performance compared with competing algorithms. In addition, relaxed start is also effective for other TVP-based PCA methods. In this paper, we studied semi-orthogonality in only one mode. A possible future work is to learn SO-MPCA-RS features from each mode separately and then do a feature/scorelevel fusion. Acknowledgments This research was supported by Research Grants Council of the Hong Kong Special Administrative Region (Grant 22200014 and the Hong Kong Ph D Fellowship Scheme). References [Abu-Mostafa et al., 2012] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin. Learning from Data, chapter 4, pages 119 166. AMLBook, 2012. [Anaraki and Hughes, 2014] F.P. Anaraki and S. Hughes. Memory and computation efficient PCA via very sparse random projections. In Proc. 31st Int. Conf. on Machine Learning, pages 1341 1349, 2014. [Deng et al., 2014] W. Deng, J. Hu, J. Lu, and J. Guo. Transforminvariant PCA: A unified approach to fully automatic face alignment, representation, and recognition. IEEE Trans. Pattern. Anal. Mach. Intell., 36(6):1275 1284, 2014. [Edelman et al., 1998] A. Edelman, T.A. Arias, and S.T. Smith. The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl., 20(2):303 353, 1998. [Faloutsos et al., 2007] C. Faloutsos, T. G. Kolda, and J. Sun. Mining large time-evolving data using matrix and tensor tools. In Int. Conf. on Data Mining Tutorial, 2007. [Gao et al., 2013] Q. Gao, J. Ma, H. Zhang, X. Gao, and Y. Liu. Stable orthogonal local discriminant embedding for linear dimensionality reduction. IEEE Trans. Image Processing, 22(7):2521 2531, 2013. [Harshman, 1970] R. A. Harshman. Foundations of the parafac procedure: Models and conditions for an explanatory multi-modal factor analysis. UCLA Working Papers in Phonetics, 16:1 84, 1970. [Hua et al., 2007] G. Hua, P.A. Viola, and S.M. Drucker. Face recognition using discriminatively trained orthogonal rank one tensor projections. In Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition, pages 1 8, 2007. [Jolliffe, 2002] I. T. Jolliffe. Principal Component Analysis. Springer Series in Statistics, second edition, 2002. [Ke and Sukthankar, 2004] Y. Ke and R. Sukthankar. PCA-SIFT: A more distinctive representation for local image descriptors. In Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition, volume II, pages 506 513, 2004. [Kokiopoulou and Saad, 2007] E. Kokiopoulou and Y. Saad. Orthogonal neighborhood preserving projections: A projectionbased dimensionality reduction technique. IEEE Trans. Pattern. Anal. Mach. Intell., 29(12):2143 2156, 2007. [Kolda and Bader, 2009] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Rev., 51(3):455 500, 2009. [Kolda, 2001] T.G. Kolda. Orthogonal tensor decompositions. SIAM J. Matrix Anal. Appl., 23(1):243 255, 2001. [Kusner et al., 2014] M. Kusner, S. Tyree, K.Q. Weinberger, and K. Agrawal. Stochastic neighbor compression. In Proc. 31st Int. Conf. Machine Learning, pages 622 630, 2014. [Lathauwer et al., 2000] L. De Lathauwer, B. De Moor, and J. Vandewalle. On the best rank-1 and rank-(R1, R2, ..., RN) approximation of higher-order tensors. SIAM J. Matrix Anal. Appl., 21(4):1324 1342, 2000. [Lu et al., 2008] H. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos. MPCA: Multilinear principal component analysis of tensor objects. IEEE Trans. Neural Networks, 19(1):18 39, 2008. [Lu et al., 2009] H. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos. Uncorrelated multilinear principal component analysis for unsupervised multilinear subspace learning. IEEE Trans. Neural Networks, 20(11):1820 1836, 2009. [Lu et al., 2013] H. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos. Multilinear Subspace Learning: Dimensionality Reduction of Multidimensional Data. CRC Press, 2013. [Phillips et al., 2000] P. J. Phillips, H. Moon, S. A. Rizvi, and P. Rauss. The FERET evaluation method for face recognition algorithms. IEEE Trans. Pattern. Anal. Mach. Intell., 22(10):1090 1104, 2000. [Sarkar et al., 2005] S. Sarkar, P.J. Phillips, Z. Liu, I. R. Vega, P. Grother, and K. W.Bowyer. The human ID gait challenge problem: Data sets, performance, and analysis. IEEE Trans. Pattern. Anal. Mach. Intell., 27(2):162 177, 2005. [Shashua and Levin, 2001] A. Shashua and A. Levin. Linear image coding for regression and classification using the tensor-rank principle. In Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition, volume I, pages 42 49, 2001. [Wang et al., 2015] L. Wang, M.T. Chu, and B. Yu Wang. SIAM J. Matrix Anal. Appl., 36(1):1 19, 2015. [Xu et al., 2005] D. Xu, S. Yan, L. Zhang, H.-J. Zhang, Z. Liu, and H.-Y. Shum;. Concurrent subspaces analysis. In Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition, volume II, pages 203 208, 2005. [Yang et al., 2004] J. Yang, D. Zhang, A. F Frangi, and J. Yang. Two-dimensional PCA: a new approach to appearance-based face representation and recognition. IEEE Trans. Pattern. Anal. Mach. Intell., 26(1):131 137, 2004. [Ye et al., 2004] J. Ye, R. Janardan, and Q. Li. GPCA: An efficient dimension reduction scheme for image compression and retrieval. In Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 354 363, 2004. [Ye, 2005] J. Ye. Generalized low rank approximations of matrices. Machine Learning, 61(1-3):167 191, 2005.