# online_robust_lowrank_tensor_learning__4c1f7871.pdf Online Robust Low-Rank Tensor Learning Ping Li1,2, Jiashi Feng2, Xiaojie Jin2, Luming Zhang3, Xianghua Xu1, Shuicheng Yan4,2 1School of Computer Science and Technology, Hangzhou Dianzi University 2Department of Electrical and Computer Engineering, National University of Singapore 3Department of Computer and Information, Hefei University of Technology 4Qihoo 360 Artificial Intelligence Institute patriclouis.lee@gmail.com, elefjia@nus.edu.sg, xiaojie.jin@u.nus.edu, zglumg@gmail.com, xhxu@hdu.edu.cn, eleyans@nus.edu.sg The rapid increase of multidimensional data (a.k.a. tensor) like videos brings new challenges for low-rank data modeling approaches such as dynamic data size, complex high-order relations, and multiplicity of low-rank structures. Resolving these challenges require a new tensor analysis method that can perform tensor data analysis online, which however is still absent. In this paper, we propose an Online Robust Lowrank Tensor Modeling (ORLTM) approach to address these challenges. ORLTM dynamically explores the high-order correlations across all tensor modes for low-rank structure modeling. To analyze mixture data from multiple subspaces, ORLTM introduces a new dictionary learning component. ORLTM processes data streamingly and thus requires quite low memory cost that is independent of data size. This makes ORLTM quite suitable for processing large-scale tensor data. Empirical studies have validated the effectiveness of the proposed method on both synthetic data and one practical task, i.e., video background subtraction. In addition, we provide theoretical analysis regarding computational complexity and memory cost, demonstrating the efficiency of ORLTM rigorously. 1 Introduction The explosion of multidimensional data raises significant demand on efficient data analysis techniques. In recent years, low-rank tensor modeling methods have received increasing attention [Goldfarb and Qin, 2014; Lu et al., 2016] as they can discover intrinsic structure of tensors and provide more transparent and consolidated knowledge of the data. However, many realistic multi-dimensional data (e.g., surveillance videos) are usually generated in a streaming way from dynamic environments, which poses following new challenges: (1) how to develop scalable methods that can process large-scale dynamic tensors efficiently; (2) how to deal with data contamination such as noise or malicious outliers in the modeling process. To handle noisy tensors, two batch robust tensor methods were developed: High-Order Robust Principal Component Analysis (HORPCA) [Goldfarb and Qin, 2014] and Tensor RPCA [Lu et al., 2016]. They both generalize RPCA [Cand es et al., 2011] from matrix cases to tensors, i.e., solving the tensor RPCA problem min Z,E Z + E 1 with X = Z +E, to learn the inherent low-rank structure. Although they can handle noisy data, those methods are not scalable to largescale data due to their high memory cost which increases rapidly with the data size, and always seek a static lowrank component that cannot model dynamics of streaming data thus providing inferior performance. A proper method to analyze large-scale noisy tensor data is still absent. In this work, we propose an Online Robust Low-rank Tensor Modeling (ORLTM) method for learning low-rank structures of tensors from streaming noisy tensor data. Different from batch approaches, ORLTM isolates the lowrank component into two factors in each mode m, i.e., Zm = Wm Rm where Wm models the low-rank subspace basis and Rm incorporates the corresponding tensor data representation coefficients w.r.t. Wm. Here, the basis and data representations are decoupled. Thus, the basis size is typically small and can be stored with low memory cost while data representation can be updated online. In this way, ORLTM saves the memory cost dramatically without performance drop. Existing tensor RPCA approaches generally assume data reside in a single low-rank subspace. This may not hold for many realistic tensor data with complex inherent structures, e.g., those drawn from a union of multiple subspaces. Previous methods as well as the vanilla ORLTM are less competent in such circumstances. Hence, we further introduce a dictionary learning component to ORLTM. Dictionary provides a more flexible set of basis to represent the low-rank component of complex tensors. Instead of seeking a single set of basis, through learning proper dictionaries, ORLTM can capture sophisticated low-rank structures (e.g., mixture of multiple low-rank components) for better tensor modeling. ORLTM has wide applications in large-scale and dynamic tensor data analysis. In this paper, we empirically examined its performance on both synthetic data and video background subtraction, the results of which have well justified the effectiveness of ORLTM. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) 2 Related Work Low-rank models are useful for robust data recovery. A typical one is Robust Principal Component Analysis (RPCA) [Cand es et al., 2011] that decomposes a given matrix into a low-rank component and a sparse component. Another is Low-Rank Representation (LRR) [Liu et al., 2013] which considers the data drawn from a union of multiple subspaces. To this end, many variants have emerged, e.g., [Liu and Yan, 2011] employed both observed and hidden data to enhance LRR; [Yin et al., 2015] developed a dual graph regularized LRR. However, these methods require tremendous memory to store all the samples. Hence, some online learning methods were developed, e.g., [Feng et al., 2013] proposed an Online Robust PCA (ORPCA) via stochastic optimization; [Zhan et al., 2016] studied online RPCA based on the recursive projected compressive sensing framework; [Shen et al., 2016] designed an online LRR implementation with convergence guarantees. Nevertheless, they fail to exploit the low-dimensional tensor structures. Robust tensor models can utilize information across all the modes. Two principled forms are CANDECOMP/PARAFAC (CP) decomposition and Tucker decomposition [Sun et al., 2008]. For example, [Zhou et al., 2016] proposed an accelerated CP method for dynamic tensors. Based on the latter, Higher-Order RPCA (HORPCA) was presented in [Goldfarb and Qin, 2014]. Moreover, [Zhang et al., 2013] proposed a low-rank three-order tensor recovery method for rectification and alignment, while [Lu et al., 2016] proposed one tensor RPCA (TRPCA) method for image denoising using the framework in [Kilmer and Martin, 2011]. Nonetheless, these methods need large memory for batch optimization and cannot handle samples sequentially. In addition, there are rare works concerning robust online tensor analysis, and the ever-known one is Online Stochastic Tensor Decomposition (OSTD) [Sobral et al., 2015], which neglects the data with the mixture structure of multiple subspaces. 3 Notations and Preliminaries Tensors are denoted by calligraphic letters, e.g., A, matrices by boldface uppercase letters, e.g., A, vectors by boldface lowercase letters, e.g., a. The order or mode of a tensor equals the number of dimensions. For a third-order tensor X Rn1 n2 n3, its (i, j, k)-th entry is xijk, and its fiber is a column vector X(i, j, :); X(i, :, :) is the i-th horizontal slice; X(:, j, :) is the j-th lateral slice; X(:, :, k) or Xk is the k-th frontal slice. For the matrix, Aij denotes the (i, j)-th entry, Am without bracket denotes a traditional matrix. Tensor Unfolding [Kolda and Bader, 2009] The modem unfolding of A Rn1 n2 ... n M , A(m), is obtained by arranging the mode-m fibers in the matrix columns, i.e., unfoldm(A) = A(m) Rn m nm, where n m = QM j =m,j=1 nj and QM j=1 nj = n1 n2 . . . n M. Tensor Folding [Kolda and Bader, 2009] Folding the mode-m unfolding of tensor A is defined as fold(A(m)) = Am, which returns the corresponding tensor A in mode m. Tensor Vectorization [Kolda and Bader, 2009] The vectorization of A is denoted by vec(A), which stacks the elements in a column vector. For mode-m unfolding, vec(Am) (a.k.a. vec(A(m))) is done by aligning the columns of mode-m unfolding into a long column vector. Mode-m Product [Goldfarb and Qin, 2014] The mode-m product of tensor A and matrix Um Rnm nc on mode m is A m U, which is defined as (A m Um)(m) = A(m)Um. Tensor Norms [Goldfarb and Qin, 2014] The ℓ1-norm is A 1 = vec(A) 1 = P ijk |aijk|, and the Frobenius norm (vec(A) vec(A)) = q (P ijk a2 ijk). These norms can degenerate to the matrix or vector norms. Tensor Rank [Goldfarb and Qin, 2014] The mode-m rank of a tensor A denoted by rankm(A) is the column rank of A(m), and the set of M mode-m ranks is called Tuckerrank. However, it is an NP-hard problem and one often uses its convex surrogate CTrank(A) in practice [Hillar and Lim, 2013; Yu et al., 2015]. CTrank sums M nuclear norms of the mode-m unfoldings, i.e., CTrank(A) := PM m=1 A(m) . 4 Online Robust Low-Rank Tensor Modeling 4.1 Formulation of ORLTM We now introduce the objective formulation by starting with reviewing the batch-based High-Order RPCA (HORPCA) [Goldfarb and Qin, 2014]. This method factorizes the input tensor into a low-rank component plus a sparse noise component accounting for gross corruption or outliers, and its slice-wise model is formulated as m=1 L(m) + λ1 E 1, s. t. X = L + E, (1) where λ1>0, L=fold(L(m)), PM m=1 L(m) = CTrank(L), X, L, E Rn1 n2 ... n M are the observed noisy tensor, its low-rank component and the sparse component, respectively. On different mode-m unfoldings X(m), the problem (1) leads to different low-rank components and sparse components. Thus we adopt the variable-splitting technique and introduce a set of auxiliary tensor variables, i.e., {Lm}M m=1, {Em}M m=1, for L and E respectively. So the problem (1) can be reformulated as min Lm,Em m=1,2, ,M m=1 L(m) + λ1 E(m) 1, s. t. X(m) = L(m) + E(m), m = 1, . . . , M, Lm = fold(L(m)), Em = fold(E(m)), where X = fold(X(m)), and each unfolding X(m) in all modes should return the common tensor X through the fold( ) operator. The tensors L and E can be approximated by average pooling, i.e., L = 1 M PM m=1 Lm, E = 1 M PM m=1 Em. However, the constraints in (2) enforce the unfolding matrix to reside in a single low-rank subspace, failing to consider the situation where the data are with the mixture structure of several subspaces. Thus, this strict constraint hardly holds when data are drawn from a union of multiple subspaces, leading to degenerated performance. Therefore, to better discover the robust low-rank structure, we introduce a dictionary Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) D(m) Rn m nm. The low-rank property is preserved by enforcing nuclear norm on Zm Rnm nm, i.e., Zm . As the inequality rank(D(m)Zm) rank(Zm) always holds, minimizing the nuclear norm of Zm amounts to bounding the rank of the clean data L(m) if it suffices to L(m) D(m)Zm, hence a low-rank structure. Thus, we can reach the following function min Zm,E(m) m=1,2, ,M m=1 Zm + λ1 E(m) , s. t. X(m) = D(m)Zm + E(m), m = 1, . . . , M. The product of D(m) and Zm yields the unfolding L(m). To dynamically handle streaming data, we relax the constrains for online optimization by treating them as quadratic penalty terms, leading to min Zm,E(m) m=1,2, ,M m=1 X(m) D(m)Zm E(m) 2 F + λ1 E(m) 1 + λ2 Zm , where λ2 > 0 governs the role of the nuclear norm term. To process samples on the fly, we propose to employ the bi-factor form Zm = Wm Rm [Srebro et al., 2004], where Wm Rnm p and Rm Rp nm with p min(nm, n m). Thus the rank of Zm is upper bounded by p. As indicated by [Fazel et al., 2001; Recht et al., 2010], minimizing Zm is equivalent to minimizing Wm 2 F and Rm 2 F simultaneously. So the unconstrained function in (4) can be reformulated as a non-convex optimization problem min Wm,Rm,E(m) m=1,2,...,M m=1 X(m) D(m)Wm Rm E(m) 2 F + λ1 E(m) 1 + λ2 2 ( Wm 2 F + Rm 2 F), which is the objective function of ORLTM, and updating the entries in Zm amounts to sequentially updating the rows of Wm and the columns of Rm. The objective function (5) reveals a fact that the sizes of unfolding matrices increase with n M and the dictionary D(m) can only be partially accessed in each iteration. Besides, the row vectors of W(m) are coupled together as being left multiplied by the dictionary. To tackle these issues, we introduce another set of auxiliary variables B(m) = D(m)Wm Rn m p, m = 1, 2, . . . , M, to approximate the recovery part (X(m) E(m)) by B(m)Rm. This suggests that {B(m)}M m=1 can be treated as Reinforced Basis Dictionaries while {Rm}M m=1 are the coefficients or low-dimensional representation. Compared with D(m), the dictionary B(m) is iteratively strengthened by respecting the two factored lowrank components of Zm. Therefore, we get an equivalent re- formulation min B(m),Wm,Rm,E(m) m=1,2,...,M 1 2 X(m) B(m)Rm E(m) 2 F +λ1 E(m) 1 + λ2 2 ( Wm 2 F + Rm 2 F ) 2 B(m) D(m)Wm 2 F , where λ3 > 0 is used to control the reconstruction quality of the basis B(m). This formulation is not only more appropriate for learning from streaming data, but also more informative since it explicitly models the basis of the union of multiple subspaces in all modes of tensor data. Thus, it achieves more accurate tensor subspace recovery and better tensor low-rank modeling. 4.2 Online Optimization for ORLTM This section develops an efficient online optimization algorithm for optimizing the objective function in (6). This algorithm employs the stochastic optimization technique. First, two functions in mode m are defined as ℓ(xm, B(m), rm, em) 1 2 xm B(m)rm em 2 2 + λ1 em 1 + λ2 ℓ(xm, B(m)) = min rm,em ℓ(xm, B(m), rm, em), h(D(m), B(m), Wm) i=1 dm i wm i 2 F , h(D(m), B(m)) = min Wm h(D(m), B(m), Wm), where xm, em, dm, rm are column vectors of matrices X(m), E(m), D(m) Rn m nm, Rm Rp nm, respectively, and wm is the row vector of Wm Rnm p. These formulations allow us to rewrite (6) as E(m), Wm,Rm ℓ(xm i , B (m) i , rm i , em i ) + h(D(m), B(m), Wm), (9) which amounts to minimizing the empirical loss function: f N(B(m)) 1 i=1 ℓ(xm i , B(m)) + 1 N h(D(m), B(m)). Now, we describe how to sequentially optimize the variables in (6), and we adopt the alternating method, i.e., solving one variable while holding the rest. The whole algorithmic framework is given in Algorithm 1, which consists of following variable updatings at each iteration t. Computing rm t , em t . Given B(m) t 1 in the previous iteration for mode-m unfolding, we obtain the optimal {rm t , em t } by solving min rm,em ℓ(xm t , B(m) t 1, rm, em). (10) Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) The above problem has a closed-form solution rm t when em is fixed, i.e., rm = (B(m) t 1 B(m) t 1 + λ2Ip) 1B(m) t (xm t e), (11) and w.r.t. fixed rm, the local minimizer of em is given by the soft-thresholding operator [Hale et al., 2008]: em = Sλ1[xm t B(m) t 1rm]. (12) Both rm t and em t can be optimized by the coordinate descent algorithm [Wright, 2015] in an efficient way. Computing wm t . Define an accumulation matrix G(m) t 1 = Pt 1 i=1 dm i wm i Rn m p, where Gm 0 = 0, and then we can figure wm t out by minimizing ℓ2(dm t , B(m) t 1, G(m) t 1, wm) 2 wm 2 2 + λ3 2 B(m) t 1 G(m) t 1 dm t wm 2 F , (13) which leads to the solution in closed form: wm t = ( dm t 2 2 + λ2 λ3 ) 1dm t (B(m) t 1 G(m) t 1). (14) While wm t depends on the entire dictionary D(m), we only have access to the current atom dm t , which thus desires a proper estimation to approximately solve each wm t . The intuition behind the solution is that after observing the t-th atom dm t , we only update wm t by keeping others, and each wm t is sequentially updated only once after revealing all the atoms. The strategy is essentially the one-pass block coordinate descent algorithm, and can be flexibly refined as a multiple-pass version in demanding practical applications. Computing B(m) t . The reinforced basis dictionary can be updated by minimizing the following surrogate function i=1 ℓ(xm i , B(m), rm i , em i ) i=1 wm i 2 2 + λ3 2t B(m) G(m) t 2 F . One key to obtain efficient dictionary updates is that the surrogate ht( ) can be summarized by a few sufficient statistics updated iteratively, i.e., it is possible to describe ht( ) without explicitly storing previous samples [Mensch et al., 2016]. Actually, we can define two accumulation matrices: i=1 rm i rm i , Θ(m) t = i=1 (xm i em i )rm i . Then, the solution to (15) is expressed by B(m) t = (Θ(m) t + λ3G(m) t )(Φm t + λ3Ip) 1, (16) where Φm t Rp p and Θ(m) t Rn m p are independent of the sample size N. This makes our approach potentially scalable to large data size. In summary, the pipeline of our ORLTM method is described in Algorithm 1. For an M-th order tensor X, n M indicates the number of unit sub-tensor samples, i.e., one (M-1)-th order sub-tensor. If the number of unit sub-tensor samples is n, i.e., the entire tensor size, then the number of sub-tensor samples is N = n n M . At each iteration, ORLTM admits one sub-tensor, either single unit sub-tensor sample (n M = 1) or multiple unit sub-tensor samples. Algorithm 1 Online Robust Low-Rank Tensor Modeling Input: Observed M-th order tensor X Rn1 n2 ... n and pre-given dictionary D, tradeoff parameters {λ1, λ2, λ3}, target rank p, sub-tensor size n M. Output: L, B, E, {Wm, Rm}M m=1. 1: Initialize: Set all entries of {L, E} Rn1 n2 ... n, Wm Rnm p, Rm Rp nm, G(m) 0 , Θ(m) 0 , Φm 0 to zero, initialized B Rn1 n2 ... p, N = n n M . 2: for t = 1 to N do 3: for m = 1 to M do 4: Access the t-th sample xm t by vec(X(m) t ). 5: Reveal the t-th atom dm t by vec(D(m) t ). 6: Obtain the coefficients rm t and sparse vectors em t by optimizing ℓ(xm t , B(m) t 1, rm, em) and utilize coordinate descent algorithm for Eqs. (11)(12) to derive the solutions. 7: Compute the coefficients wm t by minimizing ℓ2(dm t , B(m) t 1, Gm t 1, wm), leading to Eq. (14). 8: Update the accumulation matrices as G(m) t G(m) t 1 + dm t wm t , Θ(m) t Θ(m) t 1 + (xm t em t )rm t , Φm t Φm t 1 + rm t rm t . 9: Update the dictionary B(m) t by Eq. (16). 10: Update low-rank and sparse components by L(m) t B(m) t rm t and E(m) t em t , respectively. 11: end for 12: end for 13: Obtain low-rank tensor L, sparse tensor E, reinforced dictionary tensor B by average pooling on modem foldings, i.e., L = 1 M PM m=1 fold(L(m)), E = 1 M PM m=1 fold(E(m)), B = 1 M PM m=1 fold(B(m)). 4.3 Complexity Analysis Computational Complexity. There are four variables in each iteration. First, it is cheap to compute {rm t , em t } as one may utilize the block coordinate descent method in [Richt arik and Tak aˇc, 2014] which enjoys linear convergence. Second, wm t needs O(n mp) to conduct matrix-vector multiplication, where the rank p min(n m, nm). Besides, the sub-tensor size n M is usually quite small. Third, it requires O(n mp) to update accumulation matrices G(m) t , Θ(m) t , Φ(m) t . Fourth, computing Bm t costs O(n mp2). Therefore, the total computational cost is limited. Memory Cost. Lower memory cost is a very promising advantage of ORLTM. On the one hand, we need O(n mp) to load B(m) t and xm t to obtain {rm t , em t }, and also O(n mp) to exploit B(m) t , G(m) t for calculating wm t . On the other hand, because the history information of ht(B(m) t ) in (15) has been recorded by the accumulation matrices G(m) t , Θ(m) t , Φ(m) t , it only costs at most O(n mp). Hence, ORLTM only desires O(n mp) memory cost for every iteration. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Figure 1: RRSE on different corrupted entries on synthetic data. Left: RRSE vs Corruption Percentage (H = 1); Right: RRSE vs Outlier Intervals (2|H| with η = 50%). 5 Experiments 5.1 Synthetic Data To generate synthetic data with low-rank structure, we construct an authentic tensor S of size 50 50 30 by rank3 factor matrices as in [Sobral et al., 2015], e.g., Sm R50 3 (here m denotes mode). Each factor matrix consists of three components [sin( 2πmim 50 ), cos( 2πmim 50 ), sgn(sin(0.5πim)], where im denotes the column element index in mode m, and the product of two factor matrices yields a frontal slice of mode-3. We randomly corrupt entries of the tensor by small noise from distribution N(0, 0.01) and outliers from uniform distribution U( |H|, |H|) (H is interval bound magnitude). For evaluation, we use Root Relative Square Error (RRSE) as the measure criterion calculated by ˆ S S F S F , where ˆS is the estimated lowrank tensor. In all, we evaluate four batch methods, i.e., RPCA [Cand es et al., 2011], LRR [Liu et al., 2013; 2017], HOSVD [Goldfarb and Qin, 2014], and HORPCA [Goldfarb and Qin, 2014], and also four online methods (n M=1), i.e., ORPCA [Feng et al., 2013], OLRSC [Shen et al., 2016], OSTD [Sobral et al., 2015], and our ORLTM approach. Note that batch approaches have the hindsight knowledge of all the streaming data and provides performance upper bound for online counterparts. For compared methods, we use the default parameters in their papers, and the target rank p is set to the ground truth 3. For ORLTM, λ1 = 0.5/ p log(n1n2), λ2 = 1, λ3 = λ1 p log(t). The comparison results with a varying corruption percentage η from 0 to 80% and different outlier intervals 2|H| in [0 : 0.2 : 2] are plotted in Fig. 1. The figure shows that ORLTM indeed outperforms all online methods no matter how the corruption percentage and the outlier interval vary, and is comparable to batch approaches. Besides, our method is surprisingly better than LRR consistently and better than RPCA when the corruption exceeds 55%, which well justifies the superiority of ORLTM by exploiting the low-rank structure in all tensor modes. In addition, we provide quantitative analysis with different sizes (n3) for online methods in Table 1. As seen from the table, the performance of compared methods is improved by increasing tensor size, which demonstrates that more robust tensor recovery could be achieved by taking in more samples sequentially. Meanwhile, our method still performs the best no matter how the tense size varies. Table 1: RRSE of Online Methods with Different Tensor Sizes (η = 50%, H = 1, n1 = 50, n2 = 50). n3 ORPCA OLRSC OSTD ORLTM 100 0.5456 0.7104 0.1183 0.0667 500 0.6571 0.5685 0.0580 0.0324 1000 0.5220 0.5352 0.0430 0.0234 5000 0.4316 0.4786 0.0224 0.0130 5.2 Video Background Subtraction An input video is treated as a long third-order tensor, of which each frontal slice denotes a frame and multiple frames constitute a sub-tensor. We model the background (BG, i.e., static information) by the low-rank component L while the foreground (FG, i.e., moving objects) is pushed to the sparse component E. We test on the I2R database [Li et al., 2004] which involves various indoor and outdoor environments. We use eight video sequences containing over 15,000 frames with sizes ranging from 120 160 to 256 320. Each sequence provides 20 ground-truth Foreground Mask (FM) of frames. We compare ORLTM with online methods including ORPCA [Feng et al., 2013], OLRSC [Shen et al., 2016], OSTD [Sobral et al., 2015], and batch methods including RPCA [Cand es et al., 2011], LRR [Liu et al., 2013], HORPCA [Goldfarb and Qin, 2014], for which parameters are set as indicated in their papers. For our method, parameters are fixed as λ1 = α/ p log(n1n2) (α is 0.02 for Curtain, Lobby, Watersurface; and 0.1 for the rest), λ2 = 1, λ3 = λ1 p log(t) (t is the iteration number), and each sub-tensor size is set to 1. For all methods, the target rank p is empirically defined as 10, and the foreground masks are obtained by thresholding the sparse component, i.e., FMij = 1 if E2 ij/2 (std(vec(E)))2, and zero otherwise, where std( ) denotes the standard deviation. For online methods, all frames are passed two epochs. For batch methods, they are applied to each small-batch (500 frames) per time considering limited memory resource. To initialize B(m) in ORLTM, we adopt the Bilateral Random Projections (BRP) [Zhou and Tao, 2012]. Concretely, given dense matrix Γ Rd c, its low-rank approximation is built by Γ = ΓY1(Y 2 ΓY1) 1Y 2 Γ, where Y1 Rc r, Y2 Rd r are random matrices, and the rank r min(d, c). In tests, we found LRR performs very poorly using the data set itself as the dictionary, so we use its mean matrix. For evaluation, we employ F-score [Brutzer et al., 2011] computed by comparing each sequence with its available ground-truth frames. The results are recorded in Table 2. From the table, we observe that ORLTM has the most encouraging overall performance across all scenes, about 10% higher than second best RPCA (batch method). Especially, our method performs much better on Bootstrap, Curtain, Fountain, Lobby, and Hall, compared to the rest, and it also does well on the other sequences as ORLTM can achieve almost the same F-score to the best ones. The superior performance of ORLTM can be attributed to two facts: one is dictionary learning in ORLTM can better model video background despite in noisy and interrupted environments; the other is that learning the low- Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Table 2: F-score (%) comparisons for background subtraction. Best in boldface and second best underlined. Methods Bootstrap Campus Curtain Fountain Lobby Shopmall Watersurface Hall Average RPCA 63.86 59.88 66.33 79.28 72.07 75.87 44.39 53.61 64.41 LRR 55.02 32.75 84.92 62.87 35.47 70.30 61.14 55.60 57.26 HORPCA 60.27 32.58 72.79 31.31 53.12 74.95 43.39 58.02 53.30 ORPCA 55.48 37.26 82.29 56.76 62.20 49.53 89.86 56.53 61.24 OLRSC 44.75 23.01 77.76 26.96 15.54 21.49 79.16 32.53 40.15 OSTD 58.06 60.00 56.11 71.18 40.08 74.47 48.36 61.95 58.78 ORLTM 64.33 59.20 88.85 82.76 77.78 74.40 89.59 65.02 75.24 Figure 2: Sample frame masks for video background subtraction in a variety of scenes. (a) Input frame; (b) ORLTM (Background); (c) ORLTM (Foreground); (d) ORLTM; (e) OSTD; (f) OLRSC; (g) ORPCA; (h) HORPCA; (i) LRR; (j) RPCA. Table 3: Speed comparison (fps). The value in the bracket denotes the acceleration rate of ORLTM compared with others. Video n1 n2 n3 RPCA HORPCA ORLTM Bootstrap 120 160 3055 0.546 (12.16) 0.352 (18.87) 6.642 Campus 128 160 1439 2.220 (4.82) 0.358 (29.87) 10.694 Curtain 128 160 2964 2.740 (1.96) 0.404 (13.31) 5.379 Fountain 128 160 523 2.556 (6.31) 0.346 (46.65) 16.140 Lobby 128 160 1551 2.621 (3.45) 0.254 (35.60) 9.042 Shopmall 256 320 1286 0.321 (8.50) 0.101 (27.01) 2.728 Watersurface 128 160 641 3.456 (2.67) 0.412 (22.37) 9.217 Hall 144 176 3584 1.567 (2.76) 0.289 (14.94) 4.318 rank structures of tensor in all modes can sufficiently exploit the underlying information of video sequences. Note that batch methods process 500 frames per time, which might slightly degenerate its performance. To examine the efficiency, we also report the Frames Per Second (fps) of our method and two batch methods in Table 3. As shown in the table, ORLTM is very efficient, e.g., it is over 12 times faster than RPCA on Bootstrap and up to 46 times faster than HORPCA on Fountain. This is because ORLTM as an online method only needs constantly small memory and its computational cost is low in each iteration. On the contrary, batch methods have to store all frames in the memory in advance while they require expensive singular value decompositions on batch frames. All tests were run with 3.06 GHz Core X5675 processor and 24GB RAM. In addition, we display the foreground masks of some randomly selected frames in Fig. 2, showing ORLTM outperforms others in most scenes, e.g., curtain swinging (row 2), running fountain (row 3), illumination change (row 4). This demonstrates our method can well model background and foreground in a rich variety of scenes. 6 Conclusion We present an Online Robust Low-rank Tensor Modeling (ORLTM) method to learn low-rank structures of streaming noisy tensor data robustly. By developing the equivalent bifactor formulation of the nuclear norm, it can process samples sequentially, thus being scalable to large-scale tensor data. The objective function is solved by stochastic optimization, and in each iteration it saves memory cost by a factor of n compared to batch methods. Synthetic studies suggest ORLTM outperforms well-established online methods and its performance is comparable to batch ones when the data are corrupted by up to 80% gross noise with large magnitude. In video background subtraction, ORLTM gains the highest average F-score over 10% higher than the state of the arts, and is significantly faster than batch methods, up to 46 times faster than HORPCA. Acknowledgments This work was supported in part by the National Natural Science Foundation of China under Grants 61502131, 61572169, 61472266, Zhejiang Provincial Natural Science Foundation of China under Grant LQ15F020012, and China Scholarship Council. The work of Jiashi Feng was supported in part by National University of Singapore Startup Grant R-263-000-C08-133 and Ministry of Education of Singapore Ac RF Tier One Grant R-263-000-C21-112. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) References [Brutzer et al., 2011] Sebastian Brutzer, Benjamin H oferlin, and Gunther Heidemann. Evaluation of background subtraction techniques for video surveillance. In CVPR, pages 1937 1944, 2011. [Cand es et al., 2011] Emmanuel J. Cand es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM, 58(3):No.11, 2011. [Fazel et al., 2001] Maryam Fazel, Haitham Hindi, and Stephen P. Boyd. A rank minimization heuristic with application to minimum order system approximation. In Proceedings of the American Control Conference, volume 6, pages 4734 4739, 2001. [Feng et al., 2013] Jiashi Feng, Huan Xu, and Shuicheng Yan. Online robust pca via stochastic optimization. In NIPS, pages 404 412, 2013. [Goldfarb and Qin, 2014] Donald Goldfarb and Zhiwei Qin. Robust low-rank tensor recovery: models and algorithms. SIAM Journal on Matrix Analysis and Applications, 35(1):225 253, 2014. [Hale et al., 2008] Elaine T Hale, Wotao Yin, and Yin Zhang. Fixed-point continuation for ℓ1-minimization: Methodology and convergence. SIAM Journal on Optimization, 19(3):1107 1130, 2008. [Hillar and Lim, 2013] Christopher J. Hillar and Lek-Heng Lim. Most tensor problems are np-hard. Journal of the ACM, 60(6):No.45, 2013. [Kilmer and Martin, 2011] Misha E Kilmer and Carla D Martin. Factorization strategies for third-order tensors. Linear Algebra and its Applications, 435(3):641 658, 2011. [Kolda and Bader, 2009] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455 500, 2009. [Li et al., 2004] Liyuan Li, Weimin Huang, Irene Yu-Hua Gu, and Qi Tian. Statistical modeling of complex backgrounds for foreground object detection. IEEE Transactions on Image Processing, 13(11):1459 1472, 2004. [Liu and Yan, 2011] Guangcan Liu and Shuicheng Yan. Latent low-rank representation for subspace segmentation and feature extraction. In ICCV, pages 1615 1622, 2011. [Liu et al., 2013] Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, Yong Yu, and Yi Ma. Robust recovery of subspace structures by low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):171 184, 2013. [Liu et al., 2017] Guangcan Liu, Qingshan Liu, and Ping Li. Blessing of dimensionality: recovering mixture data via dictionary pursuit. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1):47 60, 2017. [Lu et al., 2016] Canyi Lu, Jiashi Feng, Yudong Chen, Wei Liu, Zhouchen Lin, and Shuicheng Yan. Tensor robust principal component analysis: Exact recovery of corrupted low-rank tensors via convex optimization. In CVPR, pages 5249 5257, 2016. [Mensch et al., 2016] Arthur Mensch, Julien Mairal, Bertrand Thirion, and Ga el Varoquaux. Dictionary learning for massive matrix factorization. In ICML, volume 48, pages 1737 1746, 2016. [Recht et al., 2010] Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52:471 501, 2010. [Richt arik and Tak aˇc, 2014] Peter Richt arik and Martin Tak aˇc. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(1-2):1 38, 2014. [Shen et al., 2016] Jie Shen, Ping Li, and Huan Xu. Online lowrank subspace clustering by basis dictionary pursuit. In ICML, volume 48, pages 622 631, 2016. [Sobral et al., 2015] Andrews Sobral, Sajid Javed, Soon Ki Jung, Thierry Bouwmans, and El-hadi Zahzah. Online stochastic tensor decomposition for background subtraction in multispectral video sequences. In ICCV Workshop, pages 106 113, 2015. [Srebro et al., 2004] Nathan Srebro, Jason D. M. Rennie, and Tommi S. Jaakkola. Maximum-margin matrix factorization. In NIPS, pages 1329 1336, 2004. [Sun et al., 2008] Jimeng Sun, Dacheng Tao, Spiros Papadimitriou, Philip S Yu, and Christos Faloutsos. Incremental tensor analysis: Theory and applications. ACM Transactions on Knowledge Discovery from Data (TKDD), 2(3):11, 2008. [Wright, 2015] Stephen J Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3 34, 2015. [Yin et al., 2015] Ming Yin, Junbin Gao, Zhouchen Lin, Qinfeng Shi, and Yi Guo. Dual graph regularized latent low-rank representation for subspace clustering. IEEE Transactions on Image Processing, 24(12):4918 4933, 2015. [Yu et al., 2015] Rose Yu, Dehua Cheng, and Yan Liu. Accelerated online low-rank tensor learning for multivariate spatio-temporal streams. In ICML, pages 238 247, 2015. [Zhan et al., 2016] Jinchun Zhan, Brian Lois, Han Guo, and Namrata Vaswani. Online (and offline) robust pca: novel algorithms and performance guarantees. In AISTATS, pages 1488 1496, 2016. [Zhang et al., 2013] Xiaoqin Zhang, Di Wang, Zhengyuan Zhou, and Yi Ma. Simultaneous rectification and alignment via robust recovery of low-rank tensors. In NIPS, pages 1637 1645, 2013. [Zhou and Tao, 2012] Tianyi Zhou and Dacheng Tao. Bilateral random projections. In Proceedings of the IEEE International Symposium on Information Theory, pages 1286 1290, 2012. [Zhou et al., 2016] Shuo Zhou, Nguyen Xuan Vinh, James Bailey, Yunzhe Jia, and Ian Davidson. Accelerating online cp decompositions for higher order tensors. In SIGKDD, pages 1375 1384, 2016. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)