# fast_and_provable_nonconvex_tensor_rpca__70e7d472.pdf

Fast and Provable Nonconvex Tensor RPCA

Haiquan Qiu 1 2 3 Yao Wang 1 Shaojie Tang 4 Deyu Meng 1 5 Quanming Yao 2

In this paper, we study nonconvex tensor robust principal component analysis (RPCA) based on the t-SVD. We first propose an alternating projection method, i.e., APT, which converges linearly to the ground-truth under the incoherence conditions of tensors. However, as the projection to the low-rank tensor space in APT can be slow, we further propose to speedup such a process by utilizing the property of the tangent space of low-rank. The resulting algorithm, i.e., EAPT, is not only more efficient than APT but also keeps the linear convergence. Compared with existing tensor RPCA works, the proposed method, especially EAPT, is not only more effective due to the recovery guarantee and adaption in the transformed (frequency) domain but also more efficient due to faster convergence rate and lower iteration complexity. These benefits are also empirically verified both on synthetic data, and real applications, e.g., hyperspectral image denoising and video background subtraction.

1. Introduction

A tensor is a multidimensional array that can model the linear and multilinear relationships in the data. Tensor related methods have been widely used in many areas such as recommender systems (Cand es & Recht, 2009), computer vision (Zhang et al., 2014), and signal processing (Cichocki et al., 2015). For example, a hyperspectral image with multiple bands can be naturally represented as a three-way tensor with the column, row, and spectral bands; a grayscale video is indexed by two spatial variables and one temporal variable. All these examples are the three-way tensors, which

1Xi an Jiaotong University, Xi an, China 2Tsinghua University, Beijing, China 34Paradigm Inc., Beijing, China 4The University of Texas at Dallas, Texas, USA 5Macau University of Science and Technology, Macau, China. Correspondence to: Yao Wang <yao.s.wang@gmail.com>, Quanming Yao <qyaoaa@connect.ust.hk>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

are the focus of this paper.

Analogous to robust PCA (Cand es et al., 2011), tensor robust PCA (Huang et al., 2015; Lu et al., 2016; Anandkumar et al., 2016; Lu et al., 2019) attempts to recover a low-rank tensor that best approximates grossly corrupted observations. While there are many works for matrix RPCA (Wright et al., 2009; Netrapalli et al., 2014; Yi et al., 2016; Xu et al., 2010), these matrix techniques cannot be directly adopted for tensors as the tensor rank is more complicated than matrices. Typically, to impose a low-rank structure in the tensor RPCA, CP and Tucker decompositions (Kolda & Bader, 2009) factorize a tensor into low-rank matrices and are used in methods such as RTD (Anandkumar et al., 2016). Besides, Tensor-Train decomposition (Oseledets, 2011), which approximates a higher order tensor with a collection of small three-way tensors, has been applied to tensor RPCA in TTNN (Yang et al., 2020). Furthermore, a convex-optimization-based tensor decomposition approach, called overlapped approach (Gandy et al., 2011; Liu et al., 2012; Tomioka et al., 2010), which penalizes each unfolding matrices from the original tensor using the nuclear norm, is considered in SNN (Huang et al., 2015).

All the above decompositions consider low-rank structure in the time domain. However, many applications (Chang, 1966; Rao et al., 2010) show that the structures in the frequency domain is important as well. For example, both SVD truncation and high-frequency filtering can remove noise from an image. For tensor decomposition, there exists a method called t-SVD (Kilmer & Martin, 2011) that can take advantage of structures in both the time domain and frequency domain. t-SVD first conducts fast Fourier transformation (FFT) on the tube fibers of a tensor. This operation converts the tensor from the time domain to the frequency domain. In this way, the highand low-frequency information of a tensor is separated, while the low-rank structure of a tensor is preserved in this process. Recently, transformed t-SVD (Kernfeld et al., 2015; Kilmer et al., 2021) extends the FFT in t-SVD to a general unitary transformation, where the main superiority is that the transformed tensor may have lower multi-rank by using suitable unitary transformation. Extensive numerical examples have shown its effectiveness in many applications (Zhang & Ng, 2021; Lu, 2021; Song et al., 2020a).

Because of the aforementioned benefits, there emerges

Fast and Provable Nonconvex Tensor RPCA

Table 1. Comparison of various tensor RPCA methods based on (transformed) t-SVD. transformed : whether base on transformed t-SVD; : the corresponding method has such a property or uses such a technique; : the corresponding method does not have such a property or use such a technique. The computational complexity is calculated on a tensor in Rn n q with multi-rank r (sr = Pq i=1 ri). effectiveness (recovery performance) efficiency (optimization time)

iteration complexity t-SVD methods transformed rank recovery guarantee

convergence rate FFT DCT TRPCA (Lu et al., 2019) tubal O(1/ϵ) O n2q log q + n3q

ETRPCA (Gao et al., 2020) tubal O n2q log q + n3q

T-TRPCA (Lu, 2021) tubal O(1/ϵ) O(n2q2 + n3q) APT multi O (log(1/ϵ)) O(n2q log q + n3q) O(n2q2 + n3q) EAPT multi O (log(1/ϵ)) O(n2q log q + n2sr) O(n2q2 + n2sr)

works trying to solve tensor RPCA based on t-SVD (Lu et al., 2019; Gao et al., 2020; Lu, 2021). TRPCA (Lu et al., 2019) proposes a new tensor nuclear norm and solves a convex optimization objective for tensor RPCA with alternating direction method of multipliers (ADMM) (Gabay & Mercier, 1976) and possesses statistical recovery guarantee. However, TRPCA penalizes all singular values equally, which does not fully utilize the information in large singular values. So ETRPCA (Gao et al., 2020) introduces weighted tensor Schatten p-norm to make large singular values shrink less. Transformed TRPCA (T-TRPCA) (Lu, 2021) extends TRPCA with a new nuclear norm derived from transformed t-SVD. Same as TRPCA, T-TRPCA also has statistical recovery guarantee. All these works, summarized in Table 1, are based on the tubal rank, which cannot comprehensively capture the difference in lowand high-frequency information in transformed tensors. Also, these methods cannot be efficient and effective at the same time.

In this paper, we propose two alternating projection algorithms for tensor RPCA, i.e., APT and EAPT, based on transformed t-SVD. Both our methods have recovery guarantee and linear convergence rate. Moreover, by utilizing properties of the tangent of space of low-rank tensors, EAPT avoids t-SVD on a full-sized tensor. When considering multi-rank, our methods can adaptively keep important information in the frequency domain. The main contributions of our method are as follows.

We study nonconvex tensor robust PCA based on transformed t-SVD and propose two alternating projection algorithms, i.e., APT and EAPT. Specifically, EAPT is more efficient since it utilizes the tangent space of lowrank tensor to reduce iteration complexity.

As far as we know, our work is the first to prove the exact recovery guarantee for nonconvex tensor RPCA with t-SVD decomposition. Specially, we show that linear convergence to the ground-truth can be guaranteed under suitable tensor incoherence conditions.

Experiments on synthetic data and real data applications, e.g., hyperspectral image denoising and video background

subtraction, demonstrate both efficiency and effectiveness of our methods.

Notations Scalars and vectors are denoted as lowercase letters and bold lowercase letters respectively, e.g., x and x. Bold capital letters and calligraphic letters denote the matrices and tensors, respectively, e.g., X and X. The real and complex Euclidean spaces are denoted as R and C, respectively. Superscript H and T denote conjugate transpose and transpose respectively. A B means A c B for some positive number c.

For a three-way tensor X Cm n q, its (i, j, k)-th entry is denoted as Xijk and the MATLAB notations X(i, : , :), X(:, i, :) and X(:, :, i) are used to denote the i-th horizontal, lateral, and frontal slice of X, respectively. A tube fiber of a three-way tensor is defined as X(i, j, :) where the first two indices of X are fixed. The inner product between two tensors X and Y in Cm n q is defined as X, Y = P

ijt Xijt Yijt. We denote the Frobenius norm of a tensor as X F = p

X, X . The ℓ1-norm is defined as X 1 = P

ijt |Xijt| and the infinity norm X is denoted as X = maxijt |Xijt|. The Euclidean projection operator P is defined as PΩ(X) = arg min Y Ω Y X F .

2. Preliminaries

In this section, we introduce basic facts about t-SVD (Kernfeld et al., 2015; Kilmer et al., 2021), which will be used in the subsequent sections. Specifically, we focus on three-way tensor in this paper. More related facts about tensor-tensor product are given in the Appendix B.

For any three-way tensor X Cm n q and any unitary matrix Φ Cq q, ˆ XΦ denotes a three-way tensor with each tube fiber multiplied by Φ, i.e.,

ˆ XΦ(i, j, :) = Φ X(i, j, :), i = 1 . . . m, j = 1 . . . n. (1)

Also, ˆ XΦ is often denoted as Φ[X] for simplicity. The block-diagonal matrix representation of a three-way tensor

Fast and Provable Nonconvex Tensor RPCA

X defined as

ˆ XΦ(:, :, 1) ... ˆ XΦ(:, :, q)

The conjugate transpose of a tensor X Cm n q is denoted as X H with X H = X H. The spectral norm of X is defined as X Φ = X and denoted as X for simplicity.

Similar to the t-product in Fourier domain (Kilmer & Martin, 2011), the result Z Cn1 n3 q of tensor-tensor product (Kilmer et al., 2021; Kernfeld et al., 2015) between two tensors X Cn1 n2 q and Y Cn2 n3 q with any unitary transformation Φ is given by Z = X Φ Y = ΦH fold X Y where fold X = ˆ XΦ = Φ[X].

Transformed t-SVD and tensor rank are described in the following Definitions 2.1 and 2.2 respectively. Note that t-SVD is a special case of transformed t-SVD by setting Φ = F/ q where F is a DFT matrix.

Definition 2.1 (Kernfeld et al. 2015). For any X Cm n q, the transformed t-SVD is given by X = U ΦS Φ VH, where U Cm m q, V Cn n q are unitary tensors with respect to the transformation Φ, and S Cm n q is a diagonal tensor. And the singular value of X is defined as ˆSΦ(i, i, c), where 1 i min(m, n) and 1 c q.

Definition 2.2 (tensor rank (Kilmer et al., 2021)). The multirank of a tensor X Cm n q is a vector r Rq, i.e., rankm(X) = r, with its i-th entry being the rank of the i-th frontal slice ˆXΦ(:, :, i) of ˆ XΦ, i.e., ri = rank( ˆXΦ(:, :, i)). Let X = U Φ S Φ VH. The tubal rank of X, denoted by ranktt(X), is defined as the number of nonzero singular value tubes of S, i.e., ranktt(X) = #{i|S(i, i, :) = 0}.

The relationship between multi-rank and tubal rank is ranktt(X) = max{ri}, i.e., tubal rank is equal to be maximum element in multi-rank. Previous works such as TRPCA and T-TRPCA are based on tubal rank while our methods are based on multi-rank. In real applications (Hao et al., 2013; Kilmer et al., 2021), using multi-rank can adaptively utilize the highand low-frequency information in the frequency (transformed) domain. Because for a tensor converted into frequency domain, the high-frequency slices are more likely to be noise, then applying lower rank for the high-frequency slices can effectively remove the noise on the original tensor.

3. Related Work

Given the observed tensor D Rn n q, tensor RPCA attempts to separate D into a low-rank tensor part L and a sparse tensor part S. However, different from matrices, there are many rank definitions for tensors because a tensor can be factorized in many ways. Common tensor decompositions include CP, Tucker (Kolda & Bader, 2009), Tensor-Train

(TT) (Oseledets, 2011), overlapped/latent tensor nuclear norm (Gandy et al., 2011; Tomioka et al., 2010), and t SVD (Kilmer & Martin, 2011; Kilmer et al., 2021). With these, there are many methods (Lu et al., 2019; Lu, 2021; Gao et al., 2020; Cai et al., 2021; Yang et al., 2020; Huang et al., 2015; Driggs et al., 2019; Anandkumar et al., 2016) for tensor RPCA (see Appendix A for details).

Tensor RPCA (Lu et al., 2019; Lu, 2021; Gao et al., 2020) based on tensor-tensor product can utilize information in both time and frequency domains simultaneously. Among them, T-TRPCA (Lu, 2021) constructs two convex surrogates, i.e., L Φ and S 1, of the low-rank and sparse constraints. The optimization objective of T-TRPCA is

min L,S L Φ + λ S 1, subject to D = L + S.

When Φ = F/ q, TRPCA (Lu et al., 2019) becomes a special case of T-TRPCA. Both TRPCA and T-TRPCA converge sub-linearly with recovery guarantee. The computational complexities of TRPCA and T-TRPCA are O(n2q log q + n3q) and O(n2q2 + n3q) respectively. Different from TRPCA, T-TRPCA replaces the DFT matrix with a DCT matrix, which achieves better performance in real applications (Lu, 2021).

Besides, ETRPCA (Gao et al., 2020) is a nonconvex method based on weighted tensor Schatten p-norm with only algorithmic convergence guarantee. ETRPCA improves TRPCA by making large singular values shrink less while TRPCA and T-TRPCA penalize all singular values equally. The computational complexity of ETRPCA is O(n2q log q + n3q). However, all these works, summarized in Table 1, use tubal rank that ignores the difference between the highand lowfrequency information in transformed tensor while multirank can specify different ranks for the frontal slices at different frequencies.

Another line of tensor RPCA methods is based on alternating projection (Cai et al., 2021; Anandkumar et al., 2016). These works alternatively project the tensor onto the lowrank and sparse spaces. Among them, RTCUR (Cai et al., 2021) is based on a new decomposition called Fiber CUR without a recovery guarantee. Anandkumar et al. 2016 proposes an alternating projection algorithm based on CP decomposition with a recovery guarantee. To better utilize tensor-tensor product and frequency information, we will propose two methods based on alternating projection and transformed t-SVD.

4. Fast Nonconvex Tensor RPCA

Here, we first introduce APT which is an alternating projection algorithm with exact recovery guarantee and linear convergence rate (Section 4.1). Then, EAPT improves the low-rank projection in APT while keeping exact recovery

Fast and Provable Nonconvex Tensor RPCA

guarantee and linear convergence rate (Section 4.2).

For simplicity, our methods are proposed based on symmetric tensors, i.e., X = X H Cn n q. The extension of our methods to asymmetric tensors is given in the Appendix B. All proofs are presented in Appendix D.

4.1. Alternative Projection with Exact Recovery

To better utilize the information in both time and frequency domains, the rank constraint of our objective is based on multi-rank (see Definition 2.2) because it can better utilize information in frequency domain. Our optimization objective follows as:

min L L,S S D L S 2 F , (2)

( L {X | rankm(X) r}, S {X | X 0 K}, (3)

where we assume its optimal solution L and S satisfying the incoherence and sparse conditions in Section 5. More importantly, if Φ is a normalized DFT matrix or DCT matrix, we can set multi-rank r as a descending sequence so that our methods can adaptively take advantage of the information in the frequency (transformed) domain which is not possible with TRPCA and T-TRPCA as shown in Table 1.

The basic idea for solving (2) is to alternatively project between the low multi-rank space L and sparse space S. For example, at k-th iteration, Lk+1 is the projection of D Sk onto the low-rank space L. Let X = U Φ S Φ VH be the transformed t-SVD of X. With the Eckart-Young theorem (Theorem B.1), the projection onto L can be conducted with truncated transformed t-SVD, i.e., Lk+1 = PL (D Sk) where PL (X) = ΦH[ ˆ Xr] with

ˆ Xr(:, :, i)= ˆU(:,1:ri, i) ˆS(1:ri,1:ri, i)ˆVH(:,1:ri, i). (4)

Computing Lk+1 is time-consuming which costs O(n2q2 + n3q). A direct implementation obtains Sk+1 by projecting D Lk+1 onto sparse space S. The projection onto S with Sk+1 = PS(D Lk+1) is equivalent to keep the top K elements in D Lk+1. However, top K operation is unstable for the change of the magnitude of sparse tensor S. Thus, this method cannot guarantee exact recovery.

Instead, we pick an appropriate threshold keeping the elements which are likely to be in the support set of S . At the k-th iteration, the projection onto S is Sk+1 = Tζk+1 (D Lk+1) where

Tζ (Z)ijt =

( Zijt |Zijt| > ζ 0 otherwise ,

is the hard thresholding operator, and {ζk} is a descending sequence to ensure the convergence and ζk+1 =

βγk σ1(D Sk) 1.

The complete procedure of APT is presented in Algorithm 2, which can be divided into two phases: initialization and alternating projection. The first phase of our algorithm is to find a good initialization for alternating projection. The same as (Netrapalli et al., 2014; Cai et al., 2019), the initialization phase (Algorithm 1) constructs an initial guess that can be close to the ground-truth and our recovery theory will give the values of β0 and βinit in Algorithm 1. Next, our method conducts alternating projection. At the k-th iteration, the first step projects D Sk onto L, i.e., Lk+1 = PL (D Sk). The second step projects D Lk+1 onto S, i.e., Sk+1 = Tζk+1 (D Lk+1).

Algorithm 1 Initialization

1: S 1 = Tζ 1(D) where ζ 1 = βinit σ1(D) 2: L0 = PL(D S 1) 3: S0 = Tζ0(D L0) where ζ0 = β0 σ1(D S 1) 4: Return: L0 and S0

Algorithm 2 APT: Alternating Projection Algorithm for Tensor RPCA

1: Run Algorithm 1 for initialization 2: for k = 0 to T 1 do 3: Lk+1 = PL (D Sk) 4: Sk+1 = Tζk+1(D Lk+1) where ζk+1 = βγk σ1(D Sk) 5: end for 6: Return: LT and ST

In Section 5, we can see that APT converges linearly to the ground-truth under suitable tensor incoherence conditions. However, the computational complexity of APT is O(n2q2+ n3q), which mainly comes from the transformed t-SVD on a full-sized tensor. Compared with previous works, APT converges faster but does not reduce iteration complexity (Table 1).

4.2. Speedup with Efficient Projection

For low-rank projection in matrix, SVD on a full-sized matrix can be avoided by projection onto the tangent space of a low-rank matrix (Vandereycken, 2013; Cai et al., 2019; Wei et al., 2016). This idea motivates us to make use of the property of tangent space of low-rank tensors to reduce the computational complexity in low-rank projection of tensors.

4.2.1. EFFICIENT PROJECTION D Sk ONTO L

We begin with the projection onto the low multi-rank tensor space. In step 3 of Algorithm 2, Lk+1 is obtained by truncated transformed t-SVD on a full-sized tensor. To reduce the computational complexity of transformed t-SVD, we

1 σ1(X) := max{ ˆS(i, i, c)| ˆS(i, i, c) > 0, i ri} and β =

2µsr/nq where X = U Φ S Φ VH

Fast and Provable Nonconvex Tensor RPCA

incorporate the property of the tangent of low multi-rank tensor space. We divide this subsection into two parts: efficient projection and selection of tangent space.

Efficient projection. The efficient projection consists of two steps: the first step is to project D Sk onto a tangent space of the low multi-rank space, the second step is to project the tensor in the tangent space onto the low multirank space. Next proposition gives the tangent space of multi-rank r space at any point L and how to project to this tangent space.

Proposition 4.1 (Proposition 2.3 in Song et al. 2020b). Let L Rn n q be a multi-rank r tensor. The tangent of multi-rank r tensor space at L is

T = U Φ X H + Y Φ VH|X, Y Rn r q ,

where L = U Φ Σ Φ VH is the transformed t-SVD of L. The Euclidean projection of Z Rn n q onto T is

PT (Z) =U Φ VH Φ Z + Z Φ V Φ VH

U Φ UH Φ Z Φ V Φ VH. (5)

Based on this proposition, the projection of PT (Z) onto the low multi-rank space can be done as follows.

Proposition 4.2. Projection of PT(Z) onto low multi-rank space L can be executed by

PL PT (Z) = U Q1 Φ PL (M) Φ

M = UH Φ Z Φ V RH 2 R1 0

Q1 Φ R1 = (IΦ U Φ UH) Φ Z Φ V and Q2 Φ R2 = (IΦ V Φ VH) Φ Z Φ U are t-QR (Theorem B.2).

Thus, instead of performing directly transformed t-SVD on PT(D Sk), this proposition allows performing t-SVD on a smaller-sized tensor M. Hence, we have obtained a way for efficiently projecting onto the low multi-rank space.

Selection of tangent space. Next, we show which tangent space to be used in Proposition 4.1, so that exact recovery of the whole algorithm can still be ensured. Similar to APT, we will project D Sk to L. So we need to give the tensor where the tangent space is defined. Such a tensor Lk is obtained by Algorithm 3. This algorithm trims Lk to a specific incoherence level µ (Assumption 5.1) with Lk remaining a multi-rank r tensor.

In order to get the t-SVD formulation of Lk, we conduct t QR on Ak and Bk, i.e., Ak = Q1 Φ R1 and Bk = Q2 Φ R2. Hence, the t-SVD formulation of Lk is Lk = Ak Φ Σ ΦBH k

Algorithm 3 Trim for tensor Require: Lk = U Φ Σ Φ VH Rn n q: tensor to be trimmed; µ: target incoherence level.

nq , bµ = q

nq 2: for i = 1 to n do 3: AH k (:, i, :) = min{1, aµ UH(:,i,:) F }UH(:, i, :)

4: BH k (:, i, :) = min{1, bµ VH(:,i,:) F }VH(:, i, :) 5: end for 6: Return: Lk = Ak Φ Σ Φ BH k

where Ak = Ak Φ U, Bk = Bk Φ V with R1 ΦΣ Φ RH 2 = U Φ Σ Φ VH being the transformed t-SVD. So we only need to conduct t-SVD on a tensor with size r r q. According to Proposition 4.1, the projection onto the tangent space defined on Lk is

PTk (Z) =Ak Φ AH k Φ Z + Z Φ Bk Φ BH k Ak Φ AH k Φ Z Φ Bk Φ BH k ,

where Tk is the tangent space at Lk. Finally, we have Lk+1 = PL PTk(D Sk).

Because all operations of tensor here can be converted into the operations of block diagonal matrix, we do not need multiple Φ to transform tensor in each operation and the complexity of tensor transformation in low-rank projection counts once. In t-QR decomposition, they have six times multiplications and twice t-QR decompositions, which costs Pq i=1 6n2ri + O(nr2 i ) . The projection onto tangent T has four times as many multiplications and once transformed t-SVD, which costs Pq i=1 n2ri + 9nr2 i + O(r3 i ) . Adding the computational complexity of transformation into the block diagonal formulation, the overall computational complexity is O(n2q2 + n2sr) with sr = Pq i=1 ri, which is less than the computational complexity of the T-TRPCA: O(n2q2 + n3q).

4.2.2. THE COMPLETE ALGORITHM

The complete procedure of EAPT is described in Algorithm 4. The difference from APT is that the projection onto L can be efficiently executed. Like APT, EAPT can be divided into two parts: initialization and alternating projection. Here, we use the same initialization method as APT. After initialization, EAPT performs alternating projection. At the k-th iteration, the first step projecting D Sk onto L is presented in Section 4.2.1. The second step follows the idea in Section 4.1 which projects onto S with Sk+1 = Tζk+1 (D Lk+1) and

ζk+1 = β σsr+1(X) + γk+1 σ1(X) , (7)

where X = PTk(D Sk+1), β = µsr/2nq, σsr+1(X) := max{ ˆS(i, i, c)| ˆS(i, i, c) > 0, i > ri} with X = U Φ S Φ

Fast and Provable Nonconvex Tensor RPCA

VH being the transformed t-SVD.

Algorithm 4 EAPT: Efficient Alternating Projection Algorithm for Tensor RPCA

1: Run Algorithm 1 for initialization 2: for k = 0 to T 1 do 3: Lk = Trim(Lk, µ) 4: Lk+1 = PL PTk (D Sk) 5: Sk+1 = Tζk+1(D Lk+1) where ζk+1 is in (7) 6: end for 7: Return: LT and ST

In conclusion, EAPT utilizes the low multi-rank tangent space property to reduce computational complexity. Compared to other methods, EAPT has computational complexity as O(n2q log q + n2sr) when using FFT transformation and O(n2q2 + n2sr) when using common unitary transformation, while keeping the exact recovery guarantee (Theorem 5.5) and linear convergence rate as shown in Table 1.

5. Theoretical Guarantees

Following (Zhang & Aeron, 2016), we first have the below tensor incoherence condition:

Assumption 5.1. Given the transformed t-SVD of a tensor L = U Φ Σ Φ VH Rn n q with multi-rank r, L is said to satisfy the tensor incoherence condition, if there exists µ > 0 such that

Tensor-column: nq

sr max i [n] UH Φ ei 2 F µ;

Tensor-row: nq

sr max j [n] VH Φ ej 2 F µ.

Here ei is defined as the n 1 q column basis with Φ[ ei](i, 1, :) = 1.

Basically, Assumption 5.1 shows that the tensor columns U(:, i, :) s and V(:, i, :) s need to be uncorrelated with the standard tensor basis. The second assumption is the sparse condition for S shown in below.

Assumption 5.2. A sparse tensor S Rn n q is α-sparse, i.e., S(:, i, :) 0 αnq and S(i, :, :) 0 αnq for i [n].

This assumption ensures the sparse elements are uniformly distributed across different slices so that S and L will not be low-rank tensors simultaneously. We do not require sparsity on S(:, :, i), i [q] here because such condition cannot guarantee S is not low-rank.

Let D = L + S be the observed tensor. Proposition 5.3 guarantees that the initialization obtained by Algorithm 1 is close to the ground-truth.

Proposition 5.3 (Algorithm 1 for initialization). Assume that a low multi-rank r tensor L satisfies Assumption 5.1

and a sparse tensor S satisfies Assumption 5.2 with αµ 1 srκ q. For hyperparameters obeying µsr σ1(L )

nq σ1(D) βinit

3µsr σ1(L )

nq σ1(D) and β0 = µsr

2nq, the outputs of Algorithm 1 satisfy L L0 8αµsr σ1(L ) and S S0 µsr

Based on above assumptions, linear convergence to the ground-truth of Algorithm 2 is ensured in Theorem 5.4. Theorem 5.4 (Exact recovery of Algorithm 2). Under the assumption of Proposition 5.3, for any ϵ > 0, we have LT L 8αϵ and ST S 4ϵ/nq with T = O (log (1/ϵ)) , β = 2µsr/nq.

Different from Theorem 5.4, proof of the recovery guarantee of EAPT is more difficult because its low-rank projection is more complicated than APT. Specifically, we need to construct an inequality that reflects the relationship between Lk and Lk while APT has no such intermediate variable. However, linear convergence to the ground-truth of Algorithm 4 is still ensured as stated in Theorem 5.5. Theorem 5.5 (Exact recovery of Algorithm 4). Under the assumption of Proposition 5.3 except that α min{ 1/µs2 rκ3, q0.5/µ1.5s2 rκ2, q0.5/µ2s2 rκ}, for any ϵ > 0, we have LT L 8αϵ and ST S ϵ/nq with T = O (log (1/ϵ)) , β = µsr/2nq.

By choosing appropriate unitary transformation matrix, a transformed tensor will have a lower multi-rank and its prominent information will be concentrated in some specific slices (Zhang & Ng, 2021). For example, by using DCT matrix for HSI denoising, the prominent information will be concentrated in the low-frequency slices. While the goodness of multi-rank truncation cannot be directly seen from these two theorems, empirically it leads to better performance than tubal rank in Section 6.2.

Existing works that are most technically similar to ours are Alt Proj (Netrapalli et al., 2014) and Acc Alt Proj (Cai et al., 2019). However, they are designed for matrices. As explained in Zhang & Aeron 2016 and Lu et al. 2019, the extension from matrices to tensors are not trivial as different mathematical tools are required. For example, sparse tensor S is not sprase any more in the frequency domain and some bounds of norms have to use the property of tensortensor product and interconvert between time and frequency domains in proofs. The detailed proofs is presented in Appendix D.

Theorems 5.4 and 5.5 indicate that βinit and β depend on unknown parameters about the ground truth tensor S . Similar RPCA methods namely Alt Proj (Netrapalli et al., 2014) and Acc Alt Proj (Cai et al., 2019) also rely on these unknown parameters about their ground truth matrix. How to remove such dependency is still an open issue. Also, unknown µ, sr, κ and σ1(L ) is not a practical issue. As we can see in Tabel 2, good empirical performance can be still obtained

Fast and Provable Nonconvex Tensor RPCA

0 20 40 60 80 Iteration

Log Relative Error

Log Relative Error versus Iteration (DCT)

APT-DCT EAPT-DCT T-TRPCA

0 5 10 15 20 25 Clock Time (s)

Relative Error

Relative Error versus Clock Time (DCT)

APT-DCT EAPT-DCT T-TRPCA

0 10 20 30 40 50 60 70 Iteration

Log Relative Error

Log Relative Error versus Iteration (FFT)

APT-FFT EAPT-FFT TRPCA ETRPCA

0 5 10 15 20 25 Clock Time (s)

Relative Error

Relative Error versus Clock Time (FFT)

APT-FFT EAPT-FFT TRPCA ETRPCA

Figure 1. Comparison between different t-SVD based tensor RPCA methods.

250 500 750 1000 1250 1500 1750 2000 n

Clock Time (s)

Clock Time versus Different n

APT EAPT T-TRPCA

2 5 10 15 20 q

Clock Time (s)

Clock Time versus Different q

APT EAPT T-TRPCA

1 5 10 15 20 r

Clock Time (s)

Clock Time versus Different r

APT EAPT T-TRPCA

0.2 0.3 0.4 0.5 0.6

Clock Time (s)

Clock Time versus Different

APT EAPT T-TRPCA

Figure 2. Consistent advantages over T-TRPCA with various parameters. α = 0.6 fails to successfully recovery.

by setting βinit and β with inaccurate µ, sr, κ, σ1(L ).

6. Experiments

In this section, experiments are conducted on both synthetic and real data. All experiments are conducted on a PC with two Intel Xeon Silver 4215R 3.20GHz CPUs and 187GB memory with MATLAB R2021b.

6.1. Synthetic Data

We generate a symmetric low-rank tensor L = Q Φ QH in Rn n q where Q Rn r q is a random tensor with its entry sampled i.i.d. from the standard normal distribution. We use normalized DFT and DCT as the unitary transformation for tensor-tensor product because these two transformations have clear physical meanings as the noise in the time domain will be transformed into the high-frequency terms in the frequency domain (Rao et al., 2010). Considering DFT and DCT cannot concentrate information in synthetic data, we do not consider multi-rank in this subsection. Whether the elements in the sparse tensor S are 0 is determined by the Bernoulli distribution with the parameter α. The value of the non-zero entries Sijt in S is drawn uniformly from the interval [ c E[|L|ijt], c E[|L|ijt]] for some constant c > 0. We set c = 3 in our synthetic experiments. The observed tensor is D = L + S. The stop criterion of all methods is D Lk Sk F / D F 10 4. Relative error ( Rel Err for short) is defined as Lk L F / L F .

The convergence and efficiency of our methods are shown in Figure 1. We conduct experiments on tensor-tensor product with both FFT and DCT, and compare with other methods based on tensor-tensor product such as TRPCA (Lu et al., 2019), T-TRPCA (Lu, 2021), and ETRPCA (Gao

et al., 2020). We set n = 500, q = 5, r = 5, α = 0.3 in Figure 1. The hyperparameters of our methods are set as the same suggested by Proposition 5.3, Theorems 5.4 and 5.5. The hyperparameters of other methods are listed in the Appendix C.1. The experimental results in Figure 1 show that both APT and EAPT converge linearly. Also, EAPT can save plenty of time cost compared to others.

Then, we conduct experiments to show the performance of our methods with various parameters. Figure 2 shows the running time of different methods with different parameters. Parameters are changed based on n = 500, q = 5, r = 5, α = 0.3, that is, the experiment in each figure changes one parameter and leaves the other unchanged. The experimental results show that our methods have consistent advantages over T-TRPCA with various parameters. We also conduct experiments to show the robustness of our methods in Appendix C.1.2.

6.2. Real Data

In this subsection, we compare the performance of TRPCA (Lu et al., 2019), T-TRPCA (Lu, 2021), ETRPCA (Gao et al., 2020), TTNN (Yang et al., 2020), SNN (Huang et al., 2015), Atomic Norm (Driggs et al., 2019), and RTD (Anandkumar et al., 2016) on real datasets. The difference of these methods is presented in Table 3.

6.2.1. HSI DENOISING

Here, we compare the performance of different methods on hyperspectral image denoising. We use the CAVE dataset (Yasuma et al., 2008) for experiments. For a hyperspectral image H Rm n q, αmnq pixels from H are randomly selected as noises by Bernoulli distribution. The values of these αmnq pixels are sampled from uniform

Fast and Provable Nonconvex Tensor RPCA

Table 2. Performance and clock time in seconds of HSI denoising. The best result in bold and the second best in underline .

TT Tucker CP CP t-SVD TTNN SNN Atomic Norm RTD TRPCA T-TRPCA ETRPCA EAPT-FFT EAPT-DCT PSNR Time PSNR Time PSNR Time PSNR Time PSNR Time PSNR Time PSNR Time PSNR Time PSNR Time toys 29.39 85.4 28.47 297.3 16.96 223.1 24.18 368.6 34.04 63.6 34.09 75.5 38.47 121.7 39.95 36.5 41.87 35.8 feathers 29.00 84.5 28.00 297.8 17.89 331.4 24.42 486.8 31.62 59.9 31.36 70.7 36.72 120.8 38.01 33.4 39.61 32.3 sponges 36.90 83.3 37.38 305.1 19.48 337.4 28.24 249.8 31.52 59.9 30.28 70.8 34.81 122.2 38.88 33.6 40.05 22.5 watercolors 28.74 85.9 28.31 284.7 18.33 377.2 23.49 353.1 36.28 61.6 36.3 71.3 40.59 121.6 41.43 34.2 41.66 35.9 paints 30.35 83.1 30.33 291.5 18.98 336.0 25.16 457.5 33.83 61.1 33.72 71.5 38.15 123.7 39.45 34.3 39.53 35.6 sushi 31.60 84.0 31.57 312.2 17.42 320.9 29.96 492.3 33.40 62.5 33.50 72.5 35.67 120.3 36.03 36.3 39.30 32.3

Original TRPCA T TRPCA ETRPCA TTNN SNN

140.70s 140.12s 357.94s 32.92s 108.76s

Atomic Norm RTD EAP-TRPCA-DCT EAP-TRPCA-FFT

76.77s 228.13s 11.78s 15.09s

66.50s 109.74s 118.88s 23.00s 78.48s 97.00s 258.67s 22.21s 17.34s

47.53s 65.40s 89.75s 22.33s 71.88s 123.16s 122.29s 13.05s 12.09s

Figure 3. Video background subtraction results of different methods and their corresponding reconstruction clock time. (a) Escalator with 200 frames; (b) Hall with 100 frames; (c) Shopping Mall with 50 frames.

distribution between interval [0, 1]. Since EAPT has comparable performance with APT but is more efficient than APT, we only involve EAPT for comparison. For detailed hyperparameter settings, please refer to the Appendix C.2.

We randomly select 6 hyperspectral images from CAVE dataset for comparison. The performance evaluated with final PSNR and clock time is presented in Table 2. EAPTFFT and EAPT-DCT in the table mean that tensor-tensor product with respect to a normalized DFT matrix and a DCT matrix are used for experiments, respectively. In Table 2, methods based on t-SVD have better performance than that based on others because t-SVD can utlize information in both time and frequency (transformed) domain. Also, the performance of our methods is better than other methods, while ours can save a large amount of time cost compared to others. Besides, EAPT-DCT has better performance than EAPT-FFT. The reason is probably the spectral compaction property of DCT, i.e., a transformed tensor in frequency domain tends to have its value concentrated in a small number of slices when compared to other transformation like FFT. And multi-rank truncation can better utilize this property by setting the large rank for the low-frequency slices and small rank for the high-frequency slices. Also, we want to clarify that the improvement in terms of the time cost here is consistent with our theoretical result. Because the hyperspectral images are reshaped into a different size (see Figure 4) and the number of the bands of CAVE is not too larger than the rank of the tensor so that the improvement of our method on HSI denoising becomes less significant.

6.2.2. VIDEO BACKGROUND SUBTRACTION

In this section, we compare the performance of various methods on video background subtraction. The task is to separate the moving foreground objects from a static background. Three videos are used for comparison, i.e., Escalator, Hall, and Shopping Mall. We choose 200, 100 and 50 frames from Escalator, Hall and Shopping Mall respectively for comparison. The visual comparison in Figure 3 shows that our methods can successfully extract the background from these videos with less reconstruction time.

7. Conclusion

In this paper, we study nonconvex tensor robust PCA based on transformed t-SVD and propose two algorithms, i.e., APT and EAPT, based on alternating projection. Our methods converge linearly and guarantee exact recovery. Also, EAPT utilizes the property of tangent space to reduce the computational complexity. Experimental results show that our proposed methods are more effective and efficient than existing methods. As a future work, it will be interesting to extend our work to higher order tensors and consider the noise in theoretical analysis.

Acknowledgements

This research was supported in part by National Key R&D Program of China (2020YFA0713900), the China NSFC projects under contract 11971374, 61721002, U1811461, and the Macao Science and Technology Development Fund under Grant 061/2020/A2.

Fast and Provable Nonconvex Tensor RPCA

Anandkumar, A., Jain, P., Shi, Y., and Niranjan, U. N. Tensor vs. matrix methods: Robust tensor decomposition under block sparse perturbations. In Artificial Intelligence and Statistics, pp. 268 276. PMLR, 2016.

Cai, H., Cai, J.-F., and Wei, K. Accelerated alternating projections for robust principal component analysis. Journal of Machine Learning Research, 20(1):685 717, 2019.

Cai, H., Chao, Z., Huang, L., and Needell, D. Fast robust tensor principal component analysis via fiber cur decomposition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 189 197, 2021.

Cand es, E. J. and Recht, B. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717 772, 2009.

Cand es, E. J., Li, X., Ma, Y., and Wright, J. Robust principal component analysis? Journal of the ACM (JACM), 58(3): 1 37, 2011.

Chang, R. W. Synthesis of band-limited orthogonal signals for multichannel data transmission. Bell System Technical Journal, 45(10):1775 1796, 1966.

Cichocki, A., Mandic, D., De Lathauwer, L., Zhou, G., Zhao, Q., Caiafa, C., and Phan, H. A. Tensor decompositions for signal processing applications: From two-way to multiway component analysis. IEEE Signal Processing Magazine, 32(2):145 163, 2015.

Driggs, D., Becker, S., and Boyd-Graber, J. Tensor robust principal component analysis: Better recovery with atomic norm regularization. ar Xiv preprint ar Xiv:1901.10991, 2019.

Gabay, D. and Mercier, B. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17 40, 1976.

Gandy, S., Recht, B., and Yamada, I. Tensor completion and low-n-rank tensor recovery via convex optimization. Inverse Problems, 27(2):025010, 2011.

Gao, Q., Zhang, P., Xia, W., Xie, D., Gao, X., and Tao, D. Enhanced tensor rpca and its application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43 (6):2133 2140, 2020.

Hao, N., Kilmer, M. E., Braman, K., and Hoover, R. C. Facial recognition using tensor-tensor decompositions. SIAM Journal on Imaging Sciences, 6(1):437 463, 2013.

Huang, B., Mu, C., Goldfarb, D., and Wright, J. Provable models for robust low-rank tensor completion. Pacific Journal of Optimization, 11(2):339 364, 2015.

Kernfeld, E., Kilmer, M., and Aeron, S. Tensor tensor products with invertible linear transforms. Linear Algebra and its Applications, 485:545 570, 2015.

Kilmer, M. E. and Martin, C. D. Factorization strategies for third-order tensors. Linear Algebra and its Applications, 435(3):641 658, 2011.

Kilmer, M. E., Horesh, L., Avron, H., and Newman, E. Tensor-tensor algebra for optimal representation and compression of multiway data. Proceedings of the National Academy of Sciences, 118(28), 2021.

Kolda, T. G. and Bader, B. W. Tensor decompositions and applications. SIAM review, 51(3):455 500, 2009.

Liu, J., Musialski, P., Wonka, P., and Ye, J. Tensor completion for estimating missing values in visual data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):208 220, 2012.

Lu, C. Transforms based tensor robust pca: Corrupted low-rank tensors recovery via convex optimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1145 1152, 2021.

Lu, C., Feng, J., Chen, Y., Liu, W., Lin, Z., and Yan, S. Tensor robust principal component analysis: Exact recovery of corrupted low-rank tensors via convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5249 5257, 2016.

Lu, C., Feng, J., Chen, Y., Liu, W., Lin, Z., and Yan, S. Tensor robust principal component analysis with a new tensor nuclear norm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4):925 938, 2019.

Netrapalli, P., Niranjan, U., Sanghavi, S., Anandkumar, A., and Jain, P. Provable non-convex robust pca. In Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 1, pp. 1107 1115, 2014.

Oseledets, I. V. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295 2317, 2011.

Rao, K. R., Kim, D. N., and Hwang, J. J. Fast fourier transform: algorithms and applications. 2010.

Song, G., Ng, M. K., and Zhang, X. Robust tensor completion using transformed tensor singular value decomposition. Numerical Linear Algebra with Applications, 27(3): e2299, 2020a.

Song, G.-J., Wang, X.-Z., and Ng, M. K. Riemannian conjugate gradient descent method for third-order tensor completion. ar Xiv preprint ar Xiv:2011.11417, 2020b.

Fast and Provable Nonconvex Tensor RPCA

Tomioka, R., Hayashi, K., and Kashima, H. Estimation of low-rank tensors via convex optimization. ar Xiv preprint ar Xiv:1010.0789, 2010.

Vandereycken, B. Low-rank matrix completion by riemannian optimization. SIAM Journal on Optimization, 23(2): 1214 1236, 2013.

Wei, K., Cai, J.-F., Chan, T. F., and Leung, S. Guarantees of riemannian optimization for low rank matrix recovery. SIAM Journal on Matrix Analysis and Applications, 37 (3):1198 1222, 2016.

Wright, J., Ganesh, A., Rao, S. R., Peng, Y., and Ma, Y. Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization. In NIPS, volume 58, pp. 289 298, 2009.

Xu, H., Caramanis, C., and Sanghavi, S. Robust pca via outlier pursuit. ar Xiv preprint ar Xiv:1010.4237, 2010.

Yang, J.-H., Zhao, X.-L., Ji, T.-Y., Ma, T.-H., and Huang, T.-Z. Low-rank tensor train for tensor robust principal component analysis. Applied Mathematics and Computation, 367:124783, 2020.

Yasuma, F., Mitsunaga, T., Iso, D., and Nayar, S. Generalized Assorted Pixel Camera: Post-Capture Control of Resolution, Dynamic Range and Spectrum. Technical report, Nov 2008.

Yi, X., Park, D., Chen, Y., and Caramanis, C. Fast algorithms for robust pca via gradient descent. In Advances in Neural Information Processing Systems, pp. 4152 4160, 2016.

Zhang, X. and Ng, M. K.-P. Low rank tensor completion with poisson observations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.

Zhang, Z. and Aeron, S. Exact tensor completion using t-svd. IEEE Transactions on Signal Processing, 65(6): 1511 1526, 2016.

Zhang, Z., Ely, G., Aeron, S., Hao, N., and Kilmer, M. Novel methods for multilinear data completion and denoising based on tensor-svd. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3842 3849, 2014.

Fast and Provable Nonconvex Tensor RPCA

A. Different Tensor RPCA Methods

Here, we present the comparison of tensor RPCA methods based on different tensor decomposition in Table 3.

Table 3. Comparison of various tensor RPCA methods. : exist; : doesn t exist; AP : alternating projection; ALM : augmented Lagrange Multiplier; T t-SVD : Transformed t-SVD. The computational complexities are calculated on tensors in Rn n n.

decomposition Convergence Iteration complexity Algorithm Recovery TRPCA (Lu et al., 2019) t-SVD O(1/ϵ) O n3 log n + n4

ADMM Transformed TRPCA (Lu, 2021) T t-SVD O(1/ϵ) O n4

ADMM ETRPCA (Gao et al., 2020) t-SVD O n3 log n + n4

ALM RTCUR (Cai et al., 2021) Fiber CUR O nr2 log2(n) + r4 log4(n)

AP TTNN (Yang et al., 2020) TT O n4

ADMM SNN (Huang et al., 2015) Tucker O(1/ϵ) O n4

ADMM Atomic Norm (Driggs et al., 2019) CP O(n4r) LBFGS RTD (Anandkumar et al., 2016) CP O (log(1/ϵ)) O n4+cr2

AP APT T t-SVD O (log(1/ϵ)) O(n4) AP EAPT T t-SVD O (log(1/ϵ)) O(n4 + n2sr) AP

B. Extension of Preliminaries

With the tensor-tensor product and the block-diagonal matrix representation, the following relation is obvious for any tensor X Cn1 n2 q and Y Cn2 n3 q: X Φ Y = C X Y = C.

Next, we recall the definitions of the identity tensor, unitary tensor, diagonal tensor, and conjugate transpose of a tensor. The identity tensor I Cn n q respect to Φ is denoted as I = ΦH[IΦ], where each frontal slice of IΦ Cn n q is the n n identity matrix. Based on the identity tensor, the unitary tensor with respect to Φ is defined as U Φ UH = UH Φ U = I. A tensor with each frontal slice being diagonal is called to be diagonal. The conjugate transpose of any tensor X Cm n q

with respect to a unitary matrix Φ Cq q, denoted by X H Cn m q, is defined as X H = ΦH fold X H .

Now we define the basis for tensor-tensor product. ei is defined as the m 1 q column basis with Φ[ ei](i, 1, :) = 1 and ej is defined as the n 1 q column basis with Φ[ ej](j, 1, :) = 1. Another tensor basis is called tubal basis ek of size 1 1 q with ei(1, 1, k) = 1 and the rest entries equal 0. Tensor Eijk with its (i, j, k)-th element equal 1 and others equal 0 can be denoted by Eijk = ei Φ ek Φ e H j .

Next theorem gives the best multi-rank r approximation of X:

Theorem B.1 (Eckart-Young theorem, Theorem 11 in Kilmer et al. 2021). If X = U Φ S Φ VH Cm n q is the transformed t-SVD of X. Define Hr(X) to be the approximation having multi-rank r, that is,

ˆ XΦ(:, :, i) = ˆUΦ(:, 1 : ri) ˆSΦ(1 : ri, 1 : ri, i) ˆV H Φ (:, 1 : ri, i).

Then, Hr(X) is the best multi-rank r approximation to X in the Frobenius norm and

Hr = arg minrankm( X) r X X F .

The following theorem is transformed t-QR decomposition of tensors.

Theorem B.2 (transformed t-QR). Let X Cm n q. Then it can be factorized as X = Q Φ R, where Q Cm m q is a orthogonal tensor, and R Cm n q is an f-upper triangle tensor (each frontal slice is an upper triangular matrix).

The tensor spectral norm X with respect to a unitary matrix Φ is defined as X = X . With block-diagonal representation, the inner product of two tensors X, Y Cm n q can also be calculated with the following property

X, Y = X, Y .

With above equality, the Frobenius of a tensor X Cm n q can be represented as X F = X F .

Fast and Provable Nonconvex Tensor RPCA

Some important singular values of tensor X are denoted as σ1(X) := max{ ˆS(i, i, c)| ˆS(i, i, c) > 0, i ri}, σsr(X) := min{ ˆSΦ(i, i, c)| ˆSΦ(i, i, c) > 0, i ri}, σsr+1(X) := max{ ˆS(i, i, c)| ˆS(i, i, c) > 0, i > ri}, σrq(X) := min{ ˆSΦ(i, i, c)| ˆSΦ(i, i, c) > 0, i r}, and σrq+1(X) := max{ ˆS(i, i, c)| ˆS(i, i, c) > 0, i > r} where X = U Φ S Φ VH, ri is the i-th element of multi-rank r, and r = max{ri} is the tubal rank.

A partitioned tensor is a tensor that is interpreted as having been broken into sections called subtensors or blocks. A partitioned tensor A with q row partitions and s column partitions with subtensors Aij can be denoted by its i-th frontal slice as

A(i) 11 A(i) 12 A(i) 1s A(i) 21 A(i) 22 A(i) 2s ... ... ... ... A(i) q1 A(i) q2 Ai qs

where A(i) ij is the i-th frontal slice of Aij. For convenient, we also represent A as

A11 A12 A1s A21 A22 A2s ... ... ... ... Aq1 Aq2 Aqs

Our analysis are based on symmetric tensors, but similar results can be obtained for asymmetric matrix by casting the general tensor problem to symmetric augmented tensors. Without loss of generality, assume m n and dm n < (d + 1)m for some d 1 and construct symmetric tensors ˆL and ˆS as

0 0 ... ... ... 0 0 LH LH | {z } d times

d times and ˆS :=

0 0 ... ... ... 0 0 SH SH | {z } d times

It can be easily verified that ˆL is O(µ)-incoherence, and ˆS is O(α)-sparse where O(µ) (resp. O(α)) means to hide the constant in front of µ (resp. α). Then our methods can be applied to asymmetric tensors with the same guarantee.

C. Extension of Experiments

C.1. Synthetic Data

C.1.1. HYPERPARAMETER SETTINGS

We set β = 2µr

n , βinit = 2µr σ1(L )

n σ1(D) for APT and set β = µr

2n, βinit = 2µr σ1(L )

n σ1(D) for EAPT. Transformed TRPCA and TRPCA use default hyperparameter λ = 1/ n and λ = 1/ nq respectively according to their theory. The weight of ETRPCA is set as [1, , 1 | {z } 167 times , 1.1, , 1.1 | {z } 167 times , 1.5, , 1.5 | {z } 166 times ].

C.1.2. ROBUSTNESS

We also conduct experiments to show the robustness of our methods under Gaussian noisy observation with different derivation σ in Table 4, i.e., the noisy observation Dn = D + E where Eijt N(0, σ2). We set n = 500, q = 5, r = 5, α = 0.3 in Table 4. As can be seen, our methods have smaller relative error and are more efficient than T-TRPCA.

C.2. HSI Denoising

Given a hyperspectral image H Rm n q, we permute the image into ˆH Rq m n as shown in Figure 4 since the mode-1 unfolding matrix of ˆH has best low-rank property based on tensor-tensor product. So we apply this permutation to all methods based on tensor-tensor product.

Fast and Provable Nonconvex Tensor RPCA

Table 4. Relative error and running time of different methods under Gaussian noise with zero mean and different standard deviation.

σ = 0.5 σ = 1 σ = 1.5 Rel Err Time Rel Err Time Rel Err Time T-TRPCA 0.1864 21.87 0.3413 21.54 0.4761 20.77 APT-DCT 0.0476 6.94 0.0987 7.11 0.1483 6.45 EAPT-DCT 0.0476 1.76 0.0987 1.66 0.1480 1.80

C.2.1. HYPERPARAMETER SETTINGS

We set λ = 1/ nq of TRPCA and λ = 1/ n for Transformed TRPCA. For ETRPCA, we set set p = 0.9, w = [1, , 1 | {z } 5 times , 1.1, , 1.1 | {z } 5 times , 1.5, , 1.5 | {z } 21 times ]. We do not apply ket augmentation scheme to TTNN because the shape of our tensor

is 512 512 31 and set hyperparameters as λ = 0.01, f = 0.01, γ = 0.001 and δ = 0.001. The parameter of SNN is set to [40, 40, 40] because we find it performs well in most cases. λ and µ are set to be 0.5 and 0.1 in Atomic Norm. The rank of RTD is set to be 20 in this experiment. Here we empirically set the hyperparameters of EAPT as β = 2µr m+n, βinit = 4β, where we set µ = 10. The multi-rank of our methods is set to be a descending sequence. For EAPT-FFT, the descending sequence of multi-rank is set to be [4, , 4 | {z } 50 times , 2, , 2 | {z } 50 times , 0, ]. For EAPT-DCT, the descending sequence of multi-rank is set

to be [4, , 4 | {z } 100 times , 1, , 1 | {z } 200 times , 0, ] and set to be [5, , 5 | {z } 100 times , 1, , 1 | {z } 200 times , 0, ] for sponges data.

Figure 4. Illustration of the reshape scheme for hyperspectral image. The rank of the frontal slice for the right tensor is much lower than that of the frontal slice for the left tensor. According to Lemma D.2, t-SVD can better model the structure in the right tensor.

D.1. Proof of Proposition 4.2

Here, we give the proof of Proposition 4.2.

PT (Z) = U Φ UH Φ Z + Z Φ V Φ VH U Φ UH Φ Z Φ V Φ VH

=U Φ UH Φ Z Φ (I V Φ VH) + (I U Φ UH) Φ Z Φ V Φ VH + U Φ UH Φ Z Φ V Φ VH

=U Φ RH 2 Φ QH 2 + Q1 Φ R1 Φ VH + U Φ UH Φ Z Φ V Φ VH

UH Φ Z Φ B RH 2 R1 0

:= U Q1 Φ M Φ

Then we have

PL PT (Z) = U Q1 Φ PL (M) Φ

Fast and Provable Nonconvex Tensor RPCA

Table 5. Sections and its corresponding lemmas / theorems / functions.

Section Function Lemma & Theorem

Basic lemmas

rank equivalence Lemma D.2 Weyl s inequality Lemma D.3 tensor sparse Lemma D.4 initialization Lemma D.5 and Theorem D.6

singular value bound Lemma D.21 bound L Lk Lemma D.22 bound S Sk Lemma D.23 convergence Theorem D.24

projection & Trim Lemma D.7 - D.13 Some bounds Lemma D.14 - D.16 bound L Lk Lemma D.17 bound L Lk Lemma D.18 bound S Sk Lemma D.19 convergence Theorem D.20

D.2. Proof of Recovery Guarantees

The proof outline of our two methods follows as:

In the t-th iteration, if [S]ijk = 0, then [St]ijk = 0.

If [S]ijk = 0, then Tζt (D Lt) = Tζt (L Lt). So we need ζt L Lt . [S St]ijk is bounded by L Lt and ζt.

Positive feedback in optimization process.

In (t 1)-th iteration, the hard thresholding parameter ζt 1 can select St 1 which includes all the zero elements in S. In the t-th iteration, the chosen St 1 can make sure that L Lt is momently decreasing w.r.t. t. Because L Lt is momently decreasing w.r.t. t, we can select smaller ζt. Because [S St]ijk is bounded by L Lt and ζt, we have smaller [S St]ijk.

We also present what the following theorems and lemmas do in the Table 5.

First, we give some notations which will be used in the proof. The optimal solution L and S are denoted here as L and S for simplicity. The (i, j)-th element of a matrix S is denote as Si,j. D(c) is shorthand for D(:, :, c). σX i,c denotes S(i, i, c) where X = U Φ S Φ VH is transformed t-SVD of X and σi,c := σL i,c.

We next defined the extension of eigenvalue in matrix to tensors as: Definition D.1. Eigen fiber can been seen as an extension of eigenvalue in tensors. The definition of eigen fiber λ is:

D Φ u = u Φ λ

where D Rn n q, u Rn 1 q, and λ R1 1 q.

D.2.1. BASIC LEMMAS

Lemma D.2 (equivalence of block diagonal matrix and mode-1 unfolding matrix / rank equivalence). Let X Rm n q, the transformed t-product between two tensors is defined with unitary matrix Φ. Then

ˆ XΦ(:, :, t) = X(:, 1, :) X(:, 2, :) X(:, n, :)

Φ(t, :)H 0 0 0 0 Φ(t, :)H 0 0 ... ... ... ... 0 0 0 Φ(t, :)H

:= X (1) Φt. (8)

Fast and Provable Nonconvex Tensor RPCA

Proof. Assume X Rm n q with

ˆ XΦ(j, k, t) =

c=1 X(j, k, c)Φ(t, c) = X(j, k, :)Φ(t, :)H.

Then, we have

ˆ XΦ(:, :, t) =X (1) Φt = X(:, 1, :) X(:, 2, :) X(:, n, :)

Φ(t, :)H 0 0 0 0 Φ(t, :)H 0 0 ... ... ... ... 0 0 0 Φ(t, :)H

where X (1) Rm nq and Φt Cnq n.

Lemma D.3 (Tensor Weyl s inequality). Let A, B, C Rn n q be the symmetric tensors such that A = B + C. Then, the inequality | σA i σB i | C holds for all i, where σA i and σB i represent the ith singular values of A and B respectively.

Proof. Because A, B, C are symmetric tensors, we know AΦ, BΦ, CΦ are symmetric matrices such that AΦ = BΦ + CΦ. Following Weyl s inequality, we have | σA i σB i | CΦ = C .

Lemma D.4. Let S Rn n q satisfy assumption 1. Then, S αnq S .

Proof. Assume two unit vector x Rn, y Rnq,

S = S = max c=1, ,q

ˆSΦ(:, :, c) = max c=1, ,q

S(1) Φc = S(1) =

xi S(1) ijcy(j 1)q+c

(x2 i + y2 (j 1)q+c S(1) ijc 1

2 (αnq + αnq) S = αnq S ,

where the last inequality follows from Assumption 5.2.

D.2.2. INITIALIZATION

Lemma D.5. Let S Rn n q be a sparse tensor satisfying Assumption 5.2. Let U Rn r q be an orthogonal tensor with µ-incoherence, i.e., e H i U F q

nq for all i. Then e H i Sa U F q

nq (αnq q S )a for all i and a 0.

Proof. This proof is done by mathematical induction. Base case: When a = 0, e H i Φ U F q

nq is satisfied following from the assumption.

Induction Hypothesis: e H i Φ Sa Φ U F maxl q

nq (αnq q e H l Φ S F )a for all i at the ath power.

Fast and Provable Nonconvex Tensor RPCA

Induction Step: We have

e H i Φ Sa+1 Φ U 2 F = e H i Φ S Φ Sa Φ U 2 F = e H i S Sa U 2 F =

k=1 [ S]cn+i,cn+k[ Sa U]cn+k,cr+j

k1,k2 [ S]cn+i,cn+k1[ S]cn+i,cn+k2

j=1 [ Sa U]cn+k1,cr+j[ Sa U]cn+k2,cr+j

k1,k2 [ S]cn+i,cn+k1[ S]cn+i,cn+k2 e H cn+k1 Sa U(:, cr + 1 : cr + r), e H cn+k2 Sa U(cr + 1 : cr + r)

[ S]cn+i,cn+k1[ S]cn+i,cn+k2 e H cn+k1 Sa U(:, cr + 1 : cr + r) 2 e H cn+k2 Sa U(:, cr + 1 : cr + r) 2

[ S]cn+i,cn+k1[ S]cn+i,cn+k2 e H cn+k1 Sa U 2 e H cn+k2 Sa U 2

e H l Φ Sa Φ U 2 F

[ S]cn+i,cn+k1[ S]cn+i,cn+k2

= e H l Φ Sa Φ U 2 F

e H cn+i S(:, cn + 1 : cn + n) 1 2 = e H l Φ Sa Φ U 2 F

S(1)(i, :) Φc+1 1

= e H l Φ Sa Φ U 2 F

t=1 S(i, j, t)Φ(c + 1, t)

e H l Φ Sa Φ U 2 F

t=1 |S(i, j, t)|

e H l Φ Sa Φ U 2 F

c=0 (αnq S )2 e H l Φ Sa Φ U 2 F (αnq q S )2 µsr

nq (αnq q S )2(a+1) .

Now, we have UH Φ Sa+1 Φ ei F maxl q

nq αnq q S a+1. In the proof, we use the inequality

e T cn+k1 Sa U 2 e T cn+k2 Sa U 2 e T k1 Sa U F e T k2 Sa U F

= e T k1 Φ Sa Φ U F e T k2 Φ Sa Φ U F µsr

nq (αnq q S )2a .

Theorem D.6. With the condition of Proposition 5.3, let L Rn n q and S Rn n q be two symmetric tensors satisfying Assumptions 5.1 and 5.2. If the thresholding parameters obey µsr σL 1 nq σD 1 βinit 3µsr σL 1 nq σD 1 and β = µsr

2nq, then the outputs of Algorithm 1 satisfy

L L0 8αµsr σ1

nq σ1, and supp(S0) Ω.

Proof. The proof can be partitioned into several parts. (i) Note that L 1 = 0 and

L L 1 = L = max i,j,k U Φ Σ Φ UH, Eijk = max i,j,k e H i U Σ UH ej, ek max i,j,k e H i U Σ UH ej F ek F

= max i,j,k e H i U F Σ UH ej F max i,j,k e H i Φ U F Σ UH Φ ej F µsr

where the last inequality follows from Assumption 5.1, i.e., L is µ-incoherence. Thus, with the choice of βinit µsr σ1

nq σD 1 , we

have L L 1 βinit σD 1 = ζ 1. Since

[S 1]ijk = [Tζ 1(S + L L 1)]ijk = Tζ 1([S + L L 1]ijk) (i, j, k) Ω Tζ 1([L L 1]ijk) (i, j, k) Ωc,

Fast and Provable Nonconvex Tensor RPCA

it follows that [S 1]ijk = 0 for all (i, j, k) Ωc, i.e., Ω 1 := supp(S 1) Ω. Moreover, for any entries of S S 1, we have

[S S 1]ijk =

0 [L 1 L]ijk [S]ijk

0 L L 1 L L 1 + ζ 1

0 (i, j, k) Ωc

nq σ1 (i, j, k) Ω 1 4µsr

nq σ1 (i, j, k) Ω\Ω 1 ,

where the last inequality follows from βinit 3µsr σ1

nq σD 1 , so that ζ 1 3µsr

nq σ1. Therefore, we get

supp(S 1) Ω and S S 1 4µsr

By Lemma D.4, we also have S S 1 2 αnq S S 1 4αµsr σ1. (ii) To bound the approximation error of L0 to L in terms of the spectral norm, note that

L L0 L (D S 1) + (D S 1) L0

2 L (D S 1) = 2 L (L + S S 1) = 2 S S 1 ,

where the second inequality follows from the fact L0 = Hr(D S 1) is the best multi-rank r approximation of D S 1. It follows immediately that L L0 8αµsr σ1. (iii) Since D = L + S, we have D S 1 = L + S S 1. Let λi denotes the (i, c)th eigenvalue of D S 1 ordered by | λ1,c| | λ2,c| | λn,c|. The application of Tensor Weyl s inequality together with the bound of α implies that

| σi,c | λi,c|| S S 1 σsr

holds for all i. Consequently, we have

7 8 σi,c | λi,c| 9

8 σi,c, 1 i rq, S S 1

Let D S 1 = h U0, U0 i Φ

Φ h U0, U0 i H = U0 Φ Λ Φ UH 0 + U0 Φ Λ Φ UH 0 be its eigenvalue decomposition,

where Λ has the multi-rank r largest eigenvalues in magnitude and Λ contains the rest eigenvalues. Also, U0 contains the first r eigenvectors, and U0 has the rest. Notice L0 = Hr(D S 1) = U0 Φ Λ Φ UH 0 due to the symmetric setting. Denote Z = D S 1 L = S S 1. Let ui = U(:, i, :) be the ith eigenvector of D S 1 = L + Z and λs = Λ(s, s, :) be the sth eigen fiber of D S 1 = L + Z. Then, we have (D S 1) Φ ui = ui Φ λi. We denote D S 1 as M, then

M ui = ui λi

M(1) 0 0 0 M(2) 0 0 0 0 0 0 M(q)

u(1) i 0 0 0 u(2) i 0 0 0 0 0 0 u(q) i

u(1) i 0 0 0 u(2) i 0 0 0 0 0 0 u(q) i

λ(1) i 0 0 0 λ(2) i 0 0 0 0 0 0 λ(q) i

This means M(c) u(c) i = λ(c) i u(c) i . Because PTk(D Sk)(c) = Z(c) + L(c), we get ( λ(c) i I Z(c)) u(c) i = L(c) u(c) i . Then, we have

λ(c) i u(c) i =

λ(c) i u(c) i .

Combining all q slices, we have ui = I + Ei Z + ( Ei Z)2 + Ei L ui where

1 λ(1) i I 0 0

0 1 λ(2) i I 0

0 0 0 0 0 1 λ(q) i I

Z(1) 0 0 0 Z(2) 0 0 0 0 0 0 Z(q)

Fast and Provable Nonconvex Tensor RPCA

The above inequality is valid because of Ei Z Z 1 λ(q) i Z

Then, we get ui = I + Ei Φ Z + (Ei Φ Z)2 + Φ Ei Φ L Φ ui for each ui which implies

U0 Φ Λ Φ UH 0 =

i=1 ui Φ λi Φ u H i

a 0 (Ei Φ Z)a Φ Ei Φ L

Φ ui Φ λi Φ u H i

b 0 (Ei Φ Z)b Φ Ei Φ L

a 0 Za Φ L Φ

Ea+1 i Φ ui Φ λi Φ u H i Φ Eb+1 i Φ L Φ X

To simplify the above formula, we have

1 λ(1) i I 0 0

0 1 λ(2) i I 0

0 0 0 0 0 1 λ(q) i I

u(1) i 0 0 0 u(2) i 0 0 0 0 0 0 u(q) i

u(1) i 0 0 0 u(2) i 0 0 0 0 0 0 u(q) i

1 λ(1) i 0 0

0 1 λ(2) i 0

0 0 0 0 0 1 λ(q) i

Then, we have Ea+1 i ui λi u H i Eb+1 i

0 0 1 λ(q) i

λ(1) i 0 0 0 λ(2) i 0 0 0 0 0 0 λ(q) i

0 0 1 λ(q) i

0 0 1 λ(q) i

u H i = ui Γ(a+b+1) i u H i ,

where we introduce new notation Γ(a+b+1) i . Hence we have

Ea+1 i Φ ui Φ λi Φ u H i Φ Eb+1 i = ui Φ Γ(a+b+1) i Φ u H i .

Fast and Provable Nonconvex Tensor RPCA

Now, we get

U0 Φ Λ Φ UH 0 = X

Ea+1 i Φ ui Φ λi Φ u H i Φ Eb+1 i Φ L Φ X

a 0 Za Φ L Φ

ui Φ Γ(a+b+1) i Φ u H i Φ L Φ X

a,b 0 Za Φ L Φ U0 Φ Γ(a+b+1) Φ UH 0 Φ L Φ Zb.

Thus, we have

L0 L = U0 Φ Λ Φ UH 0 L

= L Φ U0 Φ Γ(1) Φ UH 0 Φ L L + X

a+b>0 Za Φ L Φ U0 Φ Γ(a+b+1) Φ UH 0 Φ L Φ Zb

L Φ U0 Φ Γ(1) Φ UH 0 Φ L L + X

a+b>0 Za Φ L Φ U0 Φ Γ(a+b+1) Φ UH 0 Φ L Φ Zb

We will handle Y0 first. Recall that L = U Φ Σ Φ VH is the t-SVD of the symmetric tensor L which obeys µ-incoherence, i.e., U Φ UH = V Φ VH and e H i Φ U Φ UH F q

nq for all i. So for each (i, j, k) entry of Y0, one has

Y0 = max i,j,k

D L Φ U0 Φ Γ(1) Φ UH 0 Φ L L, Eijk E

= max i,j,k

D e H i Φ U Φ UH(L Φ U0 Φ Γ(1) Φ UH 0 Φ L L) Φ U Φ UH Φ ej, ek E

max i,j e H i Φ U Φ UH(L Φ U0 Φ Γ(1) Φ UH 0 Φ L L) Φ U Φ UH Φ ej F ek F

max i,j e H i Φ U Φ UH F L Φ U0 Φ Γ(1) Φ UH 0 Φ L L U Φ UH Φ ej F

nq L Φ U0 Φ Γ(1) Φ UH 0 Φ L L ,

where the first inequality follows from the fact U Φ UH Φ L = L Φ U Φ UH = L. Since L = U0 Φ Λ Φ UH 0 + U0 Φ Λ Φ UH 0 Z, there holds that

L Φ U0 Φ Γ(1) Φ UH 0 Φ L L

= (U0 Φ Λ Φ UH 0 + U0 Φ Λ Φ UH 0 Z) Φ U0 Φ Γ(1) Φ UH 0 Φ (U0 Φ Λ Φ UH 0 + U0 Φ Λ Φ UH 0 Z) L

= U0 Φ Λ Φ Γ(1) Φ Λ Φ UH 0 L U0 Φ Λ Φ Γ(1) Φ UH 0 Φ Z Z Φ U0 Φ Γ(1) Φ Λ Φ UH 0 + Z Φ U0 Φ Γ(1) Φ UH 0 Φ Z

Z U0 Φ Λ Φ UH 0 + U0 Φ Λ Φ Γ(1) Φ UH 0 Φ Z + Z Φ U0 Φ Γ(1) Φ Λ Φ UH 0

+ Z Φ U0 Φ Γ(1) Φ UH 0 Φ Z

= Z Uk+1 Λ UH k+1 + U0 Λ Γ(1) UH 0 Z + Z U0 Γ(1) Λ UH 0 + Z U0 Γ(1) UH 0 Z

Z Uk+1 Λ UH k+1 + 2 Z + Z 2

| λsr| Uk+1 Λ UH k+1 + 4 Z | λrq+1| + 4 Z 5 Z ,

where the last inequality use the fact Z

7 and | λsr+1| Z because of σrq+1 = 0 and Weyl s inequality. Thus we have Y0 5µsr

nq Z 5αµsr Z .

Fast and Provable Nonconvex Tensor RPCA

Next, we derive an upper bound for the rest part. Note that

Yab = max i,j,k

D Za Φ L Φ U0 Φ Γ(a+b+1) Φ UH 0 Φ L Φ Zb, Eijk E

= max i,j,k

D ( e H i Φ Za Φ U Φ UH) Φ L Φ U0 Φ Γ(a+b+1) Φ UH 0 Φ L Φ (U Φ UH Φ Zb Φ ej), ek E

max i,j ( e H i Φ Za Φ U Φ UH) Φ L Φ U0 Φ Γ(a+b+1) Φ UH 0 Φ L Φ (U Φ UH Φ Zb Φ ej) F ek F

max i,j e H i Φ Za Φ U F L Φ U0 Φ Γ(a+b+1) Φ UH 0 Φ L UH Φ Zb Φ ej F

nq (αnq q Z )a+b L Φ U0 Φ Γ(a+b+1) Φ UH 0 Φ L

αµsr q Z σsr

a+b 1 L Φ U0 Φ Γ(a+b+1) Φ UH 0 Φ L ,

where last inequality uses the bound of α (α 1 32µsrκ q in Proposition 5.3). Furthermore, by using L = U0 Φ Λ Φ UH 0 + U0 Φ Λ Φ UH 0 Z again, we get

L Φ U0 Φ Γ(a+b+1) Φ UH 0 Φ L

= (U0 Φ Λ Φ UH 0 + U0 Φ Λ Φ UH 0 Z) Φ U0 Φ Γ(a+b+1) Φ UH 0 Φ (U0 Φ Λ Φ UH 0 + U0 Φ Λ Φ UH 0 Z)

= U0 Φ Λ Φ Γ(a+b+1) Φ Λ Φ UH 0 U0 Φ Λ Φ Γ(a+b+1) Φ UH 0 Φ Z Z Φ U0 Φ Γ(a+b+1) Φ Λ Φ UH 0 + Z Φ U0 Φ Γ(a+b+1) Φ UH 0 Φ Z

U0 Λ Γ(a+b+1) Λ UH 0 U0 Λ Γ(a+b+1) UH 0 Z Z U0 Γ(a+b+1) Λ UH 0 + Z U0 Γ(a+b+1) UH 0 Z

| λsr| (a+b 1) + | λsr| (a+b) Z + | λsr| (a+b) Z + | λsr| (a+b+1) Z 2

=| λsr| (a+b 1)

=| λsr| (a+b 1) 1 + Z

2 2| λsr| (a+b 1) 2 7

Then, we have

a+b>0 Yab X

a+b>0 2αµsr q Z

8 σsr 7 8 σsr

a+b 1 2αµsr q Z X

2 3αµsr q Z .

Finally, combining them together gives

L0 L = Y0 + X

a+b>0 Yab 5αµsr Z + 3αµsr q Z µsr

where the last inequality uses the bound of α (α 1 128µsr q) in Proposition 5.3. (iv) From the thresholding rule, we know that

[S0]ijk = [Tζ0(S + L L0)]ijk = Tζ0([S + L L0]ijk) (i, j, k) Ω Tζ0([L L0]ijk) (i, j, k) Ωc .

So the above inequality and ζ0 = µsr

2nq λ1 imply [S]ijk = 0 for all (i, j, k) Ωc, i.e., supp(S0) := Ω0 Ω. Also, for any entries of S S0, there holds that

[S S0]ijk =

0 [L0 L]ijk [S]ijk

0 L L0 L L0 + ζ0

0 (i, j, k) Ωc

µsr 4nq σ1 (i, j, k) Ω0 µsr

nq σ1 (i, j, k) Ω\Ω0 .

Fast and Provable Nonconvex Tensor RPCA

Here the last inequality implies ζ0 = µsr

2nq λ1 3µsr

4nq σ1. Therefore, we have

supp(S0) Ω and S S0 µsr

D.2.3. LOCAL CONVERGENCE OF ALTERNATING PROJECTION IN EAPT

Lemma D.7 (Bounds for Projections). Let Lk = Uk Φ Σk Φ VH k be a multi-rank k tensor, and Tk be the tangent space of the multi-rank k tensor manifold at Lk. The implicit rank of Lk is sk and the tubal rank is k. Let L = U Φ Σ Φ VH be another multi-rank k tensor, and T be the corresponding tangent space. Then,

Uk Φ UH k U Φ UH Lk L

σmin(L) , Vk Φ VH k V Φ VH Lk L

σmin(L) (9)

Uk Φ UH k U Φ UH F

σmin(L) , Vk Φ VH k V Φ VH F

σmin(L) (10)

(I PTk)L F Lk L 2 F σmin(L) , PTk PT 2 Lk L F

σmin(L) . (11)

Proof. We only prove the left inequalities of Eqs (9) and Eq.(10) because the right inequalities can be easily established. The left inequality of Eq.(9) follows directly from calculations

Uk Φ UH k U Φ UH = U UH( U UH Uk UH k ) = U UH( I Uk UH k ) = ( I Uk UH k ) U UH

= ( I Uk UH k ) L V Σ 1 UH = ( I Uk UH k )( Lk L) V Σ 1 UH

I Uk UH k Lk L V Σ 1 UH Lk L

σmin( L) = Lk L

To prove the left inequality of Eq.(10), we first show that

(I Uk Φ UH k ) Φ U Φ UH F = Uk Φ UH k Φ (I U Φ UH) F .

Eq.(10) can be obtained by noting that

(I Uk Φ UH k ) Φ U Φ UH 2 F = ( I Uk UH k ) U UH 2 F = ( I Uk UH k ) U UH, ( I Uk UH k ) U UH ,

= Uk UH k , I U UH = sk Uk UH k , U UH ,

Uk Φ UH k Φ (I U Φ UH) 2 F = Uk UH k ( I U UH) 2 F = Uk UH k ( I U UH), Uk UH k ( I U UH)

= I Uk UH k , U UH = sk Uk UH k , U UH .

Then, it follows that

Uk Φ UH k U Φ UH F

2 (I Uk Φ UH k ) Φ U Φ UH F

2 ( I Uk UH k ) L V Σ 1 UH F =

2 ( I Uk UH k )( Lk L) V Σ 1 UH F

2 I Uk UH k Lk L F V Σ 1 UH

Fast and Provable Nonconvex Tensor RPCA

Next, we prove the left inequality of Eq.(11). First we note that PT (L) = L. So

(I PTk)L = (PT PTk)L

=(U Φ UH Uk Φ UH k ) Φ L + L Φ (V Φ VH Vk Φ VH k )

(U Φ UH Uk Φ UH k ) Φ L Φ Vk Φ VH k U Φ UH Φ L Φ (V Φ VH Vk Φ VH k )

=(U Φ UH Uk Φ UH k ) Φ L Φ (I Vk Φ VH k ) + (I U Φ UH) Φ L Φ (V Φ VH Vk Φ VH k )

=(U Φ UH Uk Φ UH k ) Φ L Φ (I Vk Φ VH k ) = (U Φ UH Uk Φ UH k ) Φ (L Lk) Φ (I Vk Φ VH k ),

where the last two inequalities follow from the fact (I U Φ UH) Φ L = 0 and Lk Φ (I Vk Φ VH k ) = 0. Taking the Frobenius norm on both sides gives

(I PTk)L F U Φ UH Uk Φ UH k Lk L F I Vk Φ VH k Lk L 2 F σmin(L) .

Now, we give the proof of the right side of Eq.(11). For any tensor Z, we have

(PTk PT )Z = Uk Φ UH k Φ Z + Z Φ Vk Φ VH k Uk Φ UH k Φ Z Φ Vk Φ VH k U Φ UH Φ Z Z Φ V Φ VH + U Φ UH Φ Z Φ V Φ VH

=(Uk Φ UH k U Φ UH) Φ Z + Z Φ (Vk Φ VH k V Φ VH)

(Uk Φ UH k U Φ UH) Φ Z Φ V Φ VH Uk Φ UH k Φ Z Φ (Vk Φ VH k V Φ VH)

=(Uk Φ UH k U Φ UH) Φ Z Φ (I V Φ VH) + (I Uk Φ UH k ) Φ Z Φ (Vk Φ VH k V Φ VH).

Taking Frobenius norm on both sides gives

(PTk PT )Z F

Uk Φ UH k U Φ UH Z F I V Φ VH + I Uk Φ UH k Z F Vk Φ VH k V Φ VH

σmin(L) Z F .

Lemma D.8 (Chordal and Projection Distances: the matrix version in (Wei et al., 2016)). Let Uk, U Rn k q be two orthogonal tensors of multi-rank k. Then there exists a k k q unitary tensor Q such that

Uk U Φ Q F Uk Φ UH k U Φ UH F .

Proof. Since

Uk U Φ Q 2 F = Uk U Q, Uk U Q = 2sk 2 Uk, U Φ Q = 2sk 2 Uk, U Q

Uk Φ UH k U Φ UH F = Uk UH k U UH, Uk UH k U UH

=2sk 2 Uk Φ UH k , U Φ UH = 2sk 2 Uk UH k , U UH ,

it suffices to show that there exists a Q such that

Uk, U Φ Q Uk Φ UH k , U Φ UH Uk, U Q Uk UH k , U UH .

It is equivalent to show that UH Φ Uk, Q UH Φ Uk, UH Φ Uk UH Uk, Q UH Uk, UH Uk

for some unitary tensor Q Rk k q. Let UH Φ Uk = Q1 Φ Γ Φ QH 2 be the transformed t-SVD of UH Φ Uk. Then we have Γ(i, i, s) 1(1 i ks), and we can choose Q = Q1 Φ QH 2 .

Fast and Provable Nonconvex Tensor RPCA

Lemma D.9. Let Zk = Uk Φ Σk Φ VH k be a multi-rank r tensor such that

Zk L F σmin(L)

Then the tensor ˆZk returned by Trim algorithm satisfies

ˆUH k Φ ei F 10

nq and ˆVk Φ ej F 10

for 1 i, j n, where ˆZ = ˆU Φ ˆΣ Φ ˆVH. Furthermore, ˆZk L F 8κ Zk L F .

Proof. For simplicity, let d = Zk L F . By Lemma D.7, we have

Uk Φ UH k U Φ UH F

2d σmin(L) and Vk Φ VH k V Φ VH F

2d σmin(L).

This together with Lemma D.8 implies that there exist two unitary tensors Qu Rr r q and Qv Rr r q such that

Uk U Φ Qu F

2d σmin(L) and Vk V Φ Qv F

2d σmin(L).

It follows that

Σk QH u Φ Σ Φ Qv F = Σk QH u Σ Qv F = UH k Zk Vk ( U Qu)H L( V Qv) F

UH k Zk Vk ( U Qu)H Zk Vk F + ( U Qu)H Zk Vk ( U Qu)H L Vk F + ( U Qu)H L Vk ( U Qu)H L( V Qv) F Uk U Qu F Zk + Zk L F + L Vk V Qv F

2d(d + σmax(L))

σmin(L) + d +

σmin(L) 4κd,

where κ is the condition number of L, and we have used the assumption d σmin(L)/10

2 σmax(L)/10

2 and the fact Zk L + Zk L σmax(L) + d

in the last two inequalities. Recall that Ak and Bk in Trim algorithm are defined as

AH k (:, i, :) = min UH k (:, i, :) F , rµsr

UH k (:, i, :) UH k (:, i, :) F ,

BH k (:, i, :) = min VH k (:, i, :) F , rµsr

VH k (:, i, :) VH k (:, i, :) F .

e H i Φ U Φ Qu F rµsr

nq and e H i Φ V Φ Qv F rµsr

ei Φ Ak U Φ Qu F ei Φ (Uk U Φ Qu) F ,

ei Φ Bk V Φ Qv F ei Φ (Vk V Φ Qv) F .

Ak U Φ Qu F Uk U Φ Qu F

2d σmin(L),

Bk V Φ Qv F Vk V Φ Qv F

2d σmin(L).

Fast and Provable Nonconvex Tensor RPCA

Since ˆZk = Ak Φ Σk Φ BH k by Trim algorithm, we have

ˆZk L F = Ak Φ Σk Φ BH k (U Φ Qu) Φ (QH u Φ Σ Φ Qv) Φ (V Φ Qv)H F

= Ak Σk BH k ( U Qu)( QH u Σ Qv)( V Qv)H F

Ak Σk BH k ( U Qu) Σk BH k F + ( U Qu) Σk BH k ( U Qu)( QH u Σ Qv) BH k F

+ ( U Qu)( QH u Σ Qv) BH k ( U Qu)( QH u Σ Qv)( V Qv)H F

Ak U Qu F Σk Bk + Σk QH u Σ Qv F Bk + Σ Bk V Qv F

2d σmin(L) (σmax(L) + d)

2d σmin(L) + 1

2d σmin(L) + 1

2d σmin(L) 8κd,

where we use the fact Σk = Zk = L + Zk L σmax(L) + d.

It remains to estimate the incoherence of ˆZk. Because Ak and Bk are not necessarily orthogonal, we consider their QR factorizations: Ak = Uk Φ Ru and Bk = Vk Φ Rv.

First note that

σmin( Ak) 1 Ak U Φ Qu 1

2d σmin(L) 9

σmin( Bk) 1 Bk V Φ Qv 1

2d σmin(L) 9

We give the proof of the first inequality in the above inequalities. We assume X0 Rr r q, X0 F = 1 and AH k Φ Ak Φ X0 = σmin( Ak)X0

1 = U Φ Qu Φ X0 F (U Φ Qu Ak + Ak) Φ X F

(U Φ Qu Ak) Φ X F + Ak Φ X F U Φ Qu Ak + σmin( Ak).

9 and R 1 v 10

Consequently,

e H i Φ ˆUk F = e H i Φ Uk F = e H i Φ Ak Φ R 1 u F e H i Φ Ak F R 1 u 10

e H i Φ ˆVk F = e H i Φ Vk F = e H i Φ Bk Φ R 1 v F e H i Φ Bk F R 1 v 10

Lemma D.10 (Trim Property). Let Trim be the algorithm defined in Algorithm 3. If Lk Rn n q is a multi-rank r tensor with Lk L σsr 20 sr , (12)

then the trim output with level q

nq satisfies

Lk L F 8κ Lk L F , (13)

max i [n] AH k Φ ei F 10

nq , and max j [n] BH k Φ ei F 10

where Lk = Ak Φ Σk Φ BH k is the transformed t-SVD of Lk. Furthermore, it follows that

2sr Lk L . (15)

Fast and Provable Nonconvex Tensor RPCA

Proof. Since both L and Lk are multi-rank r tensor, Lk L is multi-rank at most 2r. So

Lk L F = Lk L F

2sr σsr 20 sr = σsr 10

The Trim incoherence follows directly from Lemma D.9. Furthermore, we have

Lk L F 8κ Lk L F

Lk L = Lk L 8

2srκ Lk L = 8κ

Lemma D.11. Let L = U Φ Σ Φ VH and Lk = Ak Φ Σk Φ BH k be the transformed t-SVD of two multi-rank r tensors, then

U Φ UH Ak Φ AH k Lk L

σsr , V Φ VH Bk Φ BH k Lk L

(I PTk)L Lk L 2

Proof. From Lemma D.7 we get Eq.(16). Since L = U Φ UH Φ L and Lk Φ (I Bk Φ BH) = 0, we have

(I PTk)L = (I Ak Φ AH k ) Φ L Φ (I Bk Φ BH k )

= (I Ak Φ AH k ) Φ U Φ UH Φ L Φ (I Bk Φ BH k )

= (U Φ UH Ak Φ AH k ) Φ U Φ UH Φ L Φ (I Bk Φ BH k )

= (U Φ UH Ak Φ AH k ) Φ L Φ (I Bk Φ BH k )

= (U Φ UH Ak Φ AH k ) Φ (L Lk) Φ (I Bk Φ BH k )

= ( U UH Ak AH k )( L Lk)( I Bk BH k ) U UH Ak AH k L Lk I Bk BH k

σsr = Lk L 2

Lemma D.12. Let S Rn n q be a symmetric tensor satisfying Assumption 5.2. Let Lk Rn n q be a multi-rank r tensor with 100

81 -incoherence. That is,

max i [n] AH k Φ ei F 10

nq and max j [n] BH k Φ ej F 10

where Lk = Ak Φ Σk Φ BH k is the transformed t-SVD of Lk. If supp(Sk) Ω, then

PTk(S Sk) 4αµsr S Sk .

Fast and Provable Nonconvex Tensor RPCA

Proof. By the incoherence assumption of Lk and the sparsity assumption of S Sk, we have

[PTk(S Sk)]ijt = PTk(S Sk), Eijt

= Ak Φ AH k Φ (S Sk) + (S Sk) Φ Bk Φ BH k Ak Φ AH k Φ (S Sk) Φ Bk Φ BH k , Eijt

= Ak AH k ( S Sk) + ( S Sk) Bk BH k Ak AH k ( S Sk) Bk BH k , Eijt

= S Sk, Ak AH k Eijt + Eijt Bk BH k Ak AH k Eijt Bk BH k

= S Sk, Ak Φ AH k Φ Eijt + Eijt Φ Bk Φ BH k Ak Φ AH k Φ Eijt Φ Bk Φ BH k

= (S Sk) Φ ej, Ak Φ AH k Φ ei Φ et + e H i Φ (S Sk), et Φ e H j Φ Bk Φ BH k

(S Sk), Ak Φ AH k Φ Eijt Φ Bk Φ BH k

(S Sk)(:, j, c), (Ak Φ AH k Φ ei Φ et)(:, :, c) + X

(S Sk)(i, :, c), (et Φ e H j Φ Bk Φ BH k )(:, :, c)

+ S Sk Ak AH k Eijt Bk BH k

c [q] (S Sk)(:, j, c) (Ak Φ AH k Φ ei Φ et)(:, :, c) 1

c [q] (S Sk)(i, :, c) (et Φ e H j Φ Bk Φ BH k )(:, :, c) 1

+ αnq S Sk Ak Φ AH k Φ Eijt Φ Bk Φ BH k F

S Sk Ak Φ AH k Φ ei Φ et 1 + et Φ e H j Φ Bk Φ BH k 1 + αnq 100µsr

a|(a,i,t) Ω e H a Φ Ak Φ AH k Φ ei Φ et 1 + X

b|(j,b,t) Ω et Φ e H j Φ Bk Φ BH k Φ eb 1

a|(a,i,t) Ω e H a Φ Ak Φ AH k Φ ei Φ et F + X

b|(j,b,t) Ω e H j Φ Bk Φ BH k Φ eb Φ et F

81 S Sk one element in the l1 norm

81nq S Sk + 100αµsr

81 S Sk = 200αµsr

81q + 100αµsr

S Sk 4αµsr S Sk .

This proof uses the fact that X 1 r X F , X r X F , q 1 where X is a rank-r matrix. We also using the following inequality in the proof:

Ak Φ AH k Φ Eijt Φ Bk Φ BH k F = Ak AH k Eijt Bk BH k F = q

Tr( Ak AH k Eijt Bk BH k Bk BH k EH ijt Ak AH k )

= Ak AH k Eijt Bk F = BH k EH ijt Ak AH k F = q

Tr( BH k EH ijt Ak AH k Ak AH k Eijt Bk) = BH k EH ijt Ak F

BH k ej F e H k F e H i Ak F = BH k Φ ej F e H t F e H i Φ Ak F 100µsr

Lemma D.13. Under the symmetric setting, i.e., U Φ UH = V Φ VH where U Rn r q are two orthogonal tensors, we have

for any symmetric tensor Z Rn n q. Moreover, the bound is tight.

Fast and Provable Nonconvex Tensor RPCA

Proof. First note that

PT (Z) = U Φ UH Φ Z + Z Φ U Φ UH U Φ UH Φ Z Φ U Φ UH.

Then, we have

PT (Z) = U Φ UH Φ Z + Z Φ U Φ UH U Φ UH Φ Z Φ U Φ UH

= U UH Z + Z U UH U UH Z U UH

where the last inequality follows from Lemma 8 in Cai et al. 2019.

Lemma D.14. Let U Rn r q be an orthogonal tensor with µ-incoherence, i.e., UH Φ ei F q

nq for all i [n].

Then, for any Z Rn n q, the inequality

UH Φ Za Φ ei F max l

nq ( n Z Φ el F )a

holds for all i and a 0.

Proof. This proof is done by mathematical induction. Base case: When a = 0, e H i Φ U F q

nq is satisfied following from the assumption.

Induction Hypothesis: e H i Φ Za Φ U F maxl q

nq ( n e H l Φ Z F )a for all i at the ath power.

Induction Step: We have

e H i Φ Za+1 Φ U 2 F = e H i Φ Z Φ Za Φ U 2 F = e H i Z Za U 2 F

k=1 [ Z]cn+i,cn+k[ Za U]cn+k,cr+j

k1,k2 [ Z]cn+i,cn+k1[ Z]cn+i,cn+k2

j=1 [ Za U]cn+k1,cr+j[ Za U]cn+k2,cr+j

k1,k2 [ Z]cn+i,cn+k1[ Z]cn+i,cn+k2 e H cn+k1 Za U(:, cr + 1 : cr + r), e H cn+k2 Za U(:, cr + 1 : cr + r)

[ Z]cn+i,cn+k1[ Z]cn+i,cn+k2 e H cn+k1 Za U(:, cr + 1 : cr + r) 2 e H cn+k2 Za U(:, cr + 1 : cr + r) 2

2 is the l2 norm of vectors.

[ Z]cn+i,cn+k1[ Z]cn+i,cn+k2 e H cn+k1 Za U 2 e H cn+k2 Za U 2

nq ( n e H l Φ Z F )2a q 1 X

[ Z]cn+i,cn+k1[ Z]cn+i,cn+k2

= max l µsr

nq ( n e H l Φ Z F )2a q 1 X

e H cn+i Z (:, cn + 1 : cn + n) 1 2

nq ( n e H l Φ Z F )2a q 1 X

n e H cn+i Z (:, cn + 1 : cn + n) 2 2

= max l µsr

nq ( n e H l Φ Z F )2a n e H i Z F 2

= max l µsr

nq ( n e H l Φ Z F )2a n e H i Φ Z F 2 max l µsr

nq ( n e H l Φ Z F )2a+2,

Fast and Provable Nonconvex Tensor RPCA

where we have used the inequality x 1 n x 2, x Rn. Now, we have

UH Φ Za+1 Φ ei F max l

nq ( n Z Φ el F )a+1.

In the proof, we have used the inequality

e H cn+k1 Za U 2 e H cn+k2 Za U 2 max l

nq ( n e H l Φ Z F )a 2 = max l µsr

nq ( n e H l Φ Z F )2a.

Lemma D.15. With the condition of Theorem 5.5, let L Rn n q and S Rn n q be two symmetric tensors satisfying Assumptions 5.1 and 5.2. Let Lk Rn n q be the trim output of Lk. If

L Lk 8αµsrγk σ1, S Sk µsr

nq γk σ1, supp(Sk) Ω.

(PTk I)L + PTk(S Sk) τγk+1 σsr

max l n e H l Φ [(PTk I)L + PTk(S Sk)] F vγk σsr

hold for all k 0, provided 1 > γ 512τsrκ2 + 1

12. Here τ = 4αµsrκ and v = τ(48 q

µ q srκ + µsr q ).

Proof. Since D = L + S, we have

(PTk I)L + PTk(S Sk) (PTk I)L + PTk(S Sk) Lk L 2

(8 2srκ)2 Lk L 2

4 3αnq S Sk 128 8αµs2 rκ3 L Lk +

4 3αnq S Sk

128 64α2µ2s3 rκ3γk σ1 +

4 3αµsrγk σ1 =

512 4αµs2 rκ3 + 1

4αµsrγk σ1 4αµsrγk+1 σ1 = τγk+1 σsr,

where we have used lemmas above and the fact L Lk

σsr 8αµsrκ.

To compute the bound of maxl n e H l Φ [(PTk I)L + PTk(S Sk)] F , we first note that

max l e H l Φ (I PTk)L F = max l e H l Φ (U Φ UH A Φ AH) Φ (L Lk) Φ (I A Φ AH) F

= max l e H l ( U UH A AH)( L Lk)( I A AH) F max l e H l ( U UH A AH) F ( L Lk)( I A AH)

max l e H l ( U UH A AH) F L Lk I A AH = max l e H l Φ (U Φ UH A Φ AH) F L Lk I A AH

Fast and Provable Nonconvex Tensor RPCA

where we have used the fact L is µ-incoherence, Lk is 100

81 µ-incoherence and X X F . Hence, for all k 0, we have

max l n e H l Φ [(PTk I)L + PTk(S Sk)] F max l n e H l Φ (PTk I)L F + n e H l Φ PTk(S Sk) F

max l n e H l Φ (PTk I)L F + n e H l PTk( S Sk) F

= max l n e H l Φ (PTk I)L F + n e H l Φ PTk(S Sk) F

nq L Lk + n q PTk(S Sk) 19

q L Lk + 4αµsrn q S Sk

q µsrγk σ1 + 4αµsrn q µsr

nq γk σ1 vγk σsr,

where we have used Lemma D.10 and Lemma D.12.

Lemma D.16. With the condition of Theorem 5.5, let L Rn n q and S Rn n q be two symmetric tensors satisfying Assumptions 5.1 and 5.2. Let Lk Rn n q be the trim output of Lk. If

L Lk 8αµsrγk σ1, S Sk µsr

nq γk σ1, and supp(Sk) Ω,

then | σi,c | λ(k) i,c || τ σrq and (1 2τ)γj σt,c | λ(k) i,c | + γj| λ(k) t,c | (1 + 2τ)γj σt,c hold for all k 0, i > rc, and

j k + 1, provided 1 > γ 512τsrκ2 + 1

12. Here | λ(k) i,c | is the (i, c)th singular value of PTk( D Sk).

Proof. Since D = L + S, we have PTk(D Sk) = PTk(L + S Sk) = L + (PTk I)L + PTk(S Sk). Hence, by Tensor Weyl s inequality and Lemma D.15, we can see that

| σi,c | λ(k) i,c || (PTk I)L + PTk(S Sk) τγk+1 σsr τ σsr

holds for all i and k 0. Notice that L is a multi-rank r tensors, which implies σi,c = 0, i > rc, so we have

|| λ(k) i,c | + γj| λ(k) t,c | γj σt,c | = | λ(k) i,c | σi,c + γj| λ(k) t,c | γj σt,c

τγk+1 σsr + τγj+k+1 σsr (1 + γk+1)τγj σsr 2τγj σ1.

Lemma D.17. With the condition of Theorem 5.5, let L Rn n q and S Rn n q be two symmetric tensors satisfying Assumptions 5.1 and 5.2, respectively. Let Lk Rn n q be the trim output of Lk. If

L Lk 8αµsrγk σ1, S Sk µsr

nq γk σ1, and supp(Sk) Ω,

then we have L Lk+1 8αµsrγk+1 σ1, provided 1 > γ 512τsrκ2 + 1

Proof. A direct calculation yields

L Lk+1 L PTk(D Sk) + PTk(D Sk) Lk+1 2 L PTk(D Sk)

=2 L PTk(L + S Sk) = 2 (PTk I)L + PTk(S Sk) 2 τγk+1 σsr = 2 4αµsrκγk+1 σrq = 8αµsrγk+1 σ1,

where the second inequality follows from the fact Lk+1 = Hr (PTk(D Sk)) is the best multi-rank r approximation of PTk(D Sk), and the last inequality follows from Lemma D.15.

Lemma D.18. With the condition of Theorem 5.5, let L Rn n q and S Rn n q be two symmetric tensors satisfying Assumptions 5.1, 5.2. Let Lk Rn n q be the trim output of Lk. If

L Lk 8αµsrγk σ1, S Sk µsr

nq γk σ1, and supp(Sk) Ω,

then we have L Lk+1 1

nq γk+1 σ1, provided 1 > γ max n 512τsrκ2 + 1

12, 2v (1 12τ)(1 τ v)2 o and

Fast and Provable Nonconvex Tensor RPCA

Proof. Let PTk(D Sk) = Uk+1 Uk+1 Φ

UH k+1 UH k+1

= Uk+1 Φ Λ Φ UH k+1 + Uk+1 Φ Λ Φ UH k+1 be

its eigenvalue decomposition. We use the lighter notion λi = Λ(i, i, :) R1 1 q (1 i n) for the eigen fiber of PTk(D Sk) at the k-th iteration and assume they are ordered by λ1 F λ2 F λn F . Moreover, Λ has its r largest eigenvalues in Frobenius norm, Uk+1 contains the first r eigenvectors, and Uk+1 has the rest. Also, the multi-rank of Uk+1 Φ Λ Φ UH k+1 is r. It follows that Lk+1 = Hr(PTk(D Sk)) = Uk+1 Φ Λ Φ UH k+1. Denote Z = PTk(D Sk) L = (PTk I)L+PTk(S Sk). Let ui = Uk+1(:, i, :) be the ith eigenvector of PTk(D Sk). That is, PTk(D Sk) Φ ui = ui Φ λi. Then we denote PTk(D Sk) as M

M ui = ui λi

M(1) 0 0 0 M(2) 0 0 0 0 0 0 M(q)

u(1) i 0 0 0 u(2) i 0 0 0 0 0 0 u(q) i

u(1) i 0 0 0 u(2) i 0 0 0 0 0 0 u(q) i

λ(1) i 0 0 0 λ(2) i 0 0 0 0 0 0 λ(q) i

This means M(c) u(c) i = λ(c) i u(c) i . Because M(c) = PTk(D Sk)(c) = Z(c) + L(c), we get ( λ(c) i I Z(c)) u(c) i = L(c) u(c) i . Then we have

λ(c) i u(c) i =

λ(c) i u(c) i

Combining all q slices, we have

ui = I + Ei Z + ( Ei Z)2 + Ei L ui

1 λ(1) i I 0 0

0 1 λ(2) i I 0

0 0 0 0 0 1 λ(q) i I

Z(1) 0 0 0 Z(2) 0 0 0 0 0 0 Z(q)

The above inequality is valid because of Lemmas D.15 and D.16 (because Z is symmetric, its eigenvalues and singular values coincide):

| λsr| τγk+1 σsr

(1 τ) σsr = τ 1 τ < 1.

Then we get ui = I + Ei Φ Z + (Ei Φ Z)2 + Φ Ei Φ L Φ ui which implies

Uk+1 Φ Λ Φ UH k+1 =

i=1 ui Φ λi Φ u H i

a 0 (Ei Φ Z)a Φ Ei Φ L

Φ ui Φ λi Φ u H i

b 0 (Ei Φ Z)b Φ Ei Φ L

a 0 Za Φ L Φ

Ea+1 i Φ ui Φ λi Φ u H i Φ Eb+1 i Φ L Φ X

Fast and Provable Nonconvex Tensor RPCA

To simplify the above formula, we have

1 λ(1) i I 0 0

0 1 λ(2) i I 0

0 0 0 0 0 1 λ(q) i I

u(1) i 0 0 0 u(2) i 0 0 0 0 0 0 u(q) i

u(1) i 0 0 0 u(2) i 0 0 0 0 0 0 u(q) i

1 λ(1) i 0 0

0 1 λ(2) i 0

0 0 0 0 0 1 λ(q) i

Then we have

Ea+1 i ui λi u H i Eb+1 i

0 0 1 λ(q) i

λ(1) i 0 0 0 λ(2) i 0 0 0 0 0 0 λ(q) i

0 0 1 λ(q) i

0 0 1 λ(q) i

u H i = ui Γ(a+b+1) i u H i ,

where we introduce new notation Γ(a+b+1) i . Hence we have

Ea+1 i Φ ui Φ λi Φ u H i Φ Eb+1 i = ui Φ Γ(a+b+1) i Φ u H i .

Now, we get

Uk+1 Φ Λ Φ UH k+1 = X

a 0 Za Φ L Φ

Ea+1 i Φ ui Φ λi Φ u H i Φ Eb+1 i Φ L Φ X

a 0 Za Φ L Φ

ui Φ Γ(a+b+1) i Φ u H i Φ L Φ X

a,b 0 Za Φ L Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ L Φ Zb

Fast and Provable Nonconvex Tensor RPCA

Thus, we have

Lk+1 L = Uk+1 Φ Λ Φ UH k+1 L

= L Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ L L + X

a+b>0 Za Φ L Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ L Φ Zb

L Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ L L + X

a+b>0 Za Φ L Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ L Φ Zb

We will handle Y0 first. Recall that L = U Φ Σ Φ VH is the transformed t-SVD of the symmetric tensor L which obeys µ-incoherence, i.e., U Φ UH = V Φ VH and e H i Φ U Φ UH F q

nq for all i. So for each (i, j, k) entry of Y0, one

Y0 = max i,j,k

D L Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ L L, Eijk E

= max i,j,k

D e H i Φ U Φ UH Φ (L Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ L L) Φ U Φ UH Φ ej, ek E

max i,j e H i Φ U Φ UH Φ (L Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ L L) Φ U Φ UH Φ ej F ek F

max i,j e H i Φ U Φ UH F L Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ L L U Φ UH Φ ej F

nq L Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ L L ,

where the first inequality follows from the fact U Φ UH Φ L = L Φ U Φ UH = L. Since L = Uk+1 Φ Λ Φ UH k+1 + Uk+1 Φ Λ Φ UH k+1 Z, there holds that

L Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ L L

= (Uk+1 Φ Λ Φ UH k+1 + Uk+1 Φ Λ Φ UH k+1 Z) Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ (Uk+1 Φ Λ Φ UH k+1 + Uk+1 Φ Λ Φ UH k+1 Z) L

= Uk+1 Φ Λ Φ Γ(1) Φ Λ Φ UH k+1 L Uk+1 Φ Λ Φ Γ(1) Φ UH k+1 Φ Z

Z Φ Uk+1 Φ Γ(1) Φ Λ Φ UH k+1 + Z Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ Z

Z Uk+1 Φ Λ Φ UH k+1 + Uk+1 Φ Λ Φ Γ(1) Φ UH k+1 Φ Z

+ Z Φ Uk+1 Φ Γ(1) Φ Λ Φ UH k+1 + Z Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ Z

= Z Uk+1 Λ UH k+1 + Uk+1 Λ Γ(1) UH k+1 Z + Z Uk+1 Γ(1) Λ UH k+1 + Z Uk+1 Γ(1) UH k+1 Z

Z Uk+1 Λ UH k+1 + 2 Z + Z 2

| λsr| Uk+1 Λ UH k+1 + 4 Z

| λsr+1| + 4 Z 5 Z 5τγk+1 σ1,

where the last inequality follows from Lemma D.15, and notice Z

| λsr | τ 1 τ < 1 since τ < 1

2, and | λsr+1| Z since

L is a multi-rank r tensor (using Tensor Weyl s inequality). Thus we have Y0 µsr

nq 5τγk+1 σ1. Next, we derive an upper

Fast and Provable Nonconvex Tensor RPCA

bound for the rest part. Note that

Yab = max i,j,k

D Za Φ L Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ L Φ Zb, Eijk E

= max i,j,k

D ( e H i Φ Za Φ U Φ UH) Φ L Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ L Φ (U Φ UH Φ Zb Φ ej), ek E

max i,j ( e H i Φ Za Φ U Φ UH) Φ L Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ L Φ (U Φ UH Φ Zb Φ ej) F ek F

max i,j e H i Φ Za Φ U F L Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ L UH Φ Zb Φ ej F

nq ( n e H l Φ Z F )a+b L Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ L ,

where the last inequality follows from Lemma D.14. Furthermore, by using L = Uk+1 ΦΛ ΦUH k+1+ Uk+1 Φ Λ Φ UH k+1 Z again, we get

L Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ L

= (Uk+1 Φ Λ Φ UH k+1 + Uk+1 Φ Λ Φ UH k+1 Z) Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ (Uk+1 Φ Λ Φ UH k+1 + Uk+1 Φ Λ Φ UH k+1 Z)

= Uk+1 Φ Λ Φ Γ(a+b+1) Φ Λ Φ UH k+1 Uk+1 Φ Λ Φ Γ(a+b+1) Φ UH k+1 Φ Z

Z Φ Uk+1 Φ Γ(a+b+1) Φ Λ Φ UH k+1 + Z Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ Z

Uk+1 Λ Γ(a+b+1) Λ UH k+1 Uk+1 Λ Γ(a+b+1) UH k+1 Z Z Uk+1 Γ(a+b+1) Λ UH k+1 + Z Uk+1 Γ(a+b+1) UH k+1 Z

| λsr| (a+b 1) + | λsr| (a+b) Z + | λsr| (a+b) Z + | λsr| (a+b+1) Z 2

=| λsr| (a+b 1)

= | λsr| (a+b 1) 1 + Z

| λsr| (a+b 1) 1 1 τ

2 ((1 τ) σsr) (a+b 1) ,

where the last inequality follows from Z

| λrq| τ 1 τ , and the last inequality follows from Lemma D.16. Together with Lemma D.15, we have

a+b>0 Yab = X

vγk σsr (1 τ) σsr

Finally, combining them together gives

Lk+1 L = Y0 + X

a+b>0 Yab = µsr

nq 5τγk+1 σ1 + µsr

nq γk+1 σ1,

where the last inequality follows from γ 2v (1 12τ)(1 τ v)2 .

Lemma D.19. Let L Rn n q and S Rn n q be two symmetric tensors satisfying Assumptions 5.1 and 5.2, respectively. Let Lk Rn n q be the trim output of Lk. Recall that β = µsr

L Lk 8αµγk σ1, S Sk µsr

nq γk σ1, and supp(Sk) Ω,

Fast and Provable Nonconvex Tensor RPCA

then we have

supp(Sk+1) Ωand S Sk+1 µsr

provided 1 > γ max n 512τsrκ2 + 1

12, 2v (1 12τ)(1 τ v)2 o and τ < 1 12.

Proof. We first notice that

[S]ijk = [Tζk+1(D Lk+1)]ijk = [Tζk+1(S + L Lk+1)]ijk = Tζk+1([S + L Lk+1]ijk) (i, j, k) Ω Tζk+1([L Lk+1]ijk) (i, j, k) Ωc .

Let | λi| denote the ith singular value of P Tk( D Sk). By Lemmas D.16 and D.18, we have

|[L Lk+1]ij| L Lk+1 1

nq γk+1 σ1 1

| λ(k) sr+1| + γk+1| λ(k) 1 | = ζk+1

for any entry of L Lk+1. Hence, [Sk+1]ijk = 0 for all (i, j, k) Ωc, i.e., supp(Sk+1) Ω. Denote Ωk+1 := supp(Sk+1) = {(i, j, k)|[D Lk+1]ijk > ζk}. Then for any entry of S Sk+1, there hold

[S Sk+1]ijk =

0 [Lk+1 L]ijk [S]ijk

0 L Lk+1 L Lk+1 + ζk+1

0 (i, j, k) Ωc 1

nq γk+1 σ1 (i, j, k) Ωk+1 µsr

nq γk+1 σ1 (i, j, k) Ω\Ωk+1.

Here the last step follows from Lemma D.16 which implies ζk+1 = µsr

2nq(| λ(k) sr+1| + γk+1| λ(k) 1 |) 1

nq γk+1 σ1. Therefore, S Sk+1 µsr

nq γk+1 σ1.

Theorem D.20 (Local Convergence). With the condition of Theorem 5.5, let L Rn n q and S Rn n q be two symmetric tensors satisfying Assumptions 5.1 and 5.2. If the initial guess L0 and S0 obey the following conditions:

L L0 8αµsr σ1, S S0 µsr

nq σ1, and supp(S0) Ω,

then the iterates of Algorithm 2 with parameters β = µsr

2nq and γ ( 1

12, 1) satisfy

L Lk 8αµsrγk σ1, S Sk µsr

nq γk σ1, and supp(Sk) Ω.

Proof. This theorem will be proved by mathematical induction. Base Case: When k = 0, the base case is satisfied by the assumption on the initialization. Induction Step: Assume we have

L Lk 8αµsrγk σ1, S Sk µsr

nq γk σ1, and supp(Sk) Ω

at the kth iteration. At the (k + 1)th iteration. It follows directly from Lemmas D.17 and D.19 that

L Lk+1 8αµsrγk+1 σ1, S Sk+1 µsr

nq γk+1 σ1, and supp(Sk+1) Ω.

Additionally, notice that we overall require 1 > γ 512τsrκ2 + 1

12. By the definition of τ and v, one can easily see that the lower bound approaches 1

12 when the constant hidden in bound of α in Theorem 5.5 is sufficiently large. Therefore,

the theorem can be proved for any γ 1

Fast and Provable Nonconvex Tensor RPCA

D.2.4. LOCAL CONVERGENCE OF ALTERNATING PROJECTION IN APT

Lemma D.21. With the condition of Theorem 5.4, let L, S be the symmetric tensors and satisfy Assumptions 5.2 and 5.1. Let Sk bethe k-th iteration of our algorithm. Suppose (1) S Sk 4µsr

nq γk 1 σ1, (2) supp(S Sk) supp(S). Then 127 128 σ1 | λ1| 129

Proof. Note that D Sk = L + (S Sk). With Lemmas D.3 and D.4, we have

λi,c σi,c S Sk 2 4αµsrγk 1 σ1 1 128 σ1 (18)

where the last inequality follows from the bound of α (hidden constant in Theorem 5.4 is 1 512).

If σi,c = σ1, we have 127

128 σ1 | λi,c| | λ1|.

if λi,c = λ1, we have | λ1| = | λi,c| 4 512 σ1 + σi,c 129

Lemma D.22. Suppose γ 1 16, (1). S Sk 4µsr

nq γk 1 σ1, (2). supp(S Sk) supp(S). Then we have Lk+1 L µsr

Proof. Let D Sk = h Uk+1, Uk+1 i Φ

Φ h Uk+1, Uk+1 i H = Uk+1 Φ Λ Φ UH k+1 + Uk+1 Φ Λ Φ UH k+1 be its

eigenvalue decomposition, where Λ has the multi-rank r eigenvalues in magnitude and Λ contains the rest eigenvalues. Also, Uk+1 contains the first r eigenvectors, and Uk+1 has the rest. Notice Lk+1 = PL(D S 1) = Uk+1 Φ Λ Φ UH k+1 due to the symmetric setting. Also, denote S Sk as Z. Let ui = Uk+1(:, i, :) be the ith eigenvector of D Sk = L + Z and λs = Λ(s, s, :) be the sth eigenvalue of D Sk = L + Z. Then recall that D Sk = L + Z. We denote D Sk as M, then M ui = ui λi,

M(1) 0 0 0 M(2) 0 0 0 0 0 0 M(q)

u(1) i 0 0 0 u(2) i 0 0 0 0 0 0 u(q) i

u(1) i 0 0 0 u(2) i 0 0 0 0 0 0 u(q) i

λ(1) i 0 0 0 λ(2) i 0 0 0 0 0 0 λ(q) i

This means M(c) u(c) i = λ(c) i u(c) i . Because D(c) S(c) k = Z(c) + L(c), we get ( λ(c) i I Z(c)) u(c) i = L(c) u(c) i . Then we have

λ(c) i u(c) i =

λ(c) i u(c) i .

Combining all q slices, we have ui = I + Ei Z + ( Ei Z)2 + Ei L ui where

1 λ(1) i I 0 0

0 1 λ(2) i I 0

0 0 0 0 0 1 λ(q) i I

Z(1) 0 0 0 Z(2) 0 0 0 0 0 0 Z(q)

The above inequality is valid because of Ei Z Z 1 λ(q) i < 1. To prove the above inequality, by Lemma D.3, we have

| λi,c σi,c| Z σi,c Z λi,c σi,c + Z

1 σi,c + Z 1 λi,c 1 σi,c Z Z σi,c + Z Z

λi,c Z σi,c Z .

Fast and Provable Nonconvex Tensor RPCA

Then, we have

λi,c Z σi,c Z 4αµsr σ1 σsr 4αµsr σ1 < 1 8αµsr σ1 < σsr α 1 512αµsrκ.

Then, we get ui = I + Ei Φ Z + (Ei Φ Z)2 + Φ Ei Φ L Φ ui for each ui which implies

Uk+1 Φ Λ Φ UH k+1 =

i=1 ui Φ λi Φ u H i

a 0 (Ei Φ Z)a Φ Ei Φ L

Φ ui Φ λi Φ u H i

b 0 (Ei Φ Z)b Φ Ei Φ L

a 0 Za Φ L Φ

Ea+1 i Φ ui Φ λi Φ u H i Φ Eb+1 i Φ L Φ X

To simplify the above formula, we have

1 λ(1) i I 0 0

0 1 λ(2) i I 0

0 0 0 0 0 1 λ(q) i I

u(1) i 0 0 0 u(2) i 0 0 0 0 0 0 u(q) i

u(1) i 0 0 0 u(2) i 0 0 0 0 0 0 u(q) i

1 λ(1) i 0 0

0 1 λ(2) i 0

0 0 0 0 0 1 λ(q) i

Then we further have

Ea+1 i ui λi u H i Eb+1 i

0 0 1 λ(q) i

λ(1) i 0 0 0 λ(2) i 0 0 0 0 0 0 λ(q) i

0 0 1 λ(q) i

0 0 1 λ(q) i

u H i = ui Γ(a+b+1) i u H i ,

Fast and Provable Nonconvex Tensor RPCA

where we introduce new notation Γ(a+b+1) i . Hence we have Ea+1 i Φ ui Φ λi Φ u H i Φ Eb+1 i = ui Φ Γ(a+b+1) i Φ u H i . Now, we get

Uk+1 Φ Λ Φ UH k+1 = X

Ea+1 i Φ ui Φ λi Φ u H i Φ Eb+1 i Φ L Φ X

a 0 Za Φ L Φ

ui Φ Γ(a+b+1) i Φ u H i Φ L Φ X

a,b 0 Za Φ L Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ L Φ Zb.

Thus, we have

Lk+1 L = Uk+1 Φ Λ Φ UH k+1 L

= L Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ L L + X

a+b>0 Za Φ L Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ L Φ Zb

L Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ L L + X

a+b>0 Za Φ L Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ L Φ Zb

We will handle Y0 first. Recall that L = U Φ Σ Φ VH is the t-SVD of the symmetric tensor L which obeys µ-incoherence, i.e., U Φ UH = V Φ VH and e H i Φ U Φ UH F q

nq for all i. So for each (i, j, k) entry of Y0, one has

Y0 = max i,j,k

D L Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ L L, Eijk E

= max i,j,k

D e H i Φ U Φ UH(L Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ L L) Φ U Φ UH Φ ej, ek E

max i,j e H i Φ U Φ UH(L Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ L L) Φ U Φ UH Φ ej F ek F

max i,j e H i Φ U Φ UH F L Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ L L U Φ UH Φ ej F

nq L Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ L L ,

where the first inequality follows from the fact U Φ UH Φ L = L Φ U Φ UH = L. Since L = Uk+1 Φ Λ Φ UH k+1 + Uk+1 Φ Λ Φ UH k+1 Z, there holds that

L Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ L L

= (Uk+1 Φ Λ Φ UH k+1 + Uk+1 Φ Λ Φ UH k+1 Z) Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ (Uk+1 Φ Λ Φ UH k+1 + Uk+1 Φ Λ Φ UH k+1 Z) L

= Uk+1 Φ Λ Φ Γ(1) Φ Λ Φ UH k+1 L Uk+1 Φ Λ Φ Γ(1) Φ UH k+1 Φ Z

Z Φ Uk+1 Φ Γ(1) Φ Λ Φ UH k+1 + Z Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ Z

Z Uk+1 Φ Λ Φ UH k+1 + Uk+1 Φ Λ Φ Γ(1) Φ UH k+1 Φ Z

+ Z Φ Uk+1 Φ Γ(1) Φ Λ Φ UH k+1 + Z Φ Uk+1 Φ Γ(1) Φ UH k+1 Φ Z

= Z Uk+1 Λ UH k+1 + Uk+1 Λ Γ(1) UH k+1 Z + Z Uk+1 Γ(1) Λ UH k+1 + Z Uk+1 Γ(1) UH k+1 Z

Z Uk+1 Λ UH k+1 + 2 Z + Z 2

| λsr| Uk+1 Λ UH k+1 + 4 Z | λsr+1| + 4 Z 5 Z ,

where the last inequality follows from the fact Z

7 and | λsr+1| Z according to σsr+1 = 0 and the tensor Weyl s inequality. Thus we have

Fast and Provable Nonconvex Tensor RPCA

Next, we derive an upper bound for the rest part. Note that

Yab = max i,j,k

D Za Φ L Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ L Φ Zb, Eijk E

= max i,j,k

D ( e H i Φ Za Φ U Φ UH) Φ L Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ L Φ (U Φ UH Φ Zb Φ ej), ek E

max i,j ( e H i Φ Za Φ U Φ UH) Φ L Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ L Φ (U Φ UH Φ Zb Φ ej) F ek F

max i,j e H i Φ Za Φ U F L Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ L UH Φ Zb Φ ej F

nq (αnq q Z )a+b L Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ L ,

where last inequality have used the bound of α. Furthermore, by using L = Uk+1 Φ Λ Φ UH k+1 + Uk+1 Φ Λ Φ UH k+1 Z again, we get

L Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ L

= (Uk+1 Φ Λ Φ UH k+1 + Uk+1 Φ Λ Φ UH k+1 Z) Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ (Uk+1 Φ Λ Φ UH k+1 + Uk+1 Φ Λ Φ UH k+1 Z)

= Uk+1 Φ Λ Φ Γ(a+b+1) Φ Λ Φ UH k+1 Uk+1 Φ Λ Φ Γ(a+b+1) Φ UH k+1 Φ Z

Z Φ Uk+1 Φ Γ(a+b+1) Φ Λ Φ UH k+1 + Z Φ Uk+1 Φ Γ(a+b+1) Φ UH k+1 Φ Z

Uk+1 Λ Γ(a+b+1) Λ UH k+1 Uk+1 Λ Γ(a+b+1) UH k+1 Z Z Uk+1 Γ(a+b+1) Λ UH k+1 + Z Uk+1 Γ(a+b+1) UH k+1 Z

| λsr| (a+b 1) + | λsr| (a+b) Z + | λsr| (a+b) Z + | λsr| (a+b+1) Z 2

=| λsr| (a+b 1)

= | λsr| (a+b 1) 1 + Z

2 2| λsr| (a+b 1).

Together, we have Yab 2αµsr q Z αnq q Z

a+b 1 . Then, we have

a+b>0 Yab X

a+b>0 2αµsr q Z

a+b 1 2αµsr q Z X

2 3αµsr q Z .

Finally, combining them together gives

Lk+1 L =Y0 + X

a+b>0 Yab 5µsr

nq Z + 3αµsr q Z 8αµsr q Z µsr

where the last inequality uses the bound of α in Theorem 5.4.

Lemma D.23. Suppose L Lk+1 µsr

nq γk σ1, then we have (1). supp(S Sk+1) supp(S), (2). S Sk+1 4µsr

Proof. We shall prove the first conclusion. Recall that Sk+1 = Tζ (D Lk+1) = Tζ (L Lk+1 S) where ζ = 2µsr

nq γk λ1. If Sijt = 0, then (S Sk+1)ijt = (L Lk+1)ijt when | (L Lk+1)ijt | > ζ. The first part of lemma is proved by using assumption that Lt+1 L µsr

nq γk σ1 2µsr

nq γk λ1 = ζ.

We now come to the second conclusion

[S Sk+1]ijk =

0 [Lk+1 L]ijk [S]ijk

0 L Lk+1 L Lk+1 + ζ

0 (i, j, k) Ωc

nq γk σ1 (i, j, k) Ωk+1 4µsr

nq γk σ1 (i, j, k) Ω\Ωk+1 .

where Ω:= supp(S) and Ωk+1 := supp(Sk+1). So we have S Sk+1 4µsr

Fast and Provable Nonconvex Tensor RPCA

Theorem D.24. With the condition of Theorem 5.4, let L Rn n q and S Rn n q be two symmetric tensors satisfying Assumptions 5.1 and 5.2. If the initial guess L1 and S1 obey the following conditions:

nq σ1, S S1 4µsr

nq σ1, and supp(S1) Ω,

then the iterates of Algorithm 1 with parameters β = µsr

2nq and γ ( 1

16, 1) satisfy

nq γk σ1, S Sk+1 4µsr

nq γk σ1, and supp(Sk) Ω.

Proof. This theorem will be proved by mathematical induction. Base Case: When k = 1, we use the initialization in our efficient implementation and the condition is satisfied. Induction Step: Assume we have

nq γk 1 σ1, S Sk 4µsr

nq γk 1 σ1, and supp(Sk) Ω.

at the kth iteration. At the (k + 1)th iteration. It follows directly from Lemmas D.22 and D.23 that

nq γk σ1, S Sk+1 4µsr

nq γk σ1, and supp(Sk) Ω.

D.2.5. PROOF OF PROPOSITION 5.3

Proof. The proof of Proposition 5.3 directly follows from Theorem D.6.

D.2.6. PROOF OF THEOREM 5.4

Proof. The initialization L1 and S1 by Algorithm 1 satisfies

nq σ1, S S1 µsr

nq σ1, and supp(S1) Ω. (19)

By Theorem D.24, we have

nq γk 1 σ1, S Sk 4µsr

nq γk 1 σ1, and supp(Sk) Ω.

To get the spectral norm bound of L Lk, we have

L Lk = L (D Sk 1) + (D Sk 1) Lk L (D Sk 1) + (D Sk 1) Lk

2 L (D Sk 1) 2 S Sk 1 2αnq S Sk 1 8αµsrγk 2 σ1

By setting ϵ = µsrγk 1 σ1, we prove this theorem.

D.2.7. PROOF OF THEOREM 5.5

Proof. The proof of Theorem 5.5 can be directly follows from Theorem D.20 by setting µsr σ1 as ϵ.