# tensor_decomposition_with_smoothness__e4a561b7.pdf Tensor Decomposition with Smoothness Masaaki Imaizumi 1 Kohei Hayashi 2 3 Real data tensors are typically high dimensional; however, their intrinsic information is preserved in low-dimensional space, which motivates the use of tensor decompositions such as Tucker decomposition. Frequently, real data tensors smooth in addition to being low dimensional, which implies that adjacent elements are similar or continuously changing. These elements typically appear as spatial or temporal data. We propose smoothed Tucker decomposition (STD) to incorporate the smoothness property. STD leverages smoothness using the sum of a few basis functions; this reduces the number of parameters. An objective function is formulated as a convex problem, and an algorithm based on the alternating direction method of multipliers is derived to solve the problem. We theoretically show that, under the smoothness assumption, STD achieves a better error bound. The theoretical result and performances of STD are numerically verified. 1. Introduction A tensor (i.e., a multi-way array) is a data structure that is a generalization of a matrix, and it can represent higher-order relationships. Tensors appear in various applications such as image analysis (Jia et al., 2014), data mining (Kolda & Sun, 2008), and medical analysis (Zhou et al., 2013). For instance, functional magnetic resonance imaging (f MRI) records brain activities in each time period as voxels, which are represented as 4-way tensors (X-axis Y-axis Z-axis time). Frequently, data tensors in the real world contain several missing elements and/or are corrupted by noise, which leads to the tensor completion problem for predicting missing elements and the tensor recovery problem for removing noise. 1Institute of Statistical Mathematics 2National Institute of Advanced Industrial Science and Technology 3RIKEN. Correspondence to: Masaaki Imaizumi . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). To solve these problems, the low-rank assumption, i.e., given tensor is generated from a small number of latent factors, is widely used. If the number of observed elements is sufficiently larger than the number of latent factors (i.e., rank) and noise level, we can estimate latent factors and reconstruct the entire structure. The methods of estimating latent factors are collectively referred to as tensor decompositions. There are several formulations of tensor decompositions such as Tucker decomposition (Tucker, 1966) and the CANDECOMP/PARAFAC(CP) decomposition (Harshman, 1970). While these methods were originally formulated as nonconvex problems, several authors have studied their convex relaxations in recent years (Liu et al., 2009; Tomioka et al., 2010; Signoretto et al., 2011; Gandy et al., 2011). Another important, yet less explored, assumption is the smoothness property. Consider f MRI data as a tensor X. As f MRI data are spatiotemporal, each element of X is expected to be similar to its adjacent elements with every way, i.e., xi,j,k,t should be close to xi 1,j,k,t, xi,j 1,k,t, xi,j,k 1,t, and xi,j,k,t 1. In statistics, this kind of smoothness property has been studied through functional data analysis (Ramsay, 2006; Hsing & Eubank, 2015). Studies show that the smoothness assumption increases sample efficiency, i.e., estimation is more accurate with small sample size. Another advantage is that the smoothness assumption makes interpolation possible, i.e., we can impute an unobserved value using its adjacent observed values. This interpolation ability is particularly useful for solving a specific tensor completion problem referred to as the tensor interpolation problem, as known as the cold-start problem (Gantner et al., 2010). Suppose a case in which f MRI tensor X is completely missing at t = t0. In this case, standard tensor decompositions cannot predict missing elements because there is no information to estimate the latent factor at t = t0. However, using the smoothness property, we can estimate the missing elements from the elements at t = t0 1 and t = t0 + 1. A fundamental challenge of tensor completion and recovery methods is to analyze their performance. Tomioka et al. (2011) extensively studied the statistical performance of low-rank tensor decompositions. On the contrary, the performance of tensor decompositions incorporating smoothness (Yokota et al., 2015b;a; Amini et al., 2013) has never Tensor Decomposition with Smoothness been addressed. The most important barrier is that all methods are formulated as nonconvex problems, which hinders the use of the tools developed in the convex tensor decompositions (Liu et al., 2009; Signoretto et al., 2010; Tomioka et al., 2010). Contributions In this paper, we propose a simple tensor decomposition model incorporating the smoothness property, which we refer to as Smoothed Tucker Decomposition (STD). Following the notions of functional data analysis, STD approximates an observed tensor by a small number of basis functions, such as Fourier series, and decomposes them through Tucker decomposition (Figure 1). STD is formulated as a convex optimization problem that regularizes the ranks and degree of smoothness. To solve this problem, we derive an algorithm based on the alternating direction method of multipliers (ADMM), which always finds the global optimum. Based on the convex formulation, we provide a few theoretical guarantees of STD, namely, we derive error bounds for tensor recovery and interpolation problems. We show that the error bounds for smooth tensors are improved and better than those for other methods. In addition, to the best of our knowledge, this is the first analysis that establishes an error bound for tensor interpolation. These results are empirically confirmed through experiments using synthetic and real data. To summarize, STD has the following advantages. Sample efficiency: STD achieves the same error with less sample size. Interplation ability: STD can solve the tensor interpo- lation problem. Convex formulation: STD ensures that a global solu- tion is obtained. Related Works A few authors have investigated the smoothness property for tensor decompositions. Amini et al. (2013) proposed a kernel method, and Yokota et al. (2015a;b) developed a smooth decomposition method for matrices and tensors using basis functions. These studies demonstrated that the smoothness assumption significantly improves the performance of tensor decompositions for actual applications such as noisy image reconstruction (Yokota et al., 2015b). However, these performance gains were confirmed only in an empirical manner. Several authors have addressed the tensor interpolation problem by extending tensor decompositions; however, instead of smoothness, these methods utilize additional information such as network structures (Hu et al., 2015) or side information (Gantner et al., 2010; Narita et al., 2011). Moreover, the performance of the tensor interpolation problem has never been analyzed theoretically. Figure 1. Comparison of Tucker decomposition and STD. In STD, the mode-wise smoothness of the observed tensor is preserved via basis functions. 2. Preliminaries Given K 2 N natural numbers, I1, . . . , IK 2 N, let X RI1 ... IK be the space of a K-way tensor, and X 2 X be the K-way tensor that belongs to X. For practical use, we define I\k := Q k0=k Ik. Each way of a tensor is referred to as mode; Ik is the dimensionality of the k-th mode for k = 1, . . . , K. For vector Y 2 Rd, [Y ]j denotes its j-th element. Similarly, [X]j1j2...j K denotes the (i1, i2, . . . , i K)- th element of X. The inner product in X is defined as h X, X0i = PI1,I2,...,IK j1,j2,...,j K=1[X]j1,j2,...,j K[X0]j1,j2,...,j K for X, X0 2 X. This induces the Frobenius norm, |||X|||F = p h X, Xi. For vectors Z 2 Rd, let k Zk = ZT Z denote the norm. In addition, we introduce the L2 norm for functions as kfk2 I f(t)2dt for function f : I ! R with some domain I R. C (I) denotes a set of an -times differentiable function on I. 2.1. Tucker Decomposition With a set of finite positive integers (R1, . . . , RK), the Tucker decomposition of X is defined as R1,...,RK X r1,...,r K=1 gr1...r Ku(k) r2 . . . u(k) where gr1...r K 2 R is a coefficient for each rk, denotes the tensor product, and urk 2 RIk denotes vector for each rk(k = 1, . . . , K), which are orthogonal to each other for rk = 1, . . . , Rk. Here, we refer to (R1, . . . , RK) as the Tucker rank, and X is an (R1, . . . , RK)-rank tensor. In addition, we let tensor G 2 RR1 ... RK with [G]r1,...,r K = gr1...r K be a core tensor and matrix U(k) = r1 . . . u(k) r K ) 2 RIk Rk is a set of the vectors for all k = 1, . . . , K. Using this notation, Tucker decomposition Tensor Decomposition with Smoothness (1) can be written as X = G 1 U(1) 1 U(2) 2 . . . K U(K), (2) where k denotes the k-mode matrix product (see Kolda & Bader (2009) for more details). 2.2. Application Problems for Tensors Let S {(j1, j2, . . . , j K)}I1,I2,...,IK j1,j2,...,j K=1 be an index set and n := |S|. Let j(i) be the i-th element of S for i = 1, . . . , n. Then, we consider the following observation model: yi = [X ]j(i) + i, (3) where X 2 X is an unobserved true tensor, {yi}n i=1 is the set of observed values, and i is noise, where the mean is zero and the variance is σ2. We define an observation vector Y := (y1 . . . yn)T 2 Rn, and a noise vector E := ( 1 . . . n)T 2 Rn. Additionally, we define a rearranging operator X : X ! Rn via [X(X)]i = [X]j(i). Using this notation, observation model (3) is written as Y = X(X ) + E. (4) When all the elements are observed, i.e., n = Q k Ik, the problem of estimating X is referred to as the tensor recovery problem. When a few elements of X are missing, i.e., n < Q k Ik, the problem is referred to as the tensor completion problem. Specifically, for any mode k, if there exists an index j0 k 2 [Ik] that S does not contain, we refer to the problem of estimating [X ]I1 j K=1 as the tensor interpolation problem. Using observation model (4), we provide an estimator for the unknown true tensor X . The estimator of ˆX is obtained by solving the following optimization problem: 2nk Y X(X)k2 + (X) where X is a convex subset of X, and : ! R+ is a regularization term. For the regularization (X), the overlapped Schatten 1-norm is frequently used (Liu et al., 2009; Tomioka et al., 2010; Signoretto et al., 2011; Gandy et al., 2011); is is defined as |||X|||s := 1 k X(k)ks := 1 where X(k) 2 RIk Q k06=k Ik denotes the unfolding matrix obtained by concatenating the mode-k fibers of X as column vectors and σrk(X(k)) denotes the rk-th largest eigenvalue of X(k). This penalty term regularizes the Tucker rank of X (Negahban & Wainwright, 2011; Tomioka et al., 2011). To solve the problem (5) using the Schatten regularization, ADMM is frequently employed (Boyd et al., 2011; Tomioka et al., 2010). ADMM generates a sequence of variables and Lagrangian multipliers by iteratively minimizing the augmented Lagrangian function. It is known that ADMM can easily solve an optimization problem with a non-differentiable regularization term such as ||| |||s. 3. STD: Smoothed Tucker Decomposition 3.1. Smoothness on Tensors Before explaining the proposed approach, we introduce the notion of smoothness on tensors. We start with the idea that a data tensor is obtained as a result of the discretization of a multivariate function. For example, consider an observation model of the wind power on a land surface. Suppose that the land surface is described by a plain [0, 1]2 (i.e., longitude and latitude) and the observation model is given by a function f : [0, 1]2 ! R. Assume that we have infinite memory space so that we can record the wind power y = f(a, b) for any points a, b 2 [0, 1]. In such an unrealistic case, it is possible to handle the entire information about f. However, only finite memory space is available; we resort to retain finite observations {f(ai, bi)}n i=1. If the points (ai, bi) are considered as a grid, the observations can be considered as a matrix. This idea is generalized to tensors as follows. Consider a K-variate function f X : [0, 1]K ! R, and a set of points {(j1, . . . , j K) 2 [0, 1]K}I1...Ik j1...j K=1 as grid points in [0, 1]K. Then, each element of X is represented as [X]j1...j K = f X(gj1, . . . , gj K), (6) for jk = 1, 2, . . . with each k = 1, . . . , K. As the smoothness of the function, we assume that f X is differentiable with respect to all K arguments, which allows for the expansion of the basis function to a few useful basis functions (for example, Tsybakov (2008)) and the decomposition of multivariate functions by the basis (see Hackbusch (2012) for detail). Let {φ(k) m : [0, 1] ! R}m be a set of orthonormal basis functions, such as Fourier series or wavelet series, and {wm1,...,m K 2 R}m1,...,m K be a set of coefficients. Because of the differentiability, f X is written as the weighted sum of the basis functions as wm1...m Kφ(1) Combining (6) and (7) yields a formulation of the elements of the smooth tensor as [X]j1,...,j K (8) wm1...m Kφ(1) m1(gj1) φ(K) Tensor Decomposition with Smoothness Hereafter, we say that X is smooth if it follows (8). 3.2. Objective Function Model (8) is not directly applicable because it requires an infinite number of basis functions. To prevent this, we consider their truncation. Let M (k) < 1 be the basis functions for mode k, which represents a degree of smoothness of X in terms of mode k. For example, when M (k) is large, such as M (k) = Ik for all k, the basis function formulation can represent any X, which implies that it neglects the smooth structure. Then, we consider X such that it satisfies the following relation: [X]j1,...,j K (9) wm1 m Kφ(1) m1(gj1) φ(K) For practical use, let WX 2 RM (1) M (K) be a coefficient tensor that satisfies [WX]m1...m K = wm1...m K, with given X. Using the representation, we propose an objective function of STD. Based on the same convex optimization approach as (5), we define the objective function as 2nk Y X(X)k2 + λn|||WX|||s + µn|||WX|||2 where λn, µn 0 are regularization coefficients that depend on n and λn, µn ! 0 as n ! 1. Here, regularization terms |||WX|||s and |||WX|||2 F are employed. There are three primary advantages of the formulation given by (10). Firstly, (10) is written as a convex optimization problem. Thus, it is ensured that to obtain the global solution of WX will be obtained. Secondly, regularization term |||WX|||s determines the Tucker rank of WX appropriately. Even though we must select λn, this is considerably easier than selecting the values of K. Thirdly, the regularization |||WX|||2 F penalizes the smoothness of f X. Note that the smoothness of f X is related to M (k), and we introduce |||WX|||2 F to select an appropriate degree of smoothness. 3.3. Algorithm To optimize (10), we first reformulate it through vectorization and matricization. We define x 2 R k Ik as the vectorized tensor of X, and Q is a Q k Ik n matrix, which is the matricized version of the rearranging operator X. We define M (\k) := Q k06=k M (k), and Zk 2 RMk M (\k) is the mode-k unfolding matrix of WX. Let w 2 R k M (k) be the vectorized tensor of WX. Let Φ : R k M (1) M (K) ! RI1 IK be an operator that converts WX to X as given by (9) using {φm(gj)}m,j. Let Γ be a Q matrix that satisfies Φ(w) = Γw. As Φ(w) is a linear mapping by (9), the existence of Γ is ensured. Then, (10) is rewritten as follows: min x,w,{Zk}K 1 2nk Y Qxk2 + λn k Zkks + µnkwk2, s.t. x = Φ(w), Pk(w) = Zk, 8k (11) where Pk : R k M (k) ! RM (k) M (\k) is a rearranging operator from the vector to the unfolding matrix. Note that |||W|||F = kwk F holds by the definition of w. We use the ADMM approach to solve (11). Maximizing the augmented Lagrangian function for (11), we obtain the following iteration steps. Here, > 0 is a step size and { k 2 R k=1 and β 2 R k Ik are Lagrangian multipliers. Let (x0, w0, {Zk,0}K k=1, { k,0}K k=1, β0) be an initial point. The ADMM step at the -th iteration is written as follows: QT Y nβ + n Γw ) w +1 = (2µn I + KI + ΓT Γ) 1 + ΓT β + ΓT x +1 k = proxλn/ (Pk(w +1 + k + (w +1 P 1 β +1 = β + (x +1 Γw +1), where proxλn/ ( ) denotes the shrinkage operation of the singular values, which is defined as prox (Z) = U max(S I, 0)V T , where S, U, and V are obtained through the singular value decomposition as Z = USV T . Note that the ADMM steps for Zk and k are required for every k = 1, . . . , K. For , Tomioka et al. (2010) suggest setting = 0/ Var(yi) with some constant 0. As the regularization terms are convex, the sequence of the variables of ADMM is ensured to converge to the optimal solution of (10) (Gandy et al., 2011, Theorem 5.1). 3.4. Practical Issues STD has several hyperparameters, i.e., λn, µn, {M (k)}, and {φm}. λn and µn can be determined through cross validation. {M (k)} is initialized as a large value and reduced during the algorithm depending on µn. As M (k) does not exceed Ik because of an identification reason, the initial value of M (k) is bounded. Practically, M (k) is considerably less than Ik. Thus, we can start the iteration with small values. Tensor Decomposition with Smoothness One can criticize that a few data tensors are smooth with one mode, but not with others. We emphasize that STD can address such a situation by controlling M (k) for each k. As STD can represent tensors without the smooth structure when M (k) = Ik, setting M (k) = Ik for some mode k and M (k) Ik for other modes is sufficient to address the situation. In this study, selecting the form of the basis functions {φm}m is not our primary interest because it does not specifically affect the theoretical result. However, there are a few typical choices. For instance, when the data tensor is periodic, such as audio, the Fourier basis is appropriate. Even through other functions, such as wavelet or spline functions, provide the theoretical guarantees of approximating f. 4. Theoretical Analysis We introduce a few notations for convenience. Let |||X|||m := 1 k=1 maxr σr(X(k)) be a norm of a tensor, which is necessary for evaluating the penalty parameters. Let X be an adjoint operator of X, namely, h X(z), z0i = hz, X (z0)i holds for all z, z0 2 X. For theoretical requirement, we let the basis functions {φj : [0, 1] ! R}j be uniformly bounded for all j 1. All proofs of this section are provided in the supplemental material. 4.1. Error Bound with X First, we impose the following assumption on X. Assumption 1 (Restricted Strong Convexity (RSC) condition). A finite constant CX > 0 depending on {Ik}k exists, then the rearranging operator X satisfies 1 2nk X(X)k2 CX|||X|||2 for all X 2 . Intuitively, this assumption requires that X is sufficiently sensitive to the perturbation of X. A similar type of condition has been used in previous studies on sparse regression, such as LASSO (Bickel et al., 2009; Raskutti et al., 2010), i.e., the restricted isometry condition. The RSC condition is weaker than the isometry condition because the RSC condition requires only the lower bound. We provide the following lemma regarding the error bound when true tensor X can be neither smooth nor low-rank. Let (RW 1 , . . . , RW K ) be the Tucker rank of WX. Lemma 2. Consider X 2 , and the rearranging operator X that satisfies the RSC condition. Suppose there exist sequences λn, µn, and n that satisfy n c( 2 n|||X (E)|||m + λn) and 1 cµn < 3/4 with a constant c > 0. Then, with some constants C1, C2, C3 > 0, we have ||| ˆX X |||F max {I, II, III} , where I, II, and III are: {mk>M (k)}k respectively; w m1...mk is the coefficient of X . Lemma 2 states that the estimation error of ˆX is bounded by three types of values, where (I) indicates the error resulting from estimating a tensor that is smooth and lowrank; (II) indicates the error resulting from introducing the low-rank; and (III) indicates the error resulting from approximating by the smooth tensor. From Lemma 2, we see that (II) and (III) disappear and (I) remains when X is low rank and smooth, which we show in the next proposition. Here, we define as the set of the tensors represented by (9) with {M (k)}k and the coefficient tensor WX with its rank (RW 1 , . . . , RW Proposition 3. Suppose the same conditions of Lemma 1 hold and X 2 is smooth and low-rank. Then, with some constant Cf > 0 we have ||| ˆX X |||F Cf n 4.2. Error Bound with f X One of the advantages of STD is that it can estimate X and the smooth function f X defined in (6), which allows for the interpolation of X . Here, we evaluate the estimation error of STD with respect to the norm k k L2 for the functional space. Let us define f X := PM (1),...,M (K) m1,...,m K w m1,...,m Kφ(1) m K which is one of the smooth function as the limit of X as Ik ! 1 for all k 2 {1, . . . , K}. We define the estimator of f X by the following: M (1)...M (K) X ˆwm1...m Kφ(1) where ˆwm1...m K is an element of W ˆ X. Estimation error is provided as follows. Tensor Decomposition with Smoothness METHOD RECOVERY INTERPOLATION ASSUMPTION TOMIOKA ET AL. (2011) O(RXI 2) N/A LOW-RANK TOMIOKA & SUZUKI (2013) O(RXI 2) N/A LOW-RANK WIMALAWARNE ET AL. (2014) O(RXI 2) N/A LOW-RANK STD O(RW I 2) O LOW-RANK & SMOOTH Table 1. Comparison of error bounds under the low-rank and smoothness assumptions. For clarity, the special cases of the error bounds are shown, where the shape of X is symmetric, i.e., I := I1 = = IK, RX := RX K, and RW := RW Lemma 4. Suppose that the rearranging operator X that satisfies the RSC condition and the true tensor X 2 . Then, with some constant CF > 0, we have sup g2[0,1]K |f ˆ X(g) f X (g)| CF n When n ! 0 by the setting, Lemma 4 shows that f ˆ X estimated by STD uniformly converges to f X . 4.3. Applications and Comparison To discuss the result of Lemma 2 more precisely, we consider the following two practical settings: tensor recovery and tensor interpolation. For each setting, we derive rigorous error bounds. 4.3.1. TENSOR RECOVERY We consider that all the elements of X are observed, and they are affected by noise, i.e., we set n = QK k=1 Ik and X is a vectorization operator. Then, by applying Lemma 2, we obtain the following result. Theorem 5. Suppose that X 2 and the rearranging operator X that satisfies the RSC condition, and the noise is i.i.d. Gaussian. Let C10, C11 > 0 be some finite constants. By setting n = C10σ 1 n K k=1(p Ik + p I\k), with high probability, we have ||| ˆX X |||2 Note that, in Theorem 5, the first part p Ik + p I\k comes from the noise and the second part comes from the Tucker rank of WX. 4.3.2. TENSOR INTERPOLATION Lemma 4 shows that STD can estimate the value of f X for all in g 2 [0, 1]K, and not only on the given grids {(gj1, . . . , gj K)} [0, 1]K. By tuning n in Lemma 4, we obtain the error bound for the tensor interpolation problem. Theorem 6. Suppose that X 2 and the rearranging operator X that satisfies the RSC condition, and the noise is i.i.d. Gaussian. By setting C20, C21 > 0 and n = C20 pn K σ PK k=1(p Ik+p I\k), with high probability, we have sup g2[0,1]K |f ˆ X(g) f X (g)| 4.4. Comparison to Related Studies Several studies have derived an error bound for ||| ˆX X |||2 F /n in each situation. Tomioka et al. (2011) investigated the tensor decomposition problem with an overlapped Schatten-norm and derived the error bound as ||| ˆX X |||F = 1 , . . . , RX K) is the Tucker of X . The error bound in Proposition 3 is obtained by replacing the part RX k in (12) by RW k . Tomioka & Suzuki (2013) and Wimalawarne et al. (2014) introduced modified norms with the Schattennorm and derived other error bounds. Table 1 compares the main coefficients for the convergence. When the tensor is sufficiently smooth, i.e., RW k case, the bound of STD is tighter than those of the other methods. 5. Experiment 5.1. Theoretical Validation Firstly, we verify the theoretical bound derived in Section 4 through experiments for the tensor recovery problem. We generate data tensors by following data generating processes and investigate the relation between a mean squared error (MSE) and other factors. We set K = 3 and prepare two different sizes: (I1, I2, I3) = (10, 10, 20) and (50, 50, 20). We set the Tucker rank as (R1, R2, R3) and select Rk from {2, 3, 4} for each k = 1, 2, 3. In addition, we generate the core tensor and its elements are obtained Tensor Decomposition with Smoothness Figure 2. Result for tensor recovery problem. λn and µn are 0.2 (red circle), 0.4 (green triangle), 0.6 (blue square), 0.8 (purple diamond), and 1.0 (yellow star). Each dashed line is the linear fitting to the errors. using the standard normal distribution. Then, we generate vectors u(k) rk in the following manner and obtain X using (1). To make X smooth, we set u(k) rk as a discretized smooth functions fk , i.e., [u(k) rk ]i = fk(i/Ik): {fk}k=1,2,3 is defined as follows: f1(z) = 1z, f2(z) = 2z2, and f3(z) = 3z0.5 with random parameters ( 1, 2, 3). The scale of noise is varied as σ 2 {0.01, 0.1}. To investigate the MSE ||| ˆX X |||2 F /n, we define the STD rank as follows: STD rank := . According to the theoretical result, the upper bounds for the MSEs for STD have a linear relation with the STD rank (see Theorem 5). Figure 2 shows a lot of the MSEs against the STD rank. The results show that the MSE and STD rank have a linear relationship for each panel and each value of the penalty parameters. This result supports Theorem 5; the bound for MSE in Theorem 5 varies linearly with the STD rank. In addition, we can see that the increment in the MSEs against the STD rank increases with the regularization parameter, and it decreases as the size of the tensor increases. This result is explained by the theoretical results, as the MSE is scaled by the regularization parameter and divided by n. 5.2. Comparison with Other Convex Methods We compare the performances of convex tensor decompositions with the tensor recovery problem. To investigate the performance with smoothness, we generate two types of Figure 3. Plotted MSEs with the tensor recovery problem against the noise level. Each symbol shows Schatten (blue circle), STD (red diamond), latent Schatten (purple left triangle), and matrix completion (green triangle). tensors, i.e., a smooth tensor and a non-smooth tensor. The smooth tensor, which is a discretized smooth function, is generated using basis vectors. For the non-smooth tensor, we generate vectors u(k) rk using a multivariate normal distribution, and make X through (1). The scale of noise is varied as σ 2 {0.1, 0.2 . . . , 1.0}. In the experiment, we compare the following four methods: STD, Tucker decomposition with Schatten regularization, Tucker decomposition with latent Schatten regularization, and matrix decomposition with unfolding X, where the last three methods were proposed by Tomioka et al. (2011). For each method, regularization parameters are selected such that they minimize the generalization error ||| ˆX X |||F with grid search in an interval [0.1, 8.0]. The MSEs and their standard deviations for 100 replications are shown in Figure 3. For a small tensor size (10 10 20), STD performs better when the tensor is smooth, and the latent Schatten approach is better when the tensor is not smooth. With the large tensor (50 50 20) with the smooth structure, STD outperforms other methods. For a large tensor is non-smooth, the advantages of STD reduce, even though it exhibits good performance. When the tensor is small, the optimization of STD is close to that of Tucker decomposition, as M (k) and Ik are similar. Thus, the performances of the methods are similar for the small tensor. In contrast, when the tensor is large, STD can provide a different estimator by letting M (k) Ik, and STD successfully reduces the MSE. This difference becomes evident when the tensor has the smooth structure. Tensor Decomposition with Smoothness 5.3. Analysis of Real Data 5.3.1. AMINO ACID DATA We conduct tensor completion and interpolation using amino acids data (Kiers, 1998). The dataset contains amounts of tyrosine, dissolved in phosphate water, which are measured using a spectrofluorometer for each 1 nm interval, and the data are represented by 201 61 matrices. We make a few elements of the dataset missing, and complete them using the Schatten method (Tomioka et al., 2011) and STD. We consider the following four missing patterns: (A) element-wise missing (20%), (B) elementwise missing (50%), (C) element-wise missing (80%), and (D) column-wise and row-wise missing (50%). We employ the trigonometric basis functions for STD. The hyperparameters are determined through the cross-validation. Figure 4 shows the result. For random missing cases (A), (B), and (C), the Schatten method and STD can complete the missing elements. In contrast, when rows and columns are completely missing (D), only STD can interpolate missing values and achieve the data. Figure 4. Completion of missing elements in amino acid data. 5.3.2. HUMAN ACTIVITY VIDEO We conduct an experiment with a human activity video dataset (Schuldt et al., 2004). It contains the human running action; it has a resolution of 160 120 pixels and an average length for 4 seconds length for average with a frame rate of 25 fps. First, we downscale the resolution of each frame to one-fourth so that a 40 30 100 tensor is obtained. We make a few pixels (50% or 100%) of the video at t = 0.18 s missing, and complete them from other frames. The experiments are conducted using matrix completion, the Schatten method (Tomioka et al., 2011), and STD. For STD, we set the basis functions as trigonometric series and λn = µn = 0.1. Figure 5 shows the results of the experiments. We observe that matrix completion and the Schatten method work well when the 50% missing case. However, they recover nothing when 100% pixels are missing. On the contrary, STD successfully recovers the background. In addition, the man s body and shadow are interpolated at the correct position at t, even though they are blurred. Note that the completion result for STD for 50% missing pixels contains blocknoise. This is possibly because of over-fitting by the basis functions. Observed frame Target frame Observed frame Completed results (50% missing) Matrix Completed results (100% missing) Matrix Figure 5. Completion of missing pixels in the human activity video. 6. Discussion The smoothness we focus on is closely related to the studies of matrix completion. When a tensor is expanded by the basis functions that are close to independent with each other, this implies the tensor satisfies the incoherence property, which is frequently used as requirement for the matrix completion problem (Candes & Plan, 2010; Candes & Recht, 2012). Using the similarity, we may apply the formulation of STD to the tensor completion problem and analyze some properties such as the sample complexity. Note that the focus of this study is primary on the theoretical aspect, which provides scope for addressing more practical requirements. First, the ADMM algorithm is not scalable when the size of the tensor is large. The primary computational burden is caused by a matrix of size Q k M (k), which is essential for the convex formulation. We may use a reduction technique as proposed by Cheng et al. (2016); this is an important challenge for future work. Second, the assumed smoothness (i.e., differentiability) can be extremely general for a few actual applications. For example, images possibly contain solid edges, such as the boundaries of objects, which do not fit the smoothness assumption. Exploring more domainspecific smoothness is an open problem. Tensor Decomposition with Smoothness Amini, Arash A, Levina, Elizaveta, and Shedden, Kerby A. Structured functional regression models for highdimensional spatial spectroscopy data. ar Xiv preprint ar Xiv:1311.0416, 2013. Bickel, Peter J, Ritov, Ya acov, and Tsybakov, Alexan- dre B. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, pp. 1705 1732, 2009. Boyd, Stephen, Parikh, Neal, Chu, Eric, Peleato, Borja, and Eckstein, Jonathan. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine Learning, 3(1):1 122, 2011. Candes, Emmanuel and Recht, Benjamin. Exact matrix completion via convex optimization. Communications of the ACM, 55(6):111 119, 2012. Candes, Emmanuel J and Plan, Yaniv. Matrix completion with noise. Proceedings of the IEEE, 98(6):925 936, 2010. Cheng, Hao, Yu, Yaoliang, Zhang, Xinhua, Xing, Eric, and Schuurmans, Dale. Scalable and sound low-rank tensor learning. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pp. 1114 1123, 2016. Gandy, Silvia, Recht, Benjamin, and Yamada, Isao. Tensor completion and low-n-rank tensor recovery via convex optimization. Inverse Problems, 27(2):025010, 2011. Gantner, Zeno, Drumond, Lucas, Freudenthaler, Christoph, Rendle, Steffen, and Schmidt-Thieme, Lars. Learning attribute-to-feature mappings for cold-start recommendations. In 2010 IEEE International Conference on Data Mining, pp. 176 185. IEEE, 2010. Hackbusch, Wolfgang. Tensor spaces and numerical ten- sor calculus, volume 42. Springer Science & Business Media, 2012. Harshman, Richard A. Foundations of the parafac proce- dure: Models and conditions for an explanatory multimodal factor analysis. UCLA Working Papers in Phonetics, 316:1 84, 1970. Hsing, Tailen and Eubank, Randall. Theoretical founda- tions of functional data analysis, with an introduction to linear operators. John Wiley & Sons, 2015. Hu, Changwei, Rai, Piyush, and Carin, Lawrence. Zero- truncated poisson tensor factorization for massive binary tensors. ar Xiv preprint ar Xiv:1508.04210, 2015. Jia, Chengcheng, Zhong, Guoqiang, and Fu, Yun. Low- rank tensor learning with discriminant analysis for action classification and image recovery. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pp. 1228 1234. AAAI Press, 2014. Kiers, Henk AL. A three-step algorithm for candecomp/parafac analysis of large data sets with multicollinearity. Journal of Chemometrics, 12(3):155 171, 1998. Kolda, Tamara G and Bader, Brett W. Tensor decompo- sitions and applications. SIAM review, 51(3):455 500, 2009. Kolda, Tamara G and Sun, Jimeng. Scalable tensor decom- positions for multi-aspect data mining. In 2008 Eighth IEEE international conference on data mining, pp. 363 372. IEEE, 2008. Liu, Ji, Musialski, Przemyslaw, Wonka, Peter, and Ye, Jieping. Tensor completion for estimating missing values in visual data. In ICCV, 2009. Narita, Atsuhiro, Hayashi, Kohei, Tomioka, Ryota, and Kashima, Hisashi. Tensor factorization using auxiliary information. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 501 516. Springer, 2011. Negahban, Sahand and Wainwright, Martin J. Estimation of (near) low-rank matrices with noise and highdimensional scaling. The Annals of Statistics, 39(2): 1069 1097, 2011. Ramsay, James O. Functional data analysis. Wiley Online Library, 2006. Raskutti, Garvesh, Wainwright, Martin J, and Yu, Bin. Re- stricted eigenvalue properties for correlated gaussian designs. Journal of Machine Learning Research, 11(Aug): 2241 2259, 2010. Schuldt, Christian, Laptev, Ivan, and Caputo, Barbara. Rec- ognizing human actions: A local svm approach. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 3, pp. 32 36. IEEE, 2004. Signoretto, M., Van de Plas, R., De Moor, B., and Suykens, J. A. K. Tensor Versus Matrix Completion: A Comparison With Application to Spectral Data. Signal Processing Letters, IEEE, 18(7):403 406, 2011. Signoretto, Marco, De Lathauwer, Lieven, and Suykens, Johan AK. Nuclear norms for tensors and their use for convex multilinear estimation. Submitted to Linear Algebra and Its Applications, 43, 2010. Tensor Decomposition with Smoothness Tomioka, Ryota and Suzuki, Taiji. Convex tensor decom- position via structured schatten norm regularization. In Advances in neural information processing systems, pp. 1331 1339, 2013. Tomioka, Ryota, Hayashi, Kohei, and Kashima, Hisashi. Estimation of low-rank tensors via convex optimization. ar Xiv preprint ar Xiv:1010.0789, 2010. Tomioka, Ryota, Suzuki, Taiji, Hayashi, Kohei, and Kashima, Hisashi. Statistical performance of convex tensor decomposition. In Advances in Neural Information Processing Systems, pp. 972 980, 2011. Tsybakov, Alexandre B. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008. ISBN 0387790519, 9780387790510. Tucker, Ledyard R. Some mathematical notes on three- mode factor analysis. Psychometrika, 31(3):279 311, 1966. Wimalawarne, Kishan, Sugiyama, Masashi, and Tomioka, Ryota. Multitask learning meets tensor factorization: task imputation via convex optimization. In Advances in neural information processing systems, pp. 2825 2833, 2014. Yokota, Tatsuya, Zdunek, Rafal, Cichocki, Andrzej, and Yamashita, Yukihiko. Smooth nonnegative matrix and tensor factorizations for robust multi-way data analysis. Signal Processing, 113:234 249, 2015a. Yokota, Tatsuya, Zhao, Qibin, and Cichocki, Andrzej. Smooth parafac decomposition for tensor completion. ar Xiv preprint ar Xiv:1505.06611, 2015b. Zhou, Hua, Li, Lexin, and Zhu, Hongtu. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540 552, 2013.