# fitting_lowrank_tensors_in_constant_time__a74fa0ef.pdf Fitting Low-Rank Tensors in Constant Time Kohei Hayashi National Institute of Advanced Industrial Science and Technology RIKEN AIP hayashi.kohei@gmail.com Yuichi Yoshida National Institute of Informatics yyoshida@nii.ac.jp In this paper, we develop an algorithm that approximates the residual error of Tucker decomposition, one of the most popular tensor decomposition methods, with a provable guarantee. Given an order-K tensor X RN1 NK, our algorithm randomly samples a constant number s of indices for each mode and creates a mini tensor X Rs s, whose elements are given by the intersection of the sampled indices on X. Then, we show that the residual error of the Tucker decomposition of X is sufficiently close to that of X with high probability. This result implies that we can figure out how much we can fit a low-rank tensor to X in constant time, regardless of the size of X. This is useful for guessing the favorable rank of Tucker decomposition. Finally, we demonstrate how the sampling method works quickly and accurately using multiple real datasets. 1 Introduction Tensor decomposition is a fundamental tool for dealing with array-structured data. Using tensor decomposition, a tensor (or a multidimensional array) is approximated with multiple tensors in lower-dimensional space using a multilinear operation. This drastically reduces disk and memory usage. We say that a tensor is of order K if it is a K-dimensional array; each dimension is called a mode in tensor terminology. Among the many existing tensor decomposition methods, Tucker decomposition [18] is a popular choice. To some extent, Tucker decomposition is analogous to singular-value decomposition (SVD): as SVD decomposes a matrix into left and right singular vectors that interact via singular values, Tucker decomposition of an order-K tensor consists of K factor matrices that interact via the socalled core tensor. The key difference between SVD and Tucker decomposition is that, with the latter, the core tensor need not be diagonal and its rank can differ for each mode k = 1, . . . , K. In this paper, we refer to the size of the core tensor, which is a K-tuple, as the Tucker rank of a Tucker decomposition. We are usually interested in obtaining factor matrices and a core tensor to minimize the residual error the error between the input and low-rank approximated tensors. Sometimes, however, knowing the residual error itself is an important task. The residual error tells us how the low-rank approximation is suitable to the input tensor, and is particularly useful to predetermine the Tucker rank. In real Supported by ONR N62909-17-1-2138. Supported by JSPS KAKENHI Grant Number JP17H04676 and JST ERATO Grant Number JPMJER1305, Japan. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. Algorithm 1 Input: Random access to a tensor X RN1 NK, Tucker rank (R1, . . . , Rk), and ϵ, δ (0, 1). for k = 1 to K do Sk a sequence of s = s(ϵ, δ) indices uniformly and independently sampled from [Nk]. Construct a mini-tensor X|S1,...,SK. Return ℓR1,...,RK(X|S1,...,SK). applications, Tucker ranks are not explicitly given, and we must select them by considering the balance between space usage and approximation accuracy. For example, if the selected Tucker rank is too small, we risk losing essential information in the input tensor. On the other hand, if the selected Tucker rank is too large, the computational cost of computing the Tucker decomposition (even if we allow for approximation methods) increases considerably along with space usage. As with the case of the matrix rank, one might think that a reasonably good Tucker rank can be found using a grid search. Unfortunately, grid searches for Tucker ranks are challenging because, for an order-K tensor, the Tucker rank consists of K free parameters and the search space grows exponentially in K. Hence, we want to evaluate each grid point as quickly as possible. Unfortunately, although several practical algorithms have been proposed, such as the higher-order orthogonal iteration (HOOI) [7], they are not sufficiently scalable. For each mode, HOOI iteratively applies SVD to an unfolded tensor a matrix that is reshaped from the input tensor. Given an N1 NK tensor, the computational cost is hence O(K maxk Nk Q k Nk), which depends crucially on the input size N1, . . . , NK. Although there are several approximation algorithms [8, 21, 17], their computational costs are still intensive. Consequently, we cannot search for good Tucker ranks. Rather, we can only check a few candidates. 1.1 Our Contributions When finding a good Tucker rank with a grid search, we need only the residual error. More specifically, given an order-K tensor X RN1 NK and integers Rk Nk (k = 1, . . . , K), we consider the following rank-(R1, . . . , RK) Tucker-fitting problem. For an integer n N, let [n] denote the set {1, 2, . . . , n}. Then, we want to compute the following normalized residual error: ℓR1,...,RK(X) := min G RR1 RK ,{U (k) RNk Rk }k [K] X [G; U (1), . . . , U (K)] 2 k [K] Nk , (1) where [G; U (1), . . . , U (K)] RN1 NK is an order-K tensor, defined as [G; U (1), . . . , U (K)]i1 i K = X r1 [R1],...,r K [RK] Gr1 r K Y k [K] U (k) ikrk for every i1 [N1], . . . , i K [NK]. Here, G is the core tensor, and U (1), . . . , U (K) are the factor matrices. Note that we are not concerned with computing the minimizer. Rather, we only want to compute the minimum value. In addition, we do not need the exact minimum. Indeed, a rough estimate still helps to narrow down promising rank candidates. The question here is how quickly we can compute the normalized residual error ℓR1,...,RK(X) with moderate accuracy. We shed light on this question by considering a simple sampling-based algorithm. Given an order-K tensor X RN1 NK, Tucker rank (R1, . . . , RK), and sample size s N, we sample a sequence of indices Sk = (xk 1, . . . , xk s) uniformly and independently from {1, . . . , Nk} for each mode k [K]. Then, we construct a mini-tensor X|S1,...,SK Rs s such that (X|S1,...,SK)i1,...,i K = Xx1 i1...,x K i K . Finally, we compute ℓR1,...,RK(X|S1,...,SK) using a solver, such as HOOI, that then outputs the obtained value. The details are provided in Algorithm 1. In this paper, we show that Algorithm 1 achieves our ultimate goal: with a provable guarantee, the time complexity remains constant. Assume each rank parameter Rk is sufficiently smaller than the dimension of each mode Nk. Then, given error and confidence parameters ϵ, δ (0, 1), there exists a constant s = s(ϵ, δ) such that the approximated residual ℓR1,...,RK(X|S1,...,SK) is close to the original one ℓR1,...,RK(X), to within ϵ with a probability of at least 1 δ. Note that the time complexity for computing ℓR1,...,RK(X|S1,...,SK) does not depend on the input size N1, . . . , NK but rather on the sample size s, meaning that the algorithm runs in constant time, regardless of the input size. The main component in our proof is the weak version of Szemerédi s regularity lemma [9], which roughly states that any tensor can be well approximated by a tensor consisting of a constant number of blocks whose entries in the same block are equal. Then, we can show that X|S1,...,SK is a good sketch of the original tensor, because by sampling s many indices for each mode, we can hit each block a sufficient number of times. It follows that ℓR1,...,RK(X) and ℓR1,...,RK(X|S1,...,SK) are close. To formalize this argument, we want to measure the distance between X and X|S1,...,SK, and we want to show that it is small. To this end, we exploit graph limit theory, first described by Lovász and Szegedy [13] (see also [12]), in which we measure the distance between two graphs on a different number of vertices by considering continuous versions called dikernels. Hayashi and Yoshida [10] used graph limit theory to develop a constant-time algorithm that minimizes quadratic functions described by matrices and vectors. We further extend this theory to tensors to analyze the Tucker fitting problem. With both synthetic and real datasets, we numerically evaluate our algorithm. The results show that our algorithm overwhelmingly outperforms other approximation methods in terms of both speed and accuracy. 2 Preliminaries Tensors Let X RN1 NK be a tensor. Then, we define the Frobenius norm of X as X F = q P i1,...,i K X2 i1 i K, the max norm of X as X max = max i1 [N1],...,i K [NK] |Xi1 i K|, and the cut norm of X as X = max S1 [N1],...,SK [NK] | P i1 S1,...,i K SK Xi1 i K|. We note that these norms satisfy the triangle inequalities. For a vector v Rn and a sequence S = (x1, . . . , xs) of indices in [n], we define the restriction v|S Rs of v such that (v|S)i = vxi for i [s]. Let X RN1 NK be a tensor, and Sk = (xk 1, . . . , xk s) be a sequence of indices in [Nk] for each mode k [K]. Then, we define the restriction X|S1,...,SK Rs s of X to S1 SK such that (X|S1,...,SK)i1 i K = Xx1 i1,...,x K i K for each i1 [N1], . . . , i K [Nk]. Hyper-dikernels We call a (measurable) function W : [0, 1]K R a (hyper-)dikernel of order K. We can regard a dikernel as a tensor whose indices are specified by real values in [0, 1]. We stress that the term dikernel has nothing to do with kernel methods used in machine learning. For two functions f, g : [0, 1] R, we define their inner product as f, g = R 1 0 f(x)g(x)dx. For a sequence of functions f (1), . . . , f (K), we define their tensor product N k [K] f (k) [0, 1]K R as N k [K] f (k)(x1, . . . , x K) = Q k [K] f (k)(xk), which is an order-K dikernel. Let W : [0, 1]K R be a dikernel. Then, we define the Frobenius norm of W as W F = q R [0,1]K W(x)2dx, the max norm of W as W max = maxx [0,1]K |W(x)|, and the cut norm of W as W = sup S1,...,SK [0,1] R S1 SK W(x)dx . Again, we note that these norms satisfy the triangle inequalities. For two dikernels W and W , we define their inner product as W, W = R [0,1]K W(x)W (x)dx. Let λ be a Lebesgue measure. A map π : [0, 1] [0, 1] is said to be measure-preserving, if the pre-image π 1(X) is measurable for every measurable set X, and λ(π 1(X)) = λ(X). A measurepreserving bijection is a measure-preserving map whose inverse map exists and is also measurable (and, in turn, also measure-preserving). For a measure-preserving bijection π : [0, 1] [0, 1] and a dikernel W : [0, 1]K R, we define a dikernel π(W) : [0, 1]K R as π(W)(x1, . . . , x K) = W(π(x1), . . . , π(x K)). For a tensor G RR1 RK and vector-valued functions {F (k) : [0, 1] RRk}k [K], we define an order-K dikernel [G; F (1), . . . , F (K)] : [0, 1]K R as [G; F (1), . . . , F (K)](x1, . . . , x K) = X r1 [R1],...,r K [RK] Gr1,...,r K Y k [K] F (k)(xk)rk We note that [G; F (1), . . . , F (K)] is a continuous analogue of Tucker decomposition. Tensors and hyper-dikernels We can construct the dikernel X : [0, 1]K R from a tensor X RN1 NK as follows. For an integer n N, let In 1 = [0, 1 n], In 2 = ( 1 n], . . . , In n = ( n 1 n , . . . , 1]. For x [0, 1], we define in(x) [n] as a unique integer such that x In i . Then, we define X(x1, . . . , x K) = Xi N1(x1) i NK (x K). The main motivation for creating a dikernel from a tensor is that, in doing so, we can define the distance between two tensors X and Y of different sizes via the cut norm that is, X Y . Let W : [0, 1]K R be a dikernel and Sk = (xk 1, . . . , xk s) for k [K] be sequences of elements in [0, 1]. Then, we define a dikernel W|S1,...,SK : [0, 1]K R as follows: We first extract a tensor W Rs s by setting Wi1 i K = W(x1 i1, . . . , x K i K). Then, we define W|S1,...,SK as the dikernel constructed from W. 3 Correctness of Algorithm 1 In this section, we prove the correctness of Algorithm 1. The following sampling lemma states that dikernels and their sampling versions are close in the cut norm with high probability. Lemma 3.1. Let W1, . . . , WT : [0, 1]K [ L, L] be dikernels. Let S1, . . . , SK be sequences of s elements uniformly and independently sampled from [0, 1]. Then, with a probability of at least 1 exp( ΩK(s2(T/ log2 s)1/(K 1))), there exists a measure-preserving bijection π : [0, 1] [0, 1] such that, for every t [T], we have Wt π(Wt|S1,...,SK) = L OK T/ log2 s 1/(2K 2), where OK( ) and ΩK( ) hide factors depending on K. We now consider the dikernel counterpart to the Tucker fitting problem, in which we want to compute the following: ℓR1,...,RK(X) := inf G RR1 RK ,{f (k):[0,1] RRk }k [K] X [G; f (1), . . . , f (K)] 2 The following lemma states that the Tucker fitting problem and its dikernel counterpart have the same optimum values. Lemma 3.2. Let X RN1 NK be a tensor, and let R1, . . . , RK N be integers. Then, we have ℓR1,...,RK(X) = ℓR1,...,RK(X). For a set of vector-valued functions F = {f (k) : [0, 1] RRk}k [K], we define F max = maxk [K],r [Rk],x [0,1] f (k) r (x). For real values a, b, c R, a = b c is shorthand for b c a b + c. For a dikernel X : [0, 1]K R, we define a dikernel X 2 : [0, 1]K R as X 2(x) = X(x)2 for every x [0, 1]K. The following lemma states that if X and Y are close in the cut norm, then the optimum values of the Tucker fitting problem regarding them are also close. Lemma 3.3. Let X, Y : [0, 1]K R be dikernels with X Y ϵ and X 2 Y2 ϵ. For integers R1, . . . , RK N, we have ℓR1,...,RK(X) = ℓR1,...,RK(Y) 2ϵ 1 + R GX max FX K max + GY max FY K max , where (GX , FX = {f (k) X }k [K]) and (GY,FY = {f (k) Y }k [K]) are solutions to the problem (2) on X and Y, respectively, whose objective values exceed the infima by at most ϵ, and R = Q It is well known that the Tucker fitting problem has a minimizer for which the factor matrices are orthonormal. Thus, we have the following guarantee for the approximation error of Algorithm 1. Theorem 3.4. Let X RN1 NK be a tensor, R1, . . . , RK be integers, and ϵ, δ (0, 1). For s(ϵ, δ) = 2Θ(1/ϵ2K 2) + Θ(log 1 δ log log 1 δ ), we have the following. Let S1, . . . , SK be sequences of indices as defined in Algorithm 1. Let (G , U 1 , . . . , U K) and ( G , U 1 , . . . , U K) be minimizers of the problem (1) on X and X|S1,...,SK for which the factor matrices are orthonormal, respectively. Then, with a probability of at least 1 δ, we have ℓR1,...,RK(X|S1,...,SK) = ℓR1,...,RK(X) O(ϵL2(1 + 2MR)), where L = X max, M = max { G max, G max}, and R = Q k [K] Rk. We remark that, for the matrix case (i.e., K = 2), G max and G max are equal to the maximum singular values of the original and sampled matrices, respectively. Proof. We apply Lemma 3.1 to X and X 2. Then, with a probability of at least 1 δ, there exists a measure-preserving bijection π : [0, 1] [0, 1] such that X π(X|S1,...,SK) ϵL and X 2 π(X 2|S1,...,SK) ϵL2. In what follows, we assume that this has happened. Then, by Lemma 3.3 and the fact that ℓR1,...,RK(X|S1,...,SK) = ℓR1,...,RK(π(X|S1,...,SK)), we have ℓR1,...,RK(X|S1,...,SK) = ℓR1,...,RK(X) ϵL2 1 + 2R( G max F K max + G max F K max) , where (G, F = {f (k)}k [K]) and ( G, F = { f (k)}k [K]) be as in the statement of Lemma 3.3. From the proof of Lemma 3.2, we can assume that G max = G max, G max = G max, F max 1, and F max 1 (owing to the orthonormality of U 1 , . . . , U K and U 1 , . . . , U K). It follows that ℓR1,...,RK(X|S1,...,SK) = ℓR1,...,RK(X) ϵL2 1 + 2R( G max + G max) . (3) Then, we have ℓR1,...,RK(X|S1,...,SK) = ℓR1,...,RK(X|S1,...,SK) (By Lemma 3.2) = ℓR1,...,RK(X) ϵL2 1 + 2R( G max + G max) (By (3)) = ℓR1,...,RK(X) ϵL2 1 + 2R( G max + G max) . (By Lemma 3.2) Hence, we obtain the desired result. 4 Related Work To solve Tucker decomposition, several randomized algorithms have been proposed. A popular approach involves using a truncated or randomized SVD. For example, Zhou et al. [21] proposed a variant of HOOI with randomized SVD. Another approach is based on tensor sparsification. Tsourakakis [17] proposed MACH, which randomly picks the element of the input tensor and substitutes zero, with a probability of 1 p, where p (0, 1] is an approximation parameter. Moreover, several authors proposed CUR-type Tucker decomposition, which approximates the input tensor by sampling tensor tubes [6, 8]. Unfortunately, these methods do not significantly reduce the computational cost. Randomized SVD approaches reduce the computational cost of multiple SVDs from O(K maxk Nk Q k Nk) to O(K maxk Rk Q k Nk), but they still depend on Q k Nk. CUR-type approaches require the same time complexity. In MACH, to obtain accurate results, we need to set p as constant for instance p = 0.1 [17]. Although this will improve the runtime by a constant factor, the dependency on Q k Nk does not change. N = 100 200 400 800 G G G G G G G G G G G G G G G G G G G G G G G G G G G G 0.0 11121314151617181920 11121314151617181920 11121314151617181920 11121314151617181920 R Residual error Figure 1: Synthetic data: computed residual errors for various Tucker ranks. The horizontal axis indicates the approximated residual error ℓR1,...,RK(X|S1,...,SK). The error bar indicates the standard deviation over ten trials with different random seeds, which affected both data generation and sampling. 5 Experiments For the experimental evaluation, we slightly modified our sampling algorithm. In Algorithm 1, the indices are sampled using sampling with replacement (i.e., the same indices can be sampled more than once). Although this sampling method is theoretically sound, we risk obtaining redundant information by sampling the same index several times. To avoid this issue, we used sampling without replacement i.e., each index was sampled at most once. Furthermore, if the dimension of a mode was smaller than the sampling size, we used all the coordinates. That is, we sampled min(s, Nk) indices for each mode k [K]. Note that both sampling methods, with and without replacement, are almost equivalent when the input size N1, . . . , NK is sufficiently larger than s (i.e., the probability that a previously sampled index is sampled approaches zero.) 5.1 Synthetic Data We first demonstrate the accuracy of our method using synthetic data. We prepared N N N tensors for N {100, 200, 400, 800}, with a Tucker rank of (15, 15, 15). Each element of the core G R15 15 15 and the factor matrices U (1), U (2), U (3) RN 15 was drawn from a standard normal distribution. We set Y = [G; U (1), U (2), U (3)]. Then, we generated X RN N N as Xijk = Yijk/ Y F + 0.1ϵijk, where ϵijk follows the standard normal distribution for i, j, k [N]. Namely, X had a low-rank structure, though some small noise was added. Subsequently, X was decomposed using our method with various Tucker ranks (R, R, R) for R {11, 12, . . . , 20} and the sample size s {20, 40, 80}. The results (see Figure 1) show that our method behaved ideally. That is, the error was high when R was less than the true rank, 15, and it was almost zero when R was greater than or equal to the true rank. Note that the scale of the estimated residual error seems to depend on s, i.e., small s tends to yield a small residual error. This implies our method underestimates the residual error when s is small. 5.2 Real Data To evaluate how our method worked against real data tensors, we used eight datasets [1, 2, 4, 11, 14, 19] described in Table 1, where the fluor dataset is order-4 and the others are order-3 tensors. Details regarding the data are provided in the Supplementary material. Before the experiment, we normalized each data tensor by its norm X F . To evaluate the approximation accuracy, we used HOOI implemented in Python by Nickel3 as true residual error.4 As baselines, we used the two randomized methods introduced in Section 4: randomized SVD [21] and MACH [17]. We denote our method by samples where s indicates the sample size (e.g., sample40 denotes our method with 3https://github.com/mnick/scikit-tensor 4Note that, though no approximation is used in HOOI, the objective function (1) is nonconvex and it is not guaranteed to converge to the global minimum. The obtained solution can be different from the ground truth. Table 1: Real Datasets. Dataset Size Total # of elements movie_gray 120 160 107 2.0M EEM 28 13324 8 2.9M fluorescence 299 301 41 3.6M bonnie 89 97 549 4.7M fluor 405 136 19 5 5.2M wine 44 2700 200 23.7M BCI_Berlin 4001 59 1400 0.3G visor 16818 288 384 1.8G 0.000 0.005 0.010 0.015 0.020 0.1 0.2 0.3 0.4 movie_gray fluorescence bonnie wine BCI_Berlin visor Tucker rank Residual error Figure 2: Real data: (approximated) residual errors for various Tucker ranks. s = 40). Similarly, machp refers to MACH with sparsification probability set at p. For all the approximation methods, we used the HOOI implementation to solve Tucker decomposition. Every data tensor was decomposed with Tucker rank (R1, . . . , RK) on the grid Rk {5, 10, 15, 20} for k [K]. Figure 2 shows the residual error for order-3 data.5 It shows that the random projection tends to overestimate the decomposition error. On the other hand, except for the wine dataset, our method stably estimated the residual error with reasonable approximation errors. For the wine dataset, our method estimated a very small value, far from the correct value. This result makes sense, however, because the wine dataset is sparse (where 90% of the elements are zero) and the residual error is too small. Table 2 shows the absolute error from HOOI averaged over all rank settings. In most of the datasets, our methods achieved the lowest error. 5Here we exclude the results of the EEM dataset because its size is too small and we were unable to run the experiment with all the Tucker rank settings. Also, the results of MACH on some datasets are excluded owing to considerable errors. Table 2: Real data: absolute error of HOOI s and other s residual errors averaged over ranks. The best and the second best results are shown in bold and italic, respectively. mach0.1 mach0.3 randsvd sample40 sample80 movie_gray 0.084 0.038 0.020 0.010 0.004 0.003 0.001 0.001 0.000 0.000 EEM 2.370 0.792 0.587 0.210 0.018 0.029 0.003 0.003 0.003 0.003 fluorescence 0.569 0.204 0.129 0.053 0.024 0.023 0.004 0.005 0.002 0.002 bonnie 1.170 0.412 0.300 0.121 0.012 0.011 0.004 0.002 0.003 0.001 fluor 0.611 0.307 0.148 0.083 0.009 0.007 0.003 0.001 0.002 0.001 wine 6.826 0.733 1.417 0.191 0.012 0.009 0.008 0.006 0.007 0.006 BCI_Berlin 0.193 0.039 0.048 0.013 0.057 0.020 0.065 0.022 0.055 0.007 visor 0.002 0.001 0.000 0.000 0.007 0.003 0.003 0.001 0.001 0.001 Table 3: Real data: Kendall s tau against the ranking of Tucker ranks obtained by HOOI. mach0.1 mach0.3 randsvd sample40 sample80 movie_gray -0.07 0.04 0.1 0.71 0.73 EEM 0.64 0.68 0.77 0.79 0.91 fluorescence 0.08 0.02 0.28 0.61 0.77 bonnie -0.05 -0.01 0.33 0.27 0.67 fluor 0.77 0.73 0.83 0.93 0.89 wine 0.12 0.12 -0.02 0.04 0.15 BCI_Berlin 0.08 0.09 0.02 0.18 0.45 visor 0.07 0.18 0.11 0.64 0.7 Table 4: Real data: runtime averaged over Tucker ranks (in seconds). hooi mach0.1 mach0.3 randsvd sample40 sample80 movie_gray 0.71 32.19 85.13 0.33 0.13 0.25 EEM 3447.97 7424.8 7938.75 2212.54 0.11 0.11 fluorescence 2.67 30.05 73.52 1.47 0.13 0.23 bonnie 9.13 25.99 56.56 2.32 0.11 0.41 fluor 3.2 34.54 98.63 1.43 0.2 0.43 wine 142.34 95.28 212.19 41.94 0.12 0.23 BCI_Berlin 428.13 2765.88 7830 82.43 0.2 0.45 visor 10034.96 27897.85 27769.53 1950.45 0.13 0.26 Next, we evaluated the correctness of the order of Tucker ranks. For rank determination, it is important that the rankings of Tucker ranks in terms of residual errors are consistent between the original and the sampled tensors. For example, if the rank-(15, 15, 5) Tucker decomposition of the original tensor achieves a lower error than the rank-(5, 15, 15) Tucker decomposition, this order relation should be preserved in the sampled tensor. We evaluated this using Kendall s tau coefficient, between the rankings of Tucker ranks obtained by HOOI and the others. Kendall s tau coefficient takes as its value +1 when the two rankings are the same, and 1 when they are opposite. Table 3 shows the results. We can see that, again, our method outperformed the others. Table 4 shows the runtime averaged over all the rank settings. It shows that our method is consistently the fastest. Note that MACH was slower than normal Tucker decomposition. This is possibly because it must create an additional sparse tensor, which requires O(Q k Nk) time complexity. 6 Discussion One might point out by way of criticism that the residual error is not a satisfying measure for determining rank. In machine learning and statistics, it is common to choose hyperparameters based on the generalization error or its estimator, such as cross-validation (CV) error, rather than the training error (i.e., the residual error in Tucker decomposition). Unfortunately, our approach cannot be used the CV error, because what we can obtain is the minimum of the training error, whereas CV requires us to plug in the minimizers. An alternative is to use information criteria such as Akaike [3] and Bayesian information criteria [15]. These criteria are given by the penalty term, which consists of the number of parameters and samples6, and the maximum log-likelihood. Because the maximum log-likelihood is equivalent to the residual error, our method can approximate these criteria. Python code of our algorithm is available at: https://github.com/hayasick/CTFT. [1] E. Acar, R. Bro, and B. Schmidt. New exploratory clustering tool. Journal of Chemometrics, 22(1):91, 2008. [2] E. Acar, E. E. Papalexakis, G. Gürdeniz, M. A. Rasmussen, A. J. Lawaetz, M. Nilsson, and R. Bro. Structure-revealing data fusion. BMC bioinformatics, 15(1):239, 2014. [3] H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716 723, 1974. [4] B. Blankertz, G. Dornhege, M. Krauledat, K.-R. Müller, and G. Curio. The non-invasive berlin brain computer interface: fast acquisition of effective performance in untrained subjects. Neuro Image, 37(2):539 550, 2007. [5] C. Borgs, J. T. Chayes, L. Lovász, V. T. Sós, and K. Vesztergombi. Convergent sequences of dense graphs I: Subgraph frequencies, metric properties and testing. Advances in Mathematics, 219(6):1801 1851, 2008. [6] C. F. Caiafa and A. Cichocki. Generalizing the column row matrix decomposition to multi-way arrays. Linear Algebra and its Applications, 433(3):557 573, 2010. [7] L. De Lathauwer, B. De Moor, and J. Vandewalle. On the best rank-1 and rank-(r1, r2, . . . , rn) approximation of higher-order tensors. SIAM Journal on Matrix Analysis and Applications, 21(4):1324 1342, 2000. [8] P. Drineas and M. W. Mahoney. A randomized algorithm for a tensor-based generalization of the singular value decomposition. Linear Algebra and Its Applications, 420(2):553 571, 2007. [9] A. Frieze and R. Kannan. The regularity lemma and approximation schemes for dense problems. In FOCS, pages 12 20, 1996. [10] K. Hayashi and Y. Yoshida. Minimizing quadratic functions in constant time. In NIPS, pages 2217 2225, 2016. [11] A. J. Lawaetz, R. Bro, M. Kamstrup-Nielsen, I. J. Christensen, L. N. Jørgensen, and H. J. Nielsen. Fluorescence spectroscopy as a potential metabonomic tool for early detection of colorectal cancer. Metabolomics, 8(1):111 121, 2012. [12] L. Lovász. Large Networks and Graph Limits. American Mathematical Society, 2012. [13] L. Lovász and B. Szegedy. Limits of dense graph sequences. Journal of Combinatorial Theory, Series B, 96(6):933 957, 2006. [14] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local SVM approach. In ICPR, volume 3, pages 32 36, 2004. [15] G. Schwarz et al. Estimating the dimension of a model. The Annals of Statistics, 6(2):461 464, 1978. [16] R. J. Steele and A. E. Raftery. Performance of bayesian model selection criteria for gaussian mixture models. Frontiers of Statistical Decision Making and Bayesian Analysis, 2:113 130, 2010. [17] C. E. Tsourakakis. Mach: Fast randomized tensor decompositions. In ICDM, pages 689 700, 2010. [18] L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279 311, 1966. [19] R. Vezzani and R. Cucchiara. Video surveillance online repository (visor): an integrated framework. Multimedia Tools and Applications, 50(2):359 380, 2010. [20] S. Watanabe. Algebraic geometry and statistical learning theory, volume 25. Cambridge University Press, 2009. [21] G. Zhou, A. Cichocki, and S. Xie. Decomposition of big tensors with low multilinear rank. ar Xiv preprint ar Xiv:1412.1885, 2014. 6For models with multiple solutions, such as Tucker decomposition, the penalty term can differ from the standard form [20]. Still, these criteria are useful in practice (see, e.g. [16]).