# singleshot__a_scalable_tucker_tensor_decomposition__8a47af7b.pdf

Singleshot : a scalable Tucker tensor decomposition

Abraham Traoré LITIS EA4108 University of Rouen Normandy abraham.traore@etu.univ-rouen.fr

Maxime Bérar LITIS EA4108 University of Rouen Normandy maxime.berar@univ-rouen.fr

Alain Rakotomamonjy LITIS EA4108 University of Rouen Normandy Criteo AI Lab, Criteo Paris alain.rakoto@insa-rouen.fr

This paper introduces a new approach for the scalable Tucker decomposition problem. Given a tensor X, the algorithm proposed, named Singleshot, allows to perform the inference task by processing one subtensor drawn from X at a time. The key principle of our approach is based on the recursive computations of the gradient and on cyclic update of the latent factors involving only one single step of gradient descent. We further improve the computational efﬁciency of Singleshot by proposing an inexact gradient version named Singleshotinexact. The two algorithms are backed with theoretical guarantees of convergence and convergence rates under mild conditions. The scalabilty of the proposed approaches, which can be easily extended to handle some common constraints encountered in tensor decomposition (e.g non-negativity), is proven via numerical experiments on both synthetic and real data sets.

1 Introduction

The recovery of information-rich and task-relevant variables hidden behind observation data (commonly referred to as latent variables) is a fundamental task that has been extensively studied in machine learning. In many applications, the dataset we are dealing with naturally presents different modes (or dimensions) and thus, can be naturally represented by multidimensional arrays (also called tensors). The recent interest for efﬁcient techniques to deal with such datasets is motivated by the fact that the methodologies that matricize the data and then apply matrix factorization give a ﬂattened view of data and often cause a loss of the internal structure information. Hence, to mitigate the extent of this loss, it is more favorable to process a multimodal data set in its own domain, i.e. tensor domain, to obtain a multiple perspective view of data rather than a ﬂattened one.

Tensors represent generalization of matrices and the related decomposition techniques are promising tools for exploratory analysis of multidimensional data in diverse disciplines including signal processing [11], social networks analysis [28], etc. The two most common decompositions used for tensor analysis are the Tucker decomposition [43] and the Canonical Polyadic Decomposition also named CPD[16, 6]. These decompositions are used to infer multilinear relationships from multidimensional datasets as they allow to extract hidden (latent) components and investigate the relationships among them.

In this paper, we focus on the Tucker decomposition motivated by the fact that this decomposition and its variants have been successfully used in many real applications [24, 19]. Our technical goal

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

is to develop a scalable Tucker decomposition technique in a static setting (the tensor is ﬁxed). Such an objective is relevant in a situation where it is not possible to load in memory the tensor of interest or when the decomposition process may result in memory overﬂow generated by intermediate computations [20, 31].

1.1 Related work and main limitations

Divide-and-conquer type methods (i.e. which divide the data set into sub-parts) have already been proposed for the scalable Tucker decomposition problem, with the goal of efﬁciently decomposing a large ﬁxed tensor (static setting). There are mainly three trends for these methods: distributed methods, sequential processing of small subsets of entries drawn from the tensor or the computation of the tensor-matrix product in a piecemeal fashion by adaptively selecting the order of the computations. A variant of the Tucker-ALS has been proposed in [31] and it solves each alternate step of the Tucker decomposition by processing on-the-ﬂy intermediate data, reducing then the memory footprint of the algorithm. Several other approaches following the same principles are given in [5, 9, 4, 33] while others consider some sampling strategies [29, 36, 14, 39, 48, 18, 47, 35, 27, 10, 25] or distributed approaches [49, 7, 34]. One major limitation related to these algorithms is their lack of genericness (i.e. they cannot be extended to incorporate some constraints such as non-negativity).

Another set of techniques for large-scale Tucker decomposition in a static setting focuses on designing both deterministic and randomized algorithms in order to alleviate the computational burden of the decomposition. An approach proposed by [4] performs an alternate minimization and reduces the scale of the intermediate problems via the incorporation of sketching operators. In the same ﬂavor, one can reduce the computational burden of the standard method HOSVD through randomization and by estimating the orthonormal basis via the so-called range ﬁnder algorithm [51]. This class of approaches encompasses other methods that can be either random [8, 30, 13, 37, 42, 46] or deterministic [40, 2, 38, 3, 50, 17, 26, 32]. The main limitation of these methods essentially stems from the fact that they use the whole data set at once (instead of dividing it), which makes them non-applicable when the tensor does not ﬁt the available memory.

From a theoretical point of view, among all these works, some algorithms are backed up with convergence results [4] or have quality of approximation guarantees materialized by a recovery bound [1]. However, there is still a lack of convergence rate analysis for the scalable Tucker problem.

1.2 Main contributions

In contrast to the works described above, our contributions are the following ones:

We propose a new approach for the scalable Tucker decomposition problem, denoted as Singleshot leveraging on coordinate gradient descent [41] and sequential processing of data chunks amenable to constrained optimization. In order to improve the computational efﬁciency of Singleshot, we introduce an inexact gradient variant, denoted as Singleshotinexact. This inexact approach can be further extended so as to make it able to decompose a tensor growing in every mode and in an online fashion. From a theoretical standpoint, we establish for Singleshot an ergodic convergence rate of O 1

(K: maximum number of iterations) to a stationary point and for Singleshotinexact, we establish a convergence rate of O( 1

k) (k being the iteration number) to a minimizer. We provide experimental analyses showing that our approaches are able to decompose bigger tensors than competitors without compromising efﬁciency. From a streaming tensor decomposition point of view, our Singleshot extension is competitive with its competitor.

2 Notations & Deﬁnitions

A N order tensor is denoted by a boldface Euler script letter X RI1 IN . The matrices are denoted by bold capital letters (e.g. A). The identity matrix is denoted by Id. The jth row of a matrix A RJ L is denoted by Aj,: and the transpose of a matrix A by A . Matricization is the process of reordering all the elements of a tensor into a matrix. The moden matricization of a tensor [X](n) arranges the mode-n ﬁbers to be the columns of the resulting matrix X(n) RIn (Q

m =n Im). The mode-n product of a tensor G RJ1 JN with a matrix

A RIn Jn denoted by G n A yields a tensor of the same order B RJ1 Jn 1 In Jn+1 JN whose mode-n matricized form is deﬁned by: B(n) = AG(n). For a tensor X RI1 ... IN , its ith n subtensor with respect to the mode n is denoted by X n in RI1 In 1 1 In+1 IN . This

subtensor is a N-order tensor deﬁned via the mapping between its n-mode matricization X n in (n) and the ith n row of X(n), i.e. the tensor X n in is obtained by reshaping the ith n row of X(n), with the target shape (I1, .., In 1, 1, In+1, .., IN). The set of integers from n to N is denoted by In N = {n, .., N}: if n = 1, the set is simply denoted by IN. The set of integers from 1 to N with n excluded is denoted by IN =n = {1, .., n 1, n + 1, .., N}. Let us deﬁne the tensor G RJ1 .. JN and N matrices A(m) RIm Jm

n IN . The product of G with the matrices A(m), m IN denoted by G 1 A(1) 2 ... N A(N) will be alternatively expressed by: G m m IN A(m) = G m m In 1 A(m) n A(n) q q In+1 N A(q) = G m m IN =n A(m) n A(n).

The Frobenius norm of a tensor X RI1 IN , denoted by X F is deﬁned by:

1 in In,1 n N X 2 i1, ,i N 1

2 . The same deﬁnition holds for matrices.

3 Piecewise tensor decomposition: Singleshot

3.1 Tucker decomposition and problem statement

Given a tensor X RI1 ... IN , the Tucker decomposition aims at the following approximation:

X G m m IN A(m), G RJ1 ... JN , A(m) RIm Jm

The tensor G is generally named the core tensor and the matrices A(m)

m IN the loading matrices. With orthogonality constraints on the loading matrices, this decomposition can be seen as the multidimensional version of the singular value decomposition [23]. A natural way to tackle this problem is to infer the latent factors G and A(m)

m IN in such a way that the discrepancy is low. Thus, the decomposition of X is usually obtained by solving the following optimization problem:

min G,A(1), ,A(N)

f G, A(1), , A(N) 1

2 X G m IN A(m) 2 F

Our goal in this work is to solve the above problem, for large tensors, while addressing two potential issues : the processing of a tensor that does not ﬁt into the available memory and avoiding memory overﬂow problem generated by intermediate operations during the decomposition process [21].

For this objective, we leverage on a reformulation of the problem (1) in terms of subtensors drawn from X with respect to one mode (which we suppose to be predeﬁned), the ﬁnal objective being to set up a divide-and-conquer type approach for the inference task. Let s consider a ﬁxed integer n (in the sequel, n will be referred to as the splitting mode). Indeed, the objective function can be rewritten in the following form (see supplementary, property 2):

f G, A(1), , A(N) =

1 2 X n in G m m IN =n A(m) n A(n) in,: 2 F (2)

More generally, the function f given by (1) can be expressed in terms of subtensors drawn with respect to every mode (see supplementary material, property 3). For simplicity concerns, we only address the case of subtensors drawn with respect to one mode and the general case can be derived following the same principle (see supplementary material, section 5).

3.2 Singleshot

Since the problem (1) does not admit any analytic solution, we propose a numerical resolution based on coordinate gradient descent [41]. The underlying idea is based on a cyclic update over each of the variables G, A(1), .., A(N) while ﬁxing the others at their last updated values and each

Algorithm 1 Singleshot

Inputs: X tensor of interest, n splitting mode, n A(m) 0 o

1 m N initial loading matrices,

Output: G, A(m)

1 m N Initialization: k = 0

1: while a predeﬁned stopping criterion is not met do 2: Compute optimal step ηG k 3: Gk+1 Gk ηG k DG k with DG k given by (4) 4: for p from 1 to N do 5: Compute optimal step ηp k 6: A(p) k+1 A(p) k ηp k Dp k with Dp k given by (5),(6) 7: end for 8: end while

update being performed via a single iteration of gradient descent. More formally, given at iteration k, Gk, A(1) k , ..., A(N) k the value of the latent factors, the derivatives DG k and Dp k of f with respect

to the core tensor and the pth loading matrix respectively evaluated at Gk, A(1) k , , A(N) k and Gk+1, A(1) k+1, , A(p 1) k+1 , A(p) k ., A(N) k are given by:

DG k = Gf Gk, n A(m) k o N

, Dp k = A(p)f Gk+1, n A(m) k+1, op 1

1 , n A(q) k o N

The resulting cyclic update algorithm, named Singleshot, is summarized in Algorithm 1. A naive implementation of the gradient computation would result in memory overﬂow problem. In what follows, we show that the derivatives DG k and Dp k, 1 p N given by the equation (3) can be computed by processing a single subtensor at a time, making Algorithm 1 amenable to sequential processing of subtensors. Discussions on how the step sizes are obtained will be provide in Section 4.

Derivative with respect to G. The derivative with respect to the core tensor is given by (details in Property 7 of supplementary material):

in=1 Rin m m IN =n

, Rin = X n in+Gk m m IN =n A(m) k n A(n) k

(4) It is straightforward to see that DG k (given by the equation (4)) is the last term of the recursive sequence (DG k )j

1 j In deﬁned as DG k j = DG k j 1 + θj, with DG k 0 being the null tensor. An important observation is that the additive term θj (given by the equation (4)) depends only on one single subtensor X n j . This is the key of our approach since it allows the computation of DG k through the sequential processing of a single subtensor X n j at a time.

Derivatives with respect to A(p), p = n (n being the splitting mode). For those derivatives, we can exactly follow the same reasoning, given in detail in Property 9 of the Supplementary material, and obtain for p < n (the case p > n yields a similar formula):

Xn in (p) + A(p) k B(p) in B(p) in (5)

The matrices (Xn in)(p) and B(p) in represent respectively the mode-p matricized forms of the ith n subtensor X n in and the tensor Bin is deﬁned by:

Bin = Gk+1 m m Ip 1 A(m) k+1 p Id q q Ip+1 N =n A(q) k n (A(n) k )in,:

with Id RJp Jp being the identity matrix. With a similar reasoning as for the derivative with respect to the core, it is straightforward to see that Dp k can be computed by processing a single subtensor at a time.

Derivative with respect to A(n) (n being the splitting mode). The derivative with respect to the matrix A(n) can be computed via the row-wise stacking of independent terms, that are, the derivatives with respect to the rows A(n) j,: and the derivative of f with respect to A(n) j,: depends only on X n j . Indeed, let s consider 1 j In. In the expression of the objective function f given by the equation (2), the only term that depends on A(n) j,: is X n j G m m In 1 A(m) n A(n) j,: q q In+1 N A(q) 2 F , thus

the derivative of f with respect to A(n) j,: depends only on X n j and is given by (see property 8 in the supplementary material):

A(n) j,: f G, n A(m)o N

= (Xn j )(n) A(n) j,: B(n) B(n) (6)

The tensors (Xn j )(n) R1 Q

k =n Ik and B(n) respectively represent the mode-n matricized form of the tensors X n j and B with B = G p p In 1 A(p) n Id q q In+1 N A(q), Id RJn Jn: identity matrix.

Remark 1. . For one-mode subtensors, it is relevant to choose n such that In is the largest dimension since this yields the smallest subtensors. We stress that all entries of the tensor X have been entirely processed when running Algorithm 1 and our key action is the sequential processing of subtensors X n in. In addition, if one subtensor does not ﬁt in the available memory, the recursion, as shown in section 5 of the supplementary material, can still be applied to subtensors of the form X θ1,...,θN , θm {1, 2, .., Im} with (X θ1,..,θN )i1,..,i N = X i1,..,i N , (i1, .., i N) θ1 ... θN, referring to the Cartesian product.

3.3 Singleshotinexact

While all of the subtensors X n in, 1 in In are considered in the Singleshot algorithm for the computation of the gradient, in Singleshotinexact, we propose to use only a subset of them for the sake of reducing computational time. The principle is to use for the gradients computation only Bk < In subtensors. Let s consider the set SET k (of cardinality Bk) composed of the integers representing the indexes of the subtensors that are going to be used at iteration k. The numerical resolution scheme is identical to the one described by Algorithm 1 except for the deﬁnition of DG k and Dp k which are respectively replaced by b DG k and b Dp k, p = n deﬁned by:

in SET k Rin m m IN =n A(m) k n

Xn in (p) + A(p)B(p) in B(p) in (8)

For the theoretical convergence, the descent steps are deﬁned as ηG k Bk and ηp k Bk , 1 p N. It is

worth to highlight that the derivative b Dn k (n being the mode with respect to which the subtensors are drawn) is sparse: Singleshotinexact amounts to minimize f deﬁned by (2) by dropping the terms (

X n j G m m IN =n A(m) n A(n) j,: 2 F

with j SET k, thus, the rows b Dn k

j,: , j SET k are

all equal to zero.

3.4 Discussions

First, we discuss the space complexity needed by our algorithms supposing that the subtensors are drawn with respect to one mode. Let s denote by n the splitting mode. For Singleshot and Singleshot-inexact, at the same time, we only need to have in memory the tensor X n j of size

m IN =n Im = I1..In 1In+1..IN, the matrices A(m)

m IN =n , A(n) in,: and the previous iterate of the gradient. Thus, the complexity in space is Q m IN =n Im + P m =n Im Jm + Jn + AT with AT being the space complexity of the previous gradient iterate: for the core update, AT = Q

m IN Jm and for a matrix A(m), AT = Im Jm. If the recursion used for the derivatives computation is applied to subtensors of the form Xθ1, ,θN , the space complexity is smaller than these complexities.

Another variant of Singleshotinexact can be derived to address an interesting problem that has received little attention so far [4], that is the decomposition of a tensor streaming in every mode with a single pass constraint (i.e. each chunk of data is processed only once) named Singleshotonline. This is enabled by the inexact gradient computation which uses only subtensors that are needed. In the streaming context, the gradient is computed based only on the available subtensor.

Positivity constraints is one of the most encountered constraints in tensor computation and we can simply handle those constraints via the so-called projected gradient descent [45]. This operation does not alter the space complexity with respect to the unconstrained case, since no addition storage is required but increases the complexity in time. For more details, see the section 3 in the supplementary material for the algorithmic details for the proposed variants.

4 Theoretical result

Let s consider the minimization problem (1):

min G,A(1),..,A(N) f G, A(1), .., A(N)

By denoting the block-wise derivative by xf, the derivative of f, denoted f and deﬁned by ( Gf, A(1)f.. A(N)f), is an element of RJ1 .. JN RI1 J1 ... RIN JN endowed with the norm deﬁned as the sum of the Frobenius norms. Besides, let s consider, for writing simplicity, the alternative notations of f(G, A(1), , A(N)) given by: f(G, A(m) N 1 ), f(G, A(m) p 1 , A(q) N p+1). For the theoretical guarantees which details have been reported in the supplementary material, we consider the following assumptions:

Assumption 1. Uniform boundedness. The nth subtensors are bounded: X n in F ρ.

Assumption 2. Boundedness of factors. We consider the domain G Dg, A(m) Dm with:

Dg = { Ga F α} , Dm = n A(m) a F α o

4.1 Convergence result of Singleshot

For the convergence analysis, we consider the following deﬁnitions of the descent steps ηG k and ηp k at the (k + 1)th iteration:

ηG k = arg min

K )φg(η), ηp k = arg min

K )φp(η) (9)

φg(η) = f Gk ηDG k , n A(m) k o N

f Gk, n A(m) k o N

φp (η) =f Gk+1, n A(m) k+1 op 1

1 , A(p) k ηDp k, n A(q) k o N

f Gk+1, n A(m) k+1 op 1

1 , n A(q) k o N

and δ2 > δ1 > 0 being user-deﬁned parameters. The motivation of the problems given by the equation (9) is to ensure a decreasing of the objective function after each update. Also note that, the minimization problems related to ηG k and ηp k are well deﬁned since all the factors involved in their deﬁnitions are known at the computation stage of Gk+1 and A(p) k+1 and correspond to the minimization of a continuous function on a compact set.

Along with Assumption 1 and Assumption 2, as well as the deﬁnitions given by (9), we assume that: δ1

K < ηG k δ2

K < ηp k δ2

This simply amounts to consider that the solutions of the minimization problems deﬁned by the equation (9) are not attained at the lower bound of the interval. Under this framework, we establish, as in standard non-convex settings [12], an ergodic convergence rate. Precisely, we prove that:

K0 1, K K0, 1

k=0 f Gk, n A(m) k o N

with f Gk, n A(m) k o N

= Gf(Gk, n A(m) k o N

1 ) F + PN p=1 A(p)f(Gk, n A(m) k o N

2Γ + α2NΓ2 gδ2 2 + PN p=1(1 + 2Γ + α2NΓ2 pδ2 2) , Γ, Γp, Γg 0 being respectively the

supremun of f, A(p)f(G, A(m) ) F , Gf(G, A(m) ) F on the compact set Dg D1.. DN This result proves that Singleshot converges ergodically to a point where the gradient is equal to 0 at the rate O 1

4.2 Convergence result for Singleshotinexact

Let us consider that ℓj(A(N)) 1

2 X n j Gk+1 m m In 1 A(m) k+1 n (A(n) k+1)j,: q q In+1 N 1 A(q) k+1 N A(N) 2 F

and that the step ηN k for A(N) is deﬁned by the following minimization problem:

ηN k = arg min η [ 1 4Kγ , 1 Kγ ]

φ(η) = f Gk+1, n A(m) k+1 o N 1

f Gk, n A(m) k o N

+ λf Gk+1, n A(m) k+1 o N 1

and φ(η) = A(N) k η Bk P j SET k A(N)ℓj, A(N)ℓj being the derivative of ℓj evaluated at A(N) k .

The parameters λ > 0, γ > 1 represent user-deﬁned parameters. In addition to Assumption 1 and Assumption 2, we consider the following three additional assumptions:

1. Descent step related to the update of A(N). We assume that 1 4Kγ < ηN k 1 α2N , which means that the solution of problem (12) is not attained at the lower bound of the interval.

2. Non-vanishing gradient with respect to A(N) A(N)f Gk+1, n A(m) k+1 o N 1

This condition ensures the existence of a set SET k such that P

j SET k A(N)ℓj = 0 and the set considered for the problem (12) is one of such sets.

3. Choice of the number of subtensors Bk. We suppose that In q

1 2 + 1 In Bk and In > 2.

This condition In > 2 ensures that In q

1 2 + 1 In < In.

With these assumptions at hand, the sequence k = f Gk, n A(m) k o N

fmin veriﬁes:

k > k0 = 1 + 1 log(1 + λ) log 1 log(1 + λ)

, k 1 + ζ(λ, ρ, α, In)

with log being the logarithmic function, fmin representing the minimizer of f, a continuous function deﬁned on the compact set Dg D1 .... DN and ζa function of λ, ρ, α, In. The parameter k0 is

well-deﬁned since λ > 0. This result ensures that the sequence n Gk, A(1) k , ., A(N) k o generated by

Singleshotinexact converges to the set of minimizers of f at the rate O 1

Remark 2. The problems deﬁned by the equations (9) and (12), which solutions are global and can be solved by simple heuristics (e.g. Golden section), are not in contradiction with our approach since they can be solved by processing a single subtensor at a time due to the expression of f given by (2).

500 1000 1500 2000 2500 3000 3500 4000 Dimension M

Mean Average Precision (MAP)

Greedy HOOI Scalablerandomizedtucker Tensorsketch Singleshotinexactunconstrained Singleshotunconstrained

500 1000 1500 2000 2500 3000 3500 4000 Dimension M

Running time (s)

Greedy HOOI Scalablerandomizedtucker Tensorsketch Singleshotinexactunconstrained Singleshotunconstrained

1500 2000 2500 3000 3500 4000 Dimension M

Approximation error (AE)

Greedy HOOI Scalablerandomtucker Tensorsketch Singleshotinexactunconstrained Singleshotunconstrained

1000 1500 2000 2500 3000 3500 4000 Dimension M

Running time (s)

Greedy HOOI Scalablerandomtucker Tensorsketch Singleshotinexactunconstrained Singleshotunconstrained

Figure 1: Approximation error and running time for the unconstrained decomposition algorithms. From left to right: ﬁrst and second ﬁgures represent Movielens, third and fourth represent Enron. As M grows , missing markers for a given algorithm means that it ran out of memory. The core G rank is (5, 5, 5).

500 1000 1500 2000 2500 3000 3500 4000 Dimension M

Mean Average Precision

pos Scalablerandomtucker Singleshotinexactpositive Singleshotpositive

500 1000 1500 2000 2500 3000 3500 4000 Dimension M

Running time (s)

pos Scalablerandomtucker Singleshotinexactpositive Singleshotpositive

1500 2000 2500 3000 3500 4000 Dimension M

Approximation error

pos Scalablerandomizedtucker Singleshotinexactpositive Singleshotpositive

1500 2000 2500 3000 3500 4000 Dimension M

Running time (s)

pos Scalablerandomizedtucker Singleshotinexactpositive Singleshotpositive

Figure 2: Approximation error and running time for the non-negative decomposition algorithms. From left to right: ﬁrst and second ﬁgures represent Movielens, third and fourth represent Enron. As M grows, missing markers, for a given algorithm means that it ran out of memory. The core G rank is (5, 5, 5).

5 Numerical experiments

Our goal here is to illustrate that for small tensors, our algorithms Singleshot and Singleshotinexact and their positive variants, are comparable to some state-the-art decomposition methods. Then as the tensor size grows, we show that they are the only ones that are scalable. The competitors we have considered include SVD-based iterative algorithm [44](denoted Greedy HOOI ), a very recent alternate minimization approach based on sketching [4] (named Tensorsketch) and randomization-based methods [51] (Algorithm 2 in [51] named Scalrandomtucker and Algorithm 1 in [51] with positivity constraints named pos Scalrandomtucker). Other materials related to the numerical experiments are provided in the section 4 of the supplementary material. For the tensor computation, we have used the Tensor Ly tool [22].

The experiments are performed on the Movielens dataset [15], from which we construct a 3-order tensor whose modes represent timesteps, movies and users and on the Enron email dataset, from which a 3-order tensor is constructed, the ﬁrst and second modes representing the sender and recipients of the emails and the third one denoting the most frequent words used in the miscellaneous emails. For Movielens, we set up a recommender system for which we report a mean average precision (MAP) obtained over a test set (averaged over ﬁve 50 50 train-test random splits) and for Enron, we report an error (AE) on a test set (with the same size as the training one) for a regression problem. As our goal is to analyze the scalability of the different methods as the tensor to decompose grows, we have arbitrarily set the size of the Movielens and Enron tensors to M M 200 and M M 610, M being a user-deﬁned parameter. Experiments have been run on Mac Os with 32 Gb of memory.

Another important objective is to prove the robustness of our approach with respect to the assumptions and the deﬁnitions related to the descent steps laid out in the section 4, which is of primary importance since the minimization problems deﬁning these steps can be time-consuming in practice for large tensors. This motivates the fact that for our algorithms, the descent steps are ﬁxed in advance. For Singleshot, the steps are ﬁxed to 10 6. For Singleshot-inexact, the steps are respectively ﬁxed to 10 6 and 10 8 for Enron and Movielens. Regarding the computing of the inexact gradient for Singleshotinexact, the elements in SET k are generated randomly without replacement with the same cardinality for any k. The number of slices is chosen according to the third assumption in section 4.2. For the charts, the unconstrained versions of Singleshot and Singleshotinexact will be followed by the term "unconstrained" and "positive" for the positive variantes.

2 3 4 5 6 Rank R

Approximation Error (AE)

Tensorsketchonline Singleshotonline positive

2 3 4 5 6 Rank R

Running time (s)

Tensorsketchonline Singleshotonline positive

Figure 3: Comparing Online version of Tensorsketch and Singleshot with positive constraints on the Enron dataset. (left) Approximation error. (right) running time.

Figure 1 presents the results we obtain for these two datasets. At ﬁrst, we can note that performance, in terms of MAP or AE, are rather equivalent for the different methods. Regarding the running time, the Scalrandomtucker is the best performing algorithm being an order of magnitude more efﬁcient than other approaches. However, all competing methods struggle to decompose tensors with dimension M = 4000 and M 2800 respectively for Enron and Movielens due to memory error. Instead, our Singleshot methods are still able to decompose those tensors, although the running time is large. As expected, Singleshotinexact is more computationally efﬁcient than Singleshot.

Figure 2 displays the approximation error and the running time for Singleshot and singleshotinexact with positivity constraints and a randomized decomposition approach with non-negativity constraints denoted here as Pos Scalrandomtucker for Enron and Movielens. Quality of the decomposition is in favor of Singleshotpositive for both Movielens and Enron. In addition, when the tensor size is small enough, Pos Scalrandomtucker is very computationally efﬁcient, being one order of magnitude faster than our Singleshot approaches on Enron. However, Pos Scalrandomtucker is not able to decompose very large tensors and ran out of memory contrarily to Singleshot.

For illustrating the online capability of our algorithm, we have considered a tensor of size 20000 2000 50 constructed from Enron which is artiﬁcially divided into slices drawn with respect to the ﬁrst and the second modes. The core rank is (R, R, R). We compare the online variant of our approach associated to positivity constraints named Singleshotonlinepositive to the online version of Tensorsketch [4] denoted Tensorsketchonline. Figure 3 shows running time for both algorithms. While of equivalent performance, our method is faster as our proposed update schemes, based on one single step of gradient descent, are more computationally efﬁcient than a full alternate minimization.

Remark 3. Other assessments are provided in the supplementary material: comparisons with other recent divide-and-conquer type approaches are provided, the non-nullity of the gradient with respect to A(n) is numerically shown, and ﬁnally, we demonstrated the expected behavior of Singleshotinexact, i.e. the higher the number of subtensors in the gradient approximation, the better performance we get .

6 Conclusion

We have introduced two new algorithms named Singleshot and Singleshotinexact for scalable Tucker decomposition with convergence rates guarantees: for Singleshot, we have established a convergence rate to the set of minimizers of O( 1

K ) (K being the maximum number of iterations) and for Singleshotinexact, a convergence rate of O 1

k (k being the iteration number). Besides, we have proposed a new approach for a problem that has drawn little attention so far, that is, the Tucker decomposition under the single pass constraint (with no need to resort to the past data) of a tensor streaming with respect to every mode. In future works, we aim at applying the principle of Singleshot to other decomposition problems different from Tucker.

Acknowledgments

This work was supported by grants from the Normandie Projet GRR-DAISI, European funding FEDER DAISI and LEAUDS ANR-18-CE23-0020 Project of the French National Research Agency (ANR).

[1] Woody Austin, Grey Ballard, and Tamara G. Kolda. Parallel tensor compression for large-scale scientiﬁc data. 2016 IEEE International Parallel and Distributed Processing Symposium, 2016.

[2] Grey Ballard, Alicia Klin, and Tamara G. Kolda. Tuckermpi: A parallel c++/mpi software package for large-scale data compression via the tucker tensor decomposition. arxiv, 2019.

[3] Muthu Manikandan Baskaran, Benoît Meister, Nicolas Vasilache, and Richard Lethin. Efﬁcient and scalable computations with sparse tensors. 2012 IEEE Conference on High Performance Extreme Computing, pages 1 6, 2012.

[4] Stephen Becker and Osman Asif Malik. Low-rank tucker decomposition of large tensors using tensorsketch. Advances in Neural Information Processing Systems, pages 10117 10127, 2018.

[5] Cesar F. Caiafa and Andrzej Cichocki. Generalizing the column-row matrix decomposition to multiway arrays. In Linear Algebra and its Applications, volume 433, pages 557 573, 2010.

[6] Raymond B. Cattell. Parallel proportional proﬁles and other principles for determining the choice of factors by rotation. Psychometrika, 9(4):267 283, 1944.

[7] Venkatesan T. Chakaravarthy, Jee W. Choi, Douglas J. Joseph, Prakash Murali, Shivmaran S. Pandian, Yogish Sabharwal, and Dheeraj Sreedhar. On optimizing distributed tucker decomposition for sparse tensors. In Proceedings of the 2018 International Conference on Supercomputing, pages 374 384, 2018.

[8] Maolin Che and Yimin Wei. Randomized algorithms for the approximations of tucker and the tensor train decompositions. Advances in Computational Mathematics, pages 1 34, 2018.

[9] Dongjin Choi, Jun-Gi Jang, and Uksong Kang. Fast, accurate, and scalable method for sparse coupled matrix-tensor factorization. Co RR, 2017.

[10] Dongjin Choi and Lee Sael. Snect: Scalable network constrained tucker decomposition for integrative multi-platform data analysis. Co RR, 2017.

[11] Andrzej Cichocki, Rafal Zdunek, Anh Huy Phan, and Shun-ichi Amari. Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation. Wiley Publishing, 2009.

[12] Alistarh Dan and al. The convergence of sparsiﬁed gradient methods. Advances in Neural Information Processing Systems, pages 5977 5987, 2018.

[13] Petros Drineas and Michael W. Mahoney. A randomized algorithm for a tensor-based generalization of the singular value decomposition. Linear Algebra and its Applications, 420:553 571, 2007.

[14] Dóra Erdös and Pauli Miettinen. Walk n merge: A scalable algorithm for boolean tensor factorization. 2013 IEEE 13th International Conference on Data Mining, pages 1037 1042, 2013.

[15] F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context. Tii S, 5:19:1 19:19, 2015.

[16] F. L. Hitchcock. The expression of a tensor or a polyadic as a sum of products. J. Math.Phys., 6(1):164 189, 1927.

[17] Inah Jeon, Evangelos E. Papalexakis, Uksong Kang, and Christos Faloutsos. Haten2: Billionscale tensor decompositions. 2015 IEEE 31st International Conference on Data Engineering, pages 1047 1058, 2015.

[18] Oguz Kaya and Bora Uçar. High performance parallel algorithms for the tucker decomposition of sparse tensors. 2016 45th International Conference on Parallel Processing (ICPP), pages 103 112, 2016.

[19] Tamara Kolda and Brett Bader. The tophits model for higher-order web link analysis. Workshop on Link Analysis, Counterterrorism and Security, 7, 2006.

[20] Tamara G. Kolda and Jimeng Sun. Scalable tensor decompositions for multi-aspect data mining. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, pages 363 372, 2008.

[21] Tamara G. Kolda and Jimeng Sun. Scalable tensor decompositions for multi-aspect data mining. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM 08, pages 363 372, 2008.

[22] Jean Kossaiﬁ, Yannis Panagakis, and Maja Pantic. Tensorly: Tensor learning in python. ar Xiv, 2018.

[23] Lieven Lathauwer and al. A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl., 21(4):1253 1278, 2000.

[24] Lieven De Lathauwer and Joos Vandewalle. Dimensionality reduction in higher-order signal processing and rank-(r1, r2,...,rn) reduction in multilinear algebra. Linear Algebra and its Applications, 391:31 55, 2004.

[25] Dongha Lee, Jaehyung Lee, and Hwanjo Yu. Fast tucker factorization for large-scale tensor completion. 2018 IEEE International Conference on Data Mining (ICDM), pages 1098 1103, 2018.

[26] Xiaoshan Li, Hua Zhou, and Lexin Li. Tucker tensor regression and neuroimaging analysis. Statistics in Biosciences, 04 2013.

[27] Xinsheng Li, K. Selçuk Candan, and Maria Luisa Sapino. M2td: Multi-task tensor decomposition for sparse ensemble simulations. 2018 IEEE 34th International Conference on Data Engineering (ICDE), pages 1144 1155, 2018.

[28] Ching-Yung Lin, Nan Cao, Shi Xia Liu, Spiros Papadimitriou, J Sun, and Xifeng Yan. Smallblue: Social network analysis for expertise search and collective intelligence. ICDE, pages 1483 1486, 2009.

[29] Michael W. Mahoney, Mauro Maggioni, and Petros Drineas. Tensor-cur decompositions for tensor-based data. In SIGKDD, pages 327 336, 2006.

[30] Carmeliza Navasca and Deonnia N. Pompey. Random projections for low multilinear rank tensors. In Visualization and Processing of Higher Order Descriptors for Multi-Valued Data, pages 93 106, 2015.

[31] Jinoh Oh, Kijung Shin, Evangelos E. Papalexakis, Christos Faloutsos, and Hwanjo Yu. S-hot: Scalable high-order tucker decomposition. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 761 770, 2017.

[32] Sejoon Oh, Namyong Park, Lee Sael, and Uksong Kang. Scalable tucker factorization for sparse tensors - algorithms and discoveries. 2018 IEEE 34th International Conference on Data Engineering (ICDE), pages 1120 1131, 2018.

[33] Moonjeong Park, Jun-Gi Jang, and Sael Lee. Vest: Very sparse tucker factorization of large-scale tensors. 04 2019.

[34] Namyong Park, Sejoon Oh, and U Kang. Fast and scalable method for distributed boolean tensor factorization. In The VLDB Journal, page 1 26, 2019.

[35] Ioakeim Perros, Robert Chen, Richard Vuduc, and J Sun. Sparse hierarchical tucker factorization and its application to healthcare. pages 943 948, 2015.

[36] Kijung Shin, Lee Sael, and U Kang. Fully scalable methods for distributed tensor factorization. IEEE Trans. on Knowl. and Data Eng., 29(1):100 113, January 2017.

[37] Nicholas D. Sidiropoulos, Evangelos E. Papalexakis, and Christos Faloutsos. Parallel randomly compressed cubes ( paracomp ) : A scalable distributed architecture for big tensor decomposition. 2014.

[38] Shaden Smith and George Karypis. Accelerating the tucker decomposition with compressed sparse tensors. In Euro-Par, 2017.

[39] Jimeng Sun, Spiros Papadimitriou, Ching-Yung Lin, Nan Cao, Mengchen Liu, and Weihong Qian. Multivis: Content-based social network exploration through multi-way visual analysis. In SDM, 2009.

[40] Jimeng Sun, Dacheng Tao, Spiros Papadimitriou, Philip S. Yu, and Christos Faloutsos. Incremental tensor analysis: Theory and applications. TKDD, 2:11:1 11:37, 2008.

[41] Paul Tseng and Sangwoon Yun. A coordinate gradient descent method for nonsmooth separable minimization. In Mathematical Programming, volume 117, page 387 423, 2007.

[42] Charalampos E. Tsourakakis. Mach: Fast randomized tensor decompositions. In SDM, 2009.

[43] L. R. Tucker. Implications of factor analysis of three-way matrices for measurement of change. C.W. Harris (Ed.), Problems in Measuring Change, University of Wisconsin Press, pages 122 137, 1963.

[44] Yangyang Xu. On the convergence of higher-order orthogonal iteration. Linear and Multilinear Algebra, pages 2247 2265, 2017.

[45] Rafal Zdunek and al. Fast nonnegative matrix factorization algorithms using projected gradient approaches for large-scale problems. Intell. Neuroscience, 2008:3:1 3:13, 2008.

[46] Qibin Zhao, Liqing Zhang, and Andrzej Cichocki. Bayesian sparse tucker models for dimension reduction and tensor completion. Co RR, 2015.

[47] Shandian Zhe, Yuan Qi, Youngja Park, Zenglin Xu, Ian Molloy, and Suresh Chari. Dintucker: Scaling up gaussian process models on large multidimensional arrays. In AAAI, 2016.

[48] Shandian Zhe, Zenglin Xu, Xinqi Chu, Yuan Qi, and Youngja Park. Scalable nonparametric multiway data analysis. In AISTATS, 2015.

[49] Shandian Zhe, Kai Zhang, Pengyuan Wang, Kuang-chih Lee, Zenglin Xu, Yuan Qi, and Zoubin Gharamani. Distributed ﬂexible nonlinear tensor factorization. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 16, pages 928 936, 2016.

[50] Guoxu Zhou and al. Efﬁcient nonnegative tucker decompositions: Algorithms and uniqueness. IEEE Transactions on Image Processing, 24(12):4990 5003, 2015.

[51] Guoxu Zhou, Andrzej Cichocki, and Shengli Xie. Decomposition of big tensors with low multilinear rank. Co RR, 2014.