# manybody_approximation_for_nonnegative_tensors__5a54d5ce.pdf Many-body Approximation for Non-negative Tensors Kazu Ghalamkari RIKEN AIP kazu.ghalamkari@riken.jp Mahito Sugiyama National Institute of Informatics SOKENDAI mahito@nii.ac.jp Yoshinobu Kawahara Osaka University RIKEN AIP kawahara@ist.osaka-u.ac.jp We present an alternative approach to decompose non-negative tensors, called manybody approximation. Traditional decomposition methods assume low-rankness in the representation, resulting in difficulties in global optimization and target rank selection. We avoid these problems by energy-based modeling of tensors, where a tensor and its mode correspond to a probability distribution and a random variable, respectively. Our model can be globally optimized in terms of the KL divergence minimization by taking the interaction between variables (that is, modes), into account that can be tuned more intuitively than ranks. Furthermore, we visualize interactions between modes as tensor networks and reveal a nontrivial relationship between many-body approximation and low-rank approximation.We demonstrate the effectiveness of our approach in tensor completion and approximation. 1 Introduction Tensors are generalizations of vectors and matrices. Data in fields such as neuroscience [8], bioinformatics [22], signal processing [7], and computer vision [30] are often stored in the form of tensors, and features are extracted from them. Tensor decomposition and its non-negative version [33] are popular methods to extract features by approximating tensors by the sum of products of smaller tensors. These methods usually seek to minimize the reconstruction error, the difference between an original tensor and the tensor reconstructed from obtained smaller tensors. In tensor decomposition approaches, a low-rank structure is typically assumed, where a given tensor is essentially represented by a linear combination of a small number of bases. Such decomposition requires two kinds of information: structure and the number of bases used in the decomposition. The structure specifies the type of decomposition such as CP decomposition [15] and Tucker decomposition [37]. Tensor networks, which were initially introduced in physics, have become popular in the machine learning community recently because they help intuitive and flexible model design, including tensor train decomposition [29], tensor ring decomposition [41], and tensor tree decomposition [25]. Traditional tensor structures in physics, such as MERA [32] and PEPS [4], are also used in machine learning. The number of bases is often referred to as the rank. Larger ranks increase the capability of the model while increasing the computational cost, resulting in the tradeoff problem of rank tuning that requires a careful treatment by the user. Since these low-rank structure-based decompositions via reconstruction error minimization are non-convex, which causes initial value dependence [17, Chapter 3], the problem of finding an appropriate setting of the low-rank structure is highly nontrivial 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Figure 1: Interaction representations corresponding to (a) secondand third-order energy (b) twoand (c) three-body approximation. in practice as it is hard to locate the cause if the decomposition does not perform well. As a result, in order to find proper structure and rank, the user must often perform decomposition multiple times with various settings, which is timeand memory-consuming. Although there is an attempt to avoid rank tuning by approximating it by trace norms [2], it also requires another hyperparameter: the weights of the unfolding tensor. Hence, such an approach does not fundamentally solve the problem. Instead of the low-rank structure that has been the focus of attention in the past, in the present paper we propose a novel formulation of non-negative tensor decomposition, called many-body approximation, that focuses on the relationship among modes of tensors. The structure of decomposition can be naturally determined based on the existence of the interactions between modes. The proposed method requires only the decomposition structure and does not require the rank value, which traditional decomposition methods also require as a hyperparameter and often suffer to determine. To describe interactions between modes, we follow the standard strategy in statistical mechanics that uses an energy function H( ) to treat interactions and considers the corresponding distribution exp ( H( )). This model is known to be an energy-based model in machine learning [19] and is exploited in tensor decomposition as Legendre decomposition [36]. Technically, it parameterizes a tensor as a discrete probability distribution and reduces the number of parameters by enforcing some of them to be zero in optimization. We explore this energy-based approach further and discover the family of parameter sets that represent interactions between modes in the energy function H( ). How to choose non-zero parameters in Legendre decomposition has been an open problem; we start by addressing this problem and propose many-body approximation as a special case of Legendre decomposition. Moreover, although Legendre decomposition is not factorization of tensors in general, our proposal always offers factorization, which can reveal patterns in tensors. Since the advantage of Legendre decomposition is inherited to our proposal, many-body approximation can be achieved by convex optimization that globally minimizes the Kullback Leibler (KL) divergence [18]. Furthermore, we introduce a way of representing mode interactions, that visualizes the presence or absence of interactions between modes as a diagram. We discuss the relation to the tensor network and point out that an operation called coarse-grained transformation [20] in which multiple tensors are viewed as a new tensor reveals unexpected relationship between the proposed method and existing methods such as tensor ring and tensor tree decomposition. We summarize our contribution as follows: By focusing on the interaction between modes of tensors, we introduce rank-free tensor decomposition, called many-body approximation. This decomposition is achieved by convex optimization. We present a way of describing tensor many-body approximation called interaction representation, which is a diagram that shows interactions within a tensor and can be transformed into tensor networks. Many-body approximation can perform tensor completion via the em-algorithm, which empirically shows better performance than low-rank based existing methods. 2 Many-body Approximation for Tensors Our proposal, many-body approximation, is based on the formulation of Legendre decomposition for tensors, which we first review in Section 2.1. Then we introduce interactions between modes and its visual representation in Section 2.2, many-body approximation in Section 2.3, and transformation of interaction representation into tensor networks in Section 2.4. In the following discussion, we consider D-order non-negative tensors with the size (I1, . . . , ID), and we denote by [K] = {1, 2, . . . , K} for a positive integer K. We assume the sum of all elements in a given tensor to be 1 for simplicity, while this assumption can be eliminated using the general property of the Kullback Leibler (KL) divergence, λDKL(P, Q) = DKL(λP, λQ), for any λ R>0. 2.1 Reminder to Legendre Decomposition and its optimization Legendre decomposition is a method for decomposing a non-negative tensor by regarding the tensor as a discrete distribution and representing it with a limited number of parameters. We describe a non-negative tensor P using parameters θ = (θ2,...,1, . . . , θI1,...,ID) and its energy function H as Pi1,...,i D = exp ( Hi1,...,i D), Hi1,...,i D = i D=1 θi 1,...,i D, (1) where θ1,...,1 has a role of normalization. It is clear that the index set of a tensor corresponds to sample space of a distribution and the value of each entry of the tensor is regarded as the probability of realizing the corresponding index [35]. As we see in Equation (1), we can uniquely identify tensors from the parameters θ. Inversely, we can compute θ from a given tensor as θi1,...,i D = i D=1 µi 1,...i D i1,...,i D log Pi 1,...,i D (2) using the Möbius function µ : S S { 1, 0, +1}, where S is the set of indices, defined inductively as follows: µi 1,...,i D i1,...,i D = 1 if id = i d, d [D], QD d=1 Pi d 1 jd=id µj1,...j D i1,...,i D else if id i d, d [D], 0 otherwise. Since distribution described by Equation (1) belongs to the exponential family, θ corresponds to natural parameters of the exponential family, and we can also identify each tensor by expectation parameters η = (η2,...,1, . . . , ηI1,...,ID) using Möbius inversion as ηi1,...,i D = i D=i D Pi 1,...,i D, Pi1,...,i D = i D=1 µi 1,...,i D i1,...,i D ηi 1,...,i D, (3) where η1,...,1 = 1 because of normalization. See Section A in Supplement for examples of the above calculation. Since distribution is determined by specifying either θ-parameters or η-parameters, they form two coordinate systems, called the θ-coordinate system and the η-coordinate system, respectively. 2.1.1 Optimization Legendre decomposition approximates a tensor by setting some θ values to be zero, which corresponds to dropping some parameters for regularization. It achieves convex optimization using the dual flatness of θand η-coordinate systems. Let B be the set of indices of θ parameters that are not imposed to be 0. Then Legendre decomposition coincides with a projection of a given nonnegative tensor P onto the subspace B = {Q | θi1,...,i D = 0 if (i1, . . . , i D) / B for θ-parameters of Q}. Let us consider projection of a given tensor P onto B. The space of probability distributions, which is not a Euclidean space, is studied in information geometry. By taking geometry of probability distributions into account, we can introduce the concept of flatness for a set of tensors. The subspace B is e-flat if the logarithmic combination, or called e-geodesic, R {(1 t) log Q1 + t log Q2 ϕ(t) | 0 < t < 1} of any two points Q1, Q2 B is included in the subspace B, where ϕ(t) is a normalizer. There is always a unique point P on the e-flat subspace that minimizes the KL divergence from any point P. P = arg min Q;Q B DKL(P, Q). (4) This projection is called the m-projection. The m-projection onto an e-flat subspace is convex optimization. We define two vectors θB = (θb)b B and ηB = (ηb)b B and write the dimensionality of these vectors as |B| since it is equal to the cardinality of B. The derivative of the KL divergence and the Hessian matrix G R|B| |B| are given as θB DKL(P, Q) = ηB ˆηB, Gu,v = ηmax(i1,j1),...,max(i D,j D) ηi1,...,i Dηj1,...,j D, (5) where ηB and ˆηB are the expectation parameters of Q and P, respectively, and u = (i1, . . . , i D), v = (j1, . . . , j D) B. This matrix G is also known as the negative Fisher information matrix. Using gradient descent with second-order derivative, we can update θB in each iteration t as θB t+1 = θB t G 1(ηB t ˆηB). (6) The distribution Qt+1 is calculated from the updated natural parameters θt+1. This step finds a point Qt+1 B that is closer to the destination P along with the e-geodesic from Qt to P. We can also calculate the expected value parameters ηt+1 from the distribution. By repeating this process until convergence, we can always find the globally optimal solution satisfying Equation (4). See Section A in Supplement for more detail on the optimization. The problem of how to design and interpret B has not been explored before, and we firstly give the reasoning of B as the presence or absence of particular interactions between modes, which naturally leads to the design guideline of B. The computational complexity of Legendre decomposition is O(γ|B|3), where γ is the number of iterations. 2.2 Interaction and its representation of tensors In this subsection, we introduce interactions between modes and its visual representation to prepare for many-body approximation. The following discussion enables us to intuitively describe relationships between modes and formulate our novel rank-free tensor decomposition. First, we introduce n-body parameters, a generalized concept of one-body and two-body parameters in [11]. In any n-body parameter, there are n non-one indices; for example, θ1,2,1,1 is a one-body parameter, θ4,3,1,1 is a two-body parameter, and θ1,2,4,3 is a three-body parameter. We also use the following notation for n-body parameters: θ(k) ik = θ1,...,1,ik,1,...,1, θ(k,m) ik,im = θ1,...,1,ik,1,...,1,im,1,...,1, θ(k,m,p) ik,im,ip = θ1,...,ik,...,im,...,ip,...,1, for n = 1, 2 and 3, respectively. We write the energy function H with n-body parameters as Hi1, ,i D = H0 + m=1 H(m) im + k=1 H(k,m) ik,im + p=1 H(p,k,m) ip,ik,im + + H(1,...,D) i1,...,i D , where the n-th order energy is introduced as H(l1,...,ln) il1,...,iln = i ln=2 θ(l1,...,ln) i l1,...,i ln . (8) For simplicity, we suppose that 1 l1 < l2 < < ln D holds. We set H0 = θ1,...,1. We say that an n-body interaction exists between modes l1, . . . , ln if there are indices il1, . . . , iln satisfying H(l1,...,ln) il1,...,iln = 0. The first term, H0 in Equation (7), is called the normalized factor or the partition function. The terms H(k) are called bias in machine learning and magnetic field or self-energy in statistical physics. The terms H(k,m) are called the weight of the Boltzmann machine in machine learning and two-body interaction or electron-electron interaction in physics. To visualize the existence of interactions within a tensor, we introduce a diagram called interaction representation, which is inspired by factor graphs in graphical modeling [3, Chapter 8]. The graphical representation of the product of tensors is widely known as tensor networks. However, displaying the relations between modes of a tensor as a factor graph is our novel approach. We represent the n-body interaction as a black square, , connected with n modes. We describe examples of the Figure 2: (a)(left) Interaction representation of an example of cyclic two-body approximation and (a)(right) its transformed tensor network for D = 4. Each tensor is enclosed by a square and each mode is enclosed by a circle. A black circle is a hyper diagonal tensor. Edges through between modes mean interaction existence. (b) Tensor network of tensor ring decomposition. (c) The relationship among solution spaces of tensor-ring decomposition Bring, two-body approximation B2, and cyclic two-body approximation Bcyc. two-body interaction between modes (k, m) and the three-body interaction among modes (k, m, p) in Figure 1(a). Combining these interactions, the diagram for the energy function including all two-body interactions is shown in Figure 1(b), and all two-body and three-body interactions are included in Figure 1(c) for D = 4. This visualization allows us to intuitively understand the relationship between modes of tensors. For simplicity, we abbreviate one-body interactions in the diagrams, while we always assume them. Once interaction representation is given, we can determine the corresponding decomposition of tensors. In Boltzmann machines, we usually consider binary (two-level) variables and their second-order interactions. In our proposal, we consider multi-level D variables, each of which can take a natural number from 1 to Id for d [D]. Moreover, higher-order interactions among them are allowed. Therefore, our proposal is a form of a multi-level extension of Boltzmann machines with higher-order interaction, where each node of Boltzmann machines corresponds to the tensor mode. In the following section, we reduce some of n-body interactions, that is, H(l1,...,ln) il1,...,iln = 0 by fixing each parameter θ(l1,...,ln) il1,...,iln = 0 for all indices (il1, . . . , iln) {2, . . . , Il1} {2, . . . , Iln}. 2.3 Many-body approximation We approximate a given tensor by assuming the existence of dominant interactions between the modes of the tensor and ignoring other interactions. Since this operation can be understood as enforcing some natural parameters of the distribution to be zero, it can be achieved by convex optimization through the theory of Legendre decomposition. We discuss how we can choose dominant interactions in Section 3.1. As an example, we consider two types of approximations of a nonnegative tensor P by tensors represented in Figure 1(b, c). If all energies greater than second-order or those greater than thirdorder in Equation (7) are ignored that is, H(l1,...,ln) il1,...,iln = 0 for n > 2 or n > 3 the tensor Pi1,i2,i3,i4 is approximated as follows: Pi1,i2,i3,i4 P 2 i1,i2,i3,i4 = X(1,2) i1,i2 X(1,3) i1,i3 X(1,4) i1,i4 X(2,3) i2,i3 X(2,4) i2,i4 X(3,4) i3,i4 , Pi1,i2,i3,i4 P 3 i1,i2,i3,i4 = χ(1,2,3) i1,i2,i3χ(1,2,4) i1,i2,i4χ(1,3,4) i1,i3,i4χ(2,3,4) i2,i3,i4, where each of matrices and tensors on the right-hand side is represented as X(k,m) ik,im = 1 3H(k) ik + H(k,m) ik,im + 1 χ(k,m,p) ik,im,ip = 1 Z exp H(k) ik 3 + H(m) im 3 + H(p) ip 3 + 1 2H(k,m) ik,im + 1 2H(m,p) im,ip + 1 2H(k,p) ik,ip + H(k,m,p) ik,im,ip The partition function, or the normalization factor, is given as Z = exp ( θ1,1,1,1), which does not depend on indices (i1, i2, i3, i4). Each X(k,m) (resp. χ(k,m,p)) is a factorized representation for the relationship between k-th and m-th (resp. k-th, m-th and p-th) modes. Although our model can be transformed into a linear model by taking the logarithm, our convex formulation enables us to find the optimal solution more stable than traditional linear low-rank-based nonconvex approaches. Since we do not impose any low-rankness, factorized representations, such as X(k,m) and χ(k,m,p), can be full-rank matrices or tensors. For a given tensor P, we define its m-body approximation P m as the optimal solution of Equation (4) that is, P m = arg min Q Bm DKL(P, Q) where the solution space Bm is given as Bm = { Q | θi1,...,i D = 0 if θi1,...,i D is n(> m)-body parameters for θ-parameters of Q } . Note that P D = P always holds. For any natural number m < D, it holds that Bm Bm+1. Interestingly, the two-body approximation for a non-negative tensor with I1 = = ID = 2 is equivalent to approximating the empirical distribution with the fully connected Boltzmann machine. In the above discussion, we consider many-body approximation with all the n-body parameters, while we can relax this constraint and allow the use of only a part of n-body parameters in approximation. Let us consider the situation where only one-body interaction and two-body interaction between modes (d, d + 1) exist for all d [D] (D + 1 implies 1 for simplicity). Figure 2(a) shows the interaction representation of the approximated tensor. As we can confirm by substituting 0 for H(k,l) ik,il if l = k + 1, we can describe the approximated tensor as Pi1,...,i D Pcyc i1,...,i D = X(1) i1,i2X(2) i2,i3 . . . X(D) i D,i1, (9) X(k) ik,ik+1 = 1 2H(k) ik + H(k,k+1) ik,ik+1 + 1 2H(k+1) ik+1 with the normalization factor Z = exp ( θ1,...,1). When the tensor P is approximated by Pcyc, the set B contains only all one-body parameters and two-body parameters θ(d,d+1) id,id+1 for d [D]. We call this approximation cyclic two-body approximation since the order of indices in Equation (9) is cyclic. It holds that Bcyc B2 for the solution space of cyclic two-body approximation Bcyc (Figure 2(c)). 2.4 Connection to tensor networks Our tensor interaction representation is an undirected graph that focuses on the relationship between modes. By contrast, tensor networks, which are well known as diagrams that focus on smaller tensors after decomposition, represent a tensor as an undirected graph, whose nodes correspond to matrices or tensors and edges to summation over a mode in tensor products [6]. We provide examples in our representation that can be converted to tensor networks, which implies a tight connection between our representation and tensor networks. For the conversion, we use a hyper-diagonal tensor I defined as Iijk = δijδjkδki , where δij = 1 if i = j and 0 otherwise. The tensor I is often represented by in tensor networks. In the community of tensor networks, the tensor I appears in the CNOT gate and a special case of the Z spider [28]. The tensor network in Figure 2(a) represents Equation (9) for D = 4. A remarkable finding is that the converted tensor network representation of cyclic two-body approximation and the tensor network of tensor ring decomposition, whose tensor network is shown in Figure 2(b), have a similar structure, despite their different modeling. If we consider the region enclosed by the dotted line in Figure 2(a) as a new tensor, the tensor network of the cyclic two-body approximation coincides with the tensor network of the tensor ring decomposition shown in Figure 2(b). This operation, in which multiple tensors are regarded as a new tensor in a tensor network, is called coarse-graining transformation [9]. Formally, cyclic two-body approximation coincides with tensor ring decomposition with a specific constraint as described below. Non-negative tensor ring decomposition approximates a given tensor P RI1 ID 0 with D core tensors χ(1), . . . , χ(D) RRd 1 Id Rd 0 as Pi1,...,i D r D=1 χ(1) r D,i1,r1 . . . χ(D) r D 1,i D,r D, (10) where (R1, . . . , RD) is called the tensor ring rank. The cyclic two-body approximation also approximates the tensor P in the form of Equation (10), imposing an additional constraint that each core tensor χ(d) is decomposed as χ(d) rd 1,id,rd = md=1 X(d) rd 1,md Imd,id,rd (11) for each d [D]. We assume r0 = r D for simplicity. We obtain Equation (9) by substituting Equation (11) into Equation (10). This constraint enables us to perform convex optimization. This means that we find a subclass Bcyc that can be solved by convex optimization in tensor ring decomposition, which has suffered from the difficulty of non-convex optimization. In addition, this is simultaneously a subclass of two-body approximation, as shown in Figure 2(c). From Kronecker s delta δ, rd = id holds in Equation (11); thus, χ(d) is a tensor with the size (Id 1, Id, Id). Tensor ring rank after the cyclic two-body approximation is (I1, . . . , ID) since the size of core tensors coincides with tensor ring rank. This result firstly reveals the relationship between Legendre decomposition and low-rank approximation via tensor networks. Here we compare the number of parameters of cyclic two-body approximation and that of tensor ring decomposition. The number of elements of an input tensor is I1I2 ID. After cyclic two-body approximation, the number |B| of parameters is given as d=1 (Id 1) + d=1 (Id 1)(Id+1 1), (12) where we assume ID+1 = I1. The first term is for the normalizer, the second for the number of one-body parameters, and the final term for the number of two-body parameters. In contrast, in the tensor ring decomposition with the target rank (R1, . . . , RD), the number of parameters is given as PD d=1 Rd Id Rd+1. The ratio of the number of parameters of these two methods is proportional to I/R2 if we assume Rd = R and Id = I for all d [D] for simplicity. Therefore, when the target rank is small and the size of the input tensor is large, the proposed method has more parameters than the tensor ring decomposition. The correspondence of many-body approximation to existing low-rank approximation is not limited to tensor ring decomposition. We also provide another example of the correspondence for tensor tree decomposition in Section C.2 in Supplement. 2.5 Many-body approximation as generalization of mean-field approximation Any tensor P can be represented by vectors x(1), . . . , x(D) RId as Pi1,...,i D = x(1) i1 x(2) i2 . . . x(D) i D if and only if all n( 2)-body θ-parameters are 0 [10, 12]. The right-hand side is equal to the Kronecker product of D vectors x(1), . . . , x(D); therefore, this approximation is equivalent to the rank-1 approximation since the rank of the tensor that can be represented by the Kronecker product is always 1, which is also known to correspond to mean-field approximation. In this study, we propose many-body approximation by relaxing the condition for the mean-field approximation that ignores n( 2)-body interactions. Therefore many-body approximation is generalization of rank-1 approximation and mean-field approximation. The discussion above argues that the one-body approximation that is, the rank-1 approximation that minimizes the KL divergence from a given tensor is a convex problem. Although finding the rank-1 tensor that minimizes the Frobenius norm from a given tensor is known to be an NP-hard problem [14], it has been reported that finding the rank-1 tensor that minimizes the KL divergence from a given tensor is a convex problem [16]. Therefore, our claim is not inconsistent with those existing studies. It should be also noted that, except for the rank-1 case, there is no guarantee that traditional low-rank decomposition that optimizes the KL divergence is a convex problem [5, 13]. 2.6 Low-body tensor completion Since our proposal is an alternative to the existing low-rank approximation, we can replace it with our many-body approximation in practical applications. As a representative application, we demonstrate Algorithm 1: Low-body tensor completion LBTC(T , B, Ω) // Ωis observed indices t 1 Initialize Pt // e.g. a random tensor Pt Ω TΩ repeat Pt+1 MANYBODYAPPROXIMATION (Pt, B) // m-step, See Algorithm 2 Pt+1 Ω TΩ // e-step rest Pt+1 Pt F / Pt F t t + 1 until rest rest 1 < ϵ and t > 2 // We set ϵ = 10 5 in our implementation; return P Table 1: Recovery fit score for tensor completion. Scenario 1 Scenario 2 Scenario 3 LBTC 0.948 0.00152 0.917 0.000577 0.874 0.00153 Ha LRTC 0.864 0.842 0.825 Si LRTC 0.863 0.841 0.820 Si LRTCTT 0.948 0.915 0.868 PTRCRW 0.945 0.0000 0.901 0.0000 0.844 0.0000 missing value estimation by the em-algorithm [26], where we replace the m-step of low-rank approximation with many-body approximation. We call this approach low-body tensor completion (LBTC). The em-algorithm with tensor low-rank approximation does not guarantee the uniqueness of the m-step because the model space is not flat, while LBTC ensures that the model space is flat, and thus the m-step of LBTC is unique. See more discussion of information geometry for LBTC in Section C.4. We provide its pseudo-code in Algorithm 1. Instead of hyper-parameters related to ranks that traditional tensor completion methods require, LBTC requires only information about interactions to be used, but it can be intuitively tuned, as also seen in Section 3.1. We examine the performance of LBTC in Section 3.2. 3 Experiments We conducted three experiments to see the usefulness and effectiveness of many-body approximation. Datasets and implementation details are available in Supplement. 3.1 An example of tuning interaction We extracted 10 color images from Columbia Object Image Library (COIL-100) [27] and constructed a tensor P R40 40 3 10 0 , where each mode represents the width, height, color, or image index, respectively. Then we decomposed P with varying chosen interactions and showed the reconstructed images. Figure 3(a) shows P 4, which coincides with the input tensor P itself since its order is four. Reconstruction quality gradually dropped when we reduced interactions from Figure 3(a) to Figure 3(d). We can explicitly control the trend of reconstruction by manually choosing interaction, which is one of characteristics of the proposal. In Figure 3(c), each reconstructed image has a monotone taste because the color is not directly interacted with the pixel position (width and height). In Figure 3(d), although the whole tensor is monotone because the color is independent of any other modes, the reconstruction keeps the objects shapes since the interaction among width, height, and image index are retained. See Figure 7 in Supplement for the role of the interaction. Interactions between modes are interpretable, in contrast to the rank of tensors, which is often difficult to interpret. Therefore, tuning of interactions is more intuitive than that of ranks. We also visualize the interaction between color and image as a heatmap of a matrix X(c,i) R3 10 in Figure 3(e), which is obtained in the case of Figure 3(c). We normalize each column of X(c,i) with the 1 2 3 4 5 6 7 8 9 10 Image 0.92 0.53 0.62 0.84 0.75 0.29 0.94 0.14 0.4 0.59 0.31 0.66 0.57 0.37 0.6 0.82 0.24 0.46 0.79 0.61 0.25 0.53 0.54 0.39 0.29 0.5 0.26 0.88 0.47 0.52 Figure 3: (a,b) Four-body and three-body approximation for the COIL dataset. (c,d) Appropriate selection of interactions controls the richness of color. Visualization of interaction between (e) color and image index, and (f) width, height, and image index. L2 norm for better visualization. As expected, indices corresponding to red objects (that is, 1, 4, and 7) have large values in the third red-color row. We also visualize the interaction among width, height, and image index, which corresponds to a tensor χ(w,h,i) R40 40 10 without any information of colors, as a series of heatmaps in Figure 3(f). They provide the gray-scale reconstruction of images. The above discussion implies that many-body approximation captures patterns or features of inputs. 3.2 Tensor completion by many-body approximation To examine the performance of LBTC, we conducted experiments on a real-world dataset: the traffic speed dataset in District 7, Los Angeles County, collected from Pe MS [31]. This data was obtained from fixed-point observations of vehicle speeds every five minutes for 28 days. We made 28 24 12 4 tensor T , whose modes correspond to date, hour, minute, and lane. As practical situations, we assumeed that a disorder of a lane s speedometer causes random and continuous deficiencies. We prepared three patterns of missing cases: a case with a disorder in the speedometer in Lane 1 (Scenario 1), a case with disorders in the speedometers in Lanes 1, 2, and 3 (Scenario 2), and a case with disorders in all speedometers (Scenario 3). The missing rates for scenarios 1, 2, and 3 were 9 percent, 27 percent, and 34 percent, respectively. We evaluate the recovery fit score 1 Tτ T τ F / Tτ F for the reconstruction T and missing indices τ. In the proposal, we need to choose interactions to use. Intuitively, the interaction between date and minute and that between minute and lane seem irrelevant. Therefore, we used all the two-body interactions except for the above two interactions. In addition, the interaction among date, hour, and minute is also included because it is intuitively relevant. As baseline methods, we use Si LRTC, Ha LRTC [21], Si LRTC-TT [2], and PTRCRW [38]. Si LRTCTT and PTRCRW are based on tensor train and ring decompositions, respectively, and PTRCRW is known as the state of the art. The reconstructions are available in Figure 8 in Supplement with ground truth. We tuned hyperparameters of baseline methods and compared their best results with LBTC. The resulting scores (see Table 1) show that our method is superior to other methods. Moreover, we can intuitively tune interactions in the proposal, while other baselines need typical hyperparameter tuning as seen in Figure 14 in Supplement. In our method, interactions among non-associated modes could worsen the performance as seen in Table 2 in Supplement. Since the proposal and PTRCRW are non-convex optimizations, we ran each of them 10 times with random initialization and reported mean values and the standard error. Many-body approximation itself is always convex, while our tensor completion algorithm LBTC is non-convex due to the use of the em-algorithm. 3.3 Comparison with ring decomposition As seen in Section 2.4, many-body approximation has a close connection to low-rank approximation. For example, in tensor ring decomposition, if we impose the constraint that decomposed smaller tensors can be represented as products with hyper-diagonal tensors I, this decomposition is equivalent to a cyclic two-body approximation. Therefore, to examine our conjecture that cyclic two-body approximation is as capable of approximating as tensor ring decomposition, we empirically examined the effectiveness of cyclic two-body approximation compared with tensor ring decomposition. As baselines, we used five existing methods of non-negative tensor ring decomposition: NTR-APG, NTR-HALS, NTR-MU, NTR-MM, and NTR-lra MM [39, 40]. We also describe the efficiency in Supplement. We evaluated the approximation performance by the relative error T T F / T F for an input tensor T and a reconstructed tensor T . Since all the existing methods are based on nonconvex optimization, we plotted the best score (minimum relative error) among five restarts with random initialization. In contrast, the score of our method is obtained by a single run as it is convex optimization and such restarts are fundamentally unnecessary. 600 1200 1800 Relative Error TT_Chart Res 500 1000 1500 2000 Relative Error 103 104 104 # Parameters # Parameters Relative Error (20,20,20,20,20,20) R10 (30,30,30,30,30) R15 # Parameters # Parameters Relative Error MM HALS lra MM Proposal Figure 4: Experimental results for (a) synthetic data and (b) real datasets. The vertical red dotted line is |B| (see Equation (12)). Synthetic data We performed experiments on two synthetic datasets with low tensor ring rank. We create sixthand seventhth-order tensors with ring ranks (15, ..., 15) and (10, ..., 10), respectively. In Figure 4(a), relative errors are plotted with gradually increasing the target rank of the tensor ring decomposition, which is compared to the score of our method, plotted as the cross point of horizontal and vertical red dotted lines. Please note that our method does not have the concept of the rank, so the score of our method is invariant to changes of the target rank unlike other methods. If the cross point of red dotted lines is lower than other lines, the proposed method is better than other methods. In both experiments, the proposed method is superior to comparison partners in terms of the reconstruction error. Real data Next, we evaluated our method on real data. TT_Chart Res and TT_Origami are seventh-order tensors that are produced from Tokyo Tech Hyperspectral Image Dataset [23, 24]. Each tensor has been reshaped to reduce the computational complexity. As seen in Figure 4(b), the proposed method keeps the competitive relative errors. In baselines, a slight change of the target rank can induce a significant increase of the reconstruction error due to the nonconvexity. These results mean that we eliminate the instability of non-negative tensor ring decomposition by our convex formulation. 4 Conclusion We propose many-body approximation for tensors, which decomposes tensors by focusing on the relationship between modes represented by an energy-based model. This method approximates and regularizes tensors by ignoring the energy corresponding to certain interactions, which can be viewed as a generalized formulation of mean-field approximation that considers only one-body interactions. Our novel formulation enables us to achieve convex optimization of the model, while the existing approaches based on the low-rank structure are non-convex and empirically unstable with respect to the rank selection. Furthermore, we introduce a way of visualizing activated interactions between modes, called interaction representation. We have demonstrated transformation between our representation and tensor networks, which reveals the nontrivial connection between many-body approximation and the classical tensor low-rank tensor decomposition. We have empirically showed the intuitive model s designability and the better usefulness in tensor completion and approximation tasks compared to baseline methods. Limitation Proposal only works on non-negative tensors. We have examined the effectiveness of LBTC on only traffic datasets. Acknowledgments and Disclosure of Funding This work was supported by RIKEN, Special Postdoctoral Researcher Program (KG), JST, CREST Grant Number JPMJCR22D3, Japan, and JSPS KAKENHI Grant Number JP21H03503 (MS), and JST, CREST Grant Number JPMJCR1913, Japan (YK). [1] Amari, S. (2016). Information Geometry and Its Applications. Springer. [2] Bengua, J. A., Phien, H. N., Tuan, H. D., and Do, M. N. (2017). Efficient tensor completion for color image and video recovery: Low-rank tensor train. IEEE Transactions on Image Processing, 26(5):2466 2479. [3] Bishop, C. M. and Nasrabadi, N. M. (2006). Pattern recognition and machine learning, volume 4. Springer. [4] Cheng, S., Wang, L., and Zhang, P. (2021). Supervised learning with projected entangled pair states. Physical Review B, 103(12):125117. [5] Chi, E. C. and Kolda, T. G. (2012). On tensors, sparsity, and nonnegative factorizations. SIAM Journal on Matrix Analysis and Applications, 33(4):1272 1299. [6] Cichocki, A., Lee, N., Oseledets, I., Phan, A.-H., Zhao, Q., Mandic, D. P., et al. (2016). Tensor networks for dimensionality reduction and large-scale optimization: Part 1 low-rank tensor decompositions. Foundations and Trends in Machine Learning, 9(4-5):249 429. [7] Cichocki, A., Mandic, D., De Lathauwer, L., Zhou, G., Zhao, Q., Caiafa, C., and Phan, H. A. (2015). Tensor decompositions for signal processing applications: From two-way to multiway component analysis. IEEE signal processing magazine, 32(2):145 163. [8] Erol, A. and Hunyadi, B. (2022). Tensors for neuroimaging: A review on applications of tensors to unravel the mysteries of the brain. Tensors for Data Processing, pages 427 482. [9] Evenbly, G. and Vidal, G. (2015). Tensor network renormalization. Physical review letters, 115(18):180405. [10] Ghalamkari, K. and Sugiyama, M. (2021). Fast tucker rank reduction for non-negative tensors using mean-field approximation. In Advances in Neural Information Processing Systems, volume 34, pages 443 454, Virtual Event. [11] Ghalamkari, K. and Sugiyama, M. (2022). Fast rank-1 NMF for missing data with KL divergence. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, pages 2927 2940, Virtual Event. [12] Ghalamkari, K. and Sugiyama, M. (2023). Non-negative low-rank approximations for multidimensional arrays on statistical manifold. Information Geometry. [13] Hansen, S., Plantenga, T., and Kolda, T. G. (2015). Newton-based optimization for kullback leibler nonnegative tensor factorizations. Optimization Methods and Software, 30(5):1002 1029. [14] Hillar, C. J. and Lim, L.-H. (2013). Most tensor problems are np-hard. Journal of the ACM (JACM), 60(6):1 39. [15] Hitchcock, F. L. (1927). The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics, 6(1 4):164 189. [16] Huang, K. and Sidiropoulos, N. D. (2017). Kullback-leibler principal component for tensors is not np-hard. In 2017 51st Asilomar Conference on Signals, Systems, and Computers, pages 693 697. IEEE. [17] Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications. SIAM review, 51(3):455 500. [18] Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1):79 86. [19] Le Cun, Y., Chopra, S., Hadsell, R., Ranzato, M., and Huang, F. (2006). A tutorial on energybased learning. Predicting structured data, 1(0). [20] Levin, M. and Nave, C. P. (2007). Tensor renormalization group approach to two-dimensional classical lattice models. Physical Review Letters, 99(12):120601. [21] Liu, J., Musialski, P., Wonka, P., and Ye, J. (2013). Tensor completion for estimating missing values in visual data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):208 220. [22] Luo, Y., Wang, F., and Szolovits, P. (2017). Tensor factorization toward precision medicine. Briefings in Bioinformatics, 18(3):511 514. [23] Monno, Y., Kiku, D., Tanaka, M., and Okutomi, M. (2017). Adaptive residual interpolation for color and multispectral image demosaicking. Sensors, 17(12):2787. [24] Monno, Y., Kikuchi, S., Tanaka, M., and Okutomi, M. (2015). A practical one-shot multispectral imaging system using a single image sensor. IEEE Transactions on Image Processing, 24(10):3048 3059. [25] Murg, V., Verstraete, F., Legeza, Ö., and Noack, R. M. (2010). Simulating strongly correlated quantum systems with tree tensor networks. Physical Review B, 82(20):205105. [26] Narita, A., Hayashi, K., Tomioka, R., and Kashima, H. (2012). Tensor factorization using auxiliary information. Data Mining and Knowledge Discovery, 25:298 324. [27] Nayar and Murase, H. (1996). Columbia object image library: Coil-100. Technical Report CUCS-006-96, Department of Computer Science, Columbia University. [28] Nielsen, M. A. and Chuang, I. L. (2010). Quantum Computation and Quantum Information: 10th Anniversary Edition. Cambridge University Press. [29] Oseledets, I. V. (2011). Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295 2317. [30] Panagakis, Y., Kossaifi, J., Chrysos, G. G., Oldfield, J., Nicolaou, M. A., Anandkumar, A., and Zafeiriou, S. (2021). Tensor methods in computer vision and deep learning. Proceedings of the IEEE, 109(5):863 890. [31] Ran, B., Tan, H., Feng, J., Liu, Y., and Wang, W. (2015). Traffic speed data imputation method based on tensor completion. Computational Intelligence and Neuroscience, 2015:22 22. [32] Reyes, J. A. and Stoudenmire, E. M. (2021). Multi-scale tensor network architecture for machine learning. Machine Learning: Science and Technology, 2(3):035036. [33] Shashua, A. and Hazan, T. (2005). Non-negative tensor factorization with applications to statistics and computer vision. In Proceedings of the 22nd International Conference on Machine learning (ICML), pages 792 799, Bonn, Germany. [34] Shcherbakova, E. and Tyrtyshnikov, E. (2020). Nonnegative tensor train factorizations and some applications. In Large-Scale Scientific Computing: 12th International Conference, LSSC 2019, Sozopol, Bulgaria, June 10 14, 2019, Revised Selected Papers 12, pages 156 164. Springer. [35] Sugiyama, M., Nakahara, H., and Tsuda, K. (2017). Tensor balancing on statistical manifold. In Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70 of Proceedings of Machine Learning Research, pages 3270 3279, Sydney, Australia. [36] Sugiyama, M., Nakahara, H., and Tsuda, K. (2018). Legendre decomposition for tensors. In Advances in Neural Information Processing Systems 31, pages 8825 8835, Montréal, Canada. [37] Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279 311. [38] Yu, J., Zhou, G., Li, C., Zhao, Q., and Xie, S. (2021a). Low tensor-ring rank completion by parallel matrix factorization. IEEE Transactions on Neural Networks and Learning Systems, 32(7):3020 3033. [39] Yu, Y., Xie, K., Yu, J., Jiang, Q., and Xie, S. (2021b). Fast nonnegative tensor ring decomposition based on the modulus method and low-rank approximation. Science China Technological Sciences, 64(9):1843 1853. [40] Yu, Y., Zhou, G., Zheng, N., Qiu, Y., Xie, S., and Zhao, Q. (2022). Graph-regularized nonnegative tensor-ring decomposition for multiway representation learning. IEEE Transactions on Cybernetics. [41] Zhao, Q., Zhou, G., Xie, S., Zhang, L., and Cichocki, A. (2016). Tensor ring decomposition. ar Xiv preprint ar Xiv:1606.05535. Table of Contents A Projection theory in information geometry 14 B Formal definitions 16 C Theoretical remarks 16 C.1 Convexity and uniqueness of many-body approximation . . . . . . . . . . . . . 16 C.2 Another example: Tensor tree decomposition . . . . . . . . . . . . . . . . . . . 17 C.3 Reducing computational complexity by reshaping tensor . . . . . . . . . . . . . 17 C.4 Information geometric view of LBTC . . . . . . . . . . . . . . . . . . . . . . . 17 D Related work 18 E Additional numerical results 18 E.1 Additional results for COIL dataset . . . . . . . . . . . . . . . . . . . . . . . . 18 E.2 Additional results for LBTC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 E.3 Additional results for comparison with ring decomposition . . . . . . . . . . . . 20 E.4 Comparison with traditional low-rank approximation . . . . . . . . . . . . . . . 20 F Dataset details 22 F.1 For experiment with RGB images . . . . . . . . . . . . . . . . . . . . . . . . . 22 F.2 For experiments with LRTC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 F.3 For comparison with non-negative tensor ring decomposition . . . . . . . . . . . 22 G Implementation detail 23 G.1 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 G.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 References for Supplementary Materials 25 A Projection theory in information geometry We explain concepts of information geometry used in this study, including natural parameters, expectation parameters, model flatness, and convexity of optimization. In the following discussion, we consider only discrete probability distributions. (θ, η)-coordinate and geodesics In this study, we treat a normalized D-order non-negative tensor P RI1 ID 0 as a discrete probability distribution with D random variables. Let U be the set of discrete probability distributions with D random variables. The entire space U is a non-Euclidean space with the Fisher information matrix G as the metric. This metric measures the distance between two points. In Euclidean space, the shortest path between two points is a straight line. In a non Euclidean space, such a shortest path is called a geodesic. In the space U, two kinds of geodesics can be introduced: e-geodesics and m-geodesics. For two points P1, P2 U, eand m-geodesics are defined as { Rt | log Rt = (1 t) log P1 + t log P2 ϕ(t) } , { Rt | Rt = (1 t)P1 + t P2 } , respectively, where 0 t 1 and ϕ(t) is a normalization factor to keep Rt to be a distribution. We can parameterize distributions P U by parameters known as natural parameters. In Equation (1), we have described the relationship between a distribution P and a natural parameter vector θ = (θ2,...,1, . . . , θI1,...,ID). The natural parameter θ serves as a coordinate system of U, since any distribution in U is specified by determining θ. Furthermore, we can also specify a distribution P by its expectation parameter vector η = (η2,...,1, . . . , ηI1,...,ID), which corresponds to expected values of the distribution and an alternative coordinate system of U. The definition of the expectation parameter η is described in Equation (3). θ-coordinates and η-coordinates are orthogonal with each other, which means that the Fisher information matrix G has the following property, Gu,v = ηu/ θv and (G 1)u,v = θu/ ηv. eand m-geodesics can also be described using these parameters as follows. n θt | θt = (1 t)θP1 + tθP2 o , ηt | ηt = (1 t)ηP1 + tηP2 , where θP and ηP are θand η-coordinate of a distribution P U. Flatness and projections A subspace is called e-flat when any e-geodesic connecting two points in a subspace is included in the subspace. The vertical descent of an m-geodesic from a point P U onto e-flat subspace Be is called m-projection. Similarly, e-projection is obtained when we replace all e with m and m with e. The flatness of subspaces guarantees the uniqueness of the projection destination. The projection destination P or P obtained by mor e-projection onto Be or Bm minimizes the following KL divergence, P = arg min Q Be DKL(P, Q), P = arg min Q Bm DKL(Q, P). The KL divergence from discrete distributions P U to Q U is given as DKL(P, Q) = i D=1 Pi1,...,i D log Pi1,...,i D Qi1,...,i D . (13) A subspace with some of its natural parameters fixed at 0 is e-flat [1, Chapter 2], which is obvious from the definition of e-flatness. The proposed many-body approximation performs m-projection onto the subspace B U with some natural parameters fixed to be 0. From this linear constraint, we know that B is e-flat. Therefore, the optimal solution of the many-body approximation is always unique. When a space is e-flat and m-flat at the same time, we say that the space is dually-flat. The set of discrete probability distributions U is dually-flat. Natural gradient method e(m)-flatness guarantees that cost functions to be optimized in Equation (13) are convex. Therefore, m(e)-projection onto an e(m)-flat subspace can be implemented by a gradient method using a second-order gradient. This second-order gradient method is known as the natural gradient method [42]. The Fisher information matrix G appears by second-order differentiation of the KL divergence (see Equation (5)). We can perform fast optimization using the update formula in Equation (6), using the inverse of the Fisher information matrix. Examples for Möbius function In the proposed method, we need to transform the distribution P RI1 ID with θ and η using the Möbius function, defined in Section 2.1. We provide examples here. In Equation (2), The Möbius function is used to find the natural parameter θ from a distribution P. For example, if D = 2, 3, it holds that θi1,i2 = log Pi1,i2 log Pi1 1,i2 log Pi1,i2 1 + log Pi1 1,i2 1, θi1,i2,i3 = log Pi1,i2,i3 log Pi1 1,i2,i3 log Pi1,i2 1,i3 log Pi1,i2,i3 1 + log Pi1 1,i2 1,i3 + log Pi1,i2 1,i3 1 + log Pi1,i2 1,i3 1 log Pi1 1,i2 1,i3 1, where we assume P0,i2 = Pi1,0 = 1 and Pi1,i2,0 = Pi1,0,i3 = P0,i2,i3 = 1. To identify the value of θi1,...,id, we need only Pi 1,...,i d with (i 1, . . . , i d) {i1 1, i1} {i2 1, i2} {id 1, id}. In the same way, using Equation (3), we can find the distribution P by the expectation parameter η. For example, if D = 2, 3, it holds that Pi1,i2 = ηi1,i2 ηi1+1,i2 ηi1,i2+1 + ηi1+1,i2+1, Pi1,i2,i3 = ηi1,i2,i3 ηi1+1,i2,i3 ηi1,i2+1,i3 ηi1,i2,i3+1 + ηi1+1,i2+1,i3 + ηi1+1,i2,i3+1 + ηi1,i2+1,i3+1 ηi1+1,i2+1,i3+1, where we assume ηI1+1,i2 = ηi1,I2+1 = 0 and ηI1+1,i2,i3 = ηi1,I2+1,i3 = ηi1,i2,I3+1 = 0. Figure 5: (a) Interaction representation corresponding to Equation (14) and its transformed tensor network for D = 9. (b) Tensor network of a variant of tensor tree decomposition. B Formal definitions We provide give formal definitions of Legendre decomposition [35] as follows: Definition 1. For a normalized non-negative tensor P and a set of indices B, Legendre decomposition of P is Q defined as Q = arg min Q;Q B DKL(P, Q) where B = {Q | θi1,...,i D = 0 if (i1, . . . , i D) / B for θ-parameters of Q}. As in the following definition, a particular case of the Legendre decomposition is the many-body decomposition. Definition 2. For a normalized non-negative tensor P and a natural number m, we define its m-body approximation P m as Q = arg min Q;Q B DKL(P, Q) where B used in B contains all the indices of n( m)-body parameters. C Theoretical remarks C.1 Convexity and uniqueness of many-body approximation It is widely known that the maximum likelihood estimation of ordinary Boltzmann machines without hidden variables is a convex problem. Since we optimize the KL divergence, the proposed manybody approximation is also a maximum likelihood estimation that approximates a non-negative tensor, which is regarded as an empirical distribution, by an extended Boltzmann machine without hidden variables, as described in Section 2.2. As with ordinary Boltzmann machines, the maximum likelihood estimation of extended Boltzmann machines can be globally optimized. Theorem 1 is a general proof of this using information geometry. Theorem 1. The solution of many-body approximation is always unique, and the objective function of many-body approximation is convex. Proof. As Definition 2 shows, the objective function of the proposed many-body approximation is the KL divergence from an empirical distribution (a given normalized tensor) to a subspace B. We can immediately prove that a subspace B is e-flat from the definition of the flatness [1, Chapter 2] for any B. The KL divergence minimization from a given distribution to e-flat space is called m-projection (see the second paragraph in Section A). The m-projection onto e-flat subspace is known to be convex and the destination is unique (see the third paragraph in Section A). Thus, the optimal solution of the many-body approximation is always unique, and the objective function is convex. Figure 6: The em-algorithm. Dashed(Solid) arrows are e(m)-projection. C.2 Another example: Tensor tree decomposition As seen in Section 2.4, cyclic two-body approximation coincides with tensor ring decomposition with a constraint. In this section, we additionally provide another example of correspondence between many-body approximation and the existing low-rank approximation. For D = 9, let us consider three-body and two-body interactions among (i1, i2, i3), (i4, i5, i6), and (i7, i8, i9) and three-body approximation among (i3, i6, i9). We provide the interaction representation of the target energy function in Figure 5(a). In this approximation, the decomposed tensor can be described as Pi1,...,i9 = Ai1,i2,i3Bi4,i5,i6Ci7,i8,i9Gi3,i6,i9. (14) Its converted tensor network is described in Figure 5(a). It becomes a tensor network of tensor tree decomposition in Figure 5(b) when the region enclosed by the dotted line is replaced with a new tensor (shown with tilde) by coarse-graining. Such tensor tree decomposition is used in generative modeling [43], computational chemistry [48] and quantum many-body physics [50]. C.3 Reducing computational complexity by reshaping tensor As described in Section 2.1.1, the computational complexity of many-body approximation is O(γ|B|3), where γ is the number of iterations, because the overall complexity is dominated by the update of θ, which includes matrix inversion of G and the complexity of computing the inverse of an n n matrix is O(n3). This complexity can be reduced if we reshape tensors so that the size of each mode becomes small. For example, let us consider a third-order tensor whose size is (J2, J2, J2) and its cyclic two-body approximation. In this case, the time complexity is O(γJ12) since it holds that |B| = O(J4) (see Equation (12)). By contrast, if we reshape the input tensor to a sixth-order tensor whose size is (J, J, J, J, J, J), the time complexity becomes O(γJ6) since it holds that |B| = O(J2). This technique of reshaping a tensor into a larger-order tensor is widely used practically in various methods based on tensor networks, such as tensor ring decomposition [47]. We use this technique in the experiment in Section 3.3 to handle large tensors in the both of proposal and baselines. C.4 Information geometric view of LBTC Traditional methods for tensor completion are based on the following iteration, called emalgorithm [51]. e-step: PΩ TΩ, m-step : P low-rank-approximation(P), with an initialized tensor P, where Ωis the set of observed indices of a given tensor T with missing values. From the information geometric view, m-step does m-projection onto the model manifold B consisting of low-rank tensors, and e-step does e-projection onto the data manifold D { P | PΩ= TΩ}. It is straightforward to prove that e-step finds a point P = arg min P D DKL(P, P) for the obtained tensor P by the previous m-step [52]. The iteration of eand m-step finds the closest point between two manifolds B and D. Since we can immediately prove that t Q1 + (1 t)Q2 belongs to D for any Q1, Q2 D and 0 t 1, the data manifold D Figure 7: (a) Many-body approximation with three-body interaction among width, height, and index image can reconstruct the shapes of each image. (b) Without the interaction, many-body approximation cannot reconstruct the shapes of each image. is m-flat, which guarantees the uniqueness of the e-step. See the definition of m-flat in Section A. However, since low-rank subspace is not e-flat, that m-projection is not unique. LBTC conducts tensor completion by replacing low-rank approximation in m-step with many-body approximation. Since the model manifold of many-body approximation is always e-flat and data manifold is m-flat, each step in the algorithm is unique. D Related work As discussed in Section 1, we assume low-rankness in the structure of the tensor in conventional decompositions. However, since tensors have many modes, there are too many kinds of low-rank structures. Examples include, the CP decomposition [15], which decomposes a tensor by the sum of Kronecker products of vectors; the Tucker decomposition [37], which decomposes a tensor by a core tensor and multiple matrices; the tensor train decomposition [29], which decomposes a tensor by multiple core tensors; and tensor-ring decomposition [41], which assume periodic structure on tensor train decomposition are well known and popularly used in machine learning. However, in general, it is difficult to find out the low-rank structure of a given data set. Therefore, it is difficult to say how to choose a low-rank decomposition model from those options. To alleviate this difficulty in model selection and provide intuitive modeling, we have introduced the concept of body and formulated a decomposition that does not rely on ranks. E Additional numerical results E.1 Additional results for COIL dataset In Section 3.1, we see that interactions related to color control color-richness in the reconstruction. We provide additional experimental results in Figure 7 to see the role of the interaction among width, height, and image index. The figure shows the influence of the interaction for reconstruction. While Figure 7(b), without the interaction, could not capture the shapes of images, Figure 7(a) reconstruct each shape properly. Based on these results, it is reasonable that the interaction is responsible for these shapes. E.2 Additional results for LBTC We provide the reconstructions by the experiments in Section 3.2 in Figure 8 with ground truth. In traditional low-rank approximation, tuning hyper-parameters based on prior knowledge is difficult because its semantics are unclear. By contrast, the interactions in many-body approximation can be tuned easily by considering the expected correlation among modes of a given tensor. To support this claim, we also confirm that LBTC scores get better when we activate interactions between modes that are expected to be correlated based on prior knowledge. In this experiment, we use the same datasets in Section 3.2. We prepare two set of interactions, Interactions A and B, as shown on the left in Figure 9. In Interaction A, we activate interaction between modes such as (date, hour) and (hour, minute, lane), which we expect to be correlated. In Interaction B, we activate interaction between modes such as (date, minute) and (minute, lane), which we do not expect to be correlated. For a fair comparison, we chose these interactions so that the number of parameters used in Interactions A and B are as close as possible, where the number of parameters for Interaction A was 1847, and the number of parameters for Interaction B was 1619. As seen in Table 2, LBTC with Interaction A always has a better score than Interaction B, which implies that many-body approximation has better performance when we choose interactions among correlated modes. We conducted the experiment 10 times and reported mean values with standard error. We also visualize the reconstruction in Figure 9 to show that LBTC with Interaction A can complete tensors more accurately than that with Interaction B. As seen in Figure 14 and Section G.2, existing methods require hyper-parameters that are difficult to tune since we usually cannot find any correspondence between these values and prior knowledge about the given tensor. In contrast, as demonstrated in this section, our energy-based model is intuitively tunable that can skip time-consuming hyper-parameter tuning. Table 2: Dependency of recovery fit score of LBTC on a choice of interaction. Scenario 1 Scenario 2 Scenario 3 Interaction A 0.900 0.00125 0.889 0.000632 0.864 0.001264 Interaction B 0.820 0.00143 0.819 0.000699 0.805 0.000527 1 8 15 22 29 Day Speed (mph) 1 8 15 22 29 Day Speed (mph) 1 8 15 22 29 Day Speed (mph) 1 8 15 22 29 Day Speed (mph) 1 8 15 22 29 Day Speed (mph) 1 8 15 22 29 Day Speed (mph) 1 8 15 22 29 Day Speed (mph) 1 8 15 22 29 Day Speed (mph) 1 8 15 22 29 Day Speed (mph) 1 8 15 22 29 Day Speed (mph) 1 8 15 22 29 Day Speed (mph) 1 8 15 22 29 Day Speed (mph) GT Proposal Figure 8: Ground truth and reconstruction by LBTC in the experiment in Section 3.2. Filled areas are missing parts, and blue lines are estimated values by the proposal. Green lines are ground truth. 9 10 11 12 13 14 15 16 17 18 19 20 21 Day Speed (mph) 9 10 11 12 13 14 15 16 17 18 19 20 21 Day Speed (mph) 9 10 11 12 13 14 15 16 17 18 19 20 21 Day Speed (mph) 9 10 11 12 13 14 15 16 17 18 19 20 21 Day Speed (mph) 9 10 11 12 13 14 15 16 17 18 19 20 21 Day Speed (mph) 9 10 11 12 13 14 15 16 17 18 19 20 21 Day Speed (mph) Interaction A Interaction B Scenario 1 Scenario 2 Scenario 3 GT Interaction A Interaction B GT Interaction A Interaction B GT GT Interaction A Interaction B Figure 9: Interaction dependency of LBTC. Filled areas are missing parts, and blue lines are estimated values by the proposal with Interactions A and B, respectively. Green lines are ground truth. Due to space limitations, we show the reconstruction of Lane 1 from day 9 to 20, where many missing parts exists. E.3 Additional results for comparison with ring decomposition We additionally provide Figure 10, which shows running time compassion in the experiment in Section 3.3. As described in the main text, we repeat five times each baseline method with random initialization while the proposal runs only once. We compare the total running time of them. As seen in Figure 10, for both synthetic and real datasets, the proposed method is always faster than baselines, keeping the competitive relative errors. The flatness of solution space enables many-body approximation to be solved with the Newton method. In contrast, NTR is solved by a first-order derivative. The quadratic convergence of the Newton method is well known, and it requires only a few iterations to find the best solution in the flat solution space Bcyc. That is the reason why the proposed method is fast and stable. E.4 Comparison with traditional low-rank approximation 600 1200 1800 Relative Error TT_Chart Res 500 1000 15002000 Relative Error 600 1200 1800 Running Time (sec.) 500 1000 1500 2000 101 Running Time (sec.) MM HALS lra MM Proposal 104 # Parameters Running Time (sec.) 103 104 104 # Parameters # Parameters # Parameters # Parameters # Parameters Relative Error # Parameters Running Time (sec.) # Parameters Relative Error a. b. (20,20,20,20,20,20) R10 (30,30,30,30,30) R15 Figure 10: Comparison of reconstruction capabilities (top) and computational speed (bottom) of cyclic two-body approximation and tensor ring decomposition for synthetic low-tensor ring datasets (a) and real datasets (b). The vertical red dotted line is |B| (See Equation (12)). In low-rank approximation, we often gradually increase the rank to enlarge model capability. In the same way, if we have no prior or domain knowledge about a tensor, we can gradually increase m of m-body approximation and get more accurate reconstruction. To demonstrate that, we conducted additional experiments with the same dataset in Section 3.2. We compared the reconstruction ability 102 103 104 # Parameters NTTF Proposal 102 103 104 # Parameters NTD Proposal 102 103 104 # Parameters NTTF Proposal 102 103 104 # Parameters NTD Proposal Figure 11: We compared the fit score of proposal P 1, Pcyc, P 2, P 3 with baselines. (a) The ranks for baselines assumed as (r, . . . , r) with changing the value of r. (b) The ranks for baselines are tuned taking the tensor into size account. 0 5 11 17 23 Hour One-body Approximation 0 5 11 17 23 Hour Cyclic two-body Approximation 0 5 11 17 23 Hour Two-body Approximation 0 5 11 17 23 Hour Three-body Approximation 0 5 11 17 23 Hour Speed (mph) Figure 12: Heatmaps for reconstructed tensor P 1, Pcyc, P 2, P 3 and input tensor P. We visualize the traffic on only the first lane at 00 minutes every hour, which corresponds to P[:, :, 1, 1]. of the proposed m-body approximation with non-negative tucker decomposition (NTD) [44] and non-negative tensor train decomposition (NTTF) [34]. In many applications based on matrix and tensor factorization, their objective functions are often the fit score (RMSE) and they try to find good representations to solve the task by optimizing the objective. Therefore, for fair evaluation without depending on specific applications, we examine the performance by fit score. For m-body approximation, we gradually increased the order of the interaction by changing m to 1, 2, 3. We ran the proposal only once since the proposal does not depend on initial values. For baselines, we gradually enlarged model capacity by increasing the target rank. We repeated each baseline 20 times with random initialization. We defined the Tucker and train ranks for the baseline method in two ways. First, we assumed that all values in ranks are the same that is, (r, . . . , r) and increase the value of r. However, since the size of an input tensor is (28, 24, 12, 4) and each mode size is not unique, it can be unnatural rank setting, which may cause worse RMSE. To avoid that, we also adjusted target rank by taking the tensor size into account. In particular, the rank is set to be (1,1,1,1), (6,2,2,1), (6,5,2,1), (8,5,4,2), (10,7,4,2), (12,9,4,2), (16,12,6,2), (18,15,8,2), (20,18,11,3) for NTD and (1,1,1), (1,2,1), (1,2,2), (2,2,2), (4,2,2), (6,2,2), (6,3,2), (8,3,2), (10,3,2), (10,4,2), (15,4,2), (20,5,2), (20,8,2), (30,10,2), (30,15,8) for NTTF. We evaluated the approximation performance with the fit score 1 P P F / P F for the input P and the reconstructed tensor P. We show these results in Figure 11. While baselines are often trapped in a worse local minimum, the proposal reconstructs the original tensor better. We also show the reconstruction by the proposed method as heatmaps in Figure 12. Even one-body approximation is able to reconstruct the rush hour around 17:00. When the number of interactions increases, we can observe the trend that the time of the busy hours on weekends are different from those on weekdays more clearly. To see the effectiveness of proposal for data mining, we also plotted two-body interaction between day and lane extracted by the two-body approximation of the above traffic data. This is a matrix X(d,l) R28 4. As seen in Figure 13, the value in the first column is always the largest and the value in the fourth column is always the smallest. From this result, we can interpret that the data corresponding to P[:, :, :, 1] were taken in the overtaking lane and the data for P[:, :, :, 4] were taken in the overtaken lane. F Dataset details We describe the details of each dataset below. F.1 For experiment with RGB images We downloaded the COIL-100 dataset from the official website.1 We used 10 images with object numbers 4, 5, 7, 10, 17, 23, 28, 38, 42, and 78. This selection aimed to include both monochromatic and multi-colored objects so that we can see how the proposed method changes the reconstruction if we deactivate the interactions related to color. We reshaped the size of each image to 40 40. F.2 For experiments with LRTC We downloaded traffic speed records in District 7, Los Angeles County, from Pe MS.2 The data are in the public domain. The whole period of the data lasts for 28 days from February 4 to March 3, 2013. By measuring the speed of vehicles in four lanes at a certain point on the highway 12 times at five-minute intervals, a 12 4 matrix is generated per hour. Twenty-four of these matrices are generated per day, so the dataset is a 24 12 4 tensor. Further, since we used these tensors for 28 days, the size of the resulting tensor is 28 24 12 4. We artificially generated defects in this tensor. As practical situations, we assumed that a disorder of a lane s speedometer causes random and continuous deficiencies. We prepared three patterns of missing cases: a case with a disorder in the speedometer in Lane 1 (Scenario 1), a case with disorders in the speedometers in Lanes 1, 2, and 3 (Scenario 2), and a case with disorders in all speedometers (Scenario 3). The missing rates for Scenarios 1, 2, and 3 were 9 percent, 27 percent, and 34 percent, respectively. The source code for imparting deficits are available in the Supplementary Materials. F.3 For comparison with non-negative tensor ring decomposition Synthetic datasets We created D core tensors with the size (R, I, R) by sampling from uniform distribution. A tensor with the size ID and the tensor ring rank (R, . . . , R) was then obtained by the product of these D tensors. Results for R = 15, D = 5, I = 30 are shown in the left column in Figure 4, and those for R = 10, D = 6, I = 20 in the right column in Figure 4. For all experiments on synthetic datasets, we change the target ring-rank as (r, . . . , r) for r = 2, 3, . . . , 9 for baseline methods. Real datasets TT_Chart Res is originally a (736, 736, 31) tensor, produced from Tokyo Tech 31band Hyperspectral Image Dataset. We downloaded Chart Res.mat from the official website3 and reshaped the tensor as (23, 8, 4, 23, 8, 4, 31). For baseline methods, we chose the target ringrank as (2, 2, 2, 2, 2, 2, 2) (2, 2, 2, 2, 2, 2, 5), (2, 2, 2, 2, 2, 2, 8), (3, 2, 2, 3, 2, 2, 5), (2, 2, 2, 2, 2, 2, 9), (3, 2, 2, 3, 2, 2, 6), (4, 2, 2, 2, 2, 2, 6), (3, 2, 2, 4, 2, 2, 8), (3, 2, 2, 3, 2, 2, 9), (3, 2, 2, 3, 2, 2, 10), (3, 2, 2, 3, 2, 2, 12), (3, 2, 2, 3, 2, 2, 15), or (3, 2, 2, 3, 2, 2, 16). TT_Origami is originally (512, 512, 59) tensor, which is produced from Tokyo Tech 59-band Hyperspectral Image Dataset. We downloaded Origami.mat from the official website.4 In TT_Origami, 0.0016 percent of elements were negative, hence we preprocessed all elements of TT_Origami by subtracting 0.000764, the smallest value in TT_Origami, to make all elements non-negative. We reshaped the tensor as (8, 8, 8, 8, 8, 8, 59). For baseline methods, we chose the target ring-rank as (2, 2, 2, 2, 2, 2, r) for r = 2, 3, . . . , 15. These reshaping reduces the computational complexity as described in Section C.3 to complete the experiment in a reasonable time. 1https://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php 2http://pems.dot.ca.gov/ 3http://www.ok.sc.e.titech.ac.jp/res/MSI/MSIdata31.html 4http://www.ok.sc.e.titech.ac.jp/res/MSI/MSIdata59.html 5 10 15 20 25 Day Interaction intensity (d-l) interaction in two-body approximation Lane 1 Lane 2 Lane 3 Lane 4 Figure 13: The obtained interaction between day and lane in two-body approximation. G Implementation detail We describe the implementation details of methods in the following. All methods are implemented in Julia 1.8. Environment Experiments were conducted on Ubuntu 20.04.1 with a single core of 2.1GHz Intel Xeon CPU Gold 5218 and 128GB of memory. G.1 Proposed method Many-body approximation We use a natural gradient method for many-body approximation. The natural gradient method uses the inverse of the Fisher information matrix to perform second-order optimization in a non-Euclidean space. For non-normalized tensors, we conducted the following procedure. First, we computed the total sum of elements of an input tensor. Then, we normalized the tensor. After that, we conducted Legendre decomposition for the normalized tensor. Finally, we obtained the product of the result of the previous step and the total sum we computed initially. That procedure was justified by the general property of the KL divergence λDKL(P, Q) = DKL(λP, λQ), for any λ R>0. The termination criterion is the same as the original implementation of Legendre decomposition by [36]; that is, it terminates if ||ηB t ˆηB|| < 10 5, where ηB t is the expectation parameters on t-th step and ˆηB is the expectation parameters of an input tensor, which are defined in Section 2.1.1. The overall procedure is described in Algorithm 2. Note that this algorithm is based on Legendre decomposition by [36]. LBTC The overall procedure is described in Algorithm 1 with our tolerance threshold. We randomly initialized each missing element in P(t=1) by Gaussian distribution with mean 50 and variance 10. G.2 Baselines Baseline methods for tensor completion We implemented baseline methods for tensor completion by translating pseudo codes in original papers into Julia code. As baseline methods, we used Ha LRTC, Si LRTC [21], Si LRTC-TT [2], and PTRCRW [38]. Si LRTC and Ha LRTC are well-known classical methods based on HOSVD. Si LRTC-TT is based on tensor train decomposition. PTRCRW is based on tensor-ring decomposition and is recognized as the state of the art. Ha LRTC requires a real value ρ and a weight vector α. We used the default setting on α described in the original paper and tuned the value ρ as ρ = 10 5, 10 4, . . . , 102. Si LRTC requires a hyperparameter γd for d = 1, 2, 3, 4. This is the threshold of truncated SVD for the mode-d matricization of the input tensor. Following the original paper, we assumed τ = τ1 = = τ4 and we tune τ in the range of {10 2, 10 1, . . . , 104}. Si LRTC-TT requires a weight vector α and a real number f. We set the value of α as the default setting described in the original paper. We tune the value of f in the range of { 0.001, 0.01, 0.025, 0.1, 0.25, 1 }. PTRC-RW requires two weight vectors, α and β, step length d, and tensor ring rank as hyper parameters. Following the original paper, we used the default setting for α, β, and d. Then we tuned the tensor ring rank (R, R, R, R) within the range R { 1, 2, . . . , 6 }, assuming that it has the same values in each element, which is also assumed in the original paper. Algorithm 2: Many-body approximation MANYBODYAPPROXIMATION(T , B) s Total sum of T . Obtain normalized input tensor P T ./s // ./ denotes element-wise division Compute ˆη of P using Equation (3). Initialize θB t=1 // e.g. θb = 0 for all b B t 1 repeat Compute Qt using the current parameter θB t with Equation (1). Compute ηB t from Qt using Equation (3). Compute the inverse of the Fisher information matrix G using Equation (5). θB t+1 θB t G 1(ηB t ˆηB) t t + 1 until ||ηB t ˆηB|| < ϵ // We set ϵ = 10 5 in our implementation; T Qt . s // . denotes element-wise multiplication return T Since PTRC-RW is not a convex method, we repeated it 10 times for each R with randomly sampled initial values by Gaussian distribution with mean 50 and variance 10. We provide Figure 14, which shows the sensitivity of the tensor completion performance with respect to changes of the above hyper-parameters. All of the above baseline methods include iterative procedures. These iterations were stopped when the relative error between the previous and the current iteration is less than 10 5 or when the iteration count exceeds 1500 iterations. Baseline methods for tensor ring decomposition We implemented baseline methods for tensor ring decomposition by translating MATLAB code provided by the authors into Julia code for fair comparison. As we can see from their original papers, NTR-APG, NTR-HALS, NTR-MU, NTR-MM, and NTR-lra MM have an inner and outer loop to find a local solution. We repeated the inner loop 100 times. We stopped the outer loop when the difference between the relative error of the previous and the current iteration is less than 10 4. NTR-MM and NTR-lra MM require a diagonal parameter matrix Ξ. We define Ξ = ωI where I is an identical matrix and ω = 0.1. The NTR-lra MM method performs low-rank approximation to the matrix obtained by mode expansion of an input tensor. The target rank was set to be 20. This setting is the default setting in the provided code. The initial positions of baseline methods were sampled from uniform distribution on (0, 1). Baseline methods for low-rank approximation We implemented NNTF by translating the pseudocode in the original paper [34] into Julia code. The algorithm requires NMF in each iteration. For that, we also implement the most standard method, multiplicative update rules, following the paper [46]. We use default values of hyper parameters of sklearn [49] for NMF. For NTD, we translate Tensor Ly implementation [45] to Julia code, whose cost function is the Frobenius norm between input and output. Ha LRTC Si LRTC Si LRTC-TT PTRC-RW 10 5 10 4 10 3 10 2 10 1 100 101 102 10 2 10 1 100 101 102 103 104 10 3 10 2 10 1 100 f 1 2 3 4 5 6 Rank 1 2 3 4 5 6 Rank 10 3 10 2 10 1 100 f 10 2 10 1 100 101 102 103 104 10 5 10 4 10 3 10 2 10 1 100 101 102 0.0 10 2 10 1 100 f 1 2 3 4 5 6 Rank 10 2 10 1 100 101 102 103 104 10 5 10 4 10 3 10 2 10 1 100 101 102 0.0 Scenario 1 Scenario 2 Scenario 3 Figure 14: Hyper-parameter dependencies for the performance of baselines in the experiment in Section 3.2. References for Supplementary Materials [42] Amari, S. 1998. Natural gradient works efficiently in learning. Neural computation 10(2):251 276. [43] Cheng, Song, Lei Wang, Tao Xiang and Pan Zhang. 2019. Tree tensor networks for generative modeling. Physical Review B 99(15):155131. [44] Kim, Yong-Deok and Seungjin Choi. 2007. Nonnegative tucker decomposition. In 2007 IEEE Conference on Computer Vision and Pattern Recognition. IEEE pp. 1 8. [45] Kossaifi, Jean, Yannis Panagakis, Anima Anandkumar and Maja Pantic. 2019. Tensor Ly: Tensor Learning in Python. Journal of Machine Learning Research 20(26):1 6. URL: http://jmlr.org/papers/v20/18-277.html [46] Lee, Daniel and H Sebastian Seung. 2000. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems 13. Denver, United States: pp. 556 562. [47] Malik, Osman Asif and Stephen Becker. 2021. A sampling-based method for tensor ring decomposition. In International Conference on Machine Learning. PMLR pp. 7400 7411. [48] Murg, Valentin, Frank Verstraete, Reinhold Schneider, Peter R Nagy and O Legeza. 2015. Tree tensor network state with variable tensor order: An efficient multireference method for strongly correlated systems. Journal of Chemical Theory and Computation 11(3):1027 1036. [49] Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825 2830. [50] Shi, Y-Y, L-M Duan and Guifre Vidal. 2006. Classical simulation of quantum many-body systems with a tree tensor network. Physical review a 74(2):022320. [51] Walczak, B. and D.L. Massart. 2001. Dealing with missing data: Part I. Chemometrics and Intelligent Laboratory Systems 58(1):15 27. URL: https://www.sciencedirect.com/science/article/pii/S0169743901001319 [52] Zhang, Sheng, Weihong Wang, James Ford and Fillia Makedon. 2006. Learning from incomplete ratings using non-negative matrix factorization. In Proceedings of the 2006 SIAM International Conference on Data Mining. SIAM Bethesda, United States: pp. 549 553.