# nonnegative_tensor_completion_via_integer_optimization__85ffdcae.pdf Nonnegative Tensor Completion via Integer Optimization Caleb Xavier Bugg Chen Chen Anil Aswani University of California, Berkeley, {caleb_bugg,aaswani}@berkeley.edu The Ohio State University, chen.8018@osu.edu Unlike matrix completion, tensor completion does not have an algorithm that is known to achieve the information-theoretic sample complexity rate. This paper develops a new algorithm for the special case of completion for nonnegative tensors. We prove that our algorithm converges in a linear (in numerical tolerance) number of oracle steps, while achieving the information-theoretic rate. Our approach is to define a new norm for nonnegative tensors using the gauge of a particular 0-1 polytope; integer linear programming can, in turn, be used to solve linear separation problems over this polytope. We combine this insight with a variant of the Frank Wolfe algorithm to construct our numerical algorithm, and we demonstrate its effectiveness and scalability through computational experiments using a laptop on tensors with up to one-hundred million entries. 1 Introduction Tensors generalize matrices. A tensor ψ of order p is ψ Rr1 rp, where ri is the dimension of the tensor in the i-th index, for i = 1, . . . , p. Though related, many problems that are polynomial-time solvable for matrices are NP-hard for tensors. For instance, it is NP-hard to compute the rank of a tensor (Hillar and Lim, 2013). Tensor versions of the spectral norm, nuclear norm, and singular value decomposition are also NP-hard to compute (Hillar and Lim, 2013; Friedland and Lim, 2014). 1.1 Past Approaches to Tensor Completion Tensor completion is the problem of observing (possibly with noise) a subset of entries of a tensor and then estimating the remaining entries based on an assumption of low-rankness. The tensor completion problem is encountered in a number of important applications, including computer vision (Liu et al., 2012; Zhang et al., 2019), regression with only categorical variables (Aswani, 2016), healthcare (Gandy et al., 2011; Dauwels et al., 2011), and many other application domains (Song et al., 2019). Although the special case of matrix completion is now well understood, the tensor version of this problem has an unresolved tension. To date, no tensor completion algorithm has been shown to achieve the information-theoretic sample complexity rate. Namely, for a tensor completion problem on a rank k tensor with sample size n, the information-theoretic rate for estimation error is p i ri/n (Gandy et al., 2011). In fact, evidence suggests a computational barrier in which no polynomial-time algorithm can achieve this rate (Barak and Moitra, 2016). One set of approaches has polynomial-time computation but requires exponentially more samples than the information-theoretic rate (Gandy et al., 2011; Mu et al., 2014; Barak and Moitra, 2016; Montanari and Sun, 2018), whereas another set of approaches achieve substantially closer to the information-theoretic rate but (in order to attain global minima) requires solving NP-hard problems that lack numerical algorithms (Chandrasekaran et al., 2012; Yuan and Zhang, 2016, 2017; Rauhut and Stojanac, 2021). 36th Conference on Neural Information Processing Systems (Neur IPS 2022). However, algorithms that achieve the information-theoretic rate have been developed for a few special cases of tensors. Completion of nonnegative rank-1 tensors can be written as a convex optimization problem (Aswani, 2016). For symmetric orthogonal tensors, a variant of the Frank-Wolfe algorithm has been proposed (Rao et al., 2015). Though this latter paper does not prove their algorithm achieves the information-theoretic rate, this can be shown using standard techniques (Gandy et al., 2011). This latter paper is closely related to the approach we take in this paper, with one of the key differences being the design of a different separation oracle in order to support a different class of tensors. 1.2 Contributions This paper proposes a numerical algorithm for completion of nonnegative tensors that provably converges to a global minimum in a linear (in numerical tolerance) number of oracle steps, while achieving the information-theoretic rate. Nonnegative tensors are encountered in many applications. For instance, image and video data usually consists of nonnegative tensors (Liu et al., 2012; Zhang et al., 2019). Notably, the image demosaicing problem that must be solved by nearly every digital camera (Li et al., 2008) is an instance of a nonnegative tensor completion problem, though it has not previously been interpreted as such. Nonnegative tensor completion is also encountered in specific instances of recommender systems (Song et al., 2019), healthcare applications (Gandy et al., 2011; Dauwels et al., 2011), and statistical regression contexts (Aswani, 2016). Our approach is to define a new norm for nonnegative tensors using the gauge of a specific 0-1 polytope that we construct. We prove this norm acts as a convex surrogate for rank and has low statistical complexity (as measured by the Rademacher average for the tensor, viewed as a function from a set of indices to the corresponding entry), but is NP-hard to approximate to an arbitrary accuracy. Importantly, we prove that the tensor completion problem using this norm achieves the information-theoretic rate in terms of sample complexity. However, the resulting tensor completion problem is NP-hard to solve. Nonetheless, because our new norm is defined using a 0-1 polytope, we can use integer linear optimization as a practical means to solve linear separation problems over the polytope. We combine this insight with a variant of the Frank-Wolfe algorithm to construct our numerical algorithm, and we demonstrate its effectiveness and scalability through experiments. 2 Norm for Nonnegative Tensors Consider a tensor ψ Rr1 rp. To refer to a specific entry in the tensor we use the notation ψx := ψx1,...,xp, where x = (x1, . . . , xp), and xi [ri] denotes the value of the i-th index, with [s] := {1, . . . , s}. Let ρ = P i ri, π = Q i ri, and R = [r1] [rp]. A nonnegative rank-1 tensor is ψx = Qp k=1 θ(k) xk , where θ(k) Rrk + are nonnegative vectors indexed by the different values of xk [rk]. To simplify notation, we drop the superscript in θ(k) xk and write this as θxk when clear from the context. For a nonnegative tensor ψ, its nonnegative rank is rank+(ψ) = min{q | ψ = Pq k=1 ψk, ψk B for k [q]}, (1) where we define the ball of nonnegative rank-1 tensors whose maximum entry is λ R+ to be Bλ = {ψ : ψx = λ Qp k=1 θxk, θxk [0, 1], for x R} (2) and B = limλ Bλ. A nonnegative CP decomposition is given by the summation ψ = Prank+(ψ) k=1 ψk, where ψk B for k [rank+(ψ)]. 2.1 Convex Hull of Nonnegative Rank-1 Tensors Now for λ R+ consider the finite set of points: Sλ = {ψ : ψx = λ Qp k=1 θxk, θxk {0, 1} for x R}. (3) Using standard techniques from integer optimization (Hansen, 1979; Padberg, 1989), the above set of points can be rewritten as a set of linear constraints on binary variables: Sλ = {ψ : λ (1 p) + λ Pp k=1 θxk ψx, 0 ψx λ θxk, θxk {0, 1}, for k [p], x R}. Our first result is that the convex hulls of the set of points Sλ and of the ball Bλ are equivalent. Proposition 2.1. We have the relation that Cλ := conv(Bλ) = conv(Sλ). Proof. We prove this by showing the two set inclusions conv(Sλ) conv(Bλ) and conv(Bλ) conv(Sλ). The first inclusion is immediate since by definition we have Sλ Bλ, and so we focus on proving the second inclusion. We prove this by contradiction: Suppose conv(Bλ) conv(Sλ). Then there exists a tensor ψ Bλ with ψ conv(Sλ). By the hyperplane separation theorem, there exists φ Rr1 rp and δ > 0 such that φ, ψ φ, ψ + δ for all ψ conv(Sλ), where , is the usual inner product that is defined as the summation of elementwise multiplication. Now consider the multilinear optimization problem max φ, ψ s.t. ψx = λ Qp k=1 θxk, for x R θxk [0, 1], for x R (4) Proposition 2.1 of (Drenick, 1992) shows there exists a global optimum ψ of (4) with ψ Sλ. By construction, we must have φ, ψ φ, ψ , which implies φ, ψ φ, ψ + δ for all ψ conv(Sλ). But this last statement is a contradiction since ψ Sλ conv(Sλ). Remark 2.2. This result has two useful implications. First, Cλ is a polytope since it is the convex hull of a finite number of bounded points. Second, the elements of Sλ are the vertices of Cλ, since any individual element cannot be written as a convex combination of the other elements. We will call the set Cλ the nonnegative tensor polytope. A useful observation is that the following relationships hold: Bλ = λB1, Sλ = λS1, and Cλ = λC1. 2.2 Constructing a Norm for Nonnegative Tensors Since the set of nonnegative tensors forms a cone (Qi et al., 2014), we need to use a modified definition of a norm. A norm on a cone K is a function p : K R+ such that for all x, y K the function has the following three properties: p(x) = 0 if and only if x = 0; p(γ x) = γ p(x) for γ R+; and p(x + y) p(x) + p(y). The difference with the usual norm definition is subtle: The second property here is required to hold for γ R+, whereas in the usual norm definition we require p(γ x) = |γ| p(x) for all γ R. We next use Cλ to construct a new norm for nonnegative tensors using a gauge (or Minkowski functional) construction. Though constructing norms using a gauge is common in machine learning (Chandrasekaran et al., 2012), the convex sets used in these constructions are symmetric about the origin. Symmetry guarantees that scaling the ball eventually includes the entire space (Rockafellar and Wets, 2009). However, in our case Cλ is not symmetric about the origin, and so without proof we do not a priori know whether scaling C1 eventually includes the entire space of nonnegative tensors. Thus we have to explicitly prove the gauge is a norm. Proposition 2.3. The function defined as ψ + := inf{λ 0 | ψ λC1} (5) is a norm for nonnegative tensors ψ Rr1 rp + . Proof. We first prove the above function is finite. Consider any nonnegative tensor ψ Rr1 rp + , and note there exists a decomposition ψ = Pπ i=1 ψk with ψk ψ max B1 (Qi et al., 2016). (This holds because we can choose each ψk to have a single non-zero value that corresponds to a different entry of ψ.) Hence by Proposition 2.1, ψk ψ max C1. Recalling the decomposition of ψ, this means ψ π ψ max C1 which by definition means ψ + π ψ max. Thus ψ + must be finite. Next we verify that the three properties of a norm on a cone are satisfied. To do so, we first observe that by definition: C1 is convex; 0 C1; and C1 is closed and bounded. Thus by Example 3.50 of (Rockafellar and Wets, 2009) we have {ψ : ψ + = 0} = {0}, which means the first norm property holds. Also by Example 3.50 of (Rockafellar and Wets, 2009) we have λC1 λ C1 for all 0 λ λ , which means the second norm property holds. Last, note Example 3.50 of (Rockafellar and Wets, 2009) establishes sublinearity of the gauge, and so the third norm property holds. 2.3 Convex Surrogate for Nonnegative Tensor Rank An important property of our norm ψ + is that it can be used as a convex surrogate for tensor rank, whereas the max and Frobenius norms cannot. Proposition 2.4. If ψ is a nonnegative tensor, then we have ψ max ψ + rank+(ψ) ψ max. If rank+(ψ) = 1, then ψ + = ψ max. Proof. Consider any λ 0. If ψ Sλ, then by definition ψ max = λ. By the convexity of norms we have that: if ψ Cλ, then ψ max λ. This means that for all λ 0 we have Cλ Uλ := {ψ : ψ max λ} . Thus we have the relation inf{λ 0 | ψ λU1} inf{λ 0 | ψ λC1}. But the left side is ψ max and the right side is ψ +. This proves the left side of the inequality. Next consider the case where rank+(ψ) = 1. By definition we have ψ ψ max B1. Thus ψ ψ max C1, and so ψ + ψ max. This proves the rank-1 case. Last we prove the right side of the desired inequality. Consider any nonnegative tensor ψ. Recall its nonnegative CP decomposition ψ = Prank+(ψ) k=1 ψk with ψk B . The triangle inequality gives ψ + Prank+(ψ) k=1 ψk + = Prank+(ψ) k=1 ψk max, (6) where the equality follows from the above-proved rank-1 case since rank+(ψk) = 1 by definition of the CP decomposition. But ψk max ψ max since the tensors are nonnegative. Remark 2.5. These bounds are tight. The lower and upper bounds are achieved by all nonnegative rank-1 tensors. The identity matrix with k columns achieves the upper bound with rank+(ψ) = k. 3 Measures of Norm Complexity 3.1 Computational Complexity We next characterize the computational complexity of our new norm on nonnegative tensors. Proposition 3.1. It is NP-hard to approximate the nonnegative tensor norm + to arbitrary accuracy. Proof. The dual norm is φ = sup{ φ, ψ | ψ + 1} for all tensors φ Rr1 rp. Theorems 3 and 10 of (Friedland and Lim, 2016) show that approximation of the dual norm is polynomialtime reducible to approximation of the norm +. We proceed by showing a polynomial-time reduction to approximation of . Without loss of generality assume p = 2 and d := r1 = r2. The appendix of (Witsenhausen, 1986) proves that MAX CUT is polynomial-time reducible to sup{ φ, ψ | ψ S1}. However, since the objective function is linear and the set S1 consists of the vertices of C1, this means that optimization problem is equivalent to sup{ φ, ψ | ψ C1}. This is the desired polynomial-time reduction of MAX CUT to approximation of the dual norm because C1 = {ψ : ψ + 1}. The result now follows by recalling that approximately solving MAX CUT to arbitrary accuracy is NP-hard (Papadimitriou and Yannakakis, 1991; Arora et al., 1998). Corollary 3.2. Given K R+ and ψ Rr1 rp + , it is NP-complete to determine if ψ + K. Proof. The problem is NP-hard by reduction from the problem of Proposition 3.1. In particular, a binary search can approximate the norm using this decision problem, ( ψ + K)?, as an oracle. By Proposition 2.4, the search can be initialized over the interval [0, π ψ max]. Furthermore, the decision problem is in NP because we can use the (polynomial-sized) θ from (2) as a certificate. 3.2 Stochastic Complexity of Norm We next show that our norm ψ + has low stochastic complexity. Let X = {x 1 , . . . , x n }, and suppose σi are independent and identically distributed (i.i.d.) Rademacher random variables (i.e., σi = 1 with probability 1 2) (Bartlett and Mendelson, 2002; Srebro et al., 2010). The Rademacher complexity for a set of functions H is R(H) = E(suph H 1 n| Pn i=1 σi h(x i )|), and the worst case Rademacher complexity of H is W(H) = sup X Eσ(suph H 1 n| Pn i=1 σi h(x i )|). These notions can be used to measure the complexity of sets of matrices (Srebro and Shraibman, 2005) or tensors (Aswani, 2016). The idea is to interpret each nonnegative tensor as a function ψ : R R+ from a set of indices x R to the corresponding entry of the tensor ψx. This complexity notion is useful for the completion problem because it can be directly translated into generalization bounds. Proposition 3.3. We have R(Cλ) W(Cλ) 2λ p Proof. First note that from their definitions we get R(Cλ) W(Cλ). Next define the set Pλ = { ψ : ψ Sλ}, and recall that Cλ = conv(Sλ) by Proposition 2.1. This means W(Cλ) = W(Sλ) (Ledoux and Talagrand, 1991; Bartlett and Mendelson, 2002). Next observe that W(Cλ) = W(Sλ) = sup X Eσ supψ Sλ 1 n Pn i=1 σi ψx i = sup X Eσ maxψ Pλ 1 n Pn i=1 σi ψx i sup X r 2 log #Pλ/n (7) where in the second line we replaced the supremum with a maximum, since the set Pλ is finite, and used the Finite Class Lemma (Massart, 2000) with r = maxψ Pλ q Pn i=1(ψx i )2 λ n. This inequality on r is due to the fact that Pλ consists of tensors whose entries are from { λ, 0, λ}. Thus W(Cλ) λ p (2 log 2) (ρ + 1)/n. The result follows by noting that log 2 (ρ + 1) 2ρ. Remark 3.4. Because ψ has π = O(rp) entries, the Rademacher complexity in a typical tensor norm (e.g., max and Frobenius norms) will be O( p π/n) = O( p rp/n). However, the Rademacher complexity in our norm ψ + is O( p ρ/n) = O( p rp/n), which is exponentially smaller. 4 Tensor Completion Suppose we have data (x i , y i ) R R for i = 1, . . . , n. Let I = {i1, . . . , iu} [n] be any set of points that specify all the unique x i for i = 1, . . . , n, meaning the set U = {x i1 , . . . , x iu } does not have any repeated points and x i U for all i = 1, . . . , n. The nonnegative tensor completion problem using our norm ψ + is given by bψ arg min ψ 1 n Pn i=1 y i ψx i 2 s.t. ψ + λ (8) We will study the statistical properties of the above estimate and describe the elements of algorithm to solve the above optimization problem. The purpose of defining the set U is that it is used in constructing the algorithm used to solve the above problem. 4.1 Statistical Guarantees Though in the previous section we calculated a Rademacher complexity for nonnegative tensors in Cλ viewed as functions, here we use an alternative approach to derive generalization bounds. The reason is that generalization bounds using the Rademacher complexity are not tight here. Our approach is based on the observation that the nonnegative tensor completion problem (8) using our norm ψ + is equivalent to a convex aggregation problem (Nemirovski, 2000; Tsybakov, 2003; Lecué et al., 2013) for a finite set of functions. In particular, by Proposition 2.1 we have {ψ : ψ + λ} = Cλ = conv(Sλ). The implication is we can directly apply existing results for convex aggregation to provide a tight generalization bound for the solution of (8). Proposition 4.1 ((Lecué et al., 2013)). Suppose |y| b almost surely. Given any δ > 0, with probability at least 1 4δ we have that E (y bψx)2 min φ Cλ E (y φx)2 + c0 max b2, λ2 max ζn, log(1/δ) where c0 is an absolute constant and n , if 2ρ n r 1 n log e2ρ n , if 2ρ > n (10) Remark 4.2. We make two comments. First, note that ζn = O( p ρ/n). Second, in some regimes ζn can be considerably faster than the p Generalization bounds under specific noise models, such as an additive noise model, follow as a corollary to the above proposition combined with Proposition 2.4. Corollary 4.3. Suppose φ is a nonnegative tensor with rank+(φ) = k and φ max µ. If (x i , y i ) are independent and identically distributed with |y i φx i | e almost surely and Ey i = φx i . Then given any δ > 0, with probability at least 1 4δ we have E (y bψx)2 e2 + c0 µk + e)2 max ζn, log(1/δ) where ζn is as in (10) and c0 is an absolute constant. Remark 4.4. The above result achieves the information-theoretic rate when the rank k = O(1). 4.2 Computational Complexity Though (8) is a convex optimization problem, our next result shows that solving it is NP-hard. Proposition 4.5. It is NP-hard to solve the optimization problem (8) to an arbitrary accuracy. Proof. Define the ball of radius δ > 0 centered at a nonnegative tensor ψ to be B(ψ, δ) = {φ : φ ψ F δ}. Next define W(C1, δ) = S ψ C1 B(ψ, δ) and W(C1, δ) = {ψ C1 : B(ψ, δ) C1}. The weak membership problem for C1 is that given a nonnegative tensor ψ and a δ > 0 decide whether ψ W(C1, δ) or ψ / W(C1, δ). Theorem 10 of (Friedland and Lim, 2016) shows that approximation of the norm + is polynomial-time reducible to the weak membership problem for C1. Since Proposition 3.1 shows that approximation of the norm + is NP-hard, the result follows if we can reduce the weak membership problem to (8). Suppose we are given inputs ψ and δ for the weak membership problem. Choose x i for i = 1, . . . , π such that each element in R is enumerated exactly once. Next choose y i = ψx i . Finally, note if we solve (8) and the minimum objective value is less than or equal to δ, then we have ψ W(C1, δ); otherwise we have ψ / W(C1, δ). The result now follows since this was the desired reduction. We note that the decision version of (8), that is ascertaining whether a given tensor ψ attains an objective less than a given value while also satisfying ψ + λ, is NP-complete. This follows directly from Corollary 3.2 and Proposition 4.5. 4.3 Numerical Computation Although it is NP-hard, there is substantial structure that enables practical numerical computation of global minima of (8). The key observation is that C1 is a 0-1 polytope, which implies the linear separation problem on this polytope can be solved using integer linear optimization. Integer optimization has well-established global algorithms that are guaranteed to solve the separation problem for this polytope. This is a critical feature that enables the use of the Frank-Wolfe algorithm or one of its variants to solve (8) to a desired numerical tolerance. In fact, the Frank-Wolfe variant known as Blended Conditional Gradients (BCG) (Braun et al., 2019) is particularly well-suited for calculating a solution to (8) for the following reasons: The first reason is that our problem has structure such that the BCG algorithm will terminate in a linear (with respect to numerical tolerance) number of oracle steps. A sufficient condition for this linear convergence is if the feasible set is a polytope and the objective function is strictly convex over the feasible set. For our problem (8), the objective function can be made strictly convex over the feasible set by an equivalent reformulation. Specifically, we use the equivalent reformulation in which we change the feasible set from {ψ : ψ + λ} = Cλ to Proj U(Cλ) where the projection is done over the unique indices specified by the set U. In fact, this projection is trivial to implement because it simply consists of discarding the entries of ψ that are not observed by the x i indices. The second is that the BCG algorithm uses weak linear separation, which accommodates earlytermination of the associated integer linear optimization. Integer optimization software tends to find optimal or near-optimal solutions considerably faster than certifying the optimality of such solutions. Furthermore, a variety of tuning parameters (Berthold et al., 2018) can be used to accelerate the search for good primal solutions when exact solutions are not needed. We also deploy a fast alternating maximization heuristic in order to avoid integer optimization oracle calls where possible. Thus, early-termination allows us to deploy a globally convergent algorithm with practical solution times. Hence we use the BCG algorithm to compute global minima of (8) to arbitrary numerical tolerance. For brevity we omit a full description of the algorithm, and instead focus on the separation oracle, Algorithm 1: Weak Separation Oracle for Cλ Input: linear objective c Rr1 rp, point ψ Cλ, accuracy K 1, gap estimate Φ > 0, norm bound λ Output: Either (1) vertex φ Sλ with c, ψ φ Φ/K, or (2) false: c, ψ φ Φ for all φ Cλ Algorithm 2: Alternating Maximization Input: linear objective c Rr1 rp, point ψ Cλ, norm bound λ, incumbent (binary) solution ˆθ Sλ Output: Best known solution θ θ ˆθ z z M(θ) for i = 1 to p do for k = 1 to ri do θ(i) k 1 θ(i) k if z M(θ) > z then z z M(θ) else θ(i) k 1 θ(i) k end if end for end for which is the main component specifically adapted to our application. The oracle is described in Algorithm 1, where , is the dot product for tensors viewed as vectors; for notational simplicity we state the oracle in terms of the original space, ignoring projection onto U. Output condition (1) provides separation with some vertex φ, while (2) requires certification that no such vertex exists. In implementing the weak separation oracle, we first attempt separation with our alternating maximization procedure. Described as Algorithm 2, it involves the following objective function: z M(θ) := P x R cx, ψx λ Qp k=1 θxk (12) The separation problem is thus treated as a multilinear binary optimization problem, and the algorithm successively minimizes in each dimension. We apply this heuristic a fixed number of times, randomly complementing entries in the incumbent each time. The procedure runs in polynomial-time, and is not guaranteed to separate nor can it certify that such separation is impossible. However, in our experiments it succeeds often, offering substantial practical speedups. If alternating maximization fails to separate, then we solve the integer programming problem: max φ,θ c, ψ φ s.t. λ (1 p) + λ Pp k=1 θxk φx x R 0 φx λ θxk k [p], x R θxk {0, 1} k [p], x R Note that early termination is deployed: we stop whenever an incumbent solution is found such that the objective is greater than Φ/K. If no such solution exists, then the integer programming solver is guaranteed to (eventually) establish a dual bound z U such that c, ψ φ z U Φ. 5 Numerical Experiments Here we present results that show the efficacy and scalability of our algorithm for nonnegative tensor completion. Our experiments were conducted on a laptop computer with 8GB of RAM and an Intel Core i5 2.3Ghz processor with 2-cores/4-threads. The algorithms were coded in Python 3. We used 20 40 60 80 100 BCG ALS Si LRTC TNCP 20 40 60 80 100 BCG ALS Si LRTC TNCP Figure 1: Results for order-3 nonnegative tensors with size r r r and n = 500 samples. BCG ALS Si LRTC TNCP BCG ALS Si LRTC TNCP Figure 2: Results for increasing order nonnegative tensors with size 10 p and n = 10, 000 samples. 10 2 10 1 100 101 Sample Percent BCG ALS Si LRTC TNCP 10 2 10 1 100 101 Sample Percent BCG ALS Si LRTC TNCP Figure 3: Results for nonnegative tensors with size 10 6 and increasing n samples. 10 2 10 1 100 101 Sample Percent BCG ALS Si LRTC TNCP 10 2 10 1 100 101 Sample Percent BCG ALS Si LRTC TNCP Figure 4: Results for nonnegative tensors with size 10 7 and increasing n samples. Gurobi v9.1 (Gurobi Optimization, LLC, 2021) to solve the integer programs (13). As a benchmark, we use alternating least squares (ALS) which is often called a workhorse for numerical tensor problems (Kolda and Bader, 2009) as well as two state-of-the-art methods implemented by the Py Ten package (Song et al., 2019), namely the simple low rank tensor completion (Si LRTC) algorithm (Liu et al., 2012) and the trace norm regularized CP decomposition (TNCP) algorithm (Liu et al., 2014). Py Ten is available from https://github.com/datamllab/pyten under a GPL 2 license. To minimize the impact of hyperparameter selection in our numerical results, we provided the ground truth values when possible. For instance, in our nonnegative tensor completion formulation (8) we chose λ to be the smallest value for which we could certify that ψ + λ for the true tensor ψ. This was accomplished by construction of the true tensor ψ. For ALS and TNCP, we used a nonnegative rank k that was the smallest value for which we could certify that rank+(ψ) k. This was also accomplished by construction of the true tensor ψ. A last note is that ALS often works better when used with L2 regularization (Navasca et al., 2008). The hyperparameter for the L2 regularization for ALS was chosen in a way favorable to ALS in order to maximize its accuracy. 5.1 Experiments with Third-Order Tensors Our first set of results concerns tensors of order p = 3 with increasing dimensions. In each experiment, the true tensor ψ was constructed by randomly choosing 10 points from S1 and then taking a random convex combination. This construction ensures ψ + 1 and rank+(ψ) 10. We used n = 500 samples (with indices sampled with replacement). Each experiment was repeated 100 times, and the results are shown in Figure 1. A table of these values and their standard error is found in the Appendix. We measured accuracy using normalized mean squared error (NMSE) bψ ψ 2 F / ψ 2 F , which is a more stringent measure than used in Corollary 4.3 because the statistical theoretical result does not include normalization. The results show that our approach provides modestly higher accuracy but requires more computation time. However, computation time remains on the order of seconds for the various tensor sizes. We note that the NMSE value of approximately 0.48 that TNCP converges to is the value achieved by a tensor that is identically the average of y i in all its entries. 5.2 Experiments with Increasing Tensor Order Our second set of results concerns tensors with increasing order p, where each dimension takes the value ri = 10 for i = 1, . . . , p. In each experiment, the true tensor was constructed by the method in Sec. 5.1. We used n = 10, 000 samples (with indices sampled with replacement). Each experiment was repeated 100 times, and the results are shown in Figure 2 with the full values and standard error given in the Appendix. Si LRTC and TNCP were unable to run for tensors with 108 entries. The results show that our approach provides much higher accuracy but requires more computation time. However, computation remains on the order of minutes even for the largest tensor with 108 entries. 5.3 Experiments with Increasing Sample Size Our third and fourth set of results concerns tensors of size 10 6 and 10 7, respectively, where experiments are conducted with increasing sample size, given in the table as percentage of total entries. The true tensor ψ was constructed as in Sec. 5.1. We begin with sample percentage 0.01% (with indices sampled with replacement), and extend each set of experiments sample size by one order of magnitude. Each experiment was repeated 100 times, and the results are shown in Figures 3 and 4 with the full values and standard error given in the Appendix. Again, the numerical results show that our approach provides substantially higher accuracy but requires more computation time. The computation time remains on the order of minutes for most of the sampling schemes, excepting the draw of 106 entries for a tensor with 107 entries (i.e., about 1.5 hours). 6 Conclusion We defined a new norm for nonnegative tensors and used it to develop an algorithm for nonnegative tensor completion that provably converges in a linear number of oracle steps and meets the information-theoretic sample complexity rate. Its efficacy and scalability were demonstrated using experiments. The next step is to generalize this approach to all tensors. In fact, our norm definitions, optimization formulations, algorithm design, and theoretical analysis all extend to general tensors. Acknowledgements This material is based upon work partially supported by the NSF under grant CMMI-1847666. Arora, S., Lund, C., Motwani, R., Sudan, M., and Szegedy, M. (1998). Proof verification and the hardness of approximation problems. Journal of the ACM (JACM), 45(3):501 555. Aswani, A. (2016). Low-rank approximation and completion of positive tensors. SIAM Journal on Matrix Analysis and Applications, 37(3):1337 1364. Barak, B. and Moitra, A. (2016). Noisy tensor completion via the sum-of-squares hierarchy. In Conference on Learning Theory, pages 417 445. PMLR. Bartlett, P. and Mendelson, S. (2002). Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. Berthold, T., Hendel, G., and Koch, T. (2018). From feasibility to improvement to proof: three phases of solving mixed-integer programs. Optimization Methods and Software, 33(3):499 517. Braun, G., Pokutta, S., Tu, D., and Wright, S. (2019). Blended conditonal gradients. In International Conference on Machine Learning, pages 735 743. PMLR. Chandrasekaran, V., Recht, B., Parrilo, P. A., and Willsky, A. S. (2012). The convex geometry of linear inverse problems. Foundations of Computational mathematics, 12(6):805 849. Dauwels, J., Garg, L., Earnest, A., and Pang, L. K. (2011). Handling missing data in medical questionnaires using tensor decompositions. In 2011 8th International Conference on Information, Communications Signal Processing, pages 1 5. Drenick, R. (1992). Multilinear programming: Duality theories. Journal of optimization theory and applications, 72(3):459 486. Friedland, S. and Lim, L.-H. (2014). Computational complexity of tensor nuclear norm. Submitted. Friedland, S. and Lim, L.-H. (2016). The computational complexity of duality. SIAM Journal on Optimization, 26(4):2378 2393. Gandy, S., Recht, B., and Yamada, I. (2011). Tensor completion and low-n-rank tensor recovery via convex optimization. Inverse problems, 27(2):025010. Gurobi Optimization, LLC (2021). Gurobi Optimizer Reference Manual. Hansen, P. (1979). Methods of nonlinear 0-1 programming. In Annals of Discrete Mathematics, volume 5, pages 53 70. Elsevier. Hillar, C. and Lim, L.-H. (2013). Most tensor problems are np-hard. J. ACM, 60(6):45:1 45:39. Kolda, T. and Bader, B. (2009). Tensor decompositions and applications. SIAM Review, 51(3):455 500. Lecué, G. et al. (2013). Empirical risk minimization is optimal for the convex aggregation problem. Bernoulli, 19(5B):2153 2166. Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Springer. Li, X., Gunturk, B., and Zhang, L. (2008). Image demosaicing: A systematic survey. In Visual Communications and Image Processing 2008, volume 6822, page 68221J. International Society for Optics and Photonics. Liu, J., Musialski, P., Wonka, P., and Ye, J. (2012). Tensor completion for estimating missing values in visual data. IEEE transactions on pattern analysis and machine intelligence, 35(1):208 220. Liu, Y., Shang, F., Jiao, L., Cheng, J., and Cheng, H. (2014). Trace norm regularized candecomp/parafac decomposition with missing data. IEEE transactions on cybernetics, 45(11):2437 2448. Massart, P. (2000). Some applications of concentration inequalities to statistics. Annales de la faculté des sciences de Toulouse Sér. 6, 9(2):245 303. Montanari, A. and Sun, N. (2018). Spectral algorithms for tensor completion. Communications on Pure and Applied Mathematics, 71(11):2381 2425. Mu, C., Huang, B., Wright, J., and Goldfarb, D. (2014). Square deal: Lower bounds and improved relaxations for tensor recovery. In International conference on machine learning, pages 73 81. PMLR. Navasca, C., De Lathauwer, L., and Kindermann, S. (2008). Swamp reducing technique for tensor decomposition. In 2008 16th European Signal Processing Conference, pages 1 5. IEEE. Nemirovski, A. (2000). Topics in non-parametric statistics. Ecole d Eté de Probabilités de Saint Flour, 28:85. Padberg, M. (1989). The boolean quadric polytope: some characteristics, facets and relatives. Mathematical programming, 45(1-3):139 172. Papadimitriou, C. H. and Yannakakis, M. (1991). Optimization, approximation, and complexity classes. Journal of computer and system sciences, 43(3):425 440. Qi, Y., Comon, P., and Lim, L.-H. (2014). Uniqueness of nonnegative tensor approximations. ar Xiv preprint ar Xiv:1410.8129. Qi, Y., Comon, P., and Lim, L.-H. (2016). Uniqueness of nonnegative tensor approximations. IEEE Transactions on Information Theory, 62(4):2170 2183. Rao, N., Shah, P., and Wright, S. (2015). Forward backward greedy algorithms for atomic norm regularization. IEEE Transactions on Signal Processing, 63(21):5798 5811. Rauhut, H. and Stojanac, Ž. (2021). Tensor theta norms and low rank recovery. Numerical Algorithms, 88(1):25 66. Rockafellar, R. and Wets, R. (2009). Variational Analysis. Springer. Song, Q., Ge, H., Caverlee, J., and Hu, X. (2019). Tensor completion algorithms in big data analytics. ACM Transactions on Knowledge Discovery from Data (TKDD), 13(1):1 48. Srebro, N. and Shraibman, A. (2005). Rank, trace-norm and max-norm. In International Conference on Computational Learning Theory, pages 545 560. Springer. Srebro, N., Sridharan, K., and Tewari, A. (2010). Smoothness, low noise and fast rates. In Advances in Neural Information Processing Systems, pages 2199 2207. Tsybakov, A. B. (2003). Optimal rates of aggregation. In Learning theory and kernel machines, pages 303 313. Springer. Witsenhausen, H. (1986). A simple bilinear optimization problem. Systems & control letters, 8(1):1 4. Yuan, M. and Zhang, C.-H. (2016). On tensor completion via nuclear norm minimization. Foundations of Computational Mathematics, 16(4):1031 1068. Yuan, M. and Zhang, C.-H. (2017). Incoherent tensor norms and their applications in higher order tensor completion. IEEE Transactions on Information Theory, 63(10):6753 6766. Zhang, X., Wang, D., Zhou, Z., and Ma, Y. (2019). Robust low-rank tensor recovery with rectification and alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1):238 255. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Sec. 5. (c) Did you discuss any potential negative societal impacts of your work? [N/A] This is foundational research and not tied to particular applications. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See Supplemental Material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Sec. 5. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] See Appendix. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Sec. 5. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] See Sec. 5 for a citation to the Py Ten package. (b) Did you mention the license of the assets? [Yes] See Sec. 5 for the license of the Py Ten package. (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] There is new code in the Supplemental Material. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]