# orthogonal_nmf_through_subspace_exploration__c2d229fb.pdf Orthogonal NMF through Subspace Exploration Megasthenis Asteris The University of Texas at Austin megas@utexas.edu Dimitris Papailiopoulos University of California, Berkeley dimitrisp@berkeley.edu Alexandros G. Dimakis The University of Texas at Austin dimakis@austin.utexas.edu Orthogonal Nonnegative Matrix Factorization (ONMF) aims to approximate a nonnegative matrix as the product of two k-dimensional nonnegative factors, one of which has orthonormal columns. It yields potentially useful data representations as superposition of disjoint parts, while it has been shown to work well for clustering tasks where traditional methods underperform. Existing algorithms rely mostly on heuristics, which despite their good empirical performance, lack provable performance guarantees. We present a new ONMF algorithm with provable approximation guarantees. For any constant dimension k, we obtain an additive EPTAS without any assumptions on the input. Our algorithm relies on a novel approximation to the related Nonnegative Principal Component Analysis (NNPCA) problem; given an arbitrary data matrix, NNPCA seeks k nonnegative components that jointly capture most of the variance. Our NNPCA algorithm is of independent interest and generalizes previous work that could only obtain guarantees for a single component. We evaluate our algorithms on several real and synthetic datasets and show that their performance matches or outperforms the state of the art. 1 Introduction Orthogonal NMF The success of Nonnegative Matrix Factorization (NMF) in a range of disciplines spanning data mining, chemometrics, signal processing and more, has driven an extensive practical and theoretical study [1, 2, 3, 4, 5, 6, 7, 8]. Its power lies in its potential to generate meaningful decompositions of data into non-subtractive combinations of a few nonnegative parts. Orthogonal NMF (ONMF) [9] is a variant of NMF with an additional orthogonality constraint: given a real nonnegative m n matrix M and a target dimension k, typically much smaller than m and n, we seek to approximate M by the product of an m k nonnegative matrix W with orthogonal (w.l.o.g, orthonormal) columns, and an n k nonnegative matrix H. In the form of an optimization, (ONMF) E min W 0, H 0 W W=Ik M WH 2 F. (1) Since W is nonnegative, its columns are orthogonal if and only if they have disjoint supports. In turn, each row of M is approximated by a scaled version of a single (transposed) column of H. Despite the admittedly limited representational power compared to NMF, ONMF yields sparser partbased representations that are potentially easier to interpret, while it naturally lends itself to certain applications. In a clustering setting, for example, W serves as a cluster membership matrix and the columns of H correspond to k cluster centroids [9, 10, 11]. Empirical evidence shows that ONMF performs remarkably well in certain clustering tasks, such as document classification [6, 11, 12, 13, 14, 15]. In the analysis of textual data where M is a words by documents matrix, the orthogonal columns of W can be interpreted as topics defined by disjoint subsets of words. In the case of an image dataset, with each column of M corresponding to an image evaluated on multiple pixels, each of the orthogonal base vectors highlights a disjoint segment of the image area. Nonnegative PCA For any given factor W 0 with orthonormal columns, the second ONMF factor H is readily determined: H = M W 0. This follows from the fact that M is by assumption nonnegative. Based on the above, it can be shown that the ONMF problem (1) is equivalent to (NNPCA) V max W Wk M W 2 F, (2) where Wk W Rm k : W 0, W W = Ik . For arbitrary i.e., not necessarily nonnegative matrices M, the non-convex maximization (2) coincides with the Nonnegative Principal Component Analysis (NNPCA) problem [16]. Similarly to vanilla PCA, NNPCA seeks k orthogonal components that jointly capture most of the variance of the (centered) data in M. The nonzero entries of the extracted components, however, must be positive, which renders the problem NP-hard even in the case of a single component (k = 1) [17]. Our Contributions We present a novel algorithm for NNPCA. Our algorithm approximates the solution to (2) for any real input matrix and is accompanied with global approximation guarantees. Using the above as a building block, we develop an algorithm to approximately solve the ONMF problem (1) on any nonnegative matrix. Our algorithm outputs a solution that strictly satisfies both the nonnegativity and the orthogonality constraints. Our main results are as follows: Theorem 1. (NNPCA) For any m n matrix M, desired number of components k, and accuracy parameter ǫ (0, 1), our NNPCA algorithm computes W Wk such that M W 2 F (1 ǫ) V k σ2 r+1(M), where σr+1(M) is the (r + 1)th singular value of M, in time TSVD(r) + O 1 ǫ r k k m . Here, TSVD(r) denotes the time required to compute a rank-r approximation M of the input M using the truncated singular value decomposition (SVD). Our NNPCA algorithm operates on the low-rank matrix M. The parameter r controls a natural trade-off; higher values of r lead to tighter guarantees, but impact the running time of our algorithm. Finally, note that despite the exponential dependence in r and k, the complexity scales polynomially in the ambient dimension of the input. If the input matrix M is nonnegative, as in any instance of the ONMF problem, we can compute an approximate orthogonal nonnegative factorization in two steps: first obtain an orthogonal factor W by (approximately) solving the NNPCA problem on M, and subsequently set H = M W. Theorem 2. (ONMF) For any m n nonnegative matrix M, target dimension k, and desired accuracy ǫ (0, 1), our ONMF algorithm computes an ONMF pair W, H, such that M WH 2 F E + ǫ M 2 F, in time TSVD( k ǫ k2/ǫ k m . For any constant dimension k, Theorem 2 implies an additive EPTAS for the relative ONMF approximation error. This is, to the best our knowledge, the first general ONMF approximation guarantee since we impose no assumptions on M beyond nonnegativity. We evaluate our NNPCA and ONMF algorithms on synthetic and real datasets. As we discuss in Section 4, for several cases we show improvements compared to the previous state of the art. Related Work ONMF as a variant of NMF first appeared implicitly in [18]. The formulation in (1) was introduced in [9]. Several algorithms in a subsequent line of work [12, 13, 19, 20, 21, 22] approximately solve variants of that optimization problem. Most rely on modifying approaches for NMF to accommodate the orthogonality constraint; either exploiting the additional structural properties in the objective [13], introducing a penalization term [9], or updating the current estimate in suitable directions [12], they typically reduce to a multiplicative update rule which attains orthogonality only in a limit sense. In [11], the authors suggest two alternative approaches: an EM algorithm motivated by connections to spherical k-means, and an augmented Lagrangian formulation that explicitly enforces orthogonality, but only achieves nonnegativity in the limit. Despite their good performance in practice, existing methods only guarantee local convergence. Figure 1: ONMF and Separable NMF, upon appropriate permutation of the rows of M. In the first case, each row of M is approximated by a single row of H , while in the second, by a nonnegative combination of all k rows of H . A significant body of work [23, 24, 25, 26] has focused on Separable NMF, a variant of NMF partially related to ONMF. Sep. NMF seeks to decompose M into the product of two nonnegative matrices W and H where W contains a permutation of the k k identity matrix. Intuitively, the geometric picture of Sep. NMF should be quite different from that of ONMF: in the former, the rows of H are the extreme rays of a convex cone enclosing all rows of M, while in the latter they should be scattered in the interior of that cone so that each row of M has one representative in small angular distance. Algebraically, ONMF factors approximately satisfy the structural requirement of Sep. NMF, but the converse is not true: a Sep. NMF solution is not a valid ONMF solution (Fig. 1). In the NNPCA front, nonnegativity as a constraint on PCA first appeared in [16], which proposed a coordinate-descent scheme on a penalized version of (2) to compute a set of nonnegative components. In [27], the authors developed a framework stemming from Expectation-Maximization (EM) on a generative model of PCA to compute a nonnegative (and optionally sparse) component. In [17], the authors proposed an algorithm based on sampling points from a low-dimensional subspace of the data covariance and projecting them on the nonnegative orthant. [27] and [17] focus on the single-component problem; multiple components can be computed sequentially employing a heuristic deflation step. Our main theoretical result is a generalization of the analysis of [17] for multiple components. Finally, note that despite the connection between the two problems, existing algorithms for ONMF are not suitable for NNPCA as they only operate on nonnegative matrices. 2 Algorithms and Guarantees 2.1 Overview We first develop an algorithm to approximately solve the NNPCA problem (2) on any arbitrary i.e., not necessarily nonnegative m n matrix M. The core idea is to solve the NNPCA problem not directly on M, but a rank-r approximation M instead. Our main technical contribution is a procedure that approximates the solution to the constrained maximization (2) on a rank-r matrix within a multiplicative factor arbitrarily close to 1, in time exponential in r, but polynomial in the dimensions of the input. Our Low Rank NNPCA algorithm relies on generating a large number of candidate solutions, one of which provably achieves objective value close to optimal. The k nonnegative components W Wk returned by our Low Rank NNPCA algorithm on the sketch M are used as a surrogate for the desired components of the original input M. Intuitively, the performance of the extracted nonnegative components depends on how well M is approximated by the low rank sketch M; a higher rank approximation leads to better results. However, the complexity of our low rank solver depends exponentially in the rank of its input. A natural trade-off arises between the quality of the extracted components and the running time of our NNPCA algorithm. Using our NNPCA algorithm as a building block, we propose a novel algorithm for the ONMF problem (1). In an ONMF instance, we are given an m n nonnegative matrix M and a target dimension k < m, n, and seek to approximate M with a product WH of two nonnegative matrices, where W additionally has orthonormal columns. Computing such a factorization is equivalent to solving the NNPCA problem on the nonnegative matrix M. (See Appendix A.1 for a formal argument.) Once a nonnegative orthogonal factor W is obtained, the second ONMF factor is readily determined: H = M W minimizes the Frobenius approximation error in (1) for a given W. Under an appropriate configuration of the accuracy parameters, for any nonnegative m n input M and constant target dimension k, our algorithm yields an additive EPTAS for the relative approximation error, without any additional assumptions on the input data. 2.2 Main Results Algorithm 1 Low Rank NNPCA input real m n rank-r matrix M, k, ǫ (0, 1) output W Wk Rm k {See Lemma 1} 1: C {} {Candidate solutions} 2: U, Σ, V SVD(M, r) {Trunc. SVD} 3: for each C N k ǫ/2 Sr 1 2 do 4: A UΣC {A Rm k} 5: c W Local Opt W(A) {Alg. 3} 6: C C c W 7: end for 8: W arg max W C M W 2 F Low Rank NNPCA We develop an algorithm to approximately solve the NNPCA problem on an m n real rank-r matrix M: W arg max W Wk M W . (3) The procedure, which lies in the core of our subsequent developments, is encoded in Alg. 1. We describe it in detail in Section 3. The key observation is that irrespectively of the dimensions of the input, the maximization in (3) can be reduced to k r unknowns. The algorithm generates a large number of k-tuples of r-dimensional points; the collection of tuples is denoted by N k ǫ/2 Sr 1 2 , the kth Cartesian power of an ǫ/2-net of the r-dimensional unit sphere. Using these points, we effectively sample the column-space of the input M. Each tuple yields a feasible solution W Wk through a computationally efficient subroutine (Alg. 3). The best among those candidate solutions is provably close to the optimal W with respect to the objective in (2). The approximation guarantees are formally established in the following lemma. Lemma 1. For any real m n matrix M with rank r, desired number of components k, and accuracy parameter ǫ (0, 1), Algorithm 1 outputs W Wk such that M W 2 F (1 ǫ) M W 2 F, where W is the optimal solution defined in (3), in time TSVD(r) + O 2 ǫ r k k m . Proof. (See Appendix A.2.) Nonnegative PCA Given an arbitrary real m n matrix M, we can generate a rank-r sketch M and solve the low rank NNPCA problem on M using Algorithm 1. The output W Wk of the low rank problem can be used as a surrogate for the desired components of the original input M. For simplicity, here we consider the case where M is the rank-r approximation of M obtained by the truncated SVD. Intuitively, the performance of the extracted components on the original data matrix M will depend on how well the latter is approximated by M, and in turn by the spectral decay of the input data. For example, if M exhibits a sharp spectral decay, which is frequently the case in real data, a moderate value of r suffices to obtain a good approximation. This leads to our first main theorem which formally establishes the guarantees of our NNPCA algorithm. Theorem 1. For any real m n matrix M, let M be its best rank-r approximation. Algorithm 1 with input M, and parameters k and ǫ (0, 1), outputs W Wk such that M W 2 F (1 ǫ) M W 2 F k M M 2 2, where W arg max W Wk M W 2 F, in time TSVD(r) + O 1 ǫ r k k m . Proof. The proof follows from Lemma 1. It is formally provided in Appendix A.3. Theorem 1 establishes a trade-off between the computational complexity of the proposed NNPCA approach and the tightness of the approximation guarantees; higher values of r imply smaller M M 2 2 and in turn a tighter bound (assuming that the singular values of M decay), but have an exponential impact on the running time. Despite the exponential dependence on r and k, our approach is polynomial in the dimensions of the input M, dominated by the truncated SVD. In practice, Algorithm 1 can be terminated early returning the best computed result at the time of termination, sacrificing the theoretical approximation guarantees. In Section 4 we empirically evaluate our algorithm on real datasets and demonstrate that even for small values of r, our NNPCA algorithms significantly outperforms existing approaches. Orthogonal NMF The NNPCA algorithm straightforwardly yields an algorithm for the ONMF problem (1). In an ONMF instance, the input matrix M is by assumption nonnegative. Given any m k orthogonal nonnegative factor W, the optimal choice for the second factor is H = M W. Hence, it suffices to determine W, which can be obtained by solving the NNPCA problem on M. Algorithm 2 ONMFS input : m n real M 0, r, k, ǫ (0, 1) 1: M SVD(M, r) 2: W Low Rank NNPCA M, k, ǫ {Alg. 1} 3: H M W output W, H The proposed ONMF algorithm is outlined in Alg. 2. Given a nonnegative m n matrix M, we first obtain a rank-r approximation M via the truncated SVD, where r is an accuracy parameter. Using Alg. 1 on M, we compute an orthogonal nonnegative factor W Wk that approximately maximizes (3) within a desired accuracy. The second ONMF factor H is readily determined as described earlier. The accuracy parameter r once again controls a trade-off between the quality of the ONMF factors and the complexity of the algorithm. We note, however, that for any target dimension k and desired accuracy parameter ǫ, setting r = k/ǫ suffices to achieve an additive ǫ error on the relative approximation error of the ONMF problem. More formally, Theorem 2. For any m n real nonnegative matrix M, target dimension k, and desired accuracy ǫ (0, 1), Algorithm 2 with parameter r = k/ǫ outputs an ONMF pair W, H, such that M WH 2 F E + ε M 2 F, in time TSVD( k ǫ k2/ǫ (k m) . Proof. (See Appendix A.4.) Theorem 2 implies an additive EPTAS1 for the relative approximation error in the ONMF problem for any constant target dimension k; Algorithm 2 runs in time polynomial in the dimensions of the input M. Finally, note that it did not require any assumption on M beyond nonnegativity. 3 The Low Rank NNPCA Algorithm In this section, we re-visit Alg. 1, which plays a central role in our developments, as it is the key piece of our NNPCA and in turn our ONMF algorithm. Alg. 1 approximately solves the NNPCA problem (3) on a rank-r, m n matrix M. It operates by producing a large, but tractable number of candidate solutions W Wk, and returns the one that maximizes the objective value in (2). In the sequel, we provide a brief description of the ideas behind the algorithm. We are interested in approximately solving the low rank NNPCA problem (3). Let M = UΣV denote the truncated SVD of M. For any W Rm k, M W 2 F = ΣU W 2 F = j=1 ΣU wj 2 2 = j=1 max cj Sr 1 2 wj, UΣcj 2, (4) where Sr 1 2 denotes the r-dimensional ℓ2-unit sphere. Let C denote the r k variable formed by stacking the unit-norm vectors cj, j = 1, . . . , k. The key observation is that for a given C, we can efficiently compute a W Wk that maximizes the right-hand side of (4). The procedure for that task is outlined in Alg. 3. Hence, the NNPCA problem (3) is reduced to determining the optimal value of the low-dimensional variable C. But, first let us we provide a brief description of Alg. 3. 1 Additive EPTAS (Efficient Polynomial Time approximation Scheme [28, 29]) refers to an algorithm that can approximate the solution of an optimization problem within an arbitrarily small additive error ǫ and has complexity that scales polynomially in the input size n, but possibly exponentially in 1/ǫ. EPTAS is more efficient than a PTAS because it enforces a polynomial dependency on n for any ǫ, i.e., a running time f(1/ǫ) p(n), where p(n) is polynomial. For example, a running time of O(n 1/ǫ) is considered PTAS, but not EPTAS. Algorithm 3 Local Opt W input : real m k matrix A output c W = arg max W Wk Pk j=1 wj, aj 2 1: CW {} 2: for each s { 1}k do 3: A A diag(s) 4: Ij {}, j = 1, . . . , k 5: for i = 1 . . . , m do 6: j arg maxj A ij 7: if A ij 0 then 8: Ij Ij {i} 9: end if 10: end for 11: W 0m k 12: for j = 1, . . . , k do 13: [wj]Ij [a j]Ij/ [a j]Ij 14: end for 15: CW CW W 16: end for 17: c W arg max W CW Pk j=1 wj, aj 2 For a fixed r k matrix C, Algorithm 3 computes c W arg max W Wk wj, aj 2, (5) where A UΣC. The challenge is to determine the support of the optimal solution c W; if an oracle revealed the optimal supports Ij, j = 1, . . . , k of its columns, then the exact value of the nonzero entries would be determined by the Cauchy-Schwarz inequality, and the contribution of the jth summand in (5) would be equal to P i Ij A2 ij. Due to the nonnegativity constrains in Wk, the optimal support Ij of the jth column must contain indices corresponding to only nonnegative or nonpositive entries of aj, but not a combination of both. Algorithm 3 considers all 2k possible sign combinations for the support sets implicitly by solving (5) on all 2k matrices A = A diag(s), s { 1}k. Hence, we may assume without loss of generality that all support sets correspond to nonnegative entries of A. Moreover, if index i [m] is assigned to Ij, then the contribution of the entire ith row of A to the objective is equal to A2 ij. Based on the above, Algorithm 3 constructs the collection of the support sets by assigning index i to Ij if and only if Aij is nonnegative and the largest among the entries of the ith row of A. The algorithm runs in time2 O(2k k m) and guarantees that the output is the optimal solution to (5). A more formal analysis of the Alg. 3 is provided in Section A.5. Thus far, we have seen that any given value of C can be associated with a feasible solution W Wk via the maximization (5) and Alg. 3. If we could efficiently consider all possible values in the (continuous) domain of C, we would be able to recover the pair that maximizes (4) and, in turn, the optimal solution of (3). However, that is not possible. Instead, we consider a fine discretization of the domain of C and settle for an approximate solution. In particular, let Nǫ(Sr 1 2 ) denote a finite ǫ-net of the r-dimensional ℓ2-unit sphere; for any point in Sr 1 2 , the net contains a point within distance ǫ from the former. (see Appendix C for the construction of such a net). Further, let [Nǫ(Sr 1 2 )] k denote the kth Cartesian power of the previous net; the latter is a collection of r k matrices C. Alg. 1 operates on this collection: for each C it identifies a candidate solution W Wk via the maximization (5) using Algorithm 3. By the properties of the ǫ-nets, it can be shown that at least one of the computed candidate solutions must attain an objective value close to the optimal of (3). The guarantees of Alg. 1 are formally established in Lemma 1. A detailed analysis of the algorithm is provided in the corresponding proof in Appendix A.2. This completes the description of our algorithmic developments. 4 Experimental Evaluation NNPCA We compare our NNPCA algorithm against three existing approaches: NSPCA [16], EM [27] and NNSPAN [17] on real datasets. NSPCA computes multiple nonnegative, but not necessarily orthogonal components; a parameter α penalizes the overlap among their supports. We set a high penalty (α = 1e10) to promote orthogonality. EM and NNSPAN compute only a single nonnegative component. Multiple components are computed consecutively, interleaving an appropriate deflation step. To ensure orthogonality, the deflation step effectively zeroes out the variables used in previously extracted components. Finally, note that both the EM and NSPCA algorithms are randomly initialized. All depicted values are the best results over multiple random restarts. For our algorithm, we use a sketch of rank r = 4 of the (centered) input data. Further we apply an early termination criterion; execution is terminated if no improvement is observed in a number of consecutive iterations (samples). This can only hurt the performance of our algorithm. 2 When used as a subroutine in Alg. 1, Alg. 3 can be simplified into an O(k m) procedure (lines 4-14). Components 1 2 3 4 5 6 7 8 Cumulative Expl. Variance NNSPCA EM NNSPAN ONMFS # Target Components 2 3 4 5 6 7 8 Cumulative Expl. Variance +32.96% +42.64% +52.15% +59.95% ONMFS NNSPAN EM NNSPCA (b) Figure 2: Cumul. variance captured by k nonnegative components; CBCL dataset [30]. In Fig. 2(a), we set k = 8 and plot the cumul. variance versus the number of components. EM and NNSPAN extract components greedily; first components achieve high value, but subsequent ones contribute less to the objective. Our algorithm jointly optimizes the k = 8 components, achieving a 59.95% improvement over the second best method. Fig. 2(b) depicts the cumul. variance for various values of k. We note the percentage improvement of our algorithm over the second best method. CBCL Dataset. The CBCL dataset [30] contains 2429, 19 19 pixel, gray scale face images. It has been used in the evaluation of all three methods [16, 17, 27]. We extract k orthogonal nonnegative components using all methods and compare the total explained variance, i.e., the objective in (2). We note that input data has been centered and it is hence not nonnegative. Fig. 2(a) depicts the cumulative explained variance versus the number of components for k = 8. EM and NNSPAN extract components greedily with a deflation step; the first component achieves high value, but subsequent ones contribute less to the total variance. On the contrary, our algorithm jointly optimizes the k = 8 components, achieving an approximately 60% increase in the total variance compared to the second best method. We repeat the experiment for k = 2, . . . , 8. Fig. 2(b) depicts the total variance captured by each method for each value of k. Our algorithm significantly outperforms the existing approaches. Additional Datasets. We solve the NNPCA problem on various datasets obtained from [31]. We arbitrarily set the target number of components to k = 5 and configure our algorithm to use a rank-4 sketch of the input. Table 1 lists the total variance captured by the extracted components for each method. Our algorithm consistently outperforms the other approaches. ONMF We compare our algorithm with several state-of-the-art ONMF algorithms i) the O-PNMF algorithm of [13] (for 1000 iterations), and ii) the more recent ONP-MF iii) EM-ONMF algorithms of [11, 32] (for 1000 iterations). We also compare to clustering methods (namely, vanilla and spherical k-means) since such algorithms also yield an approximate ONMF. NSPCA EM NNSPAN ONMFS AMZN COM. REV (1500 10000) 5.44e + 01 7.32e + 03 7.32e + 03 7.86e + 03 (+7.37%) ARCENCE TRAIN (100 10000) 4.96e + 04 3.01e + 07 3.00e + 07 3.80e + 07 (+26.7%) ISOLET-5 (1559 617) 5.83e 01 3.54e + 01 3.55e + 01 4.55e + 01 (+28.03%) LEUKEMIA (72 12582) 3.02e + 07 7.94e + 09 8.02e + 09 1.04e + 10 (+29.57%) MFEAT PIX (2000 240) 2.00e + 01 3.20e + 02 3.25e + 02 5.24e + 02 (+61.17%) LOW RES. SPEC. (531 100) 3.98e + 06 2.29e + 08 2.29e + 08 2.41e + 08 (+5.34%) BOW:KOS (3430 6906) 4.96e 02 2.96e + 01 3.00e + 01 4.59e + 01 (+52.95%) Table 1: Total variance captured by k = 5 nonnegative components on various datasets [31]. For each dataset, we list (#samples #variables) and the variance captured by each method; higher values are better. Our algorithm (labeled ONMFS) operates on a rank-4 sketch in all cases, and consistently achieves the best results. We note the percentage improvement over the second best method. p (Noise power) 10-2 10-1 100 k M ! WH>k2 F=k Mk2 F K-means Sp. K-means O-PNMF ONP-MF EM-ONMF ONMFS Figure 3: Relative Frob. approximation error on synthetic data. Data points (samples) are generated by randomly scaling and adding noise to one of five base points that have been randomly selected from the unit hypercube in 100 dimensions. We run ONMF methods with target dimension k = 5. Our algorithm is labeled as ONMFS. Synthetic data. We generate a synthetic dataset as follows. We select five base vectors cj, j = 1, . . . , 5 randomly and independently from the unit hypercube in 100 dimensions. Then, we generate data points xi = ai cj +p ni, for some j {1, . . . , 5}, where ai U([0.1, 1]), ni N(0, I), and p is a parameter controlling the noise variance. Any negative entries of xi are set to zero. We vary p in [10 2, 1]. For each p value, we compute an approximate ONMF on 10 randomly generated datasets and measure the relative Frobenius approximation error. For the methods that involved random initialization, we run 10 averaging iterations per Monte Carlo trial. Our algorithm is configured to operate on a rank-5 sketch. Figure 3 depicts the relative error achieved by each method (averaged over the random trials) versus the noise variance p. Our algorithm, labeled ONMFS achieves competitive or higher accuracy for most values in the range of p. Real Datasets. We apply the ONMF algorithms on various nonnegative datasets obtained from [31]. We arbitrarily set the target number of components to k = 6. Table 2 lists the relative Frobenius approximation error achieved by each algorithm. We note that on the text datasets (e.g., Bag of Words [31]) we run the algorithms on the uncentered word-by-document matrix. Our algorithm performs competitively compared to other methods. 5 Conclusions We presented a novel algorithm for approximately solving the ONMF problem on a nonnegative matrix. Our algorithm relied on a new method for solving the NNPCA problem. The latter jointly optimizes multiple orthogonal nonnegative components and provably achieves an objective value close to optimal. Our ONMF algorithm is the first one to be equipped with theoretical approximation guarantees; for a constant target dimension k, it yields an additive EPTAS for the relative approximation error. Empirical evaluation on synthetic and real datasets demonstrates that our algorithms outperform or match existing approaches in both problems. Acknowledgments DP is generously supported by NSF awards CCF-1217058 and CCF-1116404 and MURI AFOSR grant 556016. This research has been supported by NSF Grants CCF 1344179, 1344364, 1407278, 1422549 and ARO YIP W911NF-14-1-0258. K-MEANS O-PNMF ONP-MF EM-ONMF ONMFS AMZN COM. REV (10000 1500) 0.0547 0.1153 0.1153 0.0467 0.0462(5) ARCENCE TRAIN (100 10000) 0.0837 0.1250 0.0856 0.0788(4) MFEAT PIX (2000 240) 0.2489 0.2974 0.3074 0.2447 0.2615 (4) PEMS TRAIN (267 138672) 0.1441 0.1439 0.1380 0.1278 0.1283 (5) BOW:KOS (3430 6906) 0.8193 0.7692 0.7671 0.7671 0.7609(4) BOW:ENRON (28102 39861) 0.9946 0.6728 0.7148 0.6540(4) BOW:NIPS (1500 12419) 0.8137 0.7277 0.7277 0.7375 0.7252(5) BOW:NYTIMES (102660 3 105) 0.9199 0.9238 0.9199(5) Table 2: ONMF approximation error on nonnegative datasets [31]. For each dataset, we list the size (#samples #variables) and the relative Frobenius approximation error achieved by each method; lower values are better. We arbitrarily set the target dimension k = 6. Dashes (-) denote an invalid solution/non-convergence. For our method, we note in parentheses the approximation rank r used. [1] Daniel D Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. In Advances in neural information processing systems, pages 556 562, 2000. [2] Gershon Buchsbaum and Orin Bloch. Color categories revealed by non-negative matrix factorization of munsell color spectra. Vision research, 42(5):559 563, 2002. [3] Farial Shahnaz, Michael W Berry, V Paul Pauca, and Robert J Plemmons. Document clustering using nonnegative matrix factorization. Information Processing & Management, 42(2):373 386, 2006. [4] Chih-Jen Lin. Projected gradient methods for nonnegative matrix factorization. Neural computation, 19(10):2756 2779, 2007. [5] Andrzej Cichocki, Rafal Zdunek, Anh Huy Phan, and Shun-ichi Amari. Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. John Wiley & Sons, 2009. [6] Victor Bittorf, Benjamin Recht, Christopher Re, and Joel A Tropp. Factoring nonnegative matrices with linear programs. Advances in Neural Information Processing Systems, 25:1223 1231, 2012. [7] Nicolas Gillis and Stephen A Vavasis. Fast and robust recursive algorithms for separable nonnegative matrix factorization. ar Xiv preprint ar Xiv:1208.1237, 2012. [8] K Huang, ND Sidiropoulos, and A Swamiy. Nmf revisited: New uniqueness results and algorithms. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 4524 4528. IEEE, 2013. [9] Chris Ding, Tao Li, Wei Peng, and Haesun Park. Orthogonal nonnegative matrix t-factorizations for clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 126 135. ACM, 2006. [10] Tao Li and Chris Ding. The relationships among various nonnegative matrix factorization methods for clustering. In Data Mining, 2006. ICDM 06. Sixth International Conference on, pages 362 371. IEEE, 2006. [11] Filippo Pompili, Nicolas Gillis, P-A Absil, and Franc ois Glineur. Two algorithms for orthogonal nonnegative matrix factorization with application to clustering. ar Xiv preprint ar Xiv:1201.0901, 2012. [12] Seungjin Choi. Algorithms for orthogonal nonnegative matrix factorization. In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on, pages 1828 1832. IEEE, 2008. [13] Zhirong Yang and Erkki Oja. Linear and nonlinear projective nonnegative matrix factorization. Neural Networks, IEEE Transactions on, 21(5):734 749, 2010. [14] Da Kuang, Haesun Park, and Chris HQ Ding. Symmetric nonnegative matrix factorization for graph clustering. In SDM, volume 12, pages 106 117. SIAM, 2012. [15] Zhirong Yang, Tele Hao, Onur Dikmen, Xi Chen, and Erkki Oja. Clustering by nonnegative matrix factorization using graph random walk. In Advances in Neural Information Processing Systems, pages 1088 1096, 2012. [16] Ron Zass and Amnon Shashua. Nonnegative sparse pca. In Advances in Neural Information Processing Systems 19, pages 1561 1568, Cambridge, MA, 2007. MIT Press. [17] Megasthenis Asteris, Dimitris Papailiopoulos, and Alexandros Dimakis. Nonnegative sparse pca with provable guarantees. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1728 1736, 2014. [18] Zhijian Yuan and Erkki Oja. Projective nonnegative matrix factorization for image compression and feature extraction. In Image Analysis, pages 333 342. Springer, 2005. [19] Hualiang Li, T ulay Adal, Wei Wang, Darren Emge, and Andrzej Cichocki. Non-negative matrix factorization with orthogonality constraints and its application to raman spectroscopy. The Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, 48(1-2):83 97, 2007. [20] Bin Cao, Dou Shen, Jian-Tao Sun, Xuanhui Wang, Qiang Yang, and Zheng Chen. Detect and track latent factors with online nonnegative matrix factorization. In IJCAI, volume 7, pages 2689 2694, 2007. [21] Xin Li, William KW Cheung, Jiming Liu, and Zhili Wu. A novel orthogonal nmf-based belief compression for pomdps. In Proceedings of the 24th international conference on Machine learning, pages 537 544. ACM, 2007. [22] Gang Chen, Fei Wang, and Changshui Zhang. Collaborative filtering using orthogonal nonnegative matrix tri-factorization. Information Processing & Management, 45(3):368 379, 2009. [23] NA Gillis and S Vavasis. Fast and robust recursive algorithms for separable nonnegative matrix factorization. IEEE transactions on pattern analysis and machine intelligence, 2013. [24] Sanjeev Arora, Rong Ge, Yoni Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu. A practical algorithm for topic modeling with provable guarantees. ar Xiv preprint ar Xiv:1212.4777, 2012. [25] Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Moitra. Computing a nonnegative matrix factorization provably. In Proceedings of the 44th symposium on Theory of Computing, pages 145 162, 2012. [26] Abhishek Kumar, Vikas Sindhwani, and Prabhanjan Kambadur. Fast conical hull algorithms for near-separable non-negative matrix factorization. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 231 239, 2013. [27] Christian D. Sigg and Joachim M. Buhmann. Expectation-maximization for sparse and non-negative pca. In Proceedings of the 25th International Conference on Machine Learning, ICML 08, pages 960 967, New York, NY, USA, 2008. ACM. [28] Liming Cai, Michael Fellows, David Juedes, and Frances Rosamond. On efficient polynomial-time approximation schemes for problems on planar structures. Journal of Computer and System Sciences, 2003. [29] Marco Cesati and Luca Trevisan. On the efficiency of polynomial time approximation schemes. Information Processing Letters, 64(4):165 171, 1997. [30] Kah-Kay Sung. Learning and example selection for object and pattern recognition. Ph D thesis, Ph D thesis, MIT, Artificial Intelligence Laboratory and Center for Biological and Computational Learning, Cambridge, MA, 1996. [31] M. Lichman. UCI machine learning repository, 2013. [32] Filippo Pompili, Nicolas Gillis, Pierre-Antoine Absil, and Franc ois Glineur. Onp-mf: An orthogonal nonnegative matrix factorization algorithm with application to clustering. In ESANN 2013, 2013. [33] K. Bache and M. Lichman. UCI machine learning repository, 2013.